Solved – How would you categorize / extract information out of job descriptions

I have a bunch of job descriptions entered by users. There are all sort of misspells and bad data. i.e:

... tulane univ hospital tulip tullett prebon ...  weik investment weill cornell university medical center weis weiss waldee hohimer dds welded constrction l.p. welder welder welder ... 

What steps would you take to 'augment' this values with job related insights ?

The best I can think of is to give it to wolfram alpha. But I wonder if there are other accessible techniques that I can utilize using python.

Update:
I found out that there is a Standard Occupational Classification, I really would like to match the name to the SOC and the SOC to a range of average salaries.

A potential way to start this is to make use of Python's Natural Language Tool Kit (NLTK) which can be utilized for text and topic analysis but also has useful functions to extract certain words from strings. For instance, you could extract from the job description the words "medical", "hospital", etc. in order to find broad occupations and sectors. Due to the spelling mistakes and quality of the data I don't think it can be done in a fully automated fashion such that you might end up coding the SOCs yourself. Nonetheless, having the broad occupations and sectors in this way already makes the task a lot easier.

If you are interested in natural language processing/text and topic analysis/text mining beyond this, a fairly inexpensive but useful book is by Bird et al. (2009) "Natural Language Processing with Python".

Occupational titles have been linked to salaries by David Autor. He linked data in the Current Population Survey (the data which is used to also produce U.S. unemployment figures) to the SOC titles from which you can also get salaries in each occupation. From these you can easily compute mean salaries in each occupation and you can even have an idea about the variance (within occupational earnings inequality) in each occupation. David makes his data sets available on his data archive at MIT.

Similar Posts:

Rate this post

Leave a Comment