A Recurrent Neural Network learns Indeed.com Job Postings
A few months ago Andrej Karpathy wrote an excellent introductory article on recurrent neural networks, The Unreasonable Effectiveness of Recurrent Neural Networks. With this article, he released some code (and larger version) that allows someone to train character-level language models. While RNNs have been around for a long time (Jeff Elman from UCSD Cognitive Science did pioneering work in this field), the current trend is implementing with deep learning techniques organizationally different networks that attain higher performance (Long Short-term memory networks). Andrej demonstrated the model’s ability to learn the writing styles of Paul Graham and Shakespeare. He also demonstrated that this model could learn the structure of documents, allowing the model to learn and then produce Wikipedia articles, LaTeX documents, and Linux Source code.
Others used this tutorial to produce some pretty cool projects, modeling audio and music sequences (Eminem lyrics, Obama Speeches, Irish Folk Music, and music in ABC notation) as well as learning and producing text that resembles biblical texts (RNN Bible and Learning Holiness).
Tom Brewe’s project to learn and generate cooking recipes, along with Karpathy’s demonstration that the network can learn basic document syntax, inspired me to do the same with job postings. Once we’ve learned a model, we can see what dream jobs come out of its internal workings.
To do this, I performed the following:
- Obtain training data from indeed.com:
- Create a function that takes city, state, and job title and provides indeed.com results
- Gather the job posting results, scrape the html from each, clean up html
- Save each simplified html file to disk
- Gather all the simplified html files and compile one text file
- Use recurrent neural network to learn the structure of the job postings
- Use the learned network to generate imaginary job postings
Obtaining Training Data
In order to obtain my training data, I scraped job postings from several major U.S. cities from the popular indeed.com (San Francisco Bay Area, Seattle, New York, and Chicago). The code used to scrape the website came from this great tutorial by Jesse Steinweg-Woods. My modified code, available here, explicitly checked if a website was located on indeed.com (and not another website as the job posting structure was different) and stripped the website down to a bare bones structure. Having this more specific structure I thought would help reduce the training time for the recurrent neural network. Putting these 1001 jobs into one text document gives us a 4.2MB text file, or about 4 million characters.
Training the Recurrent Neural Network
Training the RNN was pretty straight forward. I used Karpathy’s code and the text document generated from all the job postings. I set up the network in the same manner as the network Karpathy outlined for the writings of Shakespeare:
th train.lua -data_dir data/indeed/ -rnn_size 512 -num_layers 3 -dropout 0.5
I trained this network over night on my machine that has a Titan Z GPU (here is more info on acquiring a GPU for academic use).
Imaginary Job Postings
The training procedure produces a set of files that represent checkpoints in training. Let’s take a look at the loss over training:
import os loss_files = sort(os.listdir('cv/')) lfs =  for lf in loss_files: lf = map(float,lf[13:-3].split('_')) lfs.append(lf) lfs = array(sorted(lfs, key = lambda x: x)) plot(lfs[:,0],lfs[:,1]) title('Training Lossn') ylabel('Loss') xlabel('Epoch')
It looks like the model achieved pretty good results around epoch 19. After this, the performance got worse but then came back down again. Let’s use the checkpoint that had the lowest validation loss (epoch 19) and the last checkpoint (epoch 50) to produce samples from the model. These samples will demonstrate some of the relationships that the model has learned. While none of these jobs actually exist, the model produces valid html code that represents imaginary dream job postings.
Below is one of the jobs that was produced when sampling from the saved model at epoch 19. It’s for a job at Manager Persons Inc. and it’s located in San Francisco, CA. It looks like there is a need for the applicant to be “pizza friendly” and the job requires purchasing and data entry. Not too shabby. Here it is as a web page.
At epoch 50, the model has learned a few more things and the job postings are typically much longer. Here is a job for Facetionteal Agency (as a website). As you can see, more training can be done to improve the language model (“Traininging calls”) but it does a pretty good job of looking like a job posting. Some items are fun to note, like that the job requires an education of Mountain View, CA.
Below is another longer one (and as a website). Turns out the model wants to provide jobs that pay $1.50 an hour. The person needs to be a team player.
This was a pretty fun experiment! We could keep going with this as well. There are several knobs to turn to get different performance and to see what kind of results this model could produce. I encourage you to grab an interesting dataset and see what kind of fun things you can do with recurrent neural networks!