1 Intro
Have you ever been wondering how Linkedin recommends jobs that fit your profile or Netflix kept pushing you these new contents that fit your taste. They have somthing called recommender. Let’s build a simple recommender from scratch.
1.1 Content-Based Filtering
Definition from Wikipedia
Content-based filtering methods are based on a description of the item and a profile of the user’s preferences.These methods are best suited to situations where there is known data on an item (name, location, description, etc.), but not on the user. Content-based recommenders treat recommendation as a user-specific classification problem and learn a classifier for the user’s likes and dislikes based on an item’s features.
2. Getting Started
I’ll use a dataset of job post records from Kaggle Monstercom Job Posting
2.1 Loading data, filter software related
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd
import pprint
### Loading Data, filtering something with software
postings = pd.read_csv("monster_com-job_sample.csv")
# contains Software in title
softwareRelated = postings.loc[postings['job_title'].str.contains("[Ss]oftware")]
print(softwareRelated.job_title)
2.2 Vectorizer
2.2.1 Counter Vectorizer
Let’s rewind back a little if you have never heard of this. Counter vectorizer converts a sentence in to a counter array of words. For instance:
vectorizer = CountVectorizer()
# How different are these two sentence?
string = ["A fox jump over another fox","A fox jump over the lazy dog"]
matrix = vectorizer.fit_transform(string)
table = [vectorizer.get_feature_names()] + matrix.toarray().tolist()
similarity = cosine_similarity(matrix)
pprint.pprint(table)
pprint.pprint(similarity)
Getting result
[['another', 'dog', 'fox', 'jump', 'lazy', 'over', 'the'],
[ 1, 0, 2, 1, 0, 1, 0],
[ 0, 1, 1, 1, 1, 1, 1]]
array([[1. , 0.6172134],
[0.6172134, 1. ]])
There are two “fox” in th first, and it jumps over different ones. The similarity is 0.6172134. Now we get a chance of understanding how “Counter Vectorizer” could help you analyze textual similarities.
2.2.2 Vectorizing Job Description
### Vectorizing
vectorizer = CountVectorizer()
matrix = vectorizer.fit_transform(softwareRelated['job_description'])
counts = matrix.toarray()
2.3 Sorting
There have the matrix of word counts, suppose a developer have applied to the first position. the vector will be counter[0]. What’s the other similar jobs that he might also wanna apply? There are many ways to do this, consine similarity was the most simple one.
similarity = cosine_similarity([counts[0]],counts[1:])
jobSimiliarity = list(zip(range(len(softwareRelated)), similarity[0].tolist()))
jobSorted = sorted(jobSimiliarity[1:], key=lambda t: t[1], reverse = True)
print("The first job:", softwareRelated.job_description.iloc[0])
print("Top 10 Similar jobs:")
pprint.pprint(softwareRelated.job_description.iloc[[x[0] for x in jobSorted[:10]]])
2.4 Not that simple(Of course)
Clearly it not that simple, and the result we get from the previous steps weren’t even close. In fact, most contents in the job description weren’t even relavent. Companies could have spend the most time explaining their ideology and how many days of PTOs in it. Unfortunately I won’t be able to provide the complete solution to this problem due to space limit.
2.4.1 Approach A, Blacklisting (aka stop_words)
One way of filtering out these common words in CountVectorizer was using stop_words, so it ignores common terms in the description in counting, for example “PTO” “teamwork”.
vectorizer = CountVectorizer(stop_words = [
'PTO', 'and', 'at', 'can', 'copy', 'for', 'from',
'in', 'is', 'no', 'not', 'of', 'on', 'or', ])
2.4.2 Approach B, Whitelisting (aka vocabulary)
The other build-in parameter in CountVectorizer is vocabulary, where vector only in it will be counted.In our case of job postings. It’s definetely the answer!
vectorizer = CountVectorizer(vocabulary = [
'python','c','java','go','django','flask'])
3 Conclusion
Now you know that building a recommender could be simple while tweaking the model could be really time consuming.
Now you understand why Linkedin asks you to ENDORSE someone elses skills? These skill terms exactly as the perfect terms for vocabulary. A content-based filtering usually has the following steps:
– 1. Understand features of your data.
– 2. Vectorize features to counters(Maybe CounterVectorizer).
– 3. Calculate simlarity score.
– 4. Recommend the highest score!