Building a recommender system
Building a recommender system
Agenda:
- What is a recommender system?
- Defining a metrics system [3 possible approaches]
- The popularity model
- Content based filtering
- Collaborative filtering
The problem
Some examples of where you might find / use a recommendation engine?
Global examples
* Predicting next cities to visit ( booking.com )
* Predicting next meetups to attend ( meetup.com )
* Predicting people to befriend ( facebook.com )
* Predicting what adds to show you ( google.com )
etc..
Local examples
* Predicting future products to buy ( emag ;) )
* Predict next meals to have ( hipmenu ;) )
* Predicting next teams to follow ( betfair ;) )
etc..
Our data and use case
Let’s just say that I know many people from Amazon :D
We will be using a book dataset found here. It contains 10k books and 6 mil reviews (scores 1, 5).
Our task is to take those ratings and suggest for each user new
things to read based on his previous feedback.
Formally
Given * a set of users [U1, U2, …] * a set of possible elements [E1, E2, … ] * some prior interactions (relation) between Us and Es ( {seen, clicked, subscribed, bought} or a rating, or a feedback, etc..),
for a given user U, predict a list of top N elements from E such as U maximizes the defined relation.
As I’ve said, usually the relation is an something numeric, business defined ( amount of money, click-through-rates, churn, etc..)
Loading the data
import pandas as pd
books = pd.read_csv("./Building a recommender system/books.csv")
ratings = pd.read_csv("./Building a recommender system/ratings.csv")
tags = pd.read_csv("./Building a recommender system/tags.csv")
tags = tags.set_index('tag_id')
book_tags = pd.read_csv("./Building a recommender system/book_tags.csv")
Data exploration
This wrapped function tells the pandas library to display all the fields
def display_all(df):
with pd.option_context("display.max_rows", 1000, "display.max_columns", 1000):
display(df)
Books are sorted by their popularity, as measured by number of ratings
display_all(books.head().T)
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
book_id | 1 | 2 | 3 | 4 | 5 |
goodreads_book_id | 2767052 | 3 | 41865 | 2657 | 4671 |
best_book_id | 2767052 | 3 | 41865 | 2657 | 4671 |
work_id | 2792775 | 4640799 | 3212258 | 3275794 | 245494 |
books_count | 272 | 491 | 226 | 487 | 1356 |
isbn | 439023483 | 439554934 | 316015849 | 61120081 | 743273567 |
isbn13 | 9.78044e+12 | 9.78044e+12 | 9.78032e+12 | 9.78006e+12 | 9.78074e+12 |
authors | Suzanne Collins | J.K. Rowling, Mary GrandPré | Stephenie Meyer | Harper Lee | F. Scott Fitzgerald |
original_publication_year | 2008 | 1997 | 2005 | 1960 | 1925 |
original_title | The Hunger Games | Harry Potter and the Philosopher's Stone | Twilight | To Kill a Mockingbird | The Great Gatsby |
title | The Hunger Games (The Hunger Games, #1) | Harry Potter and the Sorcerer's Stone (Harry P... | Twilight (Twilight, #1) | To Kill a Mockingbird | The Great Gatsby |
language_code | eng | eng | en-US | eng | eng |
average_rating | 4.34 | 4.44 | 3.57 | 4.25 | 3.89 |
ratings_count | 4780653 | 4602479 | 3866839 | 3198671 | 2683664 |
work_ratings_count | 4942365 | 4800065 | 3916824 | 3340896 | 2773745 |
work_text_reviews_count | 155254 | 75867 | 95009 | 72586 | 51992 |
ratings_1 | 66715 | 75504 | 456191 | 60427 | 86236 |
ratings_2 | 127936 | 101676 | 436802 | 117415 | 197621 |
ratings_3 | 560092 | 455024 | 793319 | 446835 | 606158 |
ratings_4 | 1481305 | 1156318 | 875073 | 1001952 | 936012 |
ratings_5 | 2706317 | 3011543 | 1355439 | 1714267 | 947718 |
image_url | https://images.gr-assets.com/books/1447303603m... | https://images.gr-assets.com/books/1474154022m... | https://images.gr-assets.com/books/1361039443m... | https://images.gr-assets.com/books/1361975680m... | https://images.gr-assets.com/books/1490528560m... |
small_image_url | https://images.gr-assets.com/books/1447303603s... | https://images.gr-assets.com/books/1474154022s... | https://images.gr-assets.com/books/1361039443s... | https://images.gr-assets.com/books/1361975680s... | https://images.gr-assets.com/books/1490528560s... |
We have 10k books
len(books)
10000
Ratings are sorted chronologically, oldest first.
display_all(ratings.head())
user_id | book_id | rating | |
---|---|---|---|
0 | 1 | 258 | 5 |
1 | 2 | 4081 | 4 |
2 | 2 | 260 | 5 |
3 | 2 | 9296 | 5 |
4 | 2 | 2318 | 3 |
ratings.rating.min(), ratings.rating.max()
(1, 5)
ratings.rating.hist( bins = 5, grid=False)
It appears that 4 is the most popular rating. There are relatively few ones and twos.
len(ratings)
5976479
Most books have a few hundred reviews, but some have as few as eight.
reviews_per_book = ratings.groupby( 'book_id' ).book_id.apply( lambda x: len( x ))
reviews_per_book.to_frame().describe()
book_id | |
---|---|
count | 10000.000000 |
mean | 597.647900 |
std | 1267.289788 |
min | 8.000000 |
25% | 155.000000 |
50% | 248.000000 |
75% | 503.000000 |
max | 22806.000000 |
Train test split
Of course we need to first follow the best practices and split the data into training and testing sets.
from sklearn.model_selection import train_test_split
train_df, test_df = train_test_split(ratings,
stratify=ratings['user_id'],
test_size=0.20,
)
len(train_df), len(test_df)
(4781183, 1195296)
Evaluation metrics
Let’s think a bit about how would we measure a recommendation engine..
Any ideas?
If you say either of {precision, recall, f-score, accuracy} you’re wrong
Top-N accuracy metrics
… are a class of metrics that are called Top-N accuracy metrics, which evaluate the accuracy of the top recommendations provided to a user, comparing to the items the user has actually interacted in test set.
- Recall@N
- given a list [n, n, n, p, n, n, …]
- N the top most important results, how often does p is among the returned N values
- Has variants Recall@5, Recall@10, etc..
- NDCG@N (Normalized Discounted Cumulative Gain @ N)
- A recommender returns some items and we’d like to compute how good the list is. Each item has a relevance score, usually a non-negative number. That’s gain.
- Now we add up those scores; that’s cumulative gain.
- We’d prefer to see the most relevant items at the top of the list, therefore before summing the scores we divide each by a growing number (usually a logarithm of the item position) - that’s discounting
- DCGs are not directly comparable between users, so we normalize them.
- MAP@N (Mean Average Precision)
Popularity model
A common baseline approach is the Popularity model.
This model is not actually personalized - it simply recommends to a user the most popular items that the user has not previously consumed.
As the popularity accounts for the “wisdom of the crowds”, it usually provides good recommendations, generally interesting for most people.
book_ratings = ratings.groupby('book_id').size().reset_index(name='users')
book_popularity = ratings.groupby('book_id')['rating'].sum().sort_values(ascending=False).reset_index()
book_popularity = pd.merge(book_popularity, book_ratings, how='inner', on=['book_id'])
book_popularity = pd.merge(book_popularity, books[['book_id', 'title', 'authors']], how='inner', on=['book_id'])
book_popularity = book_popularity.sort_values(by=['rating'], ascending=False)
book_popularity.head()
book_id | rating | users | title | authors | |
---|---|---|---|---|---|
0 | 1 | 97603 | 22806 | The Hunger Games (The Hunger Games, #1) | Suzanne Collins |
1 | 2 | 95077 | 21850 | Harry Potter and the Sorcerer's Stone (Harry P... | J.K. Rowling, Mary GrandPré |
2 | 4 | 82639 | 19088 | To Kill a Mockingbird | Harper Lee |
3 | 18 | 70059 | 15855 | Harry Potter and the Prisoner of Azkaban (Harr... | J.K. Rowling, Mary GrandPré, Rufus Beck |
4 | 25 | 69265 | 15304 | Harry Potter and the Deathly Hallows (Harry Po... | J.K. Rowling, Mary GrandPré |
Unfortunately, this strategy depends on what we rank on….
Let’s imagine we have another way to rank this list. (Above, books are sorted by their popularity, as measured by number of ratings.)
Let’s imagine we want to sort books by their average rating ( i.e. sum(ratings) / len(ratings)
)
book_popularity.rating = book_popularity.rating / book_popularity.users
book_popularity = book_popularity.sort_values(by=['rating'], ascending=False)
book_popularity.head(n=20)
book_id | rating | users | title | authors | |
---|---|---|---|---|---|
2108 | 3628 | 4.829876 | 482 | The Complete Calvin and Hobbes | Bill Watterson |
9206 | 7947 | 4.818182 | 88 | ESV Study Bible | Anonymous, Lane T. Dennis, Wayne A. Grudem |
6648 | 9566 | 4.768707 | 147 | Attack of the Deranged Mutant Killer Monster S... | Bill Watterson |
4766 | 6920 | 4.766355 | 214 | The Indispensable Calvin and Hobbes | Bill Watterson |
5702 | 8978 | 4.761364 | 176 | The Revenge of the Baby-Sat | Bill Watterson |
3904 | 6361 | 4.760456 | 263 | There's Treasure Everywhere: A Calvin and Hobb... | Bill Watterson |
4228 | 6590 | 4.757202 | 243 | The Authoritative Calvin and Hobbes: A Calvin ... | Bill Watterson |
2695 | 4483 | 4.747396 | 384 | It's a Magical World: A Calvin and Hobbes Coll... | Bill Watterson |
3627 | 3275 | 4.736842 | 285 | Harry Potter Boxed Set, Books 1-5 (Harry Potte... | J.K. Rowling, Mary GrandPré |
1579 | 1788 | 4.728528 | 652 | The Calvin and Hobbes Tenth Anniversary Book | Bill Watterson |
4047 | 5207 | 4.722656 | 256 | The Days Are Just Packed: A Calvin and Hobbes ... | Bill Watterson |
9659 | 8946 | 4.720000 | 75 | The Divan | Hafez |
1067 | 1308 | 4.718114 | 933 | A Court of Mist and Fury (A Court of Thorns an... | Sarah J. Maas |
5938 | 9141 | 4.711765 | 170 | The Way of Kings, Part 1 (The Stormlight Archi... | Brandon Sanderson |
681 | 862 | 4.702840 | 1373 | Words of Radiance (The Stormlight Archive, #2) | Brandon Sanderson |
4445 | 3753 | 4.699571 | 233 | Harry Potter Collection (Harry Potter, #1-6) | J.K. Rowling |
3900 | 5580 | 4.689139 | 267 | The Calvin and Hobbes Lazy Sunday Book | Bill Watterson |
6968 | 8663 | 4.680851 | 141 | Locke & Key, Vol. 6: Alpha & Omega | Joe Hill, Gabriel Rodríguez |
5603 | 8109 | 4.677596 | 183 | The Absolute Sandman, Volume One | Neil Gaiman, Mike Dringenberg, Chris Bachalo, ... |
7743 | 8569 | 4.669355 | 124 | Styxx (Dark-Hunter, #22) | Sherrilyn Kenyon |
?! Maybe this Bill Watterson pays people for good reviews..
This of course is insufficient advice for good books. Title fanatics may give high scores for unknown books. Since they are the only ones that do reviews, the books generally get good scores.
Content based filtering
The aim of this approach is to group similar object together and recommend new objects from the same categories that the user already purchased
We already have tags for books in this dataset, let’s use them!
book_tags.head()
goodreads_book_id | tag_id | count | |
---|---|---|---|
0 | 1 | 30574 | 167697 |
1 | 1 | 11305 | 37174 |
2 | 1 | 11557 | 34173 |
3 | 1 | 8717 | 12986 |
4 | 1 | 33114 | 12716 |
def get_tag_name(tag_id):
return {word for word in tags.loc[tag_id].tag_name.split('-') if word}
get_tag_name(20000)
{'in', 'midnight', 'paris'}
We’re going to accumulate all the tags of a book in a single datastructure
from tqdm import tqdm_notebook as tqdm
book_tags_dict = dict()
for book_id, tag_id, _ in tqdm(book_tags.values):
tags_of_book = book_tags_dict.setdefault(book_id, set())
tags_of_book |= get_tag_name(tag_id)
Let’s see the tags for one book
" ".join(book_tags_dict[105])
'up read literature reading ficción place i time speculative audible favorites ebooks adult 20th ciencia book chronicles owned sciencefiction audio general kindle opera gave sci bought fiction currently imaginary stories not future books herbert short fantascienza favourites sf space scifi paperback it bookshelf religion get buy re default fantasy 1980s series ficcion audiobook frank novels century fi my adventure philosophy classic home to dune and calibre in e novel on science classics american ebook shelfari unread politics f finished epic scanned s all audiobooks english own library sff'
And the book is…
books.loc[books.goodreads_book_id == 105][['book_id', 'title', 'authors']]
book_id | title | authors | |
---|---|---|---|
2816 | 2817 | Chapterhouse: Dune (Dune Chronicles #6) | Frank Herbert |
There are two types of ids in this dataset: goodreads_book_id
and book_id
we will make two dict
s to switch from one to the other
goodread2id = {goodreads_book_id: book_id for book_id, goodreads_book_id in books[['book_id', 'goodreads_book_id']].values}
id2goodread = dict(zip(goodread2id.values(), goodread2id.keys()))
id2goodread[2817], goodread2id[105]
(105, 2817)
Then we’re going to do convert the tags into a numpy plain array that we aim to process later. The row position of a tag should match the book_id. Because these start from 1, we will add a DUMMY padding element.
import numpy as np
np_tags = np.array(sorted([[0, "DUMMY"]] + [[goodread2id[id], " ".join(tags)] for id, tags in book_tags_dict.items()]))
np_tags[:5]
array([['0', 'DUMMY'],
['1',
'read age reading i time speculative 2014 the favorites loved adult ebooks than of book owned 5 audio thriller kindle suzanne club faves favourite teen sci love currently fiction dystopian stars drama suspense action reviewed dystopia future 2011 young books ya futuristic favourites sf post 2013 borrowed trilogy scifi it games once buy distopian re default fantasy series distopia triangle audiobook novels 2010 fi my adventure romance favs contemporary lit to reads 2012 in e novel dystopias science favorite ebook shelfari finished star collins reread hunger more survival all audiobooks english apocalyptic completed coming own library'],
['2',
'read 2016 literature reading own 2015 i time 2014 favorites 2017 jk adult than childhood owned 5 audio england kindle faves favourite teen sci kids j currently fiction mystery friendship stars youth childrens young books ya k potter favourites 2013 urban scifi it bookshelf once buy re default fantasy series audiobook novels fi my grade adventure wizards shelf favs contemporary witches classic to reads children harry in rereads on novel science classics rowling favorite juvenile british ebook shelfari magic middle reread more s supernatural paranormal all audiobooks english lit library'],
['3',
'read stephenie reading finish i time high favorites 2008 adult than meh book owned 5 werewolves meyer kindle pleasures club faves teen sci guilty love fiction currently not stars drama stephanie young books ya already favourites urban scifi it bookshelf horror once re default fantasy saga series triangle audiobook novels youngadult fi my movie vampires séries romance shelf contemporary lit to vampire chick again did in on twilight movies science favorite pleasure american adults 2009 ebook dnf shelfari vamps never finished school abandoned more reread supernatural pnr first paranormal all audiobooks english romantic completed have own library'],
['4',
'crime read 2016 age lee literature reading own 2015 i time required historical 1001 the 2014 high history favorites prize family adult 20th racism you of book bookclub childhood before list owned 5 audio general kindle wish club banned faves usa favourite fiction currently mystery challenge stars drama young books ya favourites modern it buy re default race for audiobook die novels century harper my contemporary classic to reads pulitzer clàssics southern again in novel gilmore classics american favorite literary ebook shelfari realistic school rory reread must all audiobooks english coming lit library']],
dtype='<U970')
Next up we want to process the tags but if you look closely there are many words that have the same meaning but are slighly different (because of the context in which they are used).
We’d like to normalize
them as much as possible so as to keep the overall vocabulary small.
This process can be accomplished through stemming
and lemmatization
. Let me show you an example:
stemmer = PorterStemmer()
stemmer.stem('autobiographical')
'autobiograph'
Lemmatisation is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word’s lemma, or dictionary form.
Stamming, is the process of reducing inflected words to their word stem, base or root form—generally a written word form.
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer, PorterStemmer
class LemmaTokenizer(object):
def __init__(self):
self.wnl = WordNetLemmatizer()
self.stm = PorterStemmer()
def __call__(self, doc):
return [self.stm.stem(self.wnl.lemmatize(t)) for t in word_tokenize(doc)]
After this we’re ready to process the data.
We will build a sklearn
pipeline to process the tags.
We will first be using a tf-idf metric customized to tokenize words with the above implemented Lemmer and Stemmer class. Then we will use a StandardScaler transform to make all the values in the resulting matrix [0, 1] bound.
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import StandardScaler
p = Pipeline([
('vectorizer', TfidfVectorizer(
tokenizer=LemmaTokenizer(),
strip_accents='unicode',
ngram_range=(1, 1),
max_features=1000,
min_df=0.005,
max_df=0.5,
)),
('normalizer', StandardScaler(with_mean=False))
])
trans = p.fit_transform(np_tags[:,1])
trans.shape
(10001, 1000)
After this point, the trans
variable contains a row for each book, each suck row
corresponds to a 1000 dimensional array. Each element of that 1000, is a score for the most important
1000 words that the TfidfVectorizer decided to keep (1000 words chosen from all the book tags provided).
This is the vectorized representation of the book, the vector the contains (many 0 values) the most important (cumulated for all users) words with which people tagged the books.
Next up, let’s see how many users do we have and let’s extract them into a single list.
users = ratings.set_index('user_id').index.unique().values
len(users)
53424
That’s how we get all the book ratings of a single user
ratings.loc[ratings.user_id == users[0]][['book_id', 'rating']].head()
book_id | rating | |
---|---|---|
0 | 258 | 5 |
75 | 268 | 3 |
76 | 5556 | 3 |
77 | 3638 | 3 |
78 | 1796 | 5 |
We’ll actually write a function for this because it’s rather obscure. We want all the book_ids that the user rated, along with the given rating for each book.
def books_and_ratings(user_id):
books_and_ratings_df = ratings.loc[ratings.user_id == user_id][['book_id', 'rating']]
u_books, u_ratings = zip(*books_and_ratings_df.values)
return np.array(u_books), np.array(u_ratings)
u_books, u_ratings = books_and_ratings(users[0])
u_books.shape, trans[u_books].shape
((117,), (117, 1000))
We then multiply the book’s ratings with the features of the book, to boost the features importance for this user, then add everything together into a single user specific
feature vector.
user_vector = (u_ratings * trans[u_books]) / len(u_ratings)
user_vector.shape
(1000,)
If we get all the features of a book, scale each book by the user’s ratings and then do a mean on all the scaled book features as above, we actually obtain a condensed form of that user’s preferences.
So doing the above we just obtained a user_vector
, a 1000 dimensional vector that expresses what the user likes, by combining his prior ratings on the books he read with the respected book_vectors
.
def get_user_vector(user_id):
u_books, u_ratings = books_and_ratings(user_id)
u_books_features = trans[u_books]
u_vector = (u_ratings * u_books_features) / len(u_ratings)
return u_vector
def get_user_vectors():
user_vectors = np.zeros((len(users), 1000))
for user_id in tqdm(users[:1000]):
u_vector = get_user_vector(user_id)
user_vectors[user_id, :] = u_vector
return user_vectors
user_vectors = get_user_vectors()
HBox(children=(IntProgress(value=0, max=1000), HTML(value='')))
The pipeline transformation also keeps the most important 1000 words. Let’s see a sample of them now..
trans_feature_names = p.named_steps['vectorizer'].get_feature_names()
np.random.permutation(np.array(trans_feature_names))[:100]
array(['guilti', 'america', 'roman', 'witch', '1970', 'scott', 'hous',
'occult', 'michael', 'winner', 'nonfic', 'improv', 'australia',
'pre', 'britain', 'london', 'tear', 'comicbook', 'steami', 'arab',
'young', 'alex', 'literatura', 'man', 'altern', 's', 'sport',
'warfar', 'california', 'americana', 'keeper', '311', 'era',
'life', 'heroin', 'urban', 'sexi', '18th', 'were', 'black',
'super', 'goodread', 'nativ', 'روايات', 'novella', 'great',
'youth', 'pleasur', 'mayb', 'canadiana', 'childhood', 'realli',
'be', 'home', 'pictur', 'for', 'all', 'guardian', 'race',
'investig', 'earli', 'easi', 'latin', '314', 'long', 'seen', '20',
'polit', 'group', 'fave', 'abus', 'lendabl', 'clasico', 'essay',
'punk', 'town', 'biblic', 'mental', 'oprah', 'fantasia', 'tween',
'dean', 'asia', 'jane', 'epub', 'hilari', 'as', 'singl', '33',
'new', 'ministri', 'psycholog', 'sweet', 'jennif', 'orphan', '8th',
'idea', 'neurosci', 'natur', 'boyfriend'], dtype='<U16')
Once we have the user_vectors
computed, we want to show the most important 20 words for it.
user_id = 801
pd.DataFrame(
sorted(
zip(
trans_feature_names,
user_vectors[user_id].flatten().tolist()
),
key=lambda x: -x[1]
)[:20],
columns=['token', 'relevance']
).T
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
token | club | group | bookclub | literatur | abandon | literari | gener | t | centuri | didn | memoir | drama | shelfari | histori | biographi | recommend | non | did | borrow | not |
relevance | 6.32432 | 6.20145 | 5.94722 | 5.46231 | 5.01933 | 4.88025 | 4.53921 | 4.39075 | 4.36124 | 4.25427 | 4.21645 | 4.1113 | 4.10304 | 4.07408 | 4.07243 | 3.99759 | 3.99382 | 3.89373 | 3.82424 | 3.80188 |
The last piece of the puzzle at this stage is making a link between the book_vectors and
the user_vectors
.
Since both of them are computed on the same feature space, we simply need a distance metric that would rank a given user_vector
to all the book_vectors
.
One such metric is the cosine_similarity
that we’re using bellow. It computes a distance between two n-dimensional vectors and returns a score between -1 and 1 where:
- something close to -1 means that the vectors have opposite relations (e.g. comedy and drama)
- something close to 0 means that the vectors have no relation between them
- something close to 1 means that the vectors are quite similar
The code bellow will compute a recommendation for a single user.
from sklearn.metrics.pairwise import cosine_similarity
user_id = 100
cosine_similarities = cosine_similarity(np.expand_dims(user_vectors[user_id], 0), trans)
similar_indices = cosine_similarities.argsort().flatten()[-20:]
similar_indices
array([ 194, 172, 7810, 3085, 213, 180, 35, 872, 3508, 5374, 8456,
397, 3913, 1015, 8233, 162, 629, 2926, 4531, 323])
Which translates to the following books:
books.loc[books.book_id.isin(similar_indices)][['title', 'authors']]
title | authors | |
---|---|---|
34 | The Alchemist | Paulo Coelho, Alan R. Clarke |
161 | The Stranger | Albert Camus, Matthew Ward |
171 | Anna Karenina | Leo Tolstoy, Louise Maude, Leo Tolstoj, Aylmer... |
179 | Siddhartha | Hermann Hesse, Hilda Rosner |
193 | Moby-Dick or, The Whale | Herman Melville, Andrew Delbanco, Tom Quirk |
212 | The Metamorphosis | Franz Kafka, Stanley Corngold |
322 | The Unbearable Lightness of Being | Milan Kundera, Michael Henry Heim |
396 | Perfume: The Story of a Murderer | Patrick Süskind, John E. Woods |
628 | Veronika Decides to Die | Paulo Coelho, Margaret Jull Costa |
871 | The Plague | Albert Camus, Stuart Gilbert |
1014 | Steppenwolf | Hermann Hesse, Basil Creighton |
2925 | The Book of Laughter and Forgetting | Milan Kundera, Aaron Asher |
3084 | Narcissus and Goldmund | Hermann Hesse, Ursule Molinaro |
3507 | Swann's Way (In Search of Lost Time, #1) | Marcel Proust, Simon Vance, Lydia Davis |
3912 | Immortality | Milan Kundera |
4530 | The Joke | Milan Kundera |
5373 | Laughable Loves | Milan Kundera, Suzanne Rappaport |
7809 | Slowness | Milan Kundera, Linda Asher |
8232 | The Book of Disquiet | Fernando Pessoa, Richard Zenith |
8455 | Life is Elsewhere | Milan Kundera, Aaron Asher |
This approach gives way better predictions than the popularity model but, on the other hand, this relies on us having the content of the books (tags, book_tags, etc..) which we might not have, or might add a new level of complexity (parsing, cleaning, summarizing, etc…).
Collaborative filtering
The idea of this approach is that users fall into interest buckets
.
If we are able so say that user A
and user B
fall in that same bucket (both may like history books), then whatever A
liked, B
might also like.
books.head()
book_id | goodreads_book_id | best_book_id | work_id | books_count | isbn | isbn13 | authors | original_publication_year | original_title | ... | ratings_count | work_ratings_count | work_text_reviews_count | ratings_1 | ratings_2 | ratings_3 | ratings_4 | ratings_5 | image_url | small_image_url | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2767052 | 2767052 | 2792775 | 272 | 439023483 | 9.780439e+12 | Suzanne Collins | 2008.0 | The Hunger Games | ... | 4780653 | 4942365 | 155254 | 66715 | 127936 | 560092 | 1481305 | 2706317 | https://images.gr-assets.com/books/1447303603m... | https://images.gr-assets.com/books/1447303603s... |
1 | 2 | 3 | 3 | 4640799 | 491 | 439554934 | 9.780440e+12 | J.K. Rowling, Mary GrandPré | 1997.0 | Harry Potter and the Philosopher's Stone | ... | 4602479 | 4800065 | 75867 | 75504 | 101676 | 455024 | 1156318 | 3011543 | https://images.gr-assets.com/books/1474154022m... | https://images.gr-assets.com/books/1474154022s... |
2 | 3 | 41865 | 41865 | 3212258 | 226 | 316015849 | 9.780316e+12 | Stephenie Meyer | 2005.0 | Twilight | ... | 3866839 | 3916824 | 95009 | 456191 | 436802 | 793319 | 875073 | 1355439 | https://images.gr-assets.com/books/1361039443m... | https://images.gr-assets.com/books/1361039443s... |
3 | 4 | 2657 | 2657 | 3275794 | 487 | 61120081 | 9.780061e+12 | Harper Lee | 1960.0 | To Kill a Mockingbird | ... | 3198671 | 3340896 | 72586 | 60427 | 117415 | 446835 | 1001952 | 1714267 | https://images.gr-assets.com/books/1361975680m... | https://images.gr-assets.com/books/1361975680s... |
4 | 5 | 4671 | 4671 | 245494 | 1356 | 743273567 | 9.780743e+12 | F. Scott Fitzgerald | 1925.0 | The Great Gatsby | ... | 2683664 | 2773745 | 51992 | 86236 | 197621 | 606158 | 936012 | 947718 | https://images.gr-assets.com/books/1490528560m... | https://images.gr-assets.com/books/1490528560s... |
5 rows × 23 columns
Our goal here is to express each user and each book into some semantic
representation derived from the ratings we have.
We will model each id (both user_id, and book_id) as a hidden
latent variable sequence (also called embedding
). The user_id embeddings
would represent that user’s personal tastes. The book_id embeddings
would represent the book characteristics.
We then assume that the rating of a user would be the product between his personal tastes (the user’s embeddings) multiplied with the books characteristics (the book embeddings).
Basically, this means we will try to model the formula
rating = user_preferences * book_charatersitcs + user_bias + book_bias
user_bias
is a tendency of a user to give higher or lower scores.book_bias
is a tendency of a book to be more known, publicized, talked about so rated higher because of this.
We expect that while training, the ratings will back-propagate enough information into the embeddings so as to jointly decompose both the user_preferences
vectors and book_characteristics
vectors.
from keras.layers import Input, Embedding, Flatten
from keras.layers import Input, InputLayer, Dense, Embedding, Flatten
from keras.layers.merge import dot, add
from keras.engine import Model
from keras.regularizers import l2
from keras.optimizers import Adam
hidden_factors = 10
user = Input(shape=(1,))
emb_u_w = Embedding(input_length=1, input_dim=len(users), output_dim=hidden_factors)
emb_u_b = Embedding(input_length=1, input_dim=len(users), output_dim=1)
book = Input(shape=(1,))
emb_b_w = Embedding(input_length=1, input_dim=len(books), output_dim=hidden_factors)
emb_b_b = Embedding(input_length=1, input_dim=len(books), output_dim=1)
merged = dot([
Flatten()(emb_u_w(user)),
Flatten()(emb_b_w(book))
], axes=-1)
merged = add([merged, Flatten()(emb_u_b(user))])
merged = add([merged, Flatten()(emb_b_b(book))])
model = Model(inputs=[user, book], outputs=merged)
model.summary()
model.compile(optimizer='adam', loss='mse')
model.optimizer.lr=0.001
__________________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
==================================================================================================
input_32 (InputLayer) (None, 1) 0
__________________________________________________________________________________________________
input_33 (InputLayer) (None, 1) 0
__________________________________________________________________________________________________
embedding_62 (Embedding) (None, 1, 10) 534240 input_32[0][0]
__________________________________________________________________________________________________
embedding_64 (Embedding) (None, 1, 10) 100000 input_33[0][0]
__________________________________________________________________________________________________
flatten_60 (Flatten) (None, 10) 0 embedding_62[0][0]
__________________________________________________________________________________________________
flatten_61 (Flatten) (None, 10) 0 embedding_64[0][0]
__________________________________________________________________________________________________
embedding_63 (Embedding) (None, 1, 1) 53424 input_32[0][0]
__________________________________________________________________________________________________
dot_9 (Dot) (None, 1) 0 flatten_60[0][0]
flatten_61[0][0]
__________________________________________________________________________________________________
flatten_62 (Flatten) (None, 1) 0 embedding_63[0][0]
__________________________________________________________________________________________________
embedding_65 (Embedding) (None, 1, 1) 10000 input_33[0][0]
__________________________________________________________________________________________________
add_15 (Add) (None, 1) 0 dot_9[0][0]
flatten_62[0][0]
__________________________________________________________________________________________________
flatten_63 (Flatten) (None, 1) 0 embedding_65[0][0]
__________________________________________________________________________________________________
add_16 (Add) (None, 1) 0 add_15[0][0]
flatten_63[0][0]
==================================================================================================
Total params: 697,664
Trainable params: 697,664
Non-trainable params: 0
__________________________________________________________________________________________________
train_df.head()
user_id | book_id | rating | |
---|---|---|---|
2492982 | 12364 | 889 | 4 |
1374737 | 5905 | 930 | 3 |
4684686 | 49783 | 2398 | 3 |
5951422 | 27563 | 438 | 4 |
5588313 | 44413 | 4228 | 4 |
raw_data = train_df[['user_id', 'book_id', 'rating']].values
raw_valid = test_df[['user_id', 'book_id', 'rating']].values
u = raw_data[:,0] - 1
b = raw_data[:,1] - 1
r = raw_data[:,2]
vu = raw_data[:,0] - 1
vb = raw_data[:,1] - 1
vr = raw_data[:,2]
model.fit(x=[u, b], y=r, validation_data=([vu, vb], vr), epochs=1)
Train on 100000 samples, validate on 30000 samples
Epoch 1/1
100000/100000 [==============================] - 36s 365us/step - loss: 8.6191 - val_loss: 7.1418
After the training is done, we can retrieve the embedding values for the books, the users and the biases in order to reproduce the computations ourselves for a single user.
book_embeddings = emb_b_w.get_weights()[0]
book_embeddings.shape, "10000 books each with 10 hidden features (the embedding)"
((10000, 10), '10000 books each with 10 hidden features (the embedding)')
user_embeddings = emb_u_w.get_weights()[0]
user_embeddings.shape, "54k users each with 10 hidden preferences"
((53424, 10), '54k users each with 10 hidden preferences')
user_bias = emb_u_b.get_weights()[0]
book_bias = emb_b_b.get_weights()[0]
user_bias.shape, book_bias.shape, "every user and book has a specific bias"
((53424, 1), (10000, 1), 'every user and book has a specific bias')
Know, let’s recompute the formula
\[bookRating(b, u) = userEmbedding(u) * bookEmbedding(b) + bookBias(b) + userBias(u)\]And do this for every book, if we have a specific user set.
this_user = 220
books_ranked_for_user = (np.dot(book_embeddings, user_embeddings[this_user]) + user_bias[this_user] + book_bias.flatten())
books_ranked_for_user.shape
(10000,)
We get back 10000 ratings, one for each book, scores computed specifically for this user’s tastes.
We can now sort the ratings and get the 10 most “interesting” ones.
best_book_ids = np.argsort(books_ranked_for_user)[-10:]
best_book_ids
array([19, 17, 22, 23, 24, 16, 3, 26, 0, 1])
Which decode to…
books.loc[books.book_id.isin(best_book_ids)][['title', 'authors']]
title | authors | |
---|---|---|
0 | The Hunger Games (The Hunger Games, #1) | Suzanne Collins |
2 | Twilight (Twilight, #1) | Stephenie Meyer |
15 | The Girl with the Dragon Tattoo (Millennium, #1) | Stieg Larsson, Reg Keeland |
16 | Catching Fire (The Hunger Games, #2) | Suzanne Collins |
18 | The Fellowship of the Ring (The Lord of the Ri... | J.R.R. Tolkien |
21 | The Lovely Bones | Alice Sebold |
22 | Harry Potter and the Chamber of Secrets (Harry... | J.K. Rowling, Mary GrandPré |
23 | Harry Potter and the Goblet of Fire (Harry Pot... | J.K. Rowling, Mary GrandPré |
25 | The Da Vinci Code (Robert Langdon, #2) | Dan Brown |
Again you can imagine this being all this code tied up into a single system.
Conclusions
- Recommender systems are (needed) everywhere
- The metrics used to evaluate one are more exotic (ex. NDCG@N, MAP@N, etc..)
- We’ve shown how to implement 3 recommendation engine models:
- Popularity model
- Content based model
- Collaboration based model
- The collaboration based models have the lowest needs of data (information) among the three
Comments