The Internet Movie Database (IMDB) maintains a chart called the IMDB Top 250, which is a ranking of the top 250 movies according to a certain scoring metric. All the movies in this list are non-documentary, theatrical releases with a runtime of at least 45 minutes and over 250,000 ratings:
This chart can be considered the simplest of recommenders. It doesn’t take into consideration the tastes of a particular user nor does it try to deduce similarities between different movies. It simply calculates a score for every movie based on a predefined metric and outputs a sorted list of movies based on that score.
This article covers the following:
- Building a clone of the IMDB Top 250 chart (henceforth referred to as the simple recommender).
- Taking the functionalities of the chart one step further and building a knowledge-based recommender. This model takes user preferences with regards to genre, timeframe, runtime, and language, and recommends movies that satisfy all conditions.
You’ll be required to have Python installed on a system. Finally, to use the Git repository, you need to install Git. The code files of this article can be found on GitHub at https://github.com/PacktPublishing/Hands-On-Recommendation-Systems-with-Python/tree/master/Chapter3. You can also see the code in action at http://bit.ly/2v7SZD4.
The simple recommender
The first step in building your simple recommender is setting up your workspace. Create a new directory named
IMDB. Create a Jupyter Notebook in this directory named
Simple Recommender and open it in the browser.
Now load the dataset available at https://www.kaggle.com/rounakbanik/the-movies-dataset/downloads/movies_metadata.csv/7:
Upon running the cell, you should see a familiar table-like structure output in the notebook.
Building the simple recommender is fairly straightforward. The steps are as follows:
- Choose a metric (or score) to rate the movies on
- Decide on the prerequisites for the movie to be featured on the chart
- Calculate the score for every movie that satisfies the conditions
- Output the list of movies in decreasing order of their scores
The metric is the numeric quantity based on which you rank movies. A movie is considered to be better than another movie if it has a higher metric score than the other movie. It is very important that you have a robust and reliable metric to build your chart upon to ensure a good quality of recommendations.
The choice of a metric is arbitrary. One of the simplest metrics that can be used is the movie rating. However, this suffers from a variety of disadvantages. In the first place, the movie rating does not take the popularity of a movie into consideration. Therefore, a movie rated 9 by 100,000 users will be placed below a movie rated 9.5 by 100 users. This is not desirable as it is highly likely that a movie watched and rated only by 100 people caters to a very specific niche and may not appeal as much to the average person as the former.
It is also a well-known fact that as the number of voters increase, the rating of a movie normalizes and it approaches a value that is reflective of the movie’s quality and popularity with the general populace. To put it another way, movies with very few ratings are not very reliable. A movie rated 10⁄10 by five users doesn’t necessarily mean that it’s a good movie.
Therefore, what you need is a metric that can, to an extent, take into account the movie rating and the number of votes it has garnered (a proxy for popularity). This would give a greater preference to a blockbuster movie rated 8 by 100,000 users over an art house movie rated 9 by 100 users.
Fortunately, you do not have to brainstorm a mathematical formula for the metric. You can use IMDB’s weighted rating formula as your metric. Mathematically, it can be represented as follows:
Weighted Rating (WR) =
The following apply:
- v is the number of votes garnered by the movie
- m is the minimum number of votes required for the movie to be in the chart (the prerequisite)
- R is the mean rating of the movie
- C is the mean rating of all the movies in the dataset
You already have the values for v and R for every movie in the form of the
vote_average features respectively. Calculating C is extremely trivial.
The IMDB weighted formula also has a variable m, which it requires to compute its score. This variable is in place to make sure that only movies that are above a certain threshold of popularity are considered for the rankings. Therefore, the value of m determines the movies that qualify to be in the chart and also, by being part of the formula, determines the final value of the score.
Just like the metric, the choice of the value of m is arbitrary. In other words, there is no right value for m. It is a good idea to experiment with different values of m and then choose the one that you (and your audience) think gives the best recommendations. The only thing to be kept in mind is that the higher the value of m, the higher the emphasis on the popularity of a movie, and therefore the higher the selectivity.
For your recommender, use the number of votes garnered by the 80th percentile movie as your value for m. In other words, for a movie to be considered in the rankings, it must have garnered more votes than at least 80% of the movies present in your dataset. Additionally, the number of votes garnered by the 80th percentile movie is used in the weighted formula described previously to come up with the value for the scores.
Now calculate the value of m:
You can see that only 20% of the movies have gained more than 50 votes. Therefore, your value of m is
Another prerequisite that you want in place is the runtime. Only consider movies that are greater than
45 minutes and less than
300 minutes in length. Define a new DataFrame,
q_movies, which will hold all the movies that qualify to appear in the chart:
From your dataset of 45,000 movies, approximately 9,000 movies (or 20%) made the cut.
Calculating the score
The final value that you need to discover before you calculate your scores is C, the mean rating for all the movies in the dataset:
The average rating of a movie is approximately 5.6⁄10. It seems that IMDB happens to be particularly strict with their ratings. Now that you have the value of C, you can go about calculating your score for each movie.
First, define a function that computes the rating for a movie, given its features and the values of m and C:
Next, use the familiar
apply function on your
q_movies DataFrame to construct a new feature score. Since the calculation is done for every row, set the axis to
1 to denote row-wise operation:
Sorting and output
There is just one step left. You now need to sort your DataFrame on the basis of the score you just computed and output the list of top movies:
And voila! You have just built your recommender. Congratulations!
You can see that the Bollywood film Dilwale Dulhania Le Jayenge figures at the top of the list. It has a noticeably smaller number of votes than the other Top 25 movies. This strongly suggests that you should probably explore a higher value of m. Experiment with different values of m and observe how the movies in the chart change.
The knowledge-based recommender
Now, you’ll learn to build a knowledge-based recommender on top of your IMDB Top 250 clone. This will be a simple function that will perform the following tasks:
- Ask the user for the genres of movies he/she is looking for
- Ask the user for the duration
- Ask the user for the timeline of the movies recommended
- Using the information collected, recommend movies to the user that have a high weighted rating (according to the IMDB formula) and that satisfy the preceding conditions
The data that you have has information on the duration, genres, and timelines, but it isn’t currently in a form that is directly usable. Your data needs to be wrangled before it can be put to use to build this recommender.
IMDB folder, create a new Jupyter Notebook named
Knowledge Recommender. This notebook will contain all the code that you write as part of this section.
Load your packages and the data into your notebook. Also, take a look at the features that you have and decide on the ones that will be useful for this task:
From your output, it is quite clear which features you do and do not require. Now, reduce your DataFrame to only contain features that you need for your model:
Extract the year of release from your
year feature is still an
object and is riddled with
NaT values, which are a type of null value used by Pandas. Convert these values to an integer,
0, and convert the datatype of the
year feature into
To do this, define a helper function,
convert_int, and apply it to the
The runtime feature is already in a form that is usable. It doesn’t require any additional wrangling. Now, turn your attention to genres.
You may observe that the genres are in a format that looks like a JSON object (or a Python dictionary). Take a look at the genres object of one of your movies:
Observe that the output is a stringified dictionary. In order for this feature to be usable, it is important that you convert this string into a native Python dictionary. Fortunately, Python gives you access to a function called
literal_eval (available in the
ast library) which does exactly that.
literal_eval parses any string passed into it and converts it into its corresponding Python object:
You now have all the tools required to convert the genres feature into the Python dictionary format.
Also, each dictionary represents a genre and has two keys: id and name. However, for this exercise, you only require the name. Therefore, convert your list of dictionaries into a list of strings, where each string is a genre name:
Printing the head of the DataFrame should show you a new
genres feature, which is a list of genre names. However, you’re still not done yet. The last step is to
explode the genres column. In other words, if a particular movie has multiple genres, create multiple copies of the movie, with each movie having one of the genres.
For example, if there is a movie called Just Go With It that has romance and comedy as its genres,
explode this movie into two rows. One row will be Just Go With It as a romance movie. The other will be a comedy movie:
You should be able to see three Toy Story rows now; one each to represent animation, family, and comedy. This
gen_df DataFrame is what you will use to build your knowledge-based recommender.
The build_chart function
You are finally in a position to write the function that will act as your recommender. You cannot use your computed values of m and C from earlier, as you’ll not be considering every movie just the ones that qualify. In other words, these are three main steps:
- Get user input on their preferences
- Extract all movies that match the conditions set by the user
- Calculate the values of m and C for only these movies and proceed to build the chart as in the previous section
build_chart function will accept only two inputs: your
gen_df DataFrame and the percentile used to calculate the value of m. By default, set this to 80% or
Time to put your model into action!
You may want recommendations for animated movies between 30 minutes and 2 hours in length, and released anywhere between 1990 and 2005. See the results:
You can see that the movies that it outputs satisfy all the conditions you passed in as input. Since you applied IMDB’s metric, you can also observe that your movies are very highly rated and popular at the same time.
If you found this article interesting, you can explore Rounak Banik’s Hands-On Recommendation Systems with Python to get started with building recommendation systems is a familiarity with Python. With Hands-On Recommendation Systems with Python, learn the tools and techniques required in building various kinds of powerful recommendation systems (collaborative, knowledge and content based) and deploying them to the web.