Movies on Netflix, Prime Video, Hulu and Disney+

We live in a world where the use of technology is greater than it was ever before and it is likely to increase in the future as well. With the outbreak of the Covid’19, the world came to a halt. It has been more than a year and many parts of the world are still in lockdown. As a result, there has been an increase in the use of movies/tv shows streaming platforms as a form of entertainment since places like cinemas have been shut down. The presence of applications on devices means that users can access these streaming platforms whenever they want and watch their favourite movies on the go. The platforms also have features that allow users to filter out their movie preferences and hence aids in the user’s search for movies. It is fascinating how these platforms are able to make suggestions based on the user’s previous streaming history. The purpose of this blog is to try to understand the workings of these popular streaming platforms and see if we can make predictions about the content that is being streamed on them.

Introduction

Using machine learning models, we will be able to predict which platform is more likely to stream a movie that will be released in the upcoming years. The four platforms available in our dataset are Netflix, Hulu, Prime Video and Disney+. We start off by cleaning the movies dataset and performing Exploratory Data Analysis (EDA) on it. To support our thesis statement, we will try to answer the following subquestions.

• Which platform has the highest number and diverse range of movies available on it?

• Which platform appeals to what age group the most?

• Which country is producing the most multilingual content?

• Is there a correlation between IMDb ratings and rotten tomatoes?

• Does having more than one director have any affect on the rotten tomatoes/IMDb ratings?

• Is USA streaming more movies on these platforms as compared to the other countries?

Next, we will try to build different classifiers using the movie features available in the dataset. We will discuss this more later.

Data Cleaning and EDA

A visualisation of the dataset is available on the following link:

Original/Uncleaned movies dataset

We will now explain the steps taken to clean the dataset. The first step we took is to set the “ID” column as the index of our data. Our dataset consists of 16 columns where each column describes a movie feature. A column labelled “Type” contains a value of “0” in each row. We observe that it does not represent anything valuable regarding our dataset and hence we drop this column. There is a column named “Unnamed: 0” in the dataset which does not signify anything and is hence dropped as well.

The data now consists of 14 columns. Next, we observe that the % sign with all the values in the “Rotten Tomatoes” column might make it difficult for us to work with it hence we remove these % signs and convert the column type to float.

Movies dataset after dropping Age, Runtime and Unnamed: 0 columns

Similarly we have a + sign with all the values in the “Age” column e.g. 7+ means that a movie is suitable for people aged 7 and above. We remove this + sign and convert this column to float.

An additional step that we performed on the “Age” column is to replace the values of “all” to “3”. The all value in the “Age” column implies that a movie is suitable for all age groups but we change it to 3 to make our data consistent with the numerical values in the “Age” column. We have chosen an age of 3 since this is a common age for a human being to start watching and interpreting the digital content.

Replacing ‘all’ values in the Age column with ‘3’

We need to look at the null values present in each column of the data. It is a good practice to identify and act upon the null values in the dataset so that our analysis can be more efficient and the results less bias. The null values could be a result of corrupt data or invalid information in the dataset. The snippet below shows the number of null values in each column.

Null values before cleaning the dataset

The next step is to get rid of the null values. One way to do this could be to eliminate all the rows that contains null values. Doing so would however, significantly decrease the size of our dataset and the predictions that we draw from our analysis on this reduced dataset might not be very accurate as it will be based on a small sample that will give inaccurate results when generalised over to the entire population.

Therefore, we employ different measures to fill in the null values in our columns of interest. Firstly, we replace the null values in our “Age” column with “18”. The reason we have chosen an age of 18 is because we are not sure about what age group a particular movie is suitable for so, in order to ensure that small age groups are not exposed to inappropriate content, we choose 18 as a safe choice.

The next step deals with the “IMDb ratings” and “Rotten Tomatoes” columns. We have already removed the % sign from all the values in the “Rotten Tomatoes” column. Next we divide all the values in this column by 10 so that we are able to better compare the “IMDb ratings” and “Rotten Tomatoes” columns since the values are now in the same range 0–10. Both the columns have the type float. Now the null values in the “Rotten Tomatoes” column are filled with the corresponding value in the “IMDb ratings” column. Next the null values in the “IMDb ratings” column is filled with the corresponding values in the “Rotten Tomatoes” column. Finally, the leftover null values in both these columns are removed from the dataset.

Cleaned movies dataset

There are 4 other columns in the movies data that contain null values. The columns of Directors, Genres, Country and Language do not contain numerical values hence it is not feasible to determine values that can be replaced for the null values so we decide to drop the rows with the null values.

Null values after cleaning the dataset

Which country is producing the most multilingual content?

Is there a correlation between IMDb ratings and Rotten Tomatoes?

IMDb ratings show what the audience thinks of a movie. However, if one only trusts the opinions of top critics then Rotten Tomatoes is a better choice. Since both platforms are known for their ratings it would be interesting to see if there is any relationship between them.

We would start by seeing if there is any kind of correlation between the ratings of these two platforms. For this we can use a correlation matrix.

As we can see from the matrix, the value of correlation between IMDb ratings and Rotten Tomatoes is 0.62 which implies a positive relationship. Also, the value 0.62 is greater than 0.5 so we can say that the ratings of both the platforms are moderately correlated.

For better understanding of the relationship between the two ratings, we can observe a scatterplot visualisation.

The scatterplot shows that there is a positive correlation between IMDb ratings and Rotten Tomatoes as the scattering of the plots follows an upward trend.

So, in most of the cases it would not matter which rating platform we are following since the ratings of both platforms is positively correlated.

Does having more than one director have any effect on the Rotten Tomatoes/IMDb ratings?

Let’s visualise the average IMDb ratings of a movie and the number of directors it has side by side.

From the above plot we can see that the average IMDb rating of movies with multiple directors is slightly higher than the average IMDb rating of movies that only have one director.

Now, let’s see if this holds true for the Rotten Tomatoes.

Again we can see that the average Rotten Tomatoes rating of movies with multiple directors is higher than the average Rotten Tomatoes rating of movies with just one director.

Using the results of the bar plots above, we can conclude that having multiple directors can result in higher IMDb ratings and Rotten Tomatoes rating. This might be due to the fact that having more than one director can ensure more supervision that can result in mistakes during production being minimised. So the end result; movies will obtain higher ratings.

Which platform has the highest number and diverse range of movies available on it?

Prime Video has the highest number of movies available on it

To find which platform had the highest number of movies, we added the number of movies on each platform and displayed the one which had the highest count. The code above shows that Prime Video had the highest number of movies available on it with a count of 11348.

The next step is to figure out which platform has a more diverse range of movies available on it. Since Prime Video has the highest number of movies so just adding up the different types of genres available on a platform to show the diverse range would have been biased.

To ensure that the difference in the number of movies does not effect the diversity, we decided to take means. We call these means the diversity count of a platform. This is calculated by dividing the total number of movies by total number of genres for each platform to get a better idea of which platform has the most diverse range of content. We observe that although Prime Video has the highest number of movies available on it but it has the smallest diversity i.e. 2.34. The platform with the most diverse range of movies is Disney+ with a diverse count of 3.7. The bar graph below shows the diversity count of the 4 platforms.

Diversity count of each platform

Is US streaming more movies on these platforms as compared to the other countries?

We have a total of 162 different countries in our data and each platform has the United States streaming the most movies. Hence for better comparison of the difference in percentages of content streamed, we can take a look at the top 5 streaming countries for each platform by applying the sort_values function.

For Netflix, the top 5 streaming countries are France, Canada, United Kingdom, India and United States. US streams the most movies with a count of 1622 on Netflix.

For Disney+, the top 5 streaming countries are France, Australia, Canada, United Kingdom and United States. Again, US streams the most movies with a count of 533 on Disney+.

For Hulu, the top 5 streaming countries are Germany, France, Canada, United Kingdom and United States. US streams the most movies with a count of 639 on Hulu.

For Prime Video, the top 5 streaming countries are France, India, Canada, United Kingdom and United States. US again streams the most movies with a count of 7508.

In each streaming platform’s pie chart it can be observed by how much more United States streams movies digitally in comparison to the top 5 countries. United States is a rising movie producing country and has the most movies on each platform.­

It can be argued that we have limited our analysis to 5 countries out of a total of 162 present in the data so the observations in the pie chart are not 100% accurate. However plotting all the 162 countries on the pie chart would have only cluttered the visualisations. The pie charts compares the 4 platforms across the 5 countries by giving us the movie streaming percentages which shows that USA has the highest percentage of movies streamed across all the four platforms.

Which platform appeals to what age group the most?

We will be using separate data for all four of the platforms available to us and then we’ll be able to find that which platform appeals to what age group the most. For this purpose we generated tables for all our platforms. In the tables given below; 1 refers to the number of movies that are available for the specific age group in the given platform and 0 refers to the number of movies that are not available for the specific age group in the given platform.

Four tables for Hulu, Netflix, Prime Video and Disney+ are given below.

The table above shows the movies available (1) and not available (0) for the 5 age groups. We focus only on the movies available to the age groups and can observe that the highest number of movies (569) are available to the ages of 18+ on Hulu.
The table above shows the movies available (1) and not available (0) for the 5 age groups. We focus only on the movies available to the age groups and can observe that the highest number of movies (2351) are available to the ages of 18+ on Netflix.
The table above shows the movies available (1) and not available (0) for the 5 age groups. We focus only on the movies available to the age groups and can observe that the highest number of movies (9106) are available to the ages of 18+ on Prime Video.
The table above shows the movies available (1) and not available (0) for the 5 age groups. We focus only on the movies available to the age groups and can observe that the highest number of movies (273) are available to the ages of 3+ on Disney+.

Focusing only on the movies available to the 5 age groups, the tables show that Hulu, Netflix and Prime Video target 18+ audience the most. While Disney+ targets 3+ audience the most. Our data actually makes sense because we know that Disney+ is a kids platform so it would target the age of 3+ the most. On the other hand, Netflix, Hulu and Prime Video are mostly known to be for the 18+ audience.

We will be using bar plots for the visualisations of the age groups across all 4 platforms.

The bar graphs for Hulu, Netflix and Prime Video show the highest length of the (18.0, 1) bars indicating that these platforms stream the most content for 18+ ages. On the other hand, the bar graph of Disney+ has the highest height of the bar at (3.0, 1) indicating that it streams content that is targeted towards the audience of ages 3+.

Feature Engineering

Feature engineering is the art of extracting useful features from the raw dataset. Machine learning models use these features to make predictions. These features are the inputs for the model on which the whole processing is carried out. The accuracy of a model depends on the features selected. For the movies dataset, we already got rid of all the null values. Now we only needed to convert non-numeric data to numeric data as the machine learning models can only be applied on numeric data. For this purpose, we have used ordinal and one-hot encoding.

We one-hot encoded the genre column. The issue we faced in this process was that most movies had more than one genre so to deal with that, we first made different columns for each genre and then one-hot encoded each of the genre column. Age, IMDb ratings, Rotten tomatoes, Netflix, Hulu, Prime and Disney+ were already in numeric form. For the year column we have used ordinal encoding. We divided the years by decades e.g. 1900–1910 is 1, 1910–1920 is 2 and so on. After that we dropped the unnecessary columns i.e. “Title”, “Directors”, “Genres”, “Country”, “Language”, “Genre_Count” and our data was ready to be used in machine learning algorithms.

Final/Encoded Dataset

Linear Regression Modelling on IMDb ratings

It is believed that an effective way to determine if a movie is worth watching is to use its IMDb ratings. Luckily we have an IMDb ratings column in our dataset. We will apply a Multiple Linear Regression Model on the IMDb ratings to study the relationship between the dependent variable Y (IMDb ratings) and the independent variables x.

Our linear regression model will be trained to predict the IMDb ratings for a movie given the test data and to calculate the accuracy of its prediction.

The first step was to create a correlation matrix to observe the correlation between the different features. The darker the shade of blue (the closer the number is to 1) and hence the stronger the correlation between two variables.

Correlation Matrix

Next we divide the data into features (X) and label (y) where the features are the independent variables and the label is the dependent variable (IMDb ratings) whose value we predict from the model. The next step was to scale the data which ensures that the values of the X features and y label is within a 0-1 range. Scaling the data basically involves rescaling the distribution of the data so that the mean of observed values is 0 and the standard deviation is 1. Then, we split the data such that 80% of it is in the training set while 20% of it is in the test set.

Using Scikit-learn to make a regression model, we then fit the model on the training set and test the model on the test set.

Results of Linear Regression Model

R-square value is called the coefficient of determination which tells us how well the regression model fits the observed data. Our value of 0.679 implies that 67.9% of the data fits the regression model. The greater the R-square value, the better fit for the model.

Our model has a MSE (mean squared error) of 0.323. MSE measures the squared average distance between the real data and the predicted data. Larger errors are well noted and MSE has a smooth function so it is easier to minimise the errors using numerical methods. MSE has a disadvantage that it squares up the units of data as well. So evaluation with different units is not at all justified.

MAE (mean absolute error) of our model is 0.440. MAE measures the absolute average distance between the real data and the predicted data. MAE fails to punish large errors in prediction as it does not have a smooth function so differentiating at each of the ‘kinks’ is difficult and hence error minimisation is difficult. MSE is very sensitive to outliers as it involves taking the mean while MAE uses median. So outliers in the data do not significantly impact MAE (robust to outliers).

Both the MSE and MAE have a low value and the lower the value of both these errors, the better the fit of the regression model. Hence the low values of MAE and MSE imply greater accuracy of our regression model.

Another evaluation metric we are using is a histogram representing the test residual data. Residual is defined as the difference between the observed value and the predicted value, hence a residual of 0 means that the predicted value is correct.

It can be seen from our plot that the values are almost evenly distributed with the centre around 0. This implies that the underlying assumption of normality as well as the assumption of homoscedasticity (i.e. error terms are the same across all values of the independent variables i.e. x) holds. Therefore, we can conclude that our model has a high accuracy since the predicted values are a good fit.

The accuracy score for our model comes out to be 67.92% which implies that our model has a 67.92% probability of predicting the correct IMDb ratings of a movie.

KNN Classifier

K-Nearest Neighbours (KNN) is one of the simplest algorithms used in Machine Learning for both regression and classification problems. KNN uses the training data and classifies new data points (test data) based on similarity measures (e.g. distance function).

We implement KNN classifier on the four platforms and test its accuracy on them. KNN simply calculates the Euclidean distance of a new data point with all the training data points. It then selects the K-nearest data points, where K can be any integer. Finally, it assigns the data point to the class to which the majority of the K training data points belong.

KNN Classifier on Netflix

Applying the KNN classifier with Netflix as the response variable and the value of k = 12, the accuracy comes out to be 78.34%. This accuracy implies that given the features of a movie, KNN will correctly predict if a movie is available on Netflix with a probability of 78.34%.

This accuracy is pretty high but since it is less than 100% so we try to find a value of k for which the accuracy will further increase. For this, we calculate errors for k values between the range of 1 to 40. Observing the plot below, we see that the error decreases as the k value increases.

By observing the plot above we can see that there is no such value of k for which the error is 0. However, we obtain the minimum error at k = 38 so this is the optimal value of k which will give the highest accuracy of KNN for Netflix.

The Confusion Matrix for Netflix comes out to be:

KNN Metrics of Netflix

Similar to the approach we have used on Netflix, we now apply the KNN on the other 3 platforms. The results are presented below.

KNN Classifier on Hulu

Applying the KNN classifier with Hulu as the response variable and the value of k = 12, the accuracy comes out to be 95.27%. This accuracy implies that given the features of a movie, KNN will correctly predict if a movie is available on Hulu with a probability of 95.27%.

This accuracy is higher than that of Netflix but since it is less than 100% so we try to find a value of k for which the accuracy will further increase. For this, we calculate errors for k values between the range of 1 to 40. Observing the plot below, we see that the error decreases as the k value increases till k = 10. From k = 10 onwards, the error becomes constant as shown by the horizontal line in the plot below.

By observing the plot above we can see that there is no such value of k for which the error is 0. However, we obtain the minimum error at k = 10 so this is the optimal value of k which will give the highest accuracy of KNN for Hulu.

The Confusion Matrix for Hulu comes out to be:

KNN Metrics of Hulu

KNN Classifier on Prime Video

Applying the KNN classifier with Prime Video as the response variable and the value of k = 12, the accuracy comes out to be 75.35%. This accuracy implies that given the features of a movie, KNN will correctly predict if a movie is available on Prime Video with a probability of 75.35%.

This accuracy is less than 100% so we try to find a value of k for which the accuracy will further increase. For this, we calculate errors for k values between the range of 1 to 40. Observing the plot below, we see that the error decreases as the k value increases. The error plot follows a zig zag plot as shown below.

By observing the plot above we can see that there is no such value of k for which the error is 0. However, we obtain the minimum error at k = 39 so this is the optimal value of k which will give the highest accuracy of KNN for Prime Video.

The Confusion Matrix for Prime Video comes out to be:

KNN Metrics of Prime Video

KNN Classifier on Disney+

Applying the KNN classifier with Disney+ as the response variable and the value of k = 12, the accuracy comes out to be 97.67%. This accuracy implies that given the features of a movie, KNN will correctly predict if a movie is available on Disney+ with a probability of 97.67%.

This accuracy is less than 100% so we try to find a value of k for which the accuracy will further increase. For this, we calculate errors for k values between the range of 1 to 40. Observing the plot below, we see that the error decreases as the k value increases. The error plot follows a zig zag plot as shown below.

By observing the plot above we can see that there is no such value of k for which the error is 0. However, we obtain the minimum error at k = 36 so this is the optimal value of k which will give the highest accuracy of KNN for Disney+.

The Confusion Matrix for Disney+ comes out to be:

KNN Metrics of Disney+

Random Forest Classifier

Machine learning provides a plethora of classification algorithms and among them is Random Forest classifier. Random forest is an easy machine learning algorithm which provides high accuracy without any hyper parameter tuning as compared to other machine learning algorithms. Random forest grows multiple trees on a model and split the data features as its branches. Then it searches for the best outcome on random branches. Random forest gives a different result every time it runs because of this phenomenon.

We have used the Sciket-learn library to implement the Random Forest algorithm. It gives the accuracy of correctly predicting the availability of a movie on a particular platform. We have also plotted the feature importance graphs for each platform. Feature importance refers to techniques that assigns a score to input features based on how useful they are at predicting a target variable.

The results of the 4 platforms are given below.

Random Forest Classifier on Netflix

The plot below shows that Runtime of a movie has the highest score for predicting its availability on Netflix and is hence the most important feature of Netflix movies.

Feature importance of Netflix

Applying the Random Forest classifier with Netflix as the response variable, the accuracy comes out to be 78.90%. This accuracy implies that given the features of a movie, Random Forest will correctly predict if a movie is available on Netflix with a probability of 78.90%.

Random Forest Classifier on Hulu

The plot below shows that Rotten Tomatoes of a movie has the highest score for predicting its availability on Hulu and is hence the most important feature of Hulu movies.

Feature importance of Hulu

Applying the Random Forest classifier with Hulu as the response variable, the accuracy comes out to be 94.13%. This accuracy implies that given the features of a movie, Random Forest will correctly predict if a movie is available on Hulu with a probability of 78.90%.

Random Forest Classifier on Prime Video

The plot below shows that Runtime of a movie has the highest score for predicting its availability on Prime Video and is hence the most important feature of Prime Video movies.

Feature importance of Prime Video

Applying the Random Forest classifier with Prime Video as the response variable, the accuracy comes out to be 75.97%. This accuracy implies that given the features of a movie, Random Forest will correctly predict if a movie is available on Prime Video with a probability of 75.97%.

Random Forest Classifier on Disney+

The plot below shows that Age of a movie has the highest score for predicting its availability on Disney+ and is hence the most important feature of Disney+ movies.

Feature importance for Disney+

Applying the Random Forest classifier with Disney+ as the response variable, the accuracy comes out to be 97.70%. This accuracy implies that given the features of a movie, Random Forest will correctly predict if a movie is available on Disney+ with a probability of 97.70%.

Conclusion

Our blog begins with data cleaning and exploratory data analysis. The cleaned data allowed us to answer the sub questions. We discovered that US is producing the most multi-lingual content on the streaming platforms. Using the correlation matrix and scatterplot, we found that the Rotten Tomatoes and IMDb ratings are moderately correlated. We also analysed that having multiple directors led to a higher IMDb rating/rotten tomatoes. We uncovered that Prime Video has the highest number of movies available on it but Disney+ had a more diverse range of movies available on it. Another important finding is that US is streaming the most content on these platforms than any other country. We also analysed which platform is suitable for which age group and found that while Hulu, Netflix and Prime Video target 18+ audience, Disney+ targets the audience of 3+.

The main question that we tried to answer is that how accurately can we predict if a particular movie will be streamed on a specific platform. For this we applied 3 different Machine Learning models on our dataset. Firstly, we applied a Multiple Linear Regression model with the IMDb ratings of the movies as a response variable. The accuracy score of 67.92% implies that our model has a 67.92% probability of predicting the correct IMDb ratings of a movie.

Next we applied both KNN and Random Forest Classifier on the 4 platforms. The reason we used 2 different classifiers on the same platforms was to compare the accuracies of the 2 models and decide which model can serve as a better predictor for our dataset. As summarised in the table below, we see that both the models provide us with almost same accuracies. The accuracy of both KNN and Random Forest is highest for Disney+, followed by Hulu, then Netflix and then Prime Video. We can therefore conclude that either of these models can be used to predict the availability of a movie on Netflix, Hulu, Prime Video and Disney+.

We end this blog with a quote that might inspire the readers to invest their time in exploring the fascinating world of Machine Learning.

“A breakthrough in Machine Learning would be worth ten Microsofts.” -Bill Gates

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store