# Movies on Netflix, Prime Video, Hulu and Disney+

We live in a world where the use of technology is greater than it was ever before and it is likely to increase in the future as well. With the outbreak of the Covid’19, the world came to a halt. It has been more than a year and many parts of the world are still in lockdown. As a result, there has been an increase in the use of movies/tv shows streaming platforms as a form of entertainment since places like cinemas have been shut down. The presence of applications on devices means that users can access these streaming platforms whenever they want and watch their favourite movies on the go. The platforms also have features that allow users to filter out their movie preferences and hence aids in the user’s search for movies. It is fascinating how these platforms are able to make suggestions based on the user’s previous streaming history. The purpose of this blog is to try to understand the workings of these popular streaming platforms and see if we can make predictions about the content that is being streamed on them.

## Introduction

Throughout this project, we will look into different features that affect the popularity of a movie. Our primary focus for the project will be to discover that **if we have information about a set of features related to the movie, can we accurately predict which platform will it be streamed upon?**

Using machine learning models, we will be able to predict which platform is more likely to stream a movie that will be released in the upcoming years. The four platforms available in our dataset are Netflix, Hulu, Prime Video and Disney+. We start off by cleaning the movies dataset and performing Exploratory Data Analysis (EDA) on it. To support our thesis statement, we will try to answer the following subquestions.

• Which platform has the highest number and diverse range of movies available on it?

• Which platform appeals to what age group the most?

• Which country is producing the most multilingual content?

• Is there a correlation between IMDb ratings and rotten tomatoes?

• Does having more than one director have any affect on the rotten tomatoes/IMDb ratings?

• Is USA streaming more movies on these platforms as compared to the other countries?

Next, we will try to build different classifiers using the movie features available in the dataset. We will discuss this more later.

*Data Cleaning and EDA*

*Data Cleaning and EDA*

The dataset provides us with a record of movies from the year 1902 till 2020 and categorises these movies based on the various platforms they are available on. The dataset also contains a comprehensive list of movie features such as titles, genres, IMDb ratings, language etc that will be used to make predictions and draw useful conclusions in the subsequent sections of this blog. The following link will direct to the Kaggle website that contains the movies dataset:

A visualisation of the dataset is available on the following link:

We will now explain the steps taken to clean the dataset. The first step we took is to set the “ID” column as the index of our data. Our dataset consists of 16 columns where each column describes a movie feature. A column labelled “Type” contains a value of “0” in each row. We observe that it does not represent anything valuable regarding our dataset and hence we drop this column. There is a column named “Unnamed: 0” in the dataset which does not signify anything and is hence dropped as well.

The data now consists of 14 columns. Next, we observe that the % sign with all the values in the “Rotten Tomatoes” column might make it difficult for us to work with it hence we remove these % signs and convert the column type to float.

Similarly we have a + sign with all the values in the “Age” column e.g. 7+ means that a movie is suitable for people aged 7 and above. We remove this + sign and convert this column to float.

An additional step that we performed on the “Age” column is to replace the values of “all” to “3”. The all value in the “Age” column implies that a movie is suitable for all age groups but we change it to 3 to make our data consistent with the numerical values in the “Age” column. We have chosen an age of 3 since this is a common age for a human being to start watching and interpreting the digital content.

We need to look at the null values present in each column of the data. It is a good practice to identify and act upon the null values in the dataset so that our analysis can be more efficient and the results less bias. The null values could be a result of corrupt data or invalid information in the dataset. The snippet below shows the number of null values in each column.

The next step is to get rid of the null values. One way to do this could be to eliminate all the rows that contains null values. Doing so would however, significantly decrease the size of our dataset and the predictions that we draw from our analysis on this reduced dataset might not be very accurate as it will be based on a small sample that will give inaccurate results when generalised over to the entire population.

Therefore, we employ different measures to fill in the null values in our columns of interest. Firstly, we replace the null values in our “Age” column with “18”. The reason we have chosen an age of 18 is because we are not sure about what age group a particular movie is suitable for so, in order to ensure that small age groups are not exposed to inappropriate content, we choose 18 as a safe choice.

The next step deals with the “IMDb ratings” and “Rotten Tomatoes” columns. We have already removed the % sign from all the values in the “Rotten Tomatoes” column. Next we divide all the values in this column by 10 so that we are able to better compare the “IMDb ratings” and “Rotten Tomatoes” columns since the values are now in the same range 0–10. Both the columns have the type float. Now the null values in the “Rotten Tomatoes” column are filled with the corresponding value in the “IMDb ratings” column. Next the null values in the “IMDb ratings” column is filled with the corresponding values in the “Rotten Tomatoes” column. Finally, the leftover null values in both these columns are removed from the dataset.

There are 4 other columns in the movies data that contain null values. The columns of Directors, Genres, Country and Language do not contain numerical values hence it is not feasible to determine values that can be replaced for the null values so we decide to drop the rows with the null values.

## Which country is producing the most multilingual content?

We consider a movie to be multi-lingual if it has more than one language in the “Language” column. To find out which country has produced the most multi-lingual movies, we first separated the multi-lingual and non multi-lingual movies using the “Language” column. We had only 2672 multi-lingual movies in our dataset. Next, we split the countries of the movies in separate rows so that getting the total count of multi-lingual movies produced by the country would be easier. In the final step, we added all the countries and chose that country which had the highest number of multi-lingual movies. The bar graph below shows that United States is has the highest number of multi-lingual movies.

**Is there a correlation between IMDb ratings and Rotten Tomatoes?**

IMDb ratings show what the audience thinks of a movie. However, if one only trusts the opinions of top critics then Rotten Tomatoes is a better choice. Since both platforms are known for their ratings it would be interesting to see if there is any relationship between them.

We would start by seeing if there is any kind of correlation between the ratings of these two platforms. For this we can use a correlation matrix.

As we can see from the matrix, the value of correlation between IMDb ratings and Rotten Tomatoes is 0.62 which implies a positive relationship. Also, the value 0.62 is greater than 0.5 so we can say that the ratings of both the platforms are moderately correlated.

For better understanding of the relationship between the two ratings, we can observe a scatterplot visualisation.

The scatterplot shows that there is a positive correlation between IMDb ratings and Rotten Tomatoes as the scattering of the plots follows an upward trend.

So, in most of the cases it would not matter which rating platform we are following since the ratings of both platforms is positively correlated.

**Does having more than one director have any effect on the Rotten Tomatoes/IMDb ratings?**

We can also observe if there is any correlation between the number of directors and the rating of a movie.

Let’s visualise the average IMDb ratings of a movie and the number of directors it has side by side.

From the above plot we can see that the average IMDb rating of movies with multiple directors is slightly higher than the average IMDb rating of movies that only have one director.

Now, let’s see if this holds true for the Rotten Tomatoes.

Again we can see that the average Rotten Tomatoes rating of movies with multiple directors is higher than the average Rotten Tomatoes rating of movies with just one director.

Using the results of the bar plots above, we can conclude that having multiple directors can result in higher IMDb ratings and Rotten Tomatoes rating. This might be due to the fact that having more than one director can ensure more supervision that can result in mistakes during production being minimised. So the end result; movies will obtain higher ratings.

**Which platform has the highest number and diverse range of movies available on it?**

To find which platform had the highest number of movies, we added the number of movies on each platform and displayed the one which had the highest count. The code above shows that Prime Video had the highest number of movies available on it with a count of 11348.

The next step is to figure out which platform has a more diverse range of movies available on it. Since Prime Video has the highest number of movies so just adding up the different types of genres available on a platform to show the diverse range would have been biased.

To ensure that the difference in the number of movies does not effect the diversity, we decided to take means. We call these means the diversity count of a platform. This is calculated by dividing the total number of movies by total number of genres for each platform to get a better idea of which platform has the most diverse range of content. We observe that although Prime Video has the highest number of movies available on it but it has the smallest diversity i.e. 2.34. The platform with the most diverse range of movies is Disney+ with a diverse count of 3.7. The bar graph below shows the diversity count of the 4 platforms.

**Is US streaming more movies on these platforms as compared to the other countries?**

It is believed that US streams more digital content than any other country. To see if that is really the case, we start by calculating the number of movies streamed by each country.

We have a total of 162 different countries in our data and each platform has the United States streaming the most movies. Hence for better comparison of the difference in percentages of content streamed, we can take a look at the top 5 streaming countries for each platform by applying the sort_values function.

For Netflix, the top 5 streaming countries are France, Canada, United Kingdom, India and United States. US streams the most movies with a count of 1622 on Netflix.

For Disney+, the top 5 streaming countries are France, Australia, Canada, United Kingdom and United States. Again, US streams the most movies with a count of 533 on Disney+.

For Hulu, the top 5 streaming countries are Germany, France, Canada, United Kingdom and United States. US streams the most movies with a count of 639 on Hulu.

For Prime Video, the top 5 streaming countries are France, India, Canada, United Kingdom and United States. US again streams the most movies with a count of 7508.

In each streaming platform’s pie chart it can be observed by how much more United States streams movies digitally in comparison to the top 5 countries. United States is a rising movie producing country and has the most movies on each platform.

It can be argued that we have limited our analysis to 5 countries out of a total of 162 present in the data so the observations in the pie chart are not 100% accurate. However plotting all the 162 countries on the pie chart would have only cluttered the visualisations. The pie charts compares the 4 platforms across the 5 countries by giving us the movie streaming percentages which shows that USA has the highest percentage of movies streamed across all the four platforms.

**Which platform appeals to what age group the most?**

There is a general perception that most of the content available on digital platform targets audience that is 18+. Let’s find out if that’s actually the case.

We will be using separate data for all four of the platforms available to us and then we’ll be able to find that which platform appeals to what age group the most. For this purpose we generated tables for all our platforms. In the tables given below; **1** refers to the number of movies that are available for the specific age group in the given platform and **0 **refers to the number of movies that are not available for the specific age group in the given platform.

Four tables for Hulu, Netflix, Prime Video and Disney+ are given below.

Focusing only on the movies available to the 5 age groups, the tables show that Hulu, Netflix and Prime Video target 18+ audience the most. While Disney+ targets 3+ audience the most. Our data actually makes sense because we know that Disney+ is a kids platform so it would target the age of 3+ the most. On the other hand, Netflix, Hulu and Prime Video are mostly known to be for the 18+ audience.

We will be using bar plots for the visualisations of the age groups across all 4 platforms.

The bar graphs for Hulu, Netflix and Prime Video show the highest length of the (18.0, 1) bars indicating that these platforms stream the most content for 18+ ages. On the other hand, the bar graph of Disney+ has the highest height of the bar at (3.0, 1) indicating that it streams content that is targeted towards the audience of ages 3+.

**Feature Engineering**

Feature engineering is the art of extracting useful features from the raw dataset. Machine learning models use these features to make predictions. These features are the inputs for the model on which the whole processing is carried out. The accuracy of a model depends on the features selected. For the movies dataset, we already got rid of all the null values. Now we only needed to convert non-numeric data to numeric data as the machine learning models can only be applied on numeric data. For this purpose, we have used ordinal and one-hot encoding.

We one-hot encoded the genre column. The issue we faced in this process was that most movies had more than one genre so to deal with that, we first made different columns for each genre and then one-hot encoded each of the genre column. Age, IMDb ratings, Rotten tomatoes, Netflix, Hulu, Prime and Disney+ were already in numeric form. For the year column we have used ordinal encoding. We divided the years by decades e.g. 1900–1910 is 1, 1910–1920 is 2 and so on. After that we dropped the unnecessary columns i.e. “Title”, “Directors”, “Genres”, “Country”, “Language”, “Genre_Count” and our data was ready to be used in machine learning algorithms.

**Linear Regression Modelling on IMDb ratings**

It is believed that an effective way to determine if a movie is worth watching is to use its IMDb ratings. Luckily we have an IMDb ratings column in our dataset. We will apply a Multiple Linear Regression Model on the IMDb ratings to study the relationship between the dependent variable Y (IMDb ratings) and the independent variables x.

Our linear regression model will be trained to predict the IMDb ratings for a movie given the test data and to calculate the accuracy of its prediction.

The first step was to create a correlation matrix to observe the correlation between the different features. The darker the shade of blue (the closer the number is to 1) and hence the stronger the correlation between two variables.

Next we divide the data into features (X) and label (y) where the features are the independent variables and the label is the dependent variable (IMDb ratings) whose value we predict from the model. The next step was to scale the data which ensures that the values of the X features and y label is within a 0-1 range. Scaling the data basically involves rescaling the distribution of the data so that the mean of observed values is 0 and the standard deviation is 1. Then, we split the data such that 80% of it is in the training set while 20% of it is in the test set.

Using Scikit-learn to make a regression model, we then fit the model on the training set and test the model on the test set.

R-square value is called the coefficient of determination which tells us how well the regression model fits the observed data. Our value of 0.679 implies that 67.9% of the data fits the regression model. The greater the R-square value, the better fit for the model.

Our model has a MSE (mean squared error) of 0.323. MSE measures the squared average distance between the real data and the predicted data. Larger errors are well noted and MSE has a smooth function so it is easier to minimise the errors using numerical methods. MSE has a disadvantage that it squares up the units of data as well. So evaluation with different units is not at all justified.

MAE (mean absolute error) of our model is 0.440. MAE measures the absolute average distance between the real data and the predicted data. MAE fails to punish large errors in prediction as it does not have a smooth function so differentiating at each of the ‘kinks’ is difficult and hence error minimisation is difficult. MSE is very sensitive to outliers as it involves taking the mean while MAE uses median. So outliers in the data do not significantly impact MAE (robust to outliers).

Both the MSE and MAE have a low value and the lower the value of both these errors, the better the fit of the regression model. Hence the low values of MAE and MSE imply greater accuracy of our regression model.

Another evaluation metric we are using is a histogram representing the test residual data. Residual is defined as the difference between the observed value and the predicted value, hence a residual of 0 means that the predicted value is correct.

It can be seen from our plot that the values are almost evenly distributed with the centre around 0. This implies that the underlying assumption of normality as well as the assumption of homoscedasticity (i.e. error terms are the same across all values of the independent variables i.e. x) holds. Therefore, we can conclude that our model has a high accuracy since the predicted values are a good fit.

The accuracy score for our model comes out to be 67.92% which implies that our model has a 67.92% probability of predicting the correct IMDb ratings of a movie.

**KNN Classifier**

K-Nearest Neighbours (KNN) is one of the simplest algorithms used in Machine Learning for both regression and classification problems. KNN uses the training data and classifies new data points (test data) based on similarity measures (e.g. distance function).

We implement KNN classifier on the four platforms and test its accuracy on them. KNN simply calculates the Euclidean distance of a new data point with all the training data points. It then selects the K-nearest data points, where K can be any integer. Finally, it assigns the data point to the class to which the majority of the K training data points belong.

**KNN Classifier on Netflix**

Applying the KNN classifier with Netflix as the response variable and the value of k = 12, the accuracy comes out to be 78.34%. This accuracy implies that given the features of a movie, KNN will correctly predict if a movie is available on Netflix with a probability of 78.34%.

This accuracy is pretty high but since it is less than 100% so we try to find a value of k for which the accuracy will further increase. For this, we calculate errors for k values between the range of 1 to 40. Observing the plot below, we see that the error decreases as the k value increases.

By observing the plot above we can see that there is no such value of k for which the error is 0. However, we obtain the minimum error at k = 38 so this is the optimal value of k which will give the highest accuracy of KNN for Netflix.

The Confusion Matrix for Netflix comes out to be:

Similar to the approach we have used on Netflix, we now apply the KNN on the other 3 platforms. The results are presented below.

**KNN Classifier on Hulu**

Applying the KNN classifier with Hulu as the response variable and the value of k = 12, the accuracy comes out to be 95.27%. This accuracy implies that given the features of a movie, KNN will correctly predict if a movie is available on Hulu with a probability of 95.27%.

This accuracy is higher than that of Netflix but since it is less than 100% so we try to find a value of k for which the accuracy will further increase. For this, we calculate errors for k values between the range of 1 to 40. Observing the plot below, we see that the error decreases as the k value increases till k = 10. From k = 10 onwards, the error becomes constant as shown by the horizontal line in the plot below.

By observing the plot above we can see that there is no such value of k for which the error is 0. However, we obtain the minimum error at k = 10 so this is the optimal value of k which will give the highest accuracy of KNN for Hulu.

The Confusion Matrix for Hulu comes out to be:

**KNN Classifier on Prime Video**

Applying the KNN classifier with Prime Video as the response variable and the value of k = 12, the accuracy comes out to be 75.35%. This accuracy implies that given the features of a movie, KNN will correctly predict if a movie is available on Prime Video with a probability of 75.35%.

This accuracy is less than 100% so we try to find a value of k for which the accuracy will further increase. For this, we calculate errors for k values between the range of 1 to 40. Observing the plot below, we see that the error decreases as the k value increases. The error plot follows a zig zag plot as shown below.

By observing the plot above we can see that there is no such value of k for which the error is 0. However, we obtain the minimum error at k = 39 so this is the optimal value of k which will give the highest accuracy of KNN for Prime Video.

The Confusion Matrix for Prime Video comes out to be:

**KNN Classifier on Disney+**

Applying the KNN classifier with Disney+ as the response variable and the value of k = 12, the accuracy comes out to be 97.67%. This accuracy implies that given the features of a movie, KNN will correctly predict if a movie is available on Disney+ with a probability of 97.67%.

This accuracy is less than 100% so we try to find a value of k for which the accuracy will further increase. For this, we calculate errors for k values between the range of 1 to 40. Observing the plot below, we see that the error decreases as the k value increases. The error plot follows a zig zag plot as shown below.

By observing the plot above we can see that there is no such value of k for which the error is 0. However, we obtain the minimum error at k = 36 so this is the optimal value of k which will give the highest accuracy of KNN for Disney+.

The Confusion Matrix for Disney+ comes out to be:

**Random Forest Classifier**

Machine learning provides a plethora of classification algorithms and among them is Random Forest classifier. Random forest is an easy machine learning algorithm which provides high accuracy without any hyper parameter tuning as compared to other machine learning algorithms. Random forest grows multiple trees on a model and split the data features as its branches. Then it searches for the best outcome on random branches. Random forest gives a different result every time it runs because of this phenomenon.

We have used the Sciket-learn library to implement the Random Forest algorithm. It gives the accuracy of correctly predicting the availability of a movie on a particular platform. We have also plotted the feature importance graphs for each platform. Feature importance refers to techniques that assigns a score to input features based on how useful they are at predicting a target variable.

The results of the 4 platforms are given below.

**Random Forest Classifier on Netflix**

The plot below shows that Runtime of a movie has the highest score for predicting its availability on Netflix and is hence the most important feature of Netflix movies.

Applying the Random Forest classifier with Netflix as the response variable, the accuracy comes out to be 78.90%. This accuracy implies that given the features of a movie, Random Forest will correctly predict if a movie is available on Netflix with a probability of 78.90%.

**Random Forest Classifier on Hulu**

The plot below shows that Rotten Tomatoes of a movie has the highest score for predicting its availability on Hulu and is hence the most important feature of Hulu movies.

Applying the Random Forest classifier with Hulu as the response variable, the accuracy comes out to be 94.13%. This accuracy implies that given the features of a movie, Random Forest will correctly predict if a movie is available on Hulu with a probability of 78.90%.

**Random Forest Classifier on Prime Video**

The plot below shows that Runtime of a movie has the highest score for predicting its availability on Prime Video and is hence the most important feature of Prime Video movies.

Applying the Random Forest classifier with Prime Video as the response variable, the accuracy comes out to be 75.97%. This accuracy implies that given the features of a movie, Random Forest will correctly predict if a movie is available on Prime Video with a probability of 75.97%.

**Random Forest Classifier on Disney+**

The plot below shows that Age of a movie has the highest score for predicting its availability on Disney+ and is hence the most important feature of Disney+ movies.

Applying the Random Forest classifier with Disney+ as the response variable, the accuracy comes out to be 97.70%. This accuracy implies that given the features of a movie, Random Forest will correctly predict if a movie is available on Disney+ with a probability of 97.70%.

**Conclusion**

Our blog begins with data cleaning and exploratory data analysis. The cleaned data allowed us to answer the sub questions. We discovered that US is producing the most multi-lingual content on the streaming platforms. Using the correlation matrix and scatterplot, we found that the Rotten Tomatoes and IMDb ratings are moderately correlated. We also analysed that having multiple directors led to a higher IMDb rating/rotten tomatoes. We uncovered that Prime Video has the highest number of movies available on it but Disney+ had a more diverse range of movies available on it. Another important finding is that US is streaming the most content on these platforms than any other country. We also analysed which platform is suitable for which age group and found that while Hulu, Netflix and Prime Video target 18+ audience, Disney+ targets the audience of 3+.

The main question that we tried to answer is that how accurately can we predict if a particular movie will be streamed on a specific platform. For this we applied 3 different Machine Learning models on our dataset. Firstly, we applied a Multiple Linear Regression model with the IMDb ratings of the movies as a response variable. The accuracy score of 67.92% implies that our model has a 67.92% probability of predicting the correct IMDb ratings of a movie.

Next we applied both KNN and Random Forest Classifier on the 4 platforms. The reason we used 2 different classifiers on the same platforms was to compare the accuracies of the 2 models and decide which model can serve as a better predictor for our dataset. As summarised in the table below, we see that both the models provide us with almost same accuracies. The accuracy of both KNN and Random Forest is highest for Disney+, followed by Hulu, then Netflix and then Prime Video. We can therefore conclude that either of these models can be used to predict the availability of a movie on Netflix, Hulu, Prime Video and Disney+.

We end this blog with a quote that might inspire the readers to invest their time in exploring the fascinating world of Machine Learning.

*“A breakthrough in Machine Learning would be worth ten Microsofts.” -Bill Gates*