To put to use, the data analytics skills I've acquired recently, I've tried to find interesting insights in movies released between 1916 and 2016, using Python. I've downloaded a movie dataset, written Python code to explore the data, gained insights into the movies, actors, directors, and collections. You can use the following links to go to a particular section or scroll the document for complete analysis.
If you are short of time to go through the entire analysis, here are the important conclusions.
# Supressing Warnings
import warnings
warnings.filterwarnings('ignore')
# Importing the numpy and pandas packages
import numpy as np
import pandas as pd
import seaborn as sns
movies = pd.read_csv('./Movie+Assignment+Data.csv')
movies
Inspecting the dataframe's columns, shapes, variable types etc.
# Number of rows and columns in the dataset as a tuple : (rows, columns)
shape = movies.shape
print('There are {} rows and {} columns in the movies dataframe'.format(shape[0], shape[1]),'\n\n' )
# dataframe info
print('Column : Values \n', movies.info())
The above output gives the names of the columns, Number of valid values in each column and the datatype
# Summary Statistics of the data.
movies.describe()
# No of rows containing null values in each column.
print('Column Name \t: \tTotal Null Rows \n', movies.isnull().sum().sort_values(ascending=False))
# No of columns containing null values in each row.
print('Row Index : \tTotal Null Columns \n', movies.isnull().sum(axis=1).sort_values(ascending=False))
Note that some rows have more than half of the values missing.
# (Total null rows per column / Total rows in the data frame) * 100 , rounded to 2 decimals
column_nulls = np.round(((movies.isnull().sum()/movies.shape[0])*100).sort_values(ascending=False),2)
print('Column Name : \t\tNull Columns (%) \n', column_nulls)
In this analysis, we will mostly be analyzing the movies with respect to the ratings, gross collection, popularity of movies, etc. So many of the columns in this dataframe are not required. So we drop the following columns.
columns_to_drop = ['color','director_facebook_likes','actor_1_facebook_likes','actor_2_facebook_likes',
'actor_3_facebook_likes','actor_2_name','cast_total_facebook_likes','actor_3_name',
'duration','facenumber_in_poster','content_rating','country','movie_imdb_link',
'aspect_ratio','plot_keywords']
#dropping columns in place
movies.drop(columns=columns_to_drop, inplace=True)
movies.shape
Now, on inspection you might notice that some columns have large percentage (greater than 5%) of Null values. Dropping all the rows which have Null values for such columns.
# columns with null (%) > 5
column_nulls = np.round((movies.isnull().sum()/movies.shape[0])*100,2)
high_null_columns = column_nulls[column_nulls > 5]
print(high_null_columns)
# dropping columns with high null (%)
movies.dropna( axis = 0, subset=high_null_columns.index, inplace=True)
movies
You might notice that the language
column has some NaN values. Here, on inspection, you will see that it is safe to replace all the missing values with 'English'
.
# 12 NAs in language column
# Filling NAs with 'English'
movies.fillna({'language' : 'English'}, inplace=True)
movies
You might notice that two of the columns viz. num_critic_for_reviews
and actor_1_name
have small percentages of NaN values left. We can let these columns as it is for now.
print('Column Name : \t\tTotal Null Rows \n', movies.isnull().sum(),'\n\n')
# Retained column % = (Current Columns / Initial Columns) * 100
#shape contains the shape of the original imported movies dataset
retained = (movies.shape[0]/shape[0])*100
print('Retained rows wrt to origin dataframe :',retained,'%\n\n')
#Not rounding off to preserve accuracy
movies['budget'] = movies['budget']/10**6
movies['gross'] = movies['gross']/10**6
movies.head(10)
1. Creating a new column called `profit` which contains the difference of the two columns: `gross` and `budget`.
2. Sorting the dataframe using the `profit` column as reference.
3. Plotting `profit` (y-axis) vs `budget` (x- axis) and observe the outliers using the appropriate chart type.
4. Extracting the top ten profiting movies in descending order and store them in a new dataframe - `top10`
# creating the profit column
movies['profit'] = movies['gross'] - movies['budget']
movies.head()
# sorting the dataframe
movies.sort_values(by='profit', inplace=True, ascending=False)
movies.head()
# profit vs budget plot
import matplotlib.pyplot as plt
movies.plot.scatter('budget', 'profit')
plt.title('Profit vs Budget')
plt.show()
From the above plot , we can see that some movies have made a lot of losses compared to most. These are outliers / exceptions
#outliers
movies.loc[(movies['profit'] < -2000) & (movies['budget'] > 2000),'movie_title']
The above movies have incurred huge losses compared to any movie with average earnings.
#top 10 movies by profit
top10 = movies.iloc[:10]
top10
Notice that, James Cameroon has made two of the top 10 most profitable movies. 'Avatar' & 'Titanic'. He has made more profits than the most profitable Steven Speilberg and Christopher Nolan movies.
Out of the top 10 profiting movies, you might have noticed a duplicate value. So, it seems like the dataframe has duplicate values as well. Dropping the duplicate values from the dataframe and repeat the previous task. Note that the same movie_title
can be there in different languages.
# dropping duplicates
movies.drop_duplicates(inplace=True)
# repeating the previous task
movies_by_profit = movies.sort_values(by='profit', ascending=False)
top10 = movies_by_profit.iloc[:10]
top10
1. Creating a new dataframe `IMDb_Top_250` and storing the top 250 movies with the highest IMDb Rating (corresponding to the column: `imdb_score`). We are considering only those movies with votes is greater than 25,000.We shall add a `Rank` column containing the values 1 to 250 indicating the ranks of the corresponding films.
2. We shall also extract the movies in the `IMDb_Top_250` dataframe which are not in the English language and store them in a new dataframe named `Top_Foreign_Lang_Film`.
# extracting the top 250 movies as per the IMDb score.
# selecting movies where the number of voted users is greater than 25000.
movies_by_imdb_score = movies[movies['num_voted_users'] > 25000].sort_values(by='imdb_score', ascending=False)
# selecting the first 250 movies
IMDb_Top_250 = movies_by_imdb_score.iloc[:250]
#adding a new column "Rank" which contains rank of the movie
IMDb_Top_250['Rank'] = np.arange(1,251)
IMDb_Top_250.tail()
# Out of the top 250 movies, selecting movies which are not in 'English' language
Top_Foreign_Lang_Film = IMDb_Top_250[IMDb_Top_250['language'] != 'English'].sort_values(by='Rank')
Top_Foreign_Lang_Film
Well,our very own Veer-Zaara
has made the list. So has Bahubali!
#extracting the top 10 directors
#Grouping 'movies' dataframe by 'director_name'.
# Calculating mean of 'imdb_score' for each 'director_name'
mean_imdb_score = movies.groupby('director_name')['imdb_score'].mean()
#Creating a new dataframe directors_imdb_score
# mean imdb score is rounded to 1 decimal because that the accuracy of 1 decimal in the provided dataset
directors_mean_imdb_score = pd.DataFrame(np.round((mean_imdb_score),1))
# creating a new index and converting director_name into a column
directors_mean_imdb_score = directors_mean_imdb_score.reset_index()
# sorting in descending order of mean imdb scores
directors_mean_imdb_score.sort_values(by=['imdb_score','director_name'], ascending=[False,True],inplace=True)
# top 10 directors by mean imdb scores
top10director = directors_mean_imdb_score.iloc[:10]
top10director
No surprises that Damien Chazelle
(director of Whiplash and La La Land) is in this list.
# splitting genre into genre_1 and genre_2
movies['genre_1'] = movies['genres'].apply(lambda x : x.split('|')[0])
def genre_2(x) :
split = x.split('|')
if len(split) > 1 :
return split[1]
else :
return split[0]
movies['genre_2'] = movies['genres'].apply(genre_2)
movies[movies['genre_1'] == movies['genre_2']]
# grouping movies by genre_1 and genre_2
movies_by_segment = movies.groupby(['genre_1','genre_2'])
movies_by_segment=movies_by_segment['gross'].mean()
movies_by_segment.sort_values(ascending=False, inplace=True)
movies_by_segment
movies_by_segment.head()
Looks like the most popular Genre is Family + Sci-Fi. And it more than 2 times popular than the next category : Adventure + Sci-Fi
Meryl_Streep = movies[movies['actor_1_name'] == 'Meryl Streep']# Include all movies in which Meryl_Streep is the lead
Leo_Caprio = movies[movies['actor_1_name'] == 'Leonardo DiCaprio'] # Include all movies in which Leo_Caprio is the lead
Brad_Pitt = movies[movies['actor_1_name'] == 'Brad Pitt'] # Include all movies in which Brad_Pitt is the lead
# combining the three dataframes
Combined = pd.concat([Meryl_Streep,Leo_Caprio,Brad_Pitt])
Combined.head()
# grouping the combined dataframe
reviews_by_actor = Combined.groupby('actor_1_name')
# Finding the mean of critic reviews and audience reviews
# actors vs mean critic reviews in descending order
critic_reviews = reviews_by_actor['num_critic_for_reviews'].mean().sort_values(ascending=False)
# actors vs mean user reviews in descending order
user_reviews = reviews_by_actor['num_user_for_reviews'].mean().sort_values(ascending=False)
print(critic_reviews,'\n\n',user_reviews,'\n\n')
print('Actor with highest mean critic reviews : ',critic_reviews.index.to_list()[0] )
print('Actor with highest mean user reviews : ',user_reviews.index.to_list()[0] )
Leonardo
has aced both the lists. He is both the most liked and critically aclaimed actor!
# calculating decade
# Function to convert years into decades
decade_fn = np.vectorize(lambda x : int((x//10)*10 ))
# new column decade added to movies dataframe
movies['decade'] = decade_fn(movies['title_year'])
# number of voters grouped by decade
votes_grouped_by_decade = movies.groupby('decade')['num_voted_users'].sum()
# Write your code for creating the data frame df_by_decade here
df_by_decade = pd.DataFrame(votes_grouped_by_decade)
df_by_decade.reset_index(inplace=True)
df_by_decade
# Plotting number of voted users vs decade
sns.barplot(x='decade',y='num_voted_users',data = df_by_decade)
plt.xlabel('Decade')
plt.ylabel('No of votes')
plt.title('User Votes vs Movie Release Decade')
plt.show()
This plot shows the number of votes cast for movies of a certain decade. You can notice that there's no many votes for older movies. This could be because most people are just familiar with recent movies. But its interesting to notice that movies made during 2000-2009 are more popular than in or after 2010.
Finally, this analysis has been an endeavour to apply the python skills I acquired to draw insights from data.