IMDB Movie Analysis

To put to use, the data analytics skills I've acquired recently, I've tried to find interesting insights in movies released between 1916 and 2016, using Python. I've downloaded a movie dataset, written Python code to explore the data, gained insights into the movies, actors, directors, and collections. You can use the following links to go to a particular section or scroll the document for complete analysis.

If you are short of time to go through the entire analysis, here are the important conclusions.

  • We imported movie data from IMDB and inspected the attributes.
  • We cleaned the data of missing values, uneccessary data.
  • We found the five worst performing movies to be
    • The Host
    • Lady Vengeance
    • Fateless
    • Princess Mononoke
    • Steamboy
  • We also found movies which made the most profit. James Cameron's Avatar is the most profitable movie so far. In fact, his Titanic is in the top 3. Christopher Nolan's Dark Night is the 10th most profitable.One can easily notice that most profitable movies are dominatingly Sci-Fi.
  • We ranked the top 250 movies by IMDB rating.
  • Among the best foreign Language films, we see 'Bahubali' and 'Veer Zara' which are quite popular in India.
  • It's no wonder that Charles Chaplin is the best director. Other noteworthy mentions are Damien Chazelle , the director of Whiplash and La La Land, Christopher Nolan and Alfred Hitchcock.
  • The most popular genres are Family + SciFi and SciFi Adventure. Looks like a lot of people are still the 'Back to the Future' clan.
  • We compared the most liked contemporary actors - Meryl Streep, Leonardo Decaprio and Brad Bitt. Turns out that among them, Leonardo Decaprio is both critically acclaimed and an audience favourite.
  • At the end, we looked at a comparison of popularity of movies by the decade in which they released.
In [58]:
# Supressing Warnings
import warnings
warnings.filterwarnings('ignore')
In [59]:
# Importing the numpy and pandas packages
import numpy as np
import pandas as pd
import seaborn as sns

Reading and Inspection

Importing and reading

Importing and reading the movie database.

In [60]:
movies = pd.read_csv('./Movie+Assignment+Data.csv')
movies
Out[60]:
color director_name num_critic_for_reviews duration director_facebook_likes actor_3_facebook_likes actor_2_name actor_1_facebook_likes gross genres ... num_user_for_reviews language country content_rating budget title_year actor_2_facebook_likes imdb_score aspect_ratio movie_facebook_likes
0 Color James Cameron 723.0 178.0 0.0 855.0 Joel David Moore 1000.0 760505847.0 Action|Adventure|Fantasy|Sci-Fi ... 3054.0 English USA PG-13 237000000.0 2009.0 936.0 7.9 1.78 33000
1 Color Gore Verbinski 302.0 169.0 563.0 1000.0 Orlando Bloom 40000.0 309404152.0 Action|Adventure|Fantasy ... 1238.0 English USA PG-13 300000000.0 2007.0 5000.0 7.1 2.35 0
2 Color Sam Mendes 602.0 148.0 0.0 161.0 Rory Kinnear 11000.0 200074175.0 Action|Adventure|Thriller ... 994.0 English UK PG-13 245000000.0 2015.0 393.0 6.8 2.35 85000
3 Color Christopher Nolan 813.0 164.0 22000.0 23000.0 Christian Bale 27000.0 448130642.0 Action|Thriller ... 2701.0 English USA PG-13 250000000.0 2012.0 23000.0 8.5 2.35 164000
4 NaN Doug Walker NaN NaN 131.0 NaN Rob Walker 131.0 NaN Documentary ... NaN NaN NaN NaN NaN NaN 12.0 7.1 NaN 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
5038 Color Scott Smith 1.0 87.0 2.0 318.0 Daphne Zuniga 637.0 NaN Comedy|Drama ... 6.0 English Canada NaN NaN 2013.0 470.0 7.7 NaN 84
5039 Color NaN 43.0 43.0 NaN 319.0 Valorie Curry 841.0 NaN Crime|Drama|Mystery|Thriller ... 359.0 English USA TV-14 NaN NaN 593.0 7.5 16.00 32000
5040 Color Benjamin Roberds 13.0 76.0 0.0 0.0 Maxwell Moody 0.0 NaN Drama|Horror|Thriller ... 3.0 English USA NaN 1400.0 2013.0 0.0 6.3 NaN 16
5041 Color Daniel Hsia 14.0 100.0 0.0 489.0 Daniel Henney 946.0 10443.0 Comedy|Drama|Romance ... 9.0 English USA PG-13 NaN 2012.0 719.0 6.3 2.35 660
5042 Color Jon Gunn 43.0 90.0 16.0 16.0 Brian Herzlinger 86.0 85222.0 Documentary ... 84.0 English USA PG 1100.0 2004.0 23.0 6.6 1.85 456

5043 rows × 28 columns

Dataframe Inspection

Inspecting the dataframe's columns, shapes, variable types etc.

In [61]:
# Number of rows and columns in the dataset as a tuple : (rows, columns)
shape = movies.shape
print('There are {} rows and {} columns in the movies dataframe'.format(shape[0], shape[1]),'\n\n' )
There are 5043 rows and 28 columns in the movies dataframe 


In [62]:
# dataframe info 
print('Column   : Values  \n', movies.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5043 entries, 0 to 5042
Data columns (total 28 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   color                      5024 non-null   object 
 1   director_name              4939 non-null   object 
 2   num_critic_for_reviews     4993 non-null   float64
 3   duration                   5028 non-null   float64
 4   director_facebook_likes    4939 non-null   float64
 5   actor_3_facebook_likes     5020 non-null   float64
 6   actor_2_name               5030 non-null   object 
 7   actor_1_facebook_likes     5036 non-null   float64
 8   gross                      4159 non-null   float64
 9   genres                     5043 non-null   object 
 10  actor_1_name               5036 non-null   object 
 11  movie_title                5043 non-null   object 
 12  num_voted_users            5043 non-null   int64  
 13  cast_total_facebook_likes  5043 non-null   int64  
 14  actor_3_name               5020 non-null   object 
 15  facenumber_in_poster       5030 non-null   float64
 16  plot_keywords              4890 non-null   object 
 17  movie_imdb_link            5043 non-null   object 
 18  num_user_for_reviews       5022 non-null   float64
 19  language                   5031 non-null   object 
 20  country                    5038 non-null   object 
 21  content_rating             4740 non-null   object 
 22  budget                     4551 non-null   float64
 23  title_year                 4935 non-null   float64
 24  actor_2_facebook_likes     5030 non-null   float64
 25  imdb_score                 5043 non-null   float64
 26  aspect_ratio               4714 non-null   float64
 27  movie_facebook_likes       5043 non-null   int64  
dtypes: float64(13), int64(3), object(12)
memory usage: 1.1+ MB
Column   : Values  
 None

The above output gives the names of the columns, Number of valid values in each column and the datatype

In [63]:
# Summary Statistics of the data.
movies.describe()
Out[63]:
num_critic_for_reviews duration director_facebook_likes actor_3_facebook_likes actor_1_facebook_likes gross num_voted_users cast_total_facebook_likes facenumber_in_poster num_user_for_reviews budget title_year actor_2_facebook_likes imdb_score aspect_ratio movie_facebook_likes
count 4993.000000 5028.000000 4939.000000 5020.000000 5036.000000 4.159000e+03 5.043000e+03 5043.000000 5030.000000 5022.000000 4.551000e+03 4935.000000 5030.000000 5043.000000 4714.000000 5043.000000
mean 140.194272 107.201074 686.509212 645.009761 6560.047061 4.846841e+07 8.366816e+04 9699.063851 1.371173 272.770808 3.975262e+07 2002.470517 1651.754473 6.442138 2.220403 7525.964505
std 121.601675 25.197441 2813.328607 1665.041728 15020.759120 6.845299e+07 1.384853e+05 18163.799124 2.013576 377.982886 2.061149e+08 12.474599 4042.438863 1.125116 1.385113 19320.445110
min 1.000000 7.000000 0.000000 0.000000 0.000000 1.620000e+02 5.000000e+00 0.000000 0.000000 1.000000 2.180000e+02 1916.000000 0.000000 1.600000 1.180000 0.000000
25% 50.000000 93.000000 7.000000 133.000000 614.000000 5.340988e+06 8.593500e+03 1411.000000 0.000000 65.000000 6.000000e+06 1999.000000 281.000000 5.800000 1.850000 0.000000
50% 110.000000 103.000000 49.000000 371.500000 988.000000 2.551750e+07 3.435900e+04 3090.000000 1.000000 156.000000 2.000000e+07 2005.000000 595.000000 6.600000 2.350000 166.000000
75% 195.000000 118.000000 194.500000 636.000000 11000.000000 6.230944e+07 9.630900e+04 13756.500000 2.000000 326.000000 4.500000e+07 2011.000000 918.000000 7.200000 2.350000 3000.000000
max 813.000000 511.000000 23000.000000 23000.000000 640000.000000 7.605058e+08 1.689764e+06 656730.000000 43.000000 5060.000000 1.221550e+10 2016.000000 137000.000000 9.500000 16.000000 349000.000000

Cleaning the Data

Inspecting Null values

Finding out the number of Null values in all the columns and rows. Also, finding the percentage of Null values in each column. Rounding off the percentages upto two decimal places.

In [64]:
# No of rows containing null values in each column.
print('Column Name   \t: \tTotal Null Rows  \n', movies.isnull().sum().sort_values(ascending=False))
Column Name   	: 	Total Null Rows  
 gross                        884
budget                       492
aspect_ratio                 329
content_rating               303
plot_keywords                153
title_year                   108
director_name                104
director_facebook_likes      104
num_critic_for_reviews        50
actor_3_name                  23
actor_3_facebook_likes        23
num_user_for_reviews          21
color                         19
duration                      15
facenumber_in_poster          13
actor_2_name                  13
actor_2_facebook_likes        13
language                      12
actor_1_name                   7
actor_1_facebook_likes         7
country                        5
movie_facebook_likes           0
genres                         0
movie_title                    0
num_voted_users                0
movie_imdb_link                0
imdb_score                     0
cast_total_facebook_likes      0
dtype: int64
In [65]:
# No of columns containing null values in each row.
print('Row Index : \tTotal Null Columns  \n', movies.isnull().sum(axis=1).sort_values(ascending=False))
Row Index : 	Total Null Columns  
 279     15
4       14
4945    11
2241    11
2342    10
        ..
2708     0
2707     0
2706     0
2705     0
0        0
Length: 5043, dtype: int64

Note that some rows have more than half of the values missing.

In [66]:
# (Total null rows per column / Total rows in the data frame) * 100 , rounded to 2 decimals
column_nulls = np.round(((movies.isnull().sum()/movies.shape[0])*100).sort_values(ascending=False),2)

print('Column Name : \t\tNull Columns (%)  \n', column_nulls)
Column Name : 		Null Columns (%)  
 gross                        17.53
budget                        9.76
aspect_ratio                  6.52
content_rating                6.01
plot_keywords                 3.03
title_year                    2.14
director_name                 2.06
director_facebook_likes       2.06
num_critic_for_reviews        0.99
actor_3_name                  0.46
actor_3_facebook_likes        0.46
num_user_for_reviews          0.42
color                         0.38
duration                      0.30
facenumber_in_poster          0.26
actor_2_name                  0.26
actor_2_facebook_likes        0.26
language                      0.24
actor_1_name                  0.14
actor_1_facebook_likes        0.14
country                       0.10
movie_facebook_likes          0.00
genres                        0.00
movie_title                   0.00
num_voted_users               0.00
movie_imdb_link               0.00
imdb_score                    0.00
cast_total_facebook_likes     0.00
dtype: float64

Dropping unecessary columns

In this analysis, we will mostly be analyzing the movies with respect to the ratings, gross collection, popularity of movies, etc. So many of the columns in this dataframe are not required. So we drop the following columns.

  • color
  • director_facebook_likes
  • actor_1_facebook_likes
  • actor_2_facebook_likes
  • actor_3_facebook_likes
  • actor_2_name
  • cast_total_facebook_likes
  • actor_3_name
  • duration
  • facenumber_in_poster
  • content_rating
  • country
  • movie_imdb_link
  • aspect_ratio
  • plot_keywords
In [67]:
columns_to_drop = ['color','director_facebook_likes','actor_1_facebook_likes','actor_2_facebook_likes',
                   'actor_3_facebook_likes','actor_2_name','cast_total_facebook_likes','actor_3_name',
                   'duration','facenumber_in_poster','content_rating','country','movie_imdb_link',
                   'aspect_ratio','plot_keywords']

#dropping columns in place
movies.drop(columns=columns_to_drop, inplace=True)
movies.shape
Out[67]:
(5043, 13)

Dropping unecessary rows using columns with high Null percentages

Now, on inspection you might notice that some columns have large percentage (greater than 5%) of Null values. Dropping all the rows which have Null values for such columns.

In [68]:
# columns with null (%) > 5
column_nulls = np.round((movies.isnull().sum()/movies.shape[0])*100,2)
high_null_columns = column_nulls[column_nulls > 5]
print(high_null_columns)

# dropping columns with high null (%)
movies.dropna( axis = 0, subset=high_null_columns.index, inplace=True)
movies
gross     17.53
budget     9.76
dtype: float64
Out[68]:
director_name num_critic_for_reviews gross genres actor_1_name movie_title num_voted_users num_user_for_reviews language budget title_year imdb_score movie_facebook_likes
0 James Cameron 723.0 760505847.0 Action|Adventure|Fantasy|Sci-Fi CCH Pounder Avatar 886204 3054.0 English 237000000.0 2009.0 7.9 33000
1 Gore Verbinski 302.0 309404152.0 Action|Adventure|Fantasy Johnny Depp Pirates of the Caribbean: At World's End 471220 1238.0 English 300000000.0 2007.0 7.1 0
2 Sam Mendes 602.0 200074175.0 Action|Adventure|Thriller Christoph Waltz Spectre 275868 994.0 English 245000000.0 2015.0 6.8 85000
3 Christopher Nolan 813.0 448130642.0 Action|Thriller Tom Hardy The Dark Knight Rises 1144337 2701.0 English 250000000.0 2012.0 8.5 164000
5 Andrew Stanton 462.0 73058679.0 Action|Adventure|Sci-Fi Daryl Sabara John Carter 212204 738.0 English 263700000.0 2012.0 6.6 24000
... ... ... ... ... ... ... ... ... ... ... ... ... ...
5033 Shane Carruth 143.0 424760.0 Drama|Sci-Fi|Thriller Shane Carruth Primer 72639 371.0 English 7000.0 2004.0 7.0 19000
5034 Neill Dela Llana 35.0 70071.0 Thriller Ian Gamazon Cavite 589 35.0 English 7000.0 2005.0 6.3 74
5035 Robert Rodriguez 56.0 2040920.0 Action|Crime|Drama|Romance|Thriller Carlos Gallardo El Mariachi 52055 130.0 Spanish 7000.0 1992.0 6.9 0
5037 Edward Burns 14.0 4584.0 Comedy|Drama Kerry Bishé Newlyweds 1338 14.0 English 9000.0 2011.0 6.4 413
5042 Jon Gunn 43.0 85222.0 Documentary John August My Date with Drew 4285 84.0 English 1100.0 2004.0 6.6 456

3891 rows × 13 columns

Filling NaN values

You might notice that the language column has some NaN values. Here, on inspection, you will see that it is safe to replace all the missing values with 'English'.

In [69]:
# 12 NAs in language column

# Filling NAs with 'English' 
movies.fillna({'language' : 'English'}, inplace=True)
movies
Out[69]:
director_name num_critic_for_reviews gross genres actor_1_name movie_title num_voted_users num_user_for_reviews language budget title_year imdb_score movie_facebook_likes
0 James Cameron 723.0 760505847.0 Action|Adventure|Fantasy|Sci-Fi CCH Pounder Avatar 886204 3054.0 English 237000000.0 2009.0 7.9 33000
1 Gore Verbinski 302.0 309404152.0 Action|Adventure|Fantasy Johnny Depp Pirates of the Caribbean: At World's End 471220 1238.0 English 300000000.0 2007.0 7.1 0
2 Sam Mendes 602.0 200074175.0 Action|Adventure|Thriller Christoph Waltz Spectre 275868 994.0 English 245000000.0 2015.0 6.8 85000
3 Christopher Nolan 813.0 448130642.0 Action|Thriller Tom Hardy The Dark Knight Rises 1144337 2701.0 English 250000000.0 2012.0 8.5 164000
5 Andrew Stanton 462.0 73058679.0 Action|Adventure|Sci-Fi Daryl Sabara John Carter 212204 738.0 English 263700000.0 2012.0 6.6 24000
... ... ... ... ... ... ... ... ... ... ... ... ... ...
5033 Shane Carruth 143.0 424760.0 Drama|Sci-Fi|Thriller Shane Carruth Primer 72639 371.0 English 7000.0 2004.0 7.0 19000
5034 Neill Dela Llana 35.0 70071.0 Thriller Ian Gamazon Cavite 589 35.0 English 7000.0 2005.0 6.3 74
5035 Robert Rodriguez 56.0 2040920.0 Action|Crime|Drama|Romance|Thriller Carlos Gallardo El Mariachi 52055 130.0 Spanish 7000.0 1992.0 6.9 0
5037 Edward Burns 14.0 4584.0 Comedy|Drama Kerry Bishé Newlyweds 1338 14.0 English 9000.0 2011.0 6.4 413
5042 Jon Gunn 43.0 85222.0 Documentary John August My Date with Drew 4285 84.0 English 1100.0 2004.0 6.6 456

3891 rows × 13 columns

Checking the number of retained rows

You might notice that two of the columns viz. num_critic_for_reviews and actor_1_name have small percentages of NaN values left. We can let these columns as it is for now.

In [70]:
print('Column Name : \t\tTotal Null Rows  \n', movies.isnull().sum(),'\n\n')

# Retained column % = (Current Columns / Initial Columns) * 100
#shape contains the shape of the original imported movies dataset
retained = (movies.shape[0]/shape[0])*100
print('Retained rows wrt to origin dataframe :',retained,'%\n\n')
Column Name : 		Total Null Rows  
 director_name             0
num_critic_for_reviews    1
gross                     0
genres                    0
actor_1_name              3
movie_title               0
num_voted_users           0
num_user_for_reviews      0
language                  0
budget                    0
title_year                0
imdb_score                0
movie_facebook_likes      0
dtype: int64 


Retained rows wrt to origin dataframe : 77.15645449137418 %


Data Analysis

Changing the unit of columns

Converting the unit of the budget and gross columns from $ to million $.

In [71]:
#Not rounding off to preserve accuracy
movies['budget'] = movies['budget']/10**6
movies['gross'] = movies['gross']/10**6
movies.head(10)
Out[71]:
director_name num_critic_for_reviews gross genres actor_1_name movie_title num_voted_users num_user_for_reviews language budget title_year imdb_score movie_facebook_likes
0 James Cameron 723.0 760.505847 Action|Adventure|Fantasy|Sci-Fi CCH Pounder Avatar 886204 3054.0 English 237.0 2009.0 7.9 33000
1 Gore Verbinski 302.0 309.404152 Action|Adventure|Fantasy Johnny Depp Pirates of the Caribbean: At World's End 471220 1238.0 English 300.0 2007.0 7.1 0
2 Sam Mendes 602.0 200.074175 Action|Adventure|Thriller Christoph Waltz Spectre 275868 994.0 English 245.0 2015.0 6.8 85000
3 Christopher Nolan 813.0 448.130642 Action|Thriller Tom Hardy The Dark Knight Rises 1144337 2701.0 English 250.0 2012.0 8.5 164000
5 Andrew Stanton 462.0 73.058679 Action|Adventure|Sci-Fi Daryl Sabara John Carter 212204 738.0 English 263.7 2012.0 6.6 24000
6 Sam Raimi 392.0 336.530303 Action|Adventure|Romance J.K. Simmons Spider-Man 3 383056 1902.0 English 258.0 2007.0 6.2 0
7 Nathan Greno 324.0 200.807262 Adventure|Animation|Comedy|Family|Fantasy|Musi... Brad Garrett Tangled 294810 387.0 English 260.0 2010.0 7.8 29000
8 Joss Whedon 635.0 458.991599 Action|Adventure|Sci-Fi Chris Hemsworth Avengers: Age of Ultron 462669 1117.0 English 250.0 2015.0 7.5 118000
9 David Yates 375.0 301.956980 Adventure|Family|Fantasy|Mystery Alan Rickman Harry Potter and the Half-Blood Prince 321795 973.0 English 250.0 2009.0 7.5 10000
10 Zack Snyder 673.0 330.249062 Action|Adventure|Sci-Fi Henry Cavill Batman v Superman: Dawn of Justice 371639 3018.0 English 250.0 2016.0 6.9 197000

Finding the movies with highest profit

1. Creating a new column called `profit` which contains the difference of the two columns: `gross` and `budget`.
2. Sorting the dataframe using the `profit` column as reference.
3. Plotting `profit` (y-axis) vs `budget` (x- axis) and observe the outliers using the appropriate chart type.
4. Extracting the top ten profiting movies in descending order and store them in a new dataframe - `top10`
In [72]:
# creating the profit column
movies['profit'] = movies['gross'] - movies['budget']
movies.head()
Out[72]:
director_name num_critic_for_reviews gross genres actor_1_name movie_title num_voted_users num_user_for_reviews language budget title_year imdb_score movie_facebook_likes profit
0 James Cameron 723.0 760.505847 Action|Adventure|Fantasy|Sci-Fi CCH Pounder Avatar 886204 3054.0 English 237.0 2009.0 7.9 33000 523.505847
1 Gore Verbinski 302.0 309.404152 Action|Adventure|Fantasy Johnny Depp Pirates of the Caribbean: At World's End 471220 1238.0 English 300.0 2007.0 7.1 0 9.404152
2 Sam Mendes 602.0 200.074175 Action|Adventure|Thriller Christoph Waltz Spectre 275868 994.0 English 245.0 2015.0 6.8 85000 -44.925825
3 Christopher Nolan 813.0 448.130642 Action|Thriller Tom Hardy The Dark Knight Rises 1144337 2701.0 English 250.0 2012.0 8.5 164000 198.130642
5 Andrew Stanton 462.0 73.058679 Action|Adventure|Sci-Fi Daryl Sabara John Carter 212204 738.0 English 263.7 2012.0 6.6 24000 -190.641321
In [73]:
# sorting the dataframe
movies.sort_values(by='profit', inplace=True, ascending=False)
movies.head()
Out[73]:
director_name num_critic_for_reviews gross genres actor_1_name movie_title num_voted_users num_user_for_reviews language budget title_year imdb_score movie_facebook_likes profit
0 James Cameron 723.0 760.505847 Action|Adventure|Fantasy|Sci-Fi CCH Pounder Avatar 886204 3054.0 English 237.0 2009.0 7.9 33000 523.505847
29 Colin Trevorrow 644.0 652.177271 Action|Adventure|Sci-Fi|Thriller Bryce Dallas Howard Jurassic World 418214 1290.0 English 150.0 2015.0 7.0 150000 502.177271
26 James Cameron 315.0 658.672302 Drama|Romance Leonardo DiCaprio Titanic 793059 2528.0 English 200.0 1997.0 7.7 26000 458.672302
3024 George Lucas 282.0 460.935665 Action|Adventure|Fantasy|Sci-Fi Harrison Ford Star Wars: Episode IV - A New Hope 911097 1470.0 English 11.0 1977.0 8.7 33000 449.935665
3080 Steven Spielberg 215.0 434.949459 Family|Sci-Fi Henry Thomas E.T. the Extra-Terrestrial 281842 515.0 English 10.5 1982.0 7.9 34000 424.449459
In [74]:
# profit vs budget plot
import matplotlib.pyplot as plt 

movies.plot.scatter('budget', 'profit')
plt.title('Profit vs Budget')
plt.show()

From the above plot , we can see that some movies have made a lot of losses compared to most. These are outliers / exceptions

In [75]:
#outliers 
movies.loc[(movies['profit'] < -2000) & (movies['budget'] > 2000),'movie_title']
Out[75]:
2334             Steamboy 
2323    Princess Mononoke 
3005             Fateless 
3859       Lady Vengeance 
2988             The Host 
Name: movie_title, dtype: object

The above movies have incurred huge losses compared to any movie with average earnings.

In [76]:
#top 10 movies by profit
top10 = movies.iloc[:10]
top10
Out[76]:
director_name num_critic_for_reviews gross genres actor_1_name movie_title num_voted_users num_user_for_reviews language budget title_year imdb_score movie_facebook_likes profit
0 James Cameron 723.0 760.505847 Action|Adventure|Fantasy|Sci-Fi CCH Pounder Avatar 886204 3054.0 English 237.0 2009.0 7.9 33000 523.505847
29 Colin Trevorrow 644.0 652.177271 Action|Adventure|Sci-Fi|Thriller Bryce Dallas Howard Jurassic World 418214 1290.0 English 150.0 2015.0 7.0 150000 502.177271
26 James Cameron 315.0 658.672302 Drama|Romance Leonardo DiCaprio Titanic 793059 2528.0 English 200.0 1997.0 7.7 26000 458.672302
3024 George Lucas 282.0 460.935665 Action|Adventure|Fantasy|Sci-Fi Harrison Ford Star Wars: Episode IV - A New Hope 911097 1470.0 English 11.0 1977.0 8.7 33000 449.935665
3080 Steven Spielberg 215.0 434.949459 Family|Sci-Fi Henry Thomas E.T. the Extra-Terrestrial 281842 515.0 English 10.5 1982.0 7.9 34000 424.449459
794 Joss Whedon 703.0 623.279547 Action|Adventure|Sci-Fi Chris Hemsworth The Avengers 995415 1722.0 English 220.0 2012.0 8.1 123000 403.279547
17 Joss Whedon 703.0 623.279547 Action|Adventure|Sci-Fi Chris Hemsworth The Avengers 995415 1722.0 English 220.0 2012.0 8.1 123000 403.279547
509 Roger Allers 186.0 422.783777 Adventure|Animation|Drama|Family|Musical Matthew Broderick The Lion King 644348 656.0 English 45.0 1994.0 8.5 17000 377.783777
240 George Lucas 320.0 474.544677 Action|Adventure|Fantasy|Sci-Fi Natalie Portman Star Wars: Episode I - The Phantom Menace 534658 3597.0 English 115.0 1999.0 6.5 13000 359.544677
66 Christopher Nolan 645.0 533.316061 Action|Crime|Drama|Thriller Christian Bale The Dark Knight 1676169 4667.0 English 185.0 2008.0 9.0 37000 348.316061

Notice that, James Cameroon has made two of the top 10 most profitable movies. 'Avatar' & 'Titanic'. He has made more profits than the most profitable Steven Speilberg and Christopher Nolan movies.

Dropping duplicate values

Out of the top 10 profiting movies, you might have noticed a duplicate value. So, it seems like the dataframe has duplicate values as well. Dropping the duplicate values from the dataframe and repeat the previous task. Note that the same movie_title can be there in different languages.

In [77]:
# dropping duplicates
movies.drop_duplicates(inplace=True)
In [78]:
# repeating the previous task
movies_by_profit = movies.sort_values(by='profit', ascending=False)
top10 = movies_by_profit.iloc[:10]
top10
Out[78]:
director_name num_critic_for_reviews gross genres actor_1_name movie_title num_voted_users num_user_for_reviews language budget title_year imdb_score movie_facebook_likes profit
0 James Cameron 723.0 760.505847 Action|Adventure|Fantasy|Sci-Fi CCH Pounder Avatar 886204 3054.0 English 237.0 2009.0 7.9 33000 523.505847
29 Colin Trevorrow 644.0 652.177271 Action|Adventure|Sci-Fi|Thriller Bryce Dallas Howard Jurassic World 418214 1290.0 English 150.0 2015.0 7.0 150000 502.177271
26 James Cameron 315.0 658.672302 Drama|Romance Leonardo DiCaprio Titanic 793059 2528.0 English 200.0 1997.0 7.7 26000 458.672302
3024 George Lucas 282.0 460.935665 Action|Adventure|Fantasy|Sci-Fi Harrison Ford Star Wars: Episode IV - A New Hope 911097 1470.0 English 11.0 1977.0 8.7 33000 449.935665
3080 Steven Spielberg 215.0 434.949459 Family|Sci-Fi Henry Thomas E.T. the Extra-Terrestrial 281842 515.0 English 10.5 1982.0 7.9 34000 424.449459
794 Joss Whedon 703.0 623.279547 Action|Adventure|Sci-Fi Chris Hemsworth The Avengers 995415 1722.0 English 220.0 2012.0 8.1 123000 403.279547
509 Roger Allers 186.0 422.783777 Adventure|Animation|Drama|Family|Musical Matthew Broderick The Lion King 644348 656.0 English 45.0 1994.0 8.5 17000 377.783777
240 George Lucas 320.0 474.544677 Action|Adventure|Fantasy|Sci-Fi Natalie Portman Star Wars: Episode I - The Phantom Menace 534658 3597.0 English 115.0 1999.0 6.5 13000 359.544677
66 Christopher Nolan 645.0 533.316061 Action|Crime|Drama|Thriller Christian Bale The Dark Knight 1676169 4667.0 English 185.0 2008.0 9.0 37000 348.316061
439 Gary Ross 673.0 407.999255 Adventure|Drama|Sci-Fi|Thriller Jennifer Lawrence The Hunger Games 701607 1959.0 English 78.0 2012.0 7.3 140000 329.999255

IMDb Top 250

1. Creating a new dataframe `IMDb_Top_250` and storing the top 250 movies with the highest IMDb Rating (corresponding to the column: `imdb_score`). We are considering only those movies with votes is greater than 25,000.We shall add a `Rank` column containing the values 1 to 250 indicating the ranks of the corresponding films.
2. We shall also extract the movies in the `IMDb_Top_250` dataframe which are not in the English language and store them in a new dataframe named `Top_Foreign_Lang_Film`.
In [79]:
# extracting the top 250 movies as per the IMDb score.

# selecting movies where the number of voted users is greater than 25000.

movies_by_imdb_score = movies[movies['num_voted_users'] > 25000].sort_values(by='imdb_score', ascending=False)

# selecting the first 250 movies
IMDb_Top_250 = movies_by_imdb_score.iloc[:250]

#adding a new column "Rank" which contains rank of the movie
IMDb_Top_250['Rank'] = np.arange(1,251)
IMDb_Top_250.tail()
Out[79]:
director_name num_critic_for_reviews gross genres actor_1_name movie_title num_voted_users num_user_for_reviews language budget title_year imdb_score movie_facebook_likes profit Rank
4640 Cristian Mungiu 233.0 1.185783 Drama Anamaria Marinca 4 Months, 3 Weeks and 2 Days 44763 172.0 Romanian 0.59 2007.0 7.9 14000 0.595783 246
2492 John Carpenter 318.0 47.000000 Horror|Thriller Jamie Lee Curtis Halloween 157857 1191.0 English 0.30 1978.0 7.9 12000 46.700000 247
4821 John Carpenter 318.0 47.000000 Horror|Thriller Jamie Lee Curtis Halloween 157863 1191.0 English 0.30 1978.0 7.9 12000 46.700000 248
639 Michael Mann 209.0 28.965197 Biography|Drama|Thriller Al Pacino The Insider 133526 521.0 English 68.00 1999.0 7.9 0 -39.034803 249
3029 David O. Russell 410.0 93.571803 Biography|Drama|Sport Christian Bale The Fighter 275869 389.0 English 25.00 2010.0 7.9 36000 68.571803 250

Best Foreign Language Films

In [80]:
# Out of the top 250 movies, selecting movies which are not in 'English' language

Top_Foreign_Lang_Film = IMDb_Top_250[IMDb_Top_250['language'] != 'English'].sort_values(by='Rank')
Top_Foreign_Lang_Film
Out[80]:
director_name num_critic_for_reviews gross genres actor_1_name movie_title num_voted_users num_user_for_reviews language budget title_year imdb_score movie_facebook_likes profit Rank
4498 Sergio Leone 181.0 6.100000 Western Clint Eastwood The Good, the Bad and the Ugly 503509 780.0 Italian 1.200000 1966.0 8.9 20000 4.900000 7
4029 Fernando Meirelles 214.0 7.563397 Crime|Drama Alice Braga City of God 533200 749.0 Portuguese 3.300000 2002.0 8.7 28000 4.263397 15
4747 Akira Kurosawa 153.0 0.269061 Action|Adventure|Drama Takashi Shimura Seven Samurai 229012 596.0 Japanese 2.000000 1954.0 8.7 11000 -1.730939 17
2373 Hayao Miyazaki 246.0 10.049886 Adventure|Animation|Family|Fantasy Bunta Sugawara Spirited Away 417971 902.0 Japanese 19.000000 2001.0 8.6 28000 -8.950114 26
4921 Majid Majidi 46.0 0.925402 Drama|Family Bahare Seddiqi Children of Heaven 27882 130.0 Persian 0.180000 1997.0 8.5 0 0.745402 43
4259 Florian Henckel von Donnersmarck 215.0 11.284657 Drama|Thriller Sebastian Koch The Lives of Others 259379 407.0 German 2.000000 2006.0 8.5 39000 9.284657 46
1329 S.S. Rajamouli 44.0 6.498000 Action|Adventure|Drama|Fantasy|War Tamannaah Bhatia Baahubali: The Beginning 62756 410.0 Telugu 18.026148 2015.0 8.4 21000 -11.528148 47
4659 Asghar Farhadi 354.0 7.098492 Drama|Mystery Shahab Hosseini A Separation 151812 264.0 Persian 0.500000 2011.0 8.4 48000 6.598492 49
1298 Jean-Pierre Jeunet 242.0 33.201661 Comedy|Romance Mathieu Kassovitz Amélie 534262 1314.0 French 77.000000 2001.0 8.4 39000 -43.798339 52
4105 Chan-wook Park 305.0 2.181290 Drama|Mystery|Thriller Min-sik Choi Oldboy 356181 809.0 Korean 3.000000 2003.0 8.4 43000 -0.818710 57
2323 Hayao Miyazaki 174.0 2.298191 Adventure|Animation|Fantasy Minnie Driver Princess Mononoke 221552 570.0 Japanese 2400.000000 1997.0 8.4 11000 -2397.701809 58
2970 Wolfgang Petersen 96.0 11.433134 Adventure|Drama|Thriller|War Jürgen Prochnow Das Boot 168203 426.0 German 14.000000 1981.0 8.4 11000 -2.566866 60
2734 Fritz Lang 260.0 0.026435 Drama|Sci-Fi Brigitte Helm Metropolis 111841 413.0 German 6.000000 1927.0 8.3 12000 -5.973565 68
4033 Thomas Vinterberg 349.0 0.610968 Drama Thomas Bo Larsen The Hunt 170155 249.0 Danish 3.800000 2012.0 8.3 60000 -3.189032 70
2829 Oliver Hirschbiegel 192.0 5.501940 Biography|Drama|History|War Thomas Kretschmann Downfall 248354 564.0 German 13.500000 2004.0 8.3 14000 -7.998060 74
3550 Denis Villeneuve 226.0 6.857096 Drama|Mystery|War Lubna Azabal Incendies 80429 156.0 French 6.800000 2010.0 8.2 37000 0.057096 88
4000 Juan José Campanella 262.0 20.167424 Drama|Mystery|Thriller Ricardo Darín The Secret in Their Eyes 131831 231.0 Spanish 2.000000 2009.0 8.2 33000 18.167424 96
2551 Guillermo del Toro 406.0 37.623143 Drama|Fantasy|War Ivana Baquero Pan's Labyrinth 467234 1083.0 Spanish 13.500000 2006.0 8.2 27000 24.123143 104
2047 Hayao Miyazaki 212.0 4.710455 Adventure|Animation|Family|Fantasy Christian Bale Howl's Moving Castle 214091 330.0 Japanese 24.000000 2004.0 8.2 13000 -19.289545 107
3553 José Padilha 142.0 0.008060 Action|Crime|Drama|Thriller Wagner Moura Elite Squad 81644 107.0 Portuguese 4.000000 2007.0 8.1 11000 -3.991940 109
3423 Katsuhiro Ôtomo 150.0 0.439162 Action|Animation|Sci-Fi Mitsuo Iwata Akira 106160 430.0 Japanese 1100.000000 1988.0 8.1 0 -1099.560838 112
2914 Je-kyu Kang 86.0 1.110186 Action|Drama|War Min-sik Choi Tae Guk Gi: The Brotherhood of War 31943 224.0 Korean 12.800000 2004.0 8.1 0 -11.689814 123
4461 Thomas Vinterberg 98.0 1.647780 Drama Ulrich Thomsen The Celebration 65951 258.0 Danish 1.300000 1998.0 8.1 5000 0.347780 125
4267 Alejandro G. Iñárritu 157.0 5.383834 Drama|Thriller Adriana Barraza Amores Perros 173551 361.0 Spanish 2.000000 2000.0 8.1 11000 3.383834 147
2830 Alejandro Amenábar 157.0 2.086345 Biography|Drama|Romance Belén Rueda The Sea Inside 64556 140.0 Spanish 10.000000 2004.0 8.1 0 -7.913655 152
4284 Ari Folman 231.0 2.283276 Animation|Biography|Documentary|Drama|History|War Ari Folman Waltz with Bashir 46107 156.0 Hebrew 1.500000 2008.0 8.0 0 0.783276 157
3456 Vincent Paronnaud 242.0 4.443403 Animation|Biography|Drama|War Catherine Deneuve Persepolis 70194 158.0 French 7.300000 2007.0 8.0 14000 -2.856597 163
3344 Karan Johar 210.0 4.018695 Adventure|Drama|Thriller Shah Rukh Khan My Name Is Khan 69759 235.0 Hindi 12.000000 2010.0 8.0 27000 -7.981305 166
4897 Sergio Leone 122.0 3.500000 Action|Drama|Western Clint Eastwood A Fistful of Dollars 147566 235.0 Italian 0.200000 1964.0 8.0 0 3.300000 182
4144 Walter Salles 71.0 5.595428 Drama Fernanda Montenegro Central Station 28951 257.0 Portuguese 2.900000 1998.0 8.0 0 2.695428 200
3264 Michael Haneke 447.0 0.225377 Drama|Romance Isabelle Huppert Amour 70382 190.0 French 8.900000 2012.0 7.9 33000 -8.674623 206
2863 Clint Eastwood 251.0 13.753931 Drama|History|War Yuki Matsuzaki Letters from Iwo Jima 132149 316.0 Japanese 19.000000 2006.0 7.9 5000 -5.246069 208
2605 Ang Lee 287.0 128.067808 Action|Drama|Romance Chen Chang Crouching Tiger, Hidden Dragon 217740 1641.0 Mandarin 15.000000 2000.0 7.9 0 113.067808 211
3510 Yash Chopra 29.0 2.921738 Drama|Musical|Romance Shah Rukh Khan Veer-Zaara 34449 119.0 Hindi 7.000000 2004.0 7.9 2000 -4.078262 227
4415 Fabián Bielinsky 94.0 1.221261 Crime|Drama|Thriller Ricardo Darín Nine Queens 38215 125.0 Spanish 1.500000 2000.0 7.9 0 -0.278739 238
3677 Christophe Barratier 112.0 3.629758 Drama|Music Jean-Baptiste Maunier The Chorus 44151 110.0 French 5.500000 2004.0 7.9 0 -1.870242 241
2493 Yimou Zhang 283.0 0.084961 Action|Adventure|History Jet Li Hero 149414 841.0 Mandarin 31.000000 2002.0 7.9 0 -30.915039 242
4640 Cristian Mungiu 233.0 1.185783 Drama Anamaria Marinca 4 Months, 3 Weeks and 2 Days 44763 172.0 Romanian 0.590000 2007.0 7.9 14000 0.595783 246

Well,our very own Veer-Zaara has made the list. So has Bahubali!

Finding the best directors

In [81]:
#extracting the top 10 directors

#Grouping 'movies' dataframe by 'director_name'. 
# Calculating mean of 'imdb_score' for each 'director_name'
mean_imdb_score = movies.groupby('director_name')['imdb_score'].mean()

#Creating a new dataframe directors_imdb_score
# mean imdb score is rounded to 1 decimal because that the accuracy of 1 decimal in the provided dataset
directors_mean_imdb_score = pd.DataFrame(np.round((mean_imdb_score),1))

# creating a new index and converting director_name into a column
directors_mean_imdb_score = directors_mean_imdb_score.reset_index()

# sorting in descending order of mean imdb scores
directors_mean_imdb_score.sort_values(by=['imdb_score','director_name'], ascending=[False,True],inplace=True)

# top 10 directors by mean imdb scores
top10director = directors_mean_imdb_score.iloc[:10]

top10director
Out[81]:
director_name imdb_score
216 Charles Chaplin 8.6
1675 Tony Kaye 8.6
45 Alfred Hitchcock 8.5
302 Damien Chazelle 8.5
1017 Majid Majidi 8.5
1440 Ron Fricke 8.5
103 Asghar Farhadi 8.4
260 Christopher Nolan 8.4
1035 Marius A. Markevicius 8.4
1371 Richard Marquand 8.4

No surprises that Damien Chazelle (director of Whiplash and La La Land) is in this list.

In [82]:
# splitting genre into genre_1 and genre_2
movies['genre_1'] = movies['genres'].apply(lambda x : x.split('|')[0])

def genre_2(x) : 
    split = x.split('|')
    if len(split) > 1 : 
        return split[1]
    else : 
        return split[0]
    
movies['genre_2'] = movies['genres'].apply(genre_2)
movies[movies['genre_1'] == movies['genre_2']]
Out[82]:
director_name num_critic_for_reviews gross genres actor_1_name movie_title num_voted_users num_user_for_reviews language budget title_year imdb_score movie_facebook_likes profit genre_1 genre_2
1397 Todd Phillips 334.0 277.313371 Comedy Bradley Cooper The Hangover 583341 626.0 English 35.0 2009.0 7.8 24000 242.313371 Comedy Comedy
2916 William Friedkin 304.0 204.565000 Horror Ellen Burstyn The Exorcist 284252 1058.0 English 8.0 1973.0 8.0 18000 196.565000 Horror Horror
440 Todd Phillips 383.0 254.455986 Comedy Bradley Cooper The Hangover Part II 375879 402.0 English 80.0 2011.0 6.5 56000 174.455986 Comedy Comedy
1868 Barry Levinson 100.0 172.825435 Drama Tom Cruise Rain Man 383784 331.0 English 25.0 1988.0 8.0 12000 147.825435 Drama Drama
1875 Tate Taylor 373.0 169.705587 Drama Emma Stone The Help 318955 460.0 English 25.0 2011.0 8.1 75000 144.705587 Drama Drama
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
934 Adam McKay 272.0 2.175312 Comedy Harrison Ford Anchorman 2: The Legend Continues 131227 346.0 English 50.0 2013.0 6.3 41000 -47.824688 Comedy Comedy
4397 Romesh Sharma 4.0 0.129319 Romance Annabelle Wallis Dil Jo Bhi Kahey... 257 4.0 English 70.0 2005.0 5.1 9 -69.870681 Romance Romance
1782 Jacques Perrin 100.0 10.762178 Documentary Jacques Perrin Winged Migration 10369 153.0 English 160.0 2001.0 8.0 1000 -149.237822 Documentary Documentary
2740 Tony Jaa 110.0 0.102055 Action Nirut Sirichanya Ong-bak 2 24570 72.0 Thai 300.0 2008.0 6.2 0 -299.897945 Action Action
3075 Karan Johar 20.0 3.275443 Drama Shah Rukh Khan Kabhi Alvida Naa Kehna 13998 264.0 Hindi 700.0 2006.0 6.0 659 -696.724557 Drama Drama

382 rows × 16 columns

In [83]:
# grouping movies by genre_1 and genre_2 
movies_by_segment = movies.groupby(['genre_1','genre_2'])
movies_by_segment=movies_by_segment['gross'].mean()
movies_by_segment.sort_values(ascending=False, inplace=True)
movies_by_segment
movies_by_segment.head()
Out[83]:
genre_1    genre_2  
Family     Sci-Fi       434.949459
Adventure  Sci-Fi       228.627758
           Family       118.919540
           Animation    116.998550
Action     Adventure    109.595465
Name: gross, dtype: float64

Looks like the most popular Genre is Family + Sci-Fi. And it more than 2 times popular than the next category : Adventure + Sci-Fi

Critic favorite and Audience favorite actors

In [84]:
Meryl_Streep = movies[movies['actor_1_name'] == 'Meryl Streep']# Include all movies in which Meryl_Streep is the lead
In [85]:
Leo_Caprio = movies[movies['actor_1_name'] == 'Leonardo DiCaprio'] # Include all movies in which Leo_Caprio is the lead
In [86]:
Brad_Pitt = movies[movies['actor_1_name'] == 'Brad Pitt']  # Include all movies in which Brad_Pitt is the lead
In [87]:
# combining the three dataframes
Combined = pd.concat([Meryl_Streep,Leo_Caprio,Brad_Pitt])
Combined.head()
Out[87]:
director_name num_critic_for_reviews gross genres actor_1_name movie_title num_voted_users num_user_for_reviews language budget title_year imdb_score movie_facebook_likes profit genre_1 genre_2
1408 David Frankel 208.0 124.732962 Comedy|Drama|Romance Meryl Streep The Devil Wears Prada 286178 631.0 English 35.0 2006.0 6.8 0 89.732962 Comedy Drama
1575 Sydney Pollack 66.0 87.100000 Biography|Drama|Romance Meryl Streep Out of Africa 52339 200.0 English 31.0 1985.0 7.2 0 56.100000 Biography Drama
1204 Nora Ephron 252.0 94.125426 Biography|Drama|Romance Meryl Streep Julie & Julia 79264 277.0 English 40.0 2009.0 7.0 13000 54.125426 Biography Drama
1618 David Frankel 234.0 63.536011 Comedy|Drama|Romance Meryl Streep Hope Springs 34258 178.0 English 30.0 2012.0 6.3 0 33.536011 Comedy Drama
410 Nancy Meyers 187.0 112.703470 Comedy|Drama|Romance Meryl Streep It's Complicated 69860 214.0 English 85.0 2009.0 6.6 0 27.703470 Comedy Drama
In [88]:
# grouping the combined dataframe
reviews_by_actor = Combined.groupby('actor_1_name')
In [89]:
# Finding the mean of critic reviews and audience reviews

# actors vs mean critic reviews in descending order
critic_reviews = reviews_by_actor['num_critic_for_reviews'].mean().sort_values(ascending=False)

# actors vs mean user reviews in descending order
user_reviews = reviews_by_actor['num_user_for_reviews'].mean().sort_values(ascending=False)
print(critic_reviews,'\n\n',user_reviews,'\n\n')

print('Actor with highest mean critic reviews : ',critic_reviews.index.to_list()[0] )
print('Actor with highest mean user reviews : ',user_reviews.index.to_list()[0] )
actor_1_name
Leonardo DiCaprio    330.190476
Brad Pitt            245.000000
Meryl Streep         181.454545
Name: num_critic_for_reviews, dtype: float64 

 actor_1_name
Leonardo DiCaprio    914.476190
Brad Pitt            742.352941
Meryl Streep         297.181818
Name: num_user_for_reviews, dtype: float64 


Actor with highest mean critic reviews :  Leonardo DiCaprio
Actor with highest mean user reviews :  Leonardo DiCaprio

Leonardo has aced both the lists. He is both the most liked and critically aclaimed actor!

In [90]:
# calculating decade

# Function to convert years into decades
decade_fn = np.vectorize(lambda x : int((x//10)*10 ))

# new column decade added to movies dataframe
movies['decade'] = decade_fn(movies['title_year'])

# number of voters grouped by decade
votes_grouped_by_decade = movies.groupby('decade')['num_voted_users'].sum()
In [91]:
# Write your code for creating the data frame df_by_decade here 
df_by_decade = pd.DataFrame(votes_grouped_by_decade)
df_by_decade.reset_index(inplace=True)
df_by_decade
Out[91]:
decade num_voted_users
0 1920 116392
1 1930 804839
2 1940 230838
3 1950 678336
4 1960 2983442
5 1970 8524102
6 1980 19987476
7 1990 69735679
8 2000 170908676
9 2010 120640994

Movie Votes by Decade

In [92]:
# Plotting number of voted users vs decade
sns.barplot(x='decade',y='num_voted_users',data = df_by_decade)
plt.xlabel('Decade')
plt.ylabel('No of votes')
plt.title('User Votes vs Movie Release Decade')
plt.show()

This plot shows the number of votes cast for movies of a certain decade. You can notice that there's no many votes for older movies. This could be because most people are just familiar with recent movies. But its interesting to notice that movies made during 2000-2009 are more popular than in or after 2010.

Conclusion

  • We imported movie data from IMDB and inspected the attributes.
  • We cleaned the data of missing values, uneccessary data.
  • We found the five worst performing movies to be
    • The Host
    • Lady Vengeance
    • Fateless
    • Princess Mononoke
    • Steamboy
  • We also found movies which made the most profit. James Cameron's Avatar is the most profitable movie so far. In fact, his Titanic is in the top 3. Christopher Nolan's Dark Night is the 10th most profitable.One can easily notice that most profitable movies are dominatingly Sci-Fi.
  • We ranked the top 250 movies by IMDB rating.
  • Among the best foreign Language films, we see 'Bahubali' and 'Veer Zara' which are quite popular in India.
  • It's no wonder that Charles Chaplin is the best director. Other noteworthy mentions are Damien Chazelle , the director of Whiplash and La La Land, Christopher Nolan and Alfred Hitchcock.
  • The most popular genres are Family + SciFi and SciFi Adventure. Looks like a lot of people are still the 'Back to the Future' clan.
  • We compared the most liked contemporary actors - Meryl Streep, Leonardo Decaprio and Brad Bitt. Turns out that among them, Leonardo Decaprio is both critically acclaimed and an audience favourite.
  • At the end, we looked at a comparison of popularity of movies by the decade in which they released.

Finally, this analysis has been an endeavour to apply the python skills I acquired to draw insights from data.

In [ ]: