What Makes a Movie?

Authors: Madison Kohls, Bonnie Liu, Polina Pranovich, Danielle Goldwirth, Isha Shah

In today’s world, movies are all around us, whether we watch them on popular streaming services such as Netflix or Hulu, or in traditional movie theaters with big boxes of popcorn. There is a huge assortment of movies that have rapidly evolved over the past century, varying by different genres, budgets, languages, runtimes, and audiences. But there are also status quos that films have abided by for better or for worse. To explore these commonalities and differences and see how movies have changed throughout the years, we decided to analyze a Kaggle dataset containing information on over 45,000 movies released between the years 1906 and 2017. With this exploratory data analysis, we hope to better understand the nuances and traditions established in the movie industry, as well as critically examine areas of the industry that still seem outdated.

Better Understanding our Data: Movie Language/Location

Movies come from all over the world. To better understand the geographic and language distribution of our dataset, we visualized where our movies came from and what languages they were filmed in.

Looking at the visuals, we notice that 45% of movies from our dataset were from the United States, however there was also a large assortment of movies from continents such as Europe, Asia, and parts of South America and Africa.

We also see that 99% of movies filmed in the US were filmed in English. While English is the most common language in the United States, only 78.1% of US citizens consider English to be their main language — meaning roughly 71 million people may be disenfranchised from the limited language diversity in the movie industry. What was most surprising is that 53.2% of movies from countries excluding the US were filmed in English, despite only 20% of the world (including the US) actually speaking English. These findings are likely due to how dominant the English language is — however it’s important to note that having a more diverse range of languages can make films accessible to larger audiences.

Movie Evolution

The film industry has evolved significantly over the past century. With developments in technology and shifts in viewers’ preferences, movies have changed in terms of many different aspects — budgets, revenues, languages, and even runtimes. The treemap below gives us a sense of how the number of movies released has greatly increased over the years. The top number represents the year in which movies were released, and the bottom number shows the amount of movies that were released in that specific year.

The 2000s and the 2010s have had the greatest number of movies released; for example, there were 1,912 movies in 2014 and 1,851 movies in 2015. The early years within the dataset had the least number of movies released: both 1906 and 1911 only released 1 movie. This drastic change is likely due to the increase in resources that have become available for the production of movies.

With the improvements in technology and production resources, we expected the ratings of movies to increase as well. However, as you can see in the graph below, the movie ratings have remained quite similar throughout the years, mostly fluctuating between an average of 3 to 4. The movie ratings were obtained from GroupLens, an organization that does research on various topics, including movies. Between the years 1910 and 1920, some years did have an average rating of below 3, but this did not happen often.

Looking at average runtimes over the years, it seems as though there has been a slight increase. Between the years 1906 and 1960, there was an apparent increase in movie runtimes, but from 1960 onwards, the runtimes have remained steady, ranging from about 100 minutes to 110. In 2017, the average runtime was 100 minutes, while in 1906 the average runtime was 60 minutes.

As different countries have learned more about movie production throughout the years, many movies have been released in various languages.

Each dash in this graph represents that a movie has been released in that language in that specific year. For example, we can see that a movie in English was released in 1906, followed by another one in 1912. The languages that we can see in the earliest years are English, Italian, French, German and Russian. From 1954 onwards, movies have been produced every year in these five languages! Movies in Inuktitut, Maltese and Punjabi have only appeared very recently, in 2014. It is important to note that this information is specific to the dataset we are working with. For example, one of the first Punjabi movies was actually released in 1928, but this movie was not included in our dataset.


The financial aspects of movie production have also significantly changed. In the most recent year within the dataset (2017), the average budget for movies was over 38.5 million dollars, while the average revenue was over 153.5 million dollars. In 1939, the average budget was only around 2.7 million. In the graph illustrating how movie revenues have changed over time, we can see a lot of fluctuations in the earlier years, but these values are still significantly lower than they are these days — in 1973, the average revenue was only 55.7 million. Budget values are now much higher because it has become more expensive to produce movies, taking into account costs for production equipment, cast and crew salaries, and editing software. Revenues have also increased because movies are of higher quality, and people have more ways to watch these movies.

The average revenue-to-budget ratio from our dataset was 1391.73. This means that on average, for every dollar spent on the movie budget, $1391.73 will be made in revenue.

Now let’s examine the above graph in a little bit more detail. Notice that the movie that made the most revenue (Avatar) didn’t have the highest budget. Avatar only had a budget of $237 million and raked in about $2.788 billion in revenue. Compare that with Pirates of the Caribbean: On Stranger Tides, which had the largest budget of all the movies in our dataset. Despite having a whopping budget of $380 million, it fell below the linear regression line, making less than $1.046 billion in revenue.

Based on the scattered points in the graph above, it seems like there isn’t much correlation between movie budget and rating. To double check, let’s find the covariance between movie budget and rating using statistics with Python programming: -290285.92. While this may seem like a big number, this is relatively small when compared to the movie budget values. Thus, we can conclude that a movie’s budget has very little influence on its ratings.


Now that we’ve examined the numbers behind a movie, let’s take a closer look at the elements that actually make up a movie.

This visual shows the genre distribution of the 45,000+ movies in our dataset, with the 20 most prominent genres listed below. The genre with the most movies by far was drama, with thriller, comedy, romance, and action movies also fairly popular. This makes sense as most of the staple movies that we think of when we think of our favorite movies fall under one or more of these genres. The fact that most movies had multiple genres could also be a big factor in why drama was such a prominent genre; most movies fall under the “drama” genre regardless of other subgenres they could be categorized under (i.e. romance, action).

We also wanted to gain more clarity on the popularity of each genre relative to each other. To do so, we averaged the viewer ratings (which range from 1–5) for all movies under each genre, and then plotted the average rating of each genre on this bar graph. Although the drama genre was the most popular in terms of movies made, there doesn’t seem to be any correlation between the number of movies made and the viewer rating received for movie genres; the average rating received by each genre was between 3 and 3.5, with the range in average rating only being 0.22. Animation had the lowest average rating and Foreign had the average highest rating — but the overall differences in averages were very slight.

Visualizing the top movie plot keywords in various genres we can get a better understanding of what themes/ideas drive storylines for certains genres. What’s interesting to note is while these genres have the usual keywords we would suspect — ”martial arts” for action and “love” for romance — we also see some seemingly unrelated keywords such as “death” in romance and “female nudity” in horror. However this makes sense, movies aren’t usually monotonous storylines. They have diverse and ranging themes that can appeal to larger audiences.

Looking at the most popular keywords over all the movies, words such as “woman director” and “independent film” are observed the most. This brings up the question: are there actually a substantial number of female directors and independent films in the movie industry, or are these just popular “buzzwords” to attract viewers?


Diving deeper into the movie casts and crews reveals that there is an imbalance in gender both on screen and behind-the-scenes.

In the movie casts, the ratio of male to female is about 2:1. In movie crews, this ratio is even greater — a male to female ratio of about 5:1. It is clear that there is a greater proportion of males than females in movie casts and crews, but it is important to note the unknown gender identities seen in both charts may also consist of male and female-identifying individuals as well as members of the LGBTQ+ community, thus making the gender proportions more flexible than what is seen in the pie charts. Despite this possible discrepancy, empirical evidence in the movie industry today and this analysis of the gender imbalance in movie casts and crews further emphasizes the need to encourage gender diversity in the movie industry.


Through our data analysis, we found that 45% of the movies in our dataset are from the U.S., and an overwhelming majority of these movies were filmed in English, regardless of where the movie was filmed. Since the 1900s, the movie industry has really taken off, and over time, more and more money was poured into making the movies as revenue increased as well. We also found that in general, there is a positive correlation between budget and revenue but little to no correlation between budget and rating. In terms of genre, we found that drama and comedy were two of the most popular categories among the movies in our dataset, but there was no significant difference in the average ratings of movies across different genres. Additionally, we noticed that while movies contained keywords that suited their genres, they also sometimes contained slightly unrelated keywords suggesting dynamic storylines and attractiveness to wider audiences. Lastly, we examined the gender proportion of cast and crew members, which revealed that the majority of both movie cast and crew members were male. Despite the disparities we noticed in gender and language, the movie industry is becoming more diverse with its increase in female actors and actors of color reported in 2020. However, there still has yet to be an increase of female writers and executives, but we believe with the increasing numbers of female directors there will be more gender inclusivity in front of and behind the camera in the near future!

The link to our Github repository with code can be found here.



