Behind the Scenes: Unraveling the Mysteries of the Film Industry

DataRes at UCLA
13 min read1 day ago

--

Authors: Mingyang Li (Project Lead), Jeffrey Liu, Kelly Cheng, Larsen Bier, Wei Han Chua

Source: GR Stocks on Unsplash

Introduction

Suppose we want to become the next Steven Spielberg or Christopher Nolan, directing the next movie which will go down in history as a generational classic. To do this, let’s figure out what it is that makes movies successful and decipher the enigma that is the film industry.

There are many features we can explore to understand the intricacies of movies, whether it be geographic region of release, the starring actors, or the budget. Of course, each feature affects the performance of the film in a different way. There are predictive metrics like budget cost, as well as performance metrics like gross income. Let’s take a look at how the features affect each other through a correlation matrix.

In our correlation matrix, we grasp an overview of the different numerical features, namely IMDB Votes, IMDB Score, Budget Cost, Gross Income, Year Released, Month Released, and Movie Runtime in minutes. The darker the square, the higher the correlation between the two corresponding features, and vice versa.

We can ascertain several interesting deductions from this matrix. Firstly, the budget has an extremely low correlation with performance features like IMDB Score and Gross Income. This can be quite counter-intuitive because most of us assume that the more money you dump into a movie, the higher its quality, which leads to higher probability of success. Yet, this is not depicted in our matrix. Simple production is sometimes better, whereas large-scale marketing can be overwhelming, resulting in unexpected challenges.

For instance, Paranormal Activity (2007), despite having a budget of a mere $15,000, grossed around $193,000,000 worldwide. On the other hand, The Lone Ranger (2013) (Starring Johnny Depp), with an astronomical budget of $215,000,000, only grossed about $260,000,000, leading to a considerable financial loss for Disney.

Let’s note some other interesting results. The number of IMDb votes has a relatively high correlation with Gross Income, indicating that popular movies make more money on average. Furthermore, the year that a movie is released has minimal impact on its success. However, as we will see later, this is not the case for months of release.

Through this matrix, we obtained some generic conclusions and gained a peripheral picture of the intricacies of the film industry, but this is just the tip of the iceberg. Buckle up your seatbelts as we dive deeper into the relationships between these numerical features.

Score, Votes and Gross: Relationship between Numerical Features

Most of us rely on reviews and ratings when deciding which movies to watch. We seek out highly-rated films, assuming they promise a better viewing experience. Naturally, this should mean movies with higher ratings and more votes would perform better at the box office, right? We set out to explore this idea, hypothesizing movies with higher box office earnings would naturally attract more IMDB votes, and higher-rated movies would rake in more cash. We analyzed data from the year 2000 onwards — a time when online ratings began to take off, with an average of around 100,000 IMDB votes per year and climbing.

Our scatter plot paints a surprising picture. At first glance, we see something unexpected: movies at the higher end of box office earnings do not have the most IMDb votes. On the flip side, movies with more votes have nowhere near top box office earnings. This revelation disproves our initial hypothesis — there is no straightforward link between a movie’s box office success and its number of votes.

Take Avatar and The Dark Knight for example. Our dataset reveals that Avatar boasts the highest box office, raking in an astounding $2.9 billion, and garnered 1.4 million votes on IMDb. In contrast, The Dark Knight holds the record for the most votes at 2.9 million, but achieved a comparatively modest gross of $1 billion. That is a third of the box office despite having over twice as many votes! We noticed that Avatar had a score of 7.9, while The Dark Knight scored 8.4. Does this imply that higher-rated movies consistently generate lower box office revenues?

To answer this question, we need to visualize the effect of score on the box office. The plot below has been updated to reflect each score range using color-coding based on its rating, with a trendline for each rating range.

We notice that highly-rated movies cluster towards the lower end of the box office spectrum, while generally receiving a greater quantity of votes. This tells us that high ratings do not necessarily translate to blockbuster earnings. In fact, highly-rated movies tend to earn less than their lower-rated counterparts for the same number of votes.

This was an interesting insight–it seems while ratings are important, they are not the golden ticket to box office success we thought they were. Other factors such as marketing, release timing, cast, and franchises, most likely play larger roles in a movie’s financial success. As we will soon see, categorical features like genre have a significant impact too.

Popular Genres

When we think about movies, we often categorize them by genre — action, comedy, drama, and so forth. These categories help us decide what we might enjoy watching. But, have you ever wondered which genres tend to receive higher ratings and why?

From our bar chart above, it can be seen that our top three genres are historical, musical, and music genres. There can be several reasons why this is true.

Historical films often delve into real events and significant narratives that resonate deeply with audiences. Additionally, they tend to be well-researched and add educational value, which contribute to their appeal and high ratings. Musicals often evoke strong emotional responses through their combination of music, dance, and storytelling. Similarly, films centered around music or musicians often benefit from their soundtracks and emotional journeys of the characters.

Interestingly enough, we expect genres like action and comedy to rank higher due to so many trending movies, like any Avengers movie or Kung Fu Panda, being of those genres. However, these genres score rather low in comparison to their counterparts.

Taking a look at our box plot, we can see that these genres show a wide range of scores, indicating a mixed reception of these particular movies. We will dive into the nature of these films as possible reasons for the phenomenon.

Action films are often a hit or miss. Although some movies, like Avengers: Endgame, deliver spectacular performances and engaging plots, others fall flat due to weak storylines or overreliance on special effects. Therefore, the movies in this genre often have mixed reviews.

As for comedy, humor is oftentimes subjective. What one person finds hilarious, another might find unfunny, or even offensive. This subjectivity can lead to varied critical reception, which can be observed through our wide range of scores in our box plot.

In comparison, our top three genres: historical, musical, and music genres tend to have more consistent ratings. Due to action and comedy genres’ inherent variability and subjective nature, they tend to exhibit a wider range of scores. On the other hand, historical films, musicals, and music-centric movies tend to maintain higher ratings to their consistency in emotional depth and cultural impact.

By understanding these patterns, filmmakers can better anticipate how their work might be received and tailor their projects to meet audience expectations. For viewers, this information can guide them toward genres that consistently deliver high-quality entertainment.

“Dump Months” Phenomenon

Star Wars, The Dark Knight, Harry Potter. Curiously enough, all of these blockbusters were released either between May and July, or November and December. On the contrary, blockbusters released between January and February are few and far between. Commonly known as the “Dump Months”, this is a period when studios tend to release films they anticipate to perform poorly. To investigate this phenomenon, we filtered the 10 highest-grossing movies annually from 1980 to 2023 and plotted the gross revenue of these blockbusters against their release months.

As seen from the dot plot, it turns out that the “Dump Months” phenomenon has been consistent since the 1980s. Most of the blockbusters were released either during summer or at the end of the year, with only a meager 5% released during the “Dump Months”. Furthermore, the average gross revenue of blockbusters released during the “Dump Months” is significantly lower than those released in other months.

Several factors contribute to this phenomenon. The summer months coincide with school vacations in many countries, providing families more free time for movie outings. In addition, to qualify for awards such as the Oscars, movies must be released by the end of the previous year. Hence, studios gunning for prestigious awards tend to release their best films towards the year’s end so that they remain fresh in the minds of critics and award judges. At the turn of the year, school and work life resume in full swing, while the chilly weather further discourages people from visiting the theaters.

However, a few movies have defied the “Dump Months” trend. Black Panther, a notable outlier, was released in February 2018 and became an overwhelming hit, grossing $1.35 billion worldwide. It earned critical acclaim for its representation of African culture, with the hashtag #WakandaForever becoming a cultural phenomenon. This success demonstrates that with limited competition during the “Dump Months,” well-made movies can stand out more easily.

Race among Directors

Another question that we set out to investigate was whether a director’s level of experience had an effect on his or her film’s reception. To determine whether this question even made sense to ask, we first had to ask ourselves: how many movies can a director work on during his or her career?

Distribution of the total number of movies a person directs over the course of their career

Most directors’ careers don’t progress beyond their first movie. The movie industry is notoriously competitive, so this makes sense. Despite that, there are still many successful directors. Over 1500 directors in our dataset directed more than one movie and almost 500 directed 5 or more. Out of these, there are a few superstars whose names you probably recognize from one your favorite films:

A race between famous directors depicting how many movies they’ve directed up to that year.

Out of all the incredible talent in our database, Clint Eastwood is our most prolific director. Congratulations! Clearly, a relatively large proportion of our dataset is composed of directors with advanced careers. Does that depth of experience translate to improvement over time?

The distribution of scores for the 1st, 2nd, 3rd, … , 9th movies a person has directed, with the mean for each level depicted in red.

When we examine the distribution of scores for each director’s 1st, 2nd, 3rd, and so forth movie, it is difficult to discern a strong trend. The mean score gradually increases (notice the red circle wandering slightly to the right), suggesting experience may yield some improvement. However, this may be because skilled directors are given more opportunities to work on the more prominent movies that would be in our dataset, increasing the proportion of highly successful directors in the upper levels of our diagram.

Furthermore, we observe a small decrease in the left-skew of the distributions in the upper levels, which can be interpreted in a number of ways that are consistent with our prior observations. First, this may support the hypothesis that less skilled directors are not given subsequent opportunities to make movies, as people who’ve scored well on the lower levels are presented with more opportunities than those who scored poorly. Alternatively, movies with higher budgets, better marketing, and an overall better team may be more likely to hire renowned directors for their projects. These characteristics could be important contributors to boosting the score of the movie, which would predispose directors to receive superior scores for their later projects. Now that we have rounded off our analysis of directors, let’s proceed to actors.

Who is at the Center of Hollywood?

To cap it off, we dived into investigating the relationship between actors. Using our dataset, we constructed a multi-graph network for all movies from 1980 to 2024, as seen in the above visualization. Each node represents an actor, while each edge linking two nodes indicates that the two actors corresponding to the nodes have acted in the same movie. Since two actors could have acted in several movies together, there could be multiple edges between the same pair of nodes. As we traverse through the graph, several questions arise. Which actors have acted together most often? What is the shortest path between any two given actors? Most importantly, who is at the center of the film industry?

Degree Centrality

Degree centrality is measured by dividing the degree of an actor by the maximum possible degree, which is in this case, the number of total actors minus one. It provides a simple and intuitive way to identify the actor who is the most connected to all other actors. We have displayed here the top 10 actors in terms of degree centrality, which shows the amount of people that the actor has worked with.

As seen from the bar chart, Samuel L. Jackson has a degree of 0.018%, which means that the proportion of the actors in the network that Samuel Jackson, is about 0.018%, the highest out of all the actors. This shows the top 10 actors in terms of how well-connected they are in the actor network, defined by the number of collaborations with other actors.

Betweenness Centrality

Betweenness centrality (BC) is another way to determine which node in a graph is the most connected by measuring how often a node lies in the shortest path between two other nodes. Given a node, or in our case an actor (V), it considers each possible pair of actors (S and T) in the graph and samples the proportion of shortest paths between them that pass through V. The final result is the sum of this value over all possible pairs of actors:

Here, σ(s, t) the number of shortest paths between s and t, while σ(s, t | v) is the number of shortest paths between s and t that pass through v. For computational efficiency, s and t were samples from a random subset of 1000 actors generated for each v.

A high BC indicates that an actor is an important connection through which many people know each other. When we examine BC scores overall, it appears that there are few actors with high BC scores, with the vast majority of them achieving a BC score negligibly greater than zero.

Normalized histogram of BC scores. Probability density decreases rapidly as BC increases.

Those clustered around zero are extremely obscure actors. Hollywood is notoriously competitive, and many actors perform niche roles or only ever get to play a few roles, leading to the mode BC being incredibly close to zero. The story is much more interesting when we focus on well-connected actors. We see an interesting positive relationship between BC and degree centrality: 6/10 of the actors with highest degree centrality are also in the top ten actors sorted by BC:

Top ten actors sorted by BC in descending order. Red Arrows indicate actors who are also among the top ten actors when sorted by degree centrality in descending order.

Even more interestingly, when we consider the sum of the BC across a movie’s entire cast, it is positively correlated with the movie’s gross revenue when accounting for the movie’s budget, score, and number of votes on IMDB. This confirms our intuition that famous, “marketable” actors draw in audiences.

Closeness Centrality

Finally, to conclude our analysis of centrality scores, we present our final centrality score: closeness centrality. As the name suggests, it determines how close a node is to all other nodes

by measuring the average shortest path length from one actor to all other actors in our network. Nodes with a higher centrality have a shorter path length to any other node in the network on average.

Our results should not come as a surprise. We see many familiar names such as William Dafoe, Samuel L. Jackson, and Nicolas Cage. However, we also notice some new names like Richard Jenkins, Robert Downey Jr., and Gary Oldman. These actors did not rank in the top 10 for either degree or betweenness centrality. This suggests that these actors may not have the highest numbers of direct connections, nor are the primary intermediaries that other actors rely on to connect with one another, but still possess high levels of overall connectivity in the network. In other words, these actors are efficiently positioned within the network, but do not play as significant a role in bridging different parts of the network.

Conclusion

In unraveling the mysteries of the film industry, our analysis has affirmed preconceived hypotheses like the dump month phenomenon. At the same time, it unveiled unexpected trends such as the low correlation between budget and gross income, which underscores that financial investment alone does not translate to guaranteed success. The disparity between high-grossing movies and IMDb votes further illustrates that audience preferences and box office earnings do not always align, suggesting that a movie’s financial success can diverge significantly from its popularity among viewers. Together, these insights underscore the multi-faceted nature of the film industry, where success is a delicate balance of creative vision, strategic decisions and timing, rather than a straightforward formula.

--

--