Are You As Funny As You Think?

13 min readMar 14, 2024

Exploring the subjective experience of humor using machine learning

Helen Liu (Project Lead), Queenie Wu, Alexis Adzich, Anya Smolentseva

Introduction

Humor brings everyone together, whether it’s a joke that left the whole table laughing or a post your friend sent that prompts a smile. But what is humor? Is it the context? The words? The timing? Incredibly subjective, there’s a lot of nuance and individual factors that go into someone finding a joke “funny.”

Our project seeks a statistical approach to humor, and if it can be evaluated on a broader scale. We accessed Reddit jokes to gain comedic insight into a very large population of users, utilizing upvotes and downvotes as a metric of humor.

To answer this question, we utilized data compiled by Weller and Seppi in “The rJokes Dataset: a Large Scale Humor Collection” (2020) with over 500,000 jokes over 11 years from the Reddit r/Jokes subreddit and their corresponding upvote and downvote scores. We broke down our approach into four main steps: introductory statistical analysis of our data, K-Means clustering to group similar jokes, our joke-scoring algorithm, and finally our trained CNN on tokenized joke data.

Quantifying Quips: Understanding Parameters of Jokes

To begin, we conducted some exploratory data analysis to develop a general understanding of our data set. We explored the relationship between the word count of the joke and the average score. This distribution generally made sense in which jokes with higher average scores were not too long and not too short in length.

We then examined the most common words in our dataset. Words like “man,” “one,” and “like” appeared frequently. However, upon closer inspection, a significant portion of these frequently occurring words lacked a substantial semantic significance and contributed minimally to the overall contextual meaning of the collected jokes.

Consequently, we refined our dataset by filtering out a list of filler words. We then proceeded with a similar process of word extraction, this time comparing each word directly to the original corresponding score of the joke. We found it particularly intriguing that many of the words used in jokes with high scores were related to sex or gender.

Silly Similarities: Grouping Jokes With Meaning

In order to truly understand humor and what makes Redditors enjoy a joke, we decided to analyze the kinds of jokes Redditors like to make and how common they are among the Reddit population as a whole.

This analysis started out by pre-processing each joke within the dataset removing common words that didn’t add much to the meaning of a joke. This included words like “the”, “and”, “there”, etc. After applying the TF-IDF algorithm to identify the importance of each word in each joke, the PCA algorithm was used to only keep the words with the highest variance, allowing us to only keep the most important words in each joke in our dataset making it significantly easier to apply K-means clustering and visualize the similar jokes within the dataset.

Here, we can see three distinct groups of jokes within the dataset. Using these clustered jokes, we sought to investigate the most common kinds of jokes within the dataset by analyzing the most frequent words in each cluster of jokes.

In this first cluster of jokes, we see words like “told”, “wife”, “friend”, “people”, and “son” which suggests that this group might contain jokes revolving around personal experiences, relationships, and anecdotes. In addition, words such as “year,” “time,” and “day” hint at jokes related to events, passing time, or daily occurrences. Lastly, terms like “people,” “friend,” and “guy” may imply jokes about social interactions, gatherings, or everyday encounters. In other words, this cluster mostly has jokes about everyday occurrences with friends and family.

In this second cluster of jokes, we observe the presence of words like “bartender,” “bar,” “drink,” and “walk” which suggests that this group might contain jokes set in bars or social gatherings, and keywords like “wife,” “woman,” and “man” hint at jokes revolving around relationships. It seems like the most common joke in this cluster follows the format of the “A man walks into a bar” joke.

In this third cluster we see keywords like “wife,” “husband,” “woman,” and “guy” suggest jokes centered around relationships, marriage, and gender dynamics, and the presence of words like “sex” and “girl” may imply jokes with sexual innuendos or themes.

A common theme throughout these jokes is the frequency of words like “woman”, “girl”, and “wife”, indicating that no matter the topic, women were a common underlying subject of humor. Further analysis is necessary to determine whether these jokes referred to women in a positive or negative light.

A Word Grammarly

To model how “funny” a joke is, our first approach is to develop an algorithm to calculate the humor score of some common words. As we indexed through the dataset, we used to following function to calculate the score for each word:

We calculated the word score of the 3,000 most common words in the dataset. This is because some words only appear a few times, but if the joke they happen to be in has many likes, that word will result in insanely high scores. By choosing the first 3,000 words, we were able to make sure that every word appeared at least 40 times. Here’s what some top-rated words look like:

Some popular words are “devil”, “soldiers”, “testicles”, “gorillas”, and “santan”… Okay, it makes sense that people love these topics. One interesting feature is that “upvote” and “thank” also have a very high score — people are nice, if you ask for an upvote and thank them for it, they are probably gonna do it.

With this word-score list, we developed a joke Grammarly. Here’s a link to it in Github. When you input a joke, it calculates how funny you are based on what words you use. For example, let’s input “Why did the skeleton go to dinner alone? No body!”

The joke has a score of 0.181… hmm, it’s okay, but what if we make some minor changes:

Here we changed the word “dinner” to “party”, and the score increased. It seems that Reddit users find parties more interesting than dinners.

However, you can find ways to cheat the system with some magical phrases:

Hence, we have a brief understanding of the pros and cons of this method. The algorithm works as a tool to fine-tune the joke, helping the user pick the funny words and popular topics. However, it cannot take in an entire joke and evaluate it holistically, and it fails to identify any funny puns or interesting sarcasm. To do this, it’s time to borrow some help from the big boss AI himself.

Wait? AI Knows Humor?

Why CNN?

We chose a Convolutional Neural Network (CNN) model due to its strength with sequential patterns, such as language. Jokes can be interpreted in phrases, and order matters if the punchline comes before the premise ruining the joke. CNNs are also flexible for our very large dataset, containing jokes of various lengths and extensive vocabulary, as they can adapt to different inputs and reduce dimensionality through convolutional layers. We tokenized our jokes into numerical data that the machine can interpret and learn to make increasingly accurate score predictions as it trains on the dataset.

Baseline Model

Inspired by an article, “LMAONet-LSTM Model for Automated Objective Humor Scoring and Joke Generation,” (2019) by Simone and Cruz, with a final approach involving a CNN model classifying jokes into five categories defined by every 20th percentile threshold for the same weighted average joke score. A limitation we found in the article was a lower kernel size of 3 in their CNN. While this creates a more local receptive field only centralizing on three words at a time, we wanted to explore a larger kernel size as jokes often incorporate a lot of context information before the final punchline.

Our Improved Model

To begin with, we developed a tokenizer for the 20,000 most common words in the dataset. We tokenized all the jokes, manipulating each joke to have 65 values. We assigned unknown (<unk>) to untokenized words and padding (<pad>) as filler words to ensure each joke had the same length.

Improving on the Stanford model, we included 2 kernel sizes and wrote an algorithm to combine them. The kernel size of 4 is used to identify humorous phrases or funny puns, while the kernel size of 7 is used to detect contextualized sarcasm and meaning. As for using binary instead of numerical prediction, we originally used numerical prediction to train our model, yet after running it several times, the accuracy of the result is suboptimal. Similar to our findings, the Stanford model also slices the prediction into 4 ranges instead of giving a precise value. This could be because humor is more random and subjective, and so a simpler model fits better. To use the binary model, we took the log of the upvotes and ended up with a distribution shown as follows:

Next, we indexed out all the log values smaller or equal to 0. With a large enough dataset (n = 500,000), we sliced the data in half, setting the log values less than 2 defined as false, and the log values more than 2 defined as true. Converting to the number of upvotes, this means any joke with more than 100 upvotes is classified as funny, and others are classified as unfunny. We ran our model with the loss function BCELoss, which is commonly used for binary classification tasks. Finally, we were able to run a training loop in an online server which resulted in a trained model ready to use for predictions. In the end, our model received an AUROC score of 0.91, which is a relatively accurate result. Here is a sample prediction for one batch:

As shown in the table, despite the relatively accurate result, the main error is a false negative, when we predict a joke to be unfunny but it actually has a high rating. This could be because, as later mentioned in the analysis section, people are more likely to be attracted to popular topics, and usually don’t have the patience to read an entire joke without a hook. The model is trained to put more emphasis on diction and phrases, the hooks, rather than contextualized meaning. Hence, when it comes to jokes that are not about a popular topic but have humorous contextualized meaning, like a well-phrased old British joke, the model may label them as not funny while they have a relatively high score.

With the trained model, we wrote an algorithm for joke prediction. Here’s the link to the Github. When using the algorithm, be sure to read the User Handbook session first to avoid potential errors when dealing with jokes that are too short, joke titles, and uncommon words.

Now, it’s time to read some jokes! These are two jokes with high upvotes from the r/jokes section of Reddit, they are all recent to prevent overfitting:

Prediction by model: 0.9129990339279175

Prediction by model: 0.9971850514411926

Here is a random text generated by ChatGPT:

Prediction by model: 0.000017453021428082138

Here are two not-that-funny jokes. They are published around the same time as the funny ones:

Prediction by model: 0.1872854381799698

Prediction by model: 0.010045167990028858

As we can see, Deep Learning Models are capable of understanding humor to some extent. It is able to (a) predict random paragraphs from jokes (both funny ones and unfunny ones), and (b) predict funny jokes from unfunny jokes. However, there are several limitations to this model:

As we used the binary prediction method, we indexed out all the log values smaller or equal to 0. However, these jokes actually occupy one-half of the dataset. As the models are not exposed to these data, they have limited performance in understanding and predicting unfunny jokes. A potential improvement is to design a criticizing mechanism to help the model recognize bad jokes.
As we used binary instead of numeric prediction, we are sacrificing precision for accuracy. As a result, even though the model works well with the big picture, like identifying jokes from random paragraphs. It fails to predict which joke is “funnier”: it cannot recognize the differences between two popular jokes even if they have very different upvotes.
Due to limited computing power, we wrote our own tokenizer. A better-developed tokenizer from an online source could have better performance. In addition, CNN models are more useful with imaging, whereas the transformer is more useful with contextualizing information. If we were to approach this task again, we might use a transformer instead of a CNN model.

Despite these limitations, we can draw some conclusions from this NLP exploration:

Humor, at least Reddit jokes, are quite random.
Neither our model nor the Stanford model can give an accurate numeric prediction for the score of a Reddit joke. Even for the binary approach, it’s hard to predict a difference even when one joke is significantly more “funny”. This suggests that Reddit upvotes are relatively random and subjective.
Scores are more dependent on dictionaries, puns, and syntax instead of contextualized meaning.
The model gives a high score for sentences on popular topics that are not funny at all, for example, “satan goes to a party man just get yourself a hot girlfriend, yeh”. Also, the model results in type 1 errors failing to predict jokes with contextualized humor without hooks. Finally, the model performs poorly in identifying daily conversations on popular topics from jokes. This suggests that when giving a prediction for humor, the parameters that actually matter are popular topics and dictions, funny phrases and puns, and easy-to-understand syntax and sentence structure.
Can AI understand humor?
Overall, our model was able to interpret humor to some extent, but the predictions are far from “intelligent”. It’s not just our model: when we provide an example of an unfunny Reddit joke and a funny one from our testing dataset to ChatGPT and ask it which one is funnier, ChatGPT simply states: “They are both funny to me”. (In this sense, perhaps our model is more down-to-earth compared to ChatGPT!) Even if ChatGPT can write reports like a 60-year-old lawyer, if you ask it to generate a joke, it does worse than a high school student. Perhaps humor is, after all, a subjective experience that cannot be reduced to vector calculations and big data. As humans, we win over AI when it comes to being funny.

Conclusion

Going back to our main topic, how can you be more funny? Here are some tips based on our exploration:

Sex beats classy British jokes.
Based on the Quantifying Quips session and the word list with high scores, it seems that Reddit jokes with many upvotes seem to be sex and gender-related.
Popular topics, eye-catching dictions, and easy-to-read syntax are more important than contextualized humor.
From the Silly Similarities session, we see there are 3 kinds of common jokes: personal relationship jokes, sex jokes, and bartender jokes. Yet some words keep appearing in these meaning groups, such as “wife” “guy” and “girl”. From our machine-learning model, we see that upvotes are more dependent on popular topics, funny puns, and simple sentence structure instead of contextualized meaning. You can’t expect people to do literature analysis with jokes, pick a popular topic, use catchy words, be straightforward, and you got this.
Be confident. Humor is subjective, and you are beating AI in this.
When using AI to interpret humor, it seems that the prediction isn’t that accurate, and it seems to be a common issue for any large linguistic models like ChatGPT. Look, the subjective experience of humor cannot be reduced to mathematical terms, you are a human, you’ve got this!

Citation

GfG. (2024, January 31). NLP: How tokenizing text, sentence, Words Works. GeeksforGeeks. https://www.geeksforgeeks.org/nlp-how-tokenizing-text-sentence-words-works/

He, K., Zhang, X., Ren, S., & Sun, J. (2015, December 10). Deep residual learning for image recognition. arXiv.org. https://arxiv.org/abs/1512.03385

Navlani, A. (2023, February 23). Python decision tree classification tutorial: Scikit-Learn Decisiontreeclassifier. DataCamp. https://www.datacamp.com/tutorial/decision-tree-classification-python

Simone, Z., & Cruz, C. (n.d.). LSTM model for automated objective humor scoring and … LMAONet — LSTM Model for Automated Objective Humor Scoring and Joke Generation. https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1194/reports/custom/15791516.pdf

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2023, August 2). Attention is all you need. arXiv.org. https://arxiv.org/abs/1706.03762