“When Dating Met the Internet”

10 min readJun 21, 2021

By: Sivaji Turimella (Project Lead), Brandon Zhao, Robi Chatterjee, Kevin Hahn, Matthew Maemura

Introduction

Gone are the old days of hoping to meet your soulmate at the bar or your local bar or coffee shop. Instead, online dating apps like Tinder, Bumble, and OkCupid have become increasingly popular as a source of meeting individuals, and the number of people using these apps continues to grow. Rather paradoxically, though, as more and more people start using these apps, it could actually become harder to meet other people. After all, how can you make your profile stand out among millions of others?

To answer this question and others, we decided to look at and analyze the online dating profiles of various individuals. Specifically, we decided to look at a dataset from Kaggle of dating profiles from the popular dating app, OkCupid. In this Kaggle dataset, there are “60k records containing structured information such as age, sex, orientation as well as text data from open-ended descriptions.”

We hope that through our analysis, dating app companies can better understand their users and how to create the best matches between their users, and users of these dating apps can understand who else is on and how to make themselves stand out.

Essay Choices

Users are given the option to answer up to 10 essay prompts. Not everyone fills out each prompt, so to understand what people generally care about more in their profile, we compared the rates at which different prompts were left empty.

Dealbreakers: To Show or Not To Show

We all have our own standards of what we want — and don’t want — in the people we date. On the flip side, we are all image-conscious to varying degrees. We have an idea of what we want — and don’t want — possible matches to know about us at first glance. OkCupid allows users to skip questions when they first set up their profile. However, the questions that they do decide to answer will contribute to a compatibility percentage score with each potential match, and they can see which questions they agree or disagree on. As a result, we were able to examine how close-vested users are with respect to certain topics.

Drug use transparency in particular was an interesting topic to explore because this is often a dealbreaker for many people. Drinking and smoking are quite commonplace. Generally speaking, there is not that much negative judgment for doing so. But drug use is where many begin to draw a line in the sand. It can easily sour romantic aspirations with an otherwise compatible match.

In the entire data set, nearly 1 in 4 users (23.5%) chose not to disclose their drug use habits. Since this proportion seemed much larger than expected, we dove deeper into the consistency of this proportion across meaningful segments, such as sexual orientation and gender. Heterosexuals and non-heterosexuals appear to have the same proportion of users who did not report their drug habits, but differ in their overall drug use (“Often”, “Sometimes”).

Fascinatingly enough, in the graphs below, we can see that for any gender-orientation combination, at least 20% of those respective users remained tight-lipped, so the overall proportion was not heavily skewed by any particular subgroup. The overall distribution of drug use for a given orientation considerably differs from the other orientations in their respective gender category. But for both males and females, there appears to be higher transparency in occasional drug use (“Sometimes”) for bisexual users compared to homosexual and heterosexual.

It would be very naive to assume that such non-disclosures are trivial. Regardless of gender and orientation, users would love to be swiped right on by as many suitable matches as possible. And that starts with their profile making a good first impression, at least from their point of view. They wouldn’t want their stated drug use habits to dissuade potential matches, so it makes sense to omit that detail from their profile and avoid that possible debate.

All things considered, choose your cards wisely.

Word Frequency

Included in this dataset are 10 essays that each individual had the option of filling out for the profile. This was the main focus of our analysis- we found these essay sections to be the most rich with information on the different types and behaviors of individuals and how people present themselves on OkCupid. First, we created a word cloud to determine some of the most used words in these essay sections.

Through this word cloud and investigation into word frequency, we discovered lots of interesting insights from the data. Firstly, words such as “like’’ and “love” are the most frequently used words. This is to be expected- it makes sense that users would use these words to describe themselves and who they are interested in. The next most common are words such as “Music”, “Food”, “Friends’’, “Time”, and “Movies’’- broad categories of common interests. This, combined with the fact that “like’’ and “love” are most frequently used, seems to indicate that most users use their essay sections to describe their most common interests.

This result might not be too surprising, but it does have some interesting implications. It shows how users tend to avoid controversial topics and come off too strong, and they instead prefer to talk about things that are commonly enjoyed. Also, it shows that users like using their personal interests as a proxy for how compatible other individuals are- people tend to think that if others share the same music, food, movie, etc. taste, then they will be compatible to date as well.

On the other hand, what’s missing from these most frequent words are words that describe deeper thoughts or personality traits- words like “shy”, “outgoing”, “extroverted”, “introverted”, etc. are all missing from the most common words, even though one might think that they should be there. This indicates that people avoid exposing their true selves or personalities on their profile at first, even though they might help make a good match. If you want to stand out on your profile and find better matches, you might want to consider talking more about your personality and who you are as a person, along with your other interests.

K-Means Clustering

A point of interest we wanted to investigate further was if there exist different categories or types of behaviors people might exhibit while using the essay sections and presenting themselves online. In order to do this, we used a K-Means Clustering model to cluster people into different groups based on the words they use in their essays.

To create the model, the 200 most common words were created as variables for each individual, and the number of times each individual used each of the 200 most common words was recorded. For example, if an individual used “like” three times and “love” two times, a three and a two were recorded for those variables, and these numbers generate a point for the individual.

The K-Means algorithm works by first specifying a number of clusters to generate. Afterward, the algorithm will randomly assign points as a centroid for each cluster, calculate which other points are closest to each centroid, define a new centroid based on those points, and so forth until the algorithm converges.

To determine how many clusters to generate, multiple models with different numbers of clusters can be created, and the inertia for each model can be calculated. The inertia for each model is the average squared distance from each point to their closest centroid. The ideal number of clusters can be determined by finding the point from which the inertia starts decreasing linearly with the number of clusters. As shown below, this was four clusters for our model- after four clusters, the graph begins decreasing approximately linearly.

Through the K-Means model, we were able to determine there were four general types of behaviors individuals exhibit based on what words they use. To gain insight into what each of these behaviors entails, we create logistic classification models for each of the clusters using the individuals’ word usage variables and the labels provided by the K-Means model. For the logistic model of each cluster, we interpreted the variables with the highest coefficients as being the most positively associated with belonging in the cluster, and the variables with the lowest coefficients as being the most negatively associated with belonging in the cluster. Our results are shown below.

First, Cluster 1 is the least common group, not even making up 10 percent of total users. This group seems to be the “uncontroversial” group- they tend to use words such as “like” and “love” very often, which are relatively positive but not too strong. On the other hand, they avoid words such as “hot”, “eyes”, and “sex” which are stronger and more dominant. In general, they seem to be trying to appear appealing to everyone.

Next, Cluster 2 makes up almost 18 percent of total users. These users seem to be individuals who are not looking for any long-term relationships, and they are mainly looking to casually date around. Words like “sex”, “friday”, “watch”, and “sports” support this, making their profiles’ essays likely having more of a casual feeling, and the avoidance of words like “love”, “life”, and “people” suggests they avoid discussing more serious topics that would be indicative of those searching for long term relationships.

After that, Cluster 3 is by far the most common group, making up more than 55% of all users. These users seem to be looking for something more serious and possibly longer-term partners, with words like “honest”, “hot”, and “funny” indicating that they are looking for specific attributes in their potential significant others. On the other hand, the avoidance of words like “love”, “like”, and “life” indicates that these users are also not taking the approach that Cluster 1 is, and they are not simply trying to appeal to everyone.

Finally, Cluster 4 makes up almost the same amount of users as Cluster 2, at almost 17 percent. They also share many of the same attributes with Cluster 3, talking more about activities in their essays. However, words such as “sushi”, “company”, and “conversation”, and “laughing” indicate that they are looking to go on more classy, business-style dates than the casual dates of the Cluster 2 individuals, and they could potentially be looking for more serious relationships as well.

Sentiment Analysis

The type of language people use may be divided into three dimensions: arousal, dominance, and valence [Mohammad]. Arousal is sluggish vs stimulated language. Dominance is weak vs powerful language. Valence is negative vs positive language. The type of language we use is a reflection of how we want to convey ourselves and how we want to be perceived.

OkCupid users may answer many different free-response questions when creating their profiles. An example is “What I value is …”. We then analyzed the language used in our dataset’s essays, through a process called sentiment analysis. Through the NRC VAD Lexicon, appropriate words are assigned three scores, one for each dimension. We divided our dataset by gender, calculated the scores for all applicable words, and plotted their average scores in arousal, dominance, and valence.

We can see across gender there is minimal difference in their degree of language for all three dimensions. This is evidence that the way we present ourselves does not differ between men and women, even though our initial belief was that men may be more dominant or aroused compared to women. In these two areas, their scores are middle-ground, and for valence, the scores lean more positive. Perhaps people do not want to come off too strong, as it may put off their potential match, but do want to come off as a positive person in their dating profiles. Staying positive is typical dating advice — people don’t want to talk to a downer. To stand out more, we may suggest using words with more arousal, like “excessively”, “rigorous”, “overpower”, and “overflowing”.

One issue with our sentiment analysis is that groups of words may have completely different meanings than the sum of their individual words, which is misrepresented in their scores. One example is “one nightstand”. Individually, “one” and “nightstand” have low scores in arousal and dominance; however, the phrase “one nightstand” has an entirely different meaning. All sexual words tend to have high scores in arousal and dominance, so when people say they are looking for an one nightstand, our sentiment analysis misinterprets their energy.

Challenges

When conducting the clustering analysis, we only looked at the top 200 most frequent words used, and we also only looked at the words individually instead of as possible phrases. While we were able to establish definitive clusters to generate, we could have possibly created better clusters if we had also considered more words and groups of words together.

Also in the clustering analysis, we created logistic regression classification models and used the magnitude and sign of the coefficients to determine which words were the most important in determining which words were positively and negatively associated with each cluster. While this is usually a good method, it is not always an entirely robust way of determining the strength and direction of association of words with different clusters and can lead to some incorrect interpretations. For example, “working” was very positively associated with Cluster 2 and very negatively associated with Cluster 4, while “work” was very negatively associated with Cluster 2 and very positively associated with Cluster 4. However, “work” and “working” are essentially the same word and should have at least the same direction of association for any given cluster.

Sources

http://saifmohammad.com/WebPages/nrc-vad.html

GitHub link:

https://github.com/brndnzh/datares-spring-team-10