Will You Get a Second Date? — A Predictive Model of Dating Match Rates
By: Alan Wang (Lead), Zoeb Jamal, Justin Gong, Brandon Pan, Kowoon Jeong, and Yashas Jain
When selecting a romantic partner, many attributes come into play. Especially on a first date, which heavily depends on initial impressions, one’s decision is often skewed by superficial factors, such as physical characteristics and occupation. Our data looks closely on how these types of factors influence attraction level. Imported from Kaggle, this data set was compiled by Columbia Business School professors Ray Fisman and Sheena Iyenga from an experimental speed dating event conducted between 2002 to 2004. Participants were randomly assigned to one another to speed date for four minutes, and this process was repeated. Prior to the event, the participants were asked to answer numerous questions ranging from personal questions to rating the importance of different attributes in a romantic partner, including attractiveness, sincerity, intelligence, fun, ambition, and shared interests. At the end of 4 minutes, the couple would be considered a match if both parties agreed for a second date. Our team’s data set gave a holistic overview of how match rates are impacted based on a number of factors, allowing us to predict an approximate match rate of any given individual.
Exploratory Data Analysis
Before constructing a model, we analyzed specific relationships between match rate and attributes to get an idea of what may influence a match. We first looked at more explicit characteristics of the participants, specifically attributes that are uncontrollable by an individual. The following visualizes one of these variables: ethnicity.
With an average match rate of 16.47% across all participants, Figure 1 displays that there is a clear deviation from the average depending on the ethnicity and age of the individual. Asians yielded a match rate of 13.47%, Africans yielded a match rate of 20.23% Latinos yielded a match rate of 18.52% Europeans and Caucasians yielded a match rate of 16.67%, and others yielded a match rate of 19.73%. This demonstrates that the race of an individual factors into determining whether or not they receive a match.
While we observed physical characteristics of an individual, we also looked at deeper attributes:
Figure 2 compares match rate to the profession of an individual, which displays a clear disproportionality between different occupations. The visual demonstrates that those that pursue a medical field tend to achieve higher match rates, whereas those that are still pursuing an education achieve a much lower rate. Notably, how lucrative a profession is and its income does not play a major role in determining the match rate, as those that pursue a field relating to language receive a much higher match rate than those that are pursuing higher-paying fields such as math and engineering. After constructing a linear regression model between income and match rate, we confirmed the lack of collinearity as the correlation coefficient, R², was less than 0.05.
Lastly, we also recognize that match rate relies on two conditions: an individual’s partner being willing to date them again, and the individual themselves being willing to date their partner again. For example, if an individual heavily values sincerity but the average participant in the experiment was very insincere, that individual would reasonably have a low match rate even if they were attractive. Thus, it is important to analyze what impacts an individual’s “pickiness” and how that may influence their match rate. The following graphs looks at notable variables that impacts “pickiness”:
Figure 4 demonstrates a positive relationship between match rate and number of previous dates. Those that have been on more dates in the past yielded a considerably higher match rate of 30.95%, followed by a steep decline as the number of previous dates decreased. This could be explained by participants being more particular in what they want in a partner when they’ve been on fewer dates, whereas those that have been on more dates are less selective. Figure 5 shows how gender may impact whether an individual decides to match with someone, as males and females value different qualities in their partners. The attributes included in what participants look for were Attractiveness, Sincerity, Intelligence, Ambition, Fun and Shared Interests. Participants were given 100 points to distribute within the 6 categories, and were instructed to give more points towards attributes that they value more in a romantic partner. Most notably, males tend to highly value attractiveness in comparison to females.
Due to the magnitude of the data set, we ran into multiple problems when cleaning and organizing the data. Many values on the sheet were either missing or written as a string value which can not be processed in an algebraic matter. As this data was taken from an open-ended survey, there were many inconsistencies between the data points. For example, for the location of residency, each participant formatted their answer in a unique way. For example, one may have written “NJ” to represent New Jersey, while another may have written “New Jersey”, and another may have written “New Jersey, USA”. To fix inconsistencies in the data, we had to search for patterns in text and assign common keywords into its respective category. Also, some categories such as ‘career,’ ‘from,’ and ‘field,’ were categorical values which can not be compared with other numerical categories in the same manner. Thus, when creating models, we had to either remove the certain variables if it had a low importance, or encode the values numerically. For example, the field of study/profession category was encoded 1–18. To address NA values, we had to clean the dataset to exclude empty fields, which impacted the number of viable participants to analyze. Lastly, we had to convert string values into numeric values when possible. For example, the income variables were all characters as a result of the survey including commas in numbers. We had to use string manipulation to remove these commas and coerce the strings into numeric values.
Model Building and Final Model
Deciding on our model led us on a path through multiple different methods — first and foremost we considered multiple linear regression, random forest, support vector machines, and k-nearest-neighbors.
Our first step in feature selection was encoding our categorical predictors for usage in our model. This included creating dummy variables for each of the different levels in our categorical predictors, where 1 indicated the presence of that level and 0 otherwise. After one-hot encoding our categorical predictors, we were led to normalize and scale our remaining predictors. This is because some of our predictors were on a scale from 1–10 (Attractiveness) whereas others were on scales from 0–2400 (Intelligence). We then removed predictor columns that had more than half of their observations missing. This was accomplished through the caret library in R, where we made use of the train function to center and scale our predictors before building our models.
To begin our prediction process, we decided to create a new variable, match_rate, that would be our response variable — the data had individual match counts (0 or 1) for each speed dating round a contestant had, but in order to get the overall match rates for each contestant, we grouped the data by contestant ID to get their total sum of matches. Then, we divided by the total number of unique observations for that contestant to achieve our variable of match_rate.
After accomplishing our centering and scaling methods, we decided to use a k-nearest neighbor model — in comparison to our other methods that we tried including linear regression, support vector machine, and random forest models, k-nearest neighbors resulted in the best evaluation metrics with an RMSE of 0.018, an R-squared of 0.984, and an MAE of 0.003. This was in comparison to our linear regression models with an R-squared of 0.53, and our support vector machine models with an R-squared of 0.48. This result was similar with each of the differently-shaped kernels we attempted for our support vector machine model. The model that performed the second-best was our random forest model with similar evaluation metrics — it topped out at an R-squared of 0.97, but we ultimately decided to go with our k-nearest neighbors model.
In our k-nearest neighbors model, we used cross-validation to determine the best numeric value for k, our number of nearest neighbors used in our final model. To determine this, we ran 10-fold cross validation with a total number of 3 repeats to arrive at a k-value of 5. This metric table here showed an exhaustive set of evaluation metrics for a range of different k’s:
Ultimately, our data visualizations show several variables that could potentially influence the probability of being able to match with a partner in a speed dating round. Using data of over 8000 individuals, notable observations came from variables such as gender, ethnicity, age, and profession. Furthermore, by utilizing the versatility of the kNN algorithm within our model, we were able to estimate the likelihood of a data point landing into a category most similar to the available categories. However, there are many attributes that contribute to an individual’s perception of attraction that were not included in this dataset. It is important to understand that the level of attraction is subjective and can not be accurately measured even through big data. However, we can only predict how certain characteristics have generically grabbed the attention of the opposite gender. In the end, every couple is different and we all have different ideal types.