You Got a Friend in Me — A Predictive Model of Ideal Friendships

7 min readJun 21, 2022

By: Alan Wang (Lead), Justin Gong, Yashas Jain, Kowoon Jeong, & Derek Nakagawa

Source: Vitality Magazine

Introduction

Whether it’s high school, college, or the workplace, humans are constantly placed in environments where they are surrounded by new faces. In these situations, we gravitate towards those that have a preferred character type. Oftentime, this happens self-consciously as people seek out those with personality types that complement their own. But what if there was a way to quantify these self-conscious decisions and model a way to predict who would be friends with whom in a crowd of strangers? In this article, we explore the relationship between personality types from a self-evaluated survey, which is then used to predict ideal friendships.

Imported from Kaggle, the dataset that we will be using was collected by Open Psychometrics from 2016–2018 through an online personality test. This test asks a range of questions relating to the five big personality types: extraversion, agreeableness, emotional stability, and conscientiousness. Our project replicates this survey to reproduce data from DataRes at UCLA to ultimately construct a predictive model that takes in multiple user-inputted answers and outputs ideal friendships amongst the participants.

Data Processing

Due to the magnitude of the data set, there were a few problems logistically and how the survey was done. For one, we have to assume that the test by the participants were accurate to how they are in real life and it wasn’t a random sample as it was a participatory survey. Another thing was that for many of the models, rather than using the entire dataset of over one million participants, it was based on random samples from the data.

Another problem was that some of the questions were contrasting like “EXT2: I don’t talk a lot” (negative) and “EXT1: I am the life of the party” (positive) so a 5 for both of these means the opposite. To fix this, for all of the even questions, which were negative like EXT2 seen above, we subtracted the scores from 6 to see what their actual positive score was. We then added all of the scores together, divided by the max score of 50, and then multiplied by 100 to get the score for each category for everyone.

Exploratory Data Analysis

Analyzing relationship between demographic variables and personality

Figure 1

Figure 1 displays that age, ethnicity, and gender do not play a significant factor in determining the personality type of an individual. However, all groups despite the age, ethnicity, and gender, have a similar pattern: a high intellect personality score and low extraversion score. This implies that each individual’s personality type is not impacted by their demographic. This tells us that people with different demographics can get along just as easily as those who are the same demographic.

Relationship between specific character traits

Figure 2

As displayed in figure 2, the relationship between characteristics varies based on which two are being compared. In these density maps, extraversion and agreeableness are being compared to openness in blue and neuroticism in orange. The visual shows that there is a stronger linear relationship for the blue density maps, suggesting that those that are more open tend to also be more agreeable and extroverted. On the right, there is no clear relationship between neuroticism and agreeableness/extraversion. This provides insight into the correlation between personality types for a given person. Additionally, it suggests that openness may have a higher VIF score than neuroticism when constructing our model.

Comparing DataRes vs All

Figure 3

Using over 690,000 observations from our dataset, we have created 5 histograms to visualize the distribution of each personality trait. This will help us give a clear comparison with the frequency scores within our club with over 30 people. Taking the average score of each person, we found that a histogram distribution would be ideal to analyze our data.

The average scores for each of the personality trait was calculated and is compared below:

From the comparison visualizations, we can see that DataRes members do somewhat resemble the general public with similar characteristics and shape of the histogram. The average DataRes member is more extroverted (we can definitely confirm this), less neurotic (and this), more agreeable, more conscientious and less open than the general public.

Matching People

Had lexicon of all categories that matched well, turned it into a diagonal matrix and then once we had each of those that matched up, we had each person as the row index and their opposing match as the column index to see where they matched in terms of agr, extra, csn on a scale of low, fair, good, high, then converting it into a numeric value. Then, we did this for each of the five categories and added them all up to get a corresponding compatibility score. Thus, the highest score a person could

When observing the personality scores and the match rates of our team members, it was surprising to compare our capabilities. Surprisingly, Alan, who is placed in cluster 1, received Justin who is in cluster 5 as one of the worst matches. This is partially due to the fact that Alan scores comparatively higher on agreeableness and openness compared to Justin. According to the matrix where we created the indices that showed which score matched well in terms of the 5 personality score categories, Alan’s and Justin’s agreeableness score did not correlate to a good match. Moreover, although Kowoon and Justin scored differently and were placed in separate clusters, they received a similar list of people who matched well. This signifies that different personality types can get the same corresponding compatibility score with certain individuals. In sum, there’s not a solid set of personality that will perfectly match well since there are multiple indices that correspond to a good match.

Analysis

Clustering graph:

Figure 4

K means clustering, then PCA

Once we matched our participants, we were also interested in seeing who of the participants were similar to each other: we wanted to see if there was any feasible structure to the data. To accomplish this, we used k-means clustering. We set an arbitrary number of clusters at 5 as we felt this was a solid middle-ground between having enough distinguishing clusters for our 31 participants but not enough clusters to make them meaningless. Utilizing the ClusterR library, we were able to converge the running of our k means clustering algorithm, ending up with cluster groups that we defined as

worried, agreeable extravert,
thorough, chill, and amicable,
all-rounder,
Thorough, even-keel introvert,
Worried, agreeable extravert

We then used dimension reduction to easily visualize our data on a graph shown above. The PCA analysis used the 5 basic dimensions of each trait, which then decomposes the data matrix. Using the eigenvalues and vectors, we can get the top two principal components which explains the most variance in the data. The two dimensions are shown on the axes, which explain 26.8% and 35.6% of the variance, which are the highest ones.

Analyzing each cluster group:

Since each of these clusters were grouped together by those that had similar personality traits, we took the average score of each of the big five personality traits for each cluster. The results are displayed below.

As seen above, each cluster had unique characteristics that defined their personality. For example, cluster group 2 had particularly low neuroticism, but had the highest conscientiousness score. Meanwhile, cluster group 1 was clearly the most agreeable, but scored the lowest in conscientiousness.

Comparing cluster group’s compatibility scores

Figure 5

Figure 5 takes each cluster group and graphs it against the average compatibility of each group. The average compatibility was calculated by taking each member of each cluster group, and taking their average compatibility score with other members of DataRes. As seen in the bar graph, cluster group five tend to have higher compatibility scores with other people than those in cluster group one. However, this could be a result of the amount of individuals in cluster group one. Only four participants were listed in cluster group one, thus the compatibility score is heavily affected by each individual.

Conclusion

Through rigorous data processing and modeling, we were able to create a predictive model that will allow people to find their best matches. Although it was hard to quantify and measure one’s personality, we analyzed a simple representation of one’s personality by scaling amongst 5 factors of characteristic. It was notable to see that certain personality types showed more compatibility with one another. However, in the end, every individual is different in who they become friends with and many people find good relationships from those who are considered to have “incompatible” personalities. We did not intend to create a perfect model that would predict who your best friend is going to be, but hoped to guess which personality types people would best get along with and form lasting friendships.