The Maze of Amazon Reviews

10 min readDec 21, 2023

Authors: Sia Phulambrikar (Project Lead), Nathan Chen, Srivarsha Rayasam, Michelle Sun, Aashna Sibal, Shaashwath Sivakumar, Arely Aguirre, Katrina Iguban

INTRODUCTION

This article delves into the vast sea of Amazon reviews, seeking to uncover the factors that influence how customers rate products. We used a reviews dataset spanning over 10 years of customer reviews written for Amazon Fashion products, gathered from Jianmo Ni, Jiacheng Li, Julian McAuley’s research on Justifying recommendations using distantly-labeled reviews and fined-grained aspects.

Our group focused on five pivotal relationships:

Brand: does Brand influence Amazon Review ratings?
Price dynamics: do people rate highly priced products higher?
Upvoting trends: do people vote more for reviews with higher ratings?
Time: are ratings for products time-variant?
Language choice and ratings: How does review word choice relate to star rating of the product? Is it possible to predict star rating for a product based on a single word in a review?

First, we conducted exploratory research to get an overview of the factors mentioned above. Then, we honed in on language choice and ratings, creating deep learning models trying to predict product rating using the language used in the review.

Brand: does Brand influence Amazon Review ratings?

One of our first hypotheses about reviews on Amazon was that people would write reviews if felt strongly about a product, and either had something highly positive or highly negative to say about a product. We decided to test this hypothesis by visualizing the volume of reviews by rating.

From the bar graph above, we can see that the majority of reviewers were those who rated products 4 or 5-star.

Based on these findings, we wanted to research if the brand of the item was an influential factor in a product receiving a high rating. To start, we filtered the data to determine the top 3 brands within the dataset, which were Boomer Eyeware, Kidorable, and MJ Metals Jewelry.

The boxplot below visualizes the distribution of the ratings these brands received. Based on the graph, it seems that Boomer Eyewear tends to receive lower ratings while Kidorable’s ratings are typically higher. Additionally, the MJ Metals Jewelry box plot was a single black line, likely due to the limited amount of data.

Subsequently, we created a linear regression model to investigate the relationship between brand and overall rating amongst the top 3 brands.

Based on the results of the linear regression model, we created the above coefficient plot to visualize the coefficients of each brand, indicated by the blue dots. As you can see, all of the blue dots are positive, meaning that when an item belongs to these brands, the rating of the item tends to increase.

One of the metrics we used to evaluate this regression model was an ANOVA table. ANOVA tables break down the total variance into two components: explained by the regression model (explained variance) and unexplained (residual variance).

The PRE value, or the proportion of error explained by this model in comparison to the empty model, is 0.7148. This is a pretty large value, and provides evidence that our model accurately fits our data.

Overall, based on this analysis, it can be assumed that brand is a positive influence on a product’s ratings. However, one should take into consideration the limitations of this analysis, mainly being that we only considered the 3 most frequently listed brands in the dataset and only considered data points that had values for the brand and overall variables.

The relationship between Ratings and Price

As part of our exploratory analysis, we also wanted to explore the relationship between the maximum price people are willing to pay for products, versus the rating they would award that product. The above bar graph shows the results of our exploration: we can see that in general, highly rated items also cost the most. The maximum price for a product rated 6 was over $6000, while the maximum price paid for a product rated 1 was under $1000. However, it is important to note how the prices for the one and three star ratings were quite similar, even though a three star rating is definitely better than a one star rating. Following our examination of pricing dynamics, we turned our attention to voting trends, aiming to uncover patterns in how customers engage with and endorse reviews based on their ratings.

Voting Trends

Who is most likely to drop a vote on an amazon review? Our team hypothesized that people would likely vote for reviews for products they felt strongly about. In other words, reviews which either gave highly positive ratings (such as 4 or 5 stars) or reviews with low ratings (like 1 star). To test this hypothesis, we first plotted the number of votes against the rating for reviews.

The initial scatter plot reveals that voters tend to vote more for reviews granting 5-star ratings, aligning with our expectations. Notable outliers include values for 3 and 4-star ratings, hinting at some deviations in voting behavior. Overall, there seems to be increased voter engagement among Amazon voters when endorsing positive reviews.

However, we realized that we had overlooked the fact that some products might accumulate more votes simply due to a higher volume of reviews. To address this limitation, we developed an enhanced scatterplot, as shown below. Here, each data point represents an individual product. We tried to juxtapose the sum of votes against the sum of reviews, trying to see if number of reviews was a factor influencing number of votes. Moreover, each data point has a color given by the rating it received.

This modification accounts for a more comprehensive analysis, accounting for the effect of review frequency on voting patterns. We had expected a concentration of the red points (corresponding to average Rating 5) on the top right corner of the graph. This would mean that the highest number of votes, and the highest number of reviews, are given to product consumers feel positively about. However, we do not see this pattern, indicating that there is no clear relationship between number of votes and the number of reviews for Amazon Fashion Products.

Are ratings time-variant?

To understand if reviews for product categories varied with time, we first needed to create these categories using Natural Language Processing. We used various columns in the data such as the title, review text, item description, and styles to divide the items into these categories. Once we graphed the average category ratings by year from 2010–2018, sudden drops and sharp increases in ratings became evident. For example, one can see a drop in the ratings of socks around 2017 and similarly, a gradual increase in pants’ ratings around 2015. In order to dive deeper into these abnormalities. We found the top words in reviews of those categories before and after the changes, as shown in the word clouds. Words that were more common had larger weight, or size, while those that were less common are smaller in the clouds.

Further delving into our analysis, we explored the temporal aspects of product ratings. Focusing on products with the highest review volumes, we applied a moving average scale from January 2016 to examine how reviewers’ opinions evolved over time.

On repeating this process over many sets of products, we realized that there is a sharp drop in rating as we begin cumulating the moving average scale, but rating plateaus around the value of 4.5 on average. However, we think this could be because of the way data was collected, since data was more dense after 2018. Following our examination of temporal dynamics, we turned our attention to language choice and its impact on star ratings.

Word Choice and Star Rating

We built a deep learning model to find relationships between specific words and star ratings. Then, we used a smaller subset of data to test our model, using word choices in reviews to make predictions about the number of stars a product receives.

To determine what words are most closely related to certain star ratings, we created a neural network model using Tensorflow, resulting in a 76% test accuracy. This model included numerous layers , including an embedding layer, which helped the model make predictions on unseen data. The embedding layer is what assigns weights to words, helping distinguish what rating a word is most closely related to.

The above multiple bar chart depicts the results of our modeling procedure. On the x-axis, for each Rating, we present the words with the strongest associations to a particular rating. The y axis — Rating Likelihood (%) — corresponds to the likelihood that the review will have that star rating given that the word was found in the review.

Let us try to understand the modeling process for a specific word found in the dataset, say “perfect”.

Since “perfect” is located in the 5 star rating subplot, the model associated the word “perfect” with a five star rating. The bar for “perfect” has a y-value of about 65. This means that if “perfect” is in a given review, that review has a 65% chance of receiving a 5 star rating.

In order to claim that we have found a word that helps us effectively predict a star rating, the likelihood of a review having the associated star ratidng given that the word is in the review must be greater than 20%. We determine 20% to be the baseline score since there are 5 different categories, so a completely random model will be correct 20% of the time when predicting star rating.

As shown by the graph, the top 5 words associated with 3 star, 4 star, and 5 star ratings all have a likelihood greater than 20%. Therefore, the words that the model associates with 3,4 and 5 star ratings are very helpful in predicting a review’s rating solely based on a single word.

Unfortunately, the same doesn’t hold true for the top 5 words associated with 1 star and 2 stars. Although the words that the model associated with these ratings seem logical, the Rating Likelihood for these words are much lower than the baseline score. This could possibly be explained by the fact that there may be significant overlap between these two categories. Therefore, the words that the model associates with 1 and 2 star ratings are not very helpful in predicting a review’s rating solely based on a single word. In order to resolve this issue, future work would include using more 1 and 2 star reviews when training the model.

Bigrams of words used in reviews, distributed by rating

After analyzing the words most commonly used in reviews for certain ratings, we wanted to delve deeper into the context in which these words are used. One technique which was helpful with this was bigram analysis. Bigrams identify pairs of words that frequently appear together. This can help capture meaningful phrases and sentiments that might be missed when focusing on individual words. For instance, the word ‘quality’ was one of the most common words throughout reviews. But just this indicate much about the context of word usage. For instance, a reviewer could have discussed the “low quality” of a product and given it a rating 1, or “high quality” and given it a rating 5.

The following graphs show the most common bigrams, distributed by review.

For 1-star reviews, the most common bigram was “poor quality”, followed by “cheap material”. This was similar to the reviews with 2-star ratings, with the most common bigrams being “poor quality” followed “size chart”.

Comparing this with bigrams of reviews for the highest rated products, the most common bigrams were “fits perfectly”, followed by “super cute”. The bigrams “nice quality” and “excellent quality” also appears in the plots for ratings 4 or 5. Thus, the technique allows us to understand context for the generic, but most frequently used words in reviews.

Conclusion

As we wrap up our exploration into Amazon reviews, the findings shed light on key aspects of how customers rate products. Brand influence emerged as a positive factor in ratings, backed by strong statistical evidence. The link between price and ratings showed a general trend, and voting behaviors leaned towards endorsing positive reviews. Moreover, we did not find much temporal variation in the ratings for particular products, since most product ratings settled at an average rating of 4.5 throughout the time the data was collected. We also explored the connection between language use and star ratings. While our deep learning model was able to associate specific words with likely ratings given to reviews with the word in it, further analysis of bigrams revealed that we should refine this model to use combinations of words to predict review rating, adding more context to the analysis and helping the model make better predictions. However, our research process has left us with plenty of room for improvement. Future work could involve brand analysis with a wider spectrum of brands, and could include customer analytics to explore demographics in relation to products purchased and ratings.