By Yupeng Chen, Sylvia Ma, Hana lim, Lu Cheng, Anish Dulla
Our team’s analysis focused on comparing Uber and Lyft rides in Boston, MA for a sample set of 750,000 rideshares. We were inspired to pursue this challenge because of its relevance to us as college students, especially in Los Angeles. We often take Uber and Lyft rideshares to get around the huge city, and understanding how these pricing models work and vary by circumstances gives us a better feel of what rideshare service to use. Furthermore, what we’ve learned from this dataset will have great applications in real world problem-solving using data science and machine learning techniques. In our analysis, we predicted and compared the price of Uber and Lyft rideshares based on a variety of predictors such as distance, hour of the day, surge multiplier (demand-based pricing), etc. We collected our data from Kaggle and used this rich dataset to build a price prediction model.
II. Data Source
The contributors of the dataset queried both Lyft and Uber prices in the Boston Area. The queries were done on the apps every 5 minutes for 22 days from late November through mid-December in 2018. The weather data were queried from the Dark Sky API every hour.
We cleaned up some of the variables of our interests. The time stamp data was in Unix format, so we converted them to fit in the local time zone so that it could be more applicable when assessing how time influences price. We also noticed that there were no prices associated with rows with “cab_type” taxi. After removing unusable or irrelevant data, we are left with 637,976 observations.
1. Price Comparison on Comparable Products between Uber and Lyft
In order to satisfy different needs of customers, both Uber and Lyft offer different cab types, from normal to luxury cars. To find how product types influences price, we compare product descriptions for Uber and Lyft, and group comparable products into 5 categories:
“Share” = (“UberPool”,”Shared”),
“Normal” = (“UberX”,”Lyft”),
“SUV” = (“UberXL”,”Lyft XL”),
“Lux” = (“Black”,”Lux Black”),
“Lux SUV”= (“Black SUV”,”Lux Black XL”)
Since the “WAV” in Uber and “Lux” in Lyft could not find a match, they are removed from this graph. The resulting dataset has a similar number of observations in each product type across Uber and Lyft.
The box plot indicates that the price of Lyft is cheaper than Uber for shared, normal, and SUV cars. As college students, this finding aligns with our daily experience. However, it is interesting to find that the luxury products of Lyft are more expensive than Uber(Lux and Lux SUV). From the graph, we can analyze in what scenarios would Lyft/Uber be cheaper compared to another. Also, the result guides the variable selection of our machine learning model for price.
2. Uber VS Lyft: Price Comparison on Distances
Distances definitely play an important role in deciding price, but we want to see how Uber and Lyft weigh this factor differently in their price model. In our dataset, the queries of distance in Lyft have a smaller range compared to that of Uber. This is a human-made difference that we could not draw any conclusion from.
However, it is noticeable that the price variation of “Lux” and “Lux SUV” in Lyft is much higher than in Uber, and price increases more in Lyft than in Uber as distance increases. This phenomenon suggests that Lyft’s luxury products are more sensitive to distance change than Uber’s. Also, beyond distance, there are many other factors contributing to the price of Lyft’s luxury products.
Interestingly, the price estimation in Lyft’s app has a larger gap between price levels than that of Uber, which might imply a larger difference between the estimation and the real price in Lyft.
3. Uber VS Lyft: Heat Map for Product Types and Weather
From the boxplot of all products, there is no clear pattern between weather and price for both Uber and Lyft. However, by observing the heat map separated by product types, we might draw some interesting insights.
For shared and normal products, weather does not change the price for both Uber and Lyft. For luxury products, Uber’s price of “Lux SUV” in “drizzle” is slightly higher than in “clear”, but the price in “rain” is almost the same as in “clear”. Moreover, a similar pattern does not appear in its “Lux” product. Therefore, there is no apparent and consistent relationship between weather and price for Uber.
As for Lyft, the price in “Mostly Cloudy” and “Rain” is slightly higher than in “clear” for both “Lux SUV” and “Lux’’ products. This might partly explain the large price variation in Lyft’s luxury products and indicates that bad weather affects Lyft’s luxury car price to some extent. However, it is unexplainable that the average price in “Possible Drizzle” is the lowest among all weathers in Lyft’s ”Lux”. Hence, the relationship between weather and price in Lyft still need further investigation to decide whether they are valid and consistent.
4. Uber VS Lyft: Heat Map for Product Types and Hours
Another possible influential factor for price is rush hour, which is 7am to 9am and 3pm to 6pm in Boston. Since there is no rush hour on weekends, we removed weekend data in the heat map.
For Lyft, the price of “Share” products does not change with Hours. However, for “Normal” products, the price at 1pm is slightly higher than other hours. For “SUV” , 5am is higher and for “Lux”, 1–2pm is higher. For “Lux SUV”, 10am is higher. There is no consistent pattern in Lyft so that certain hours have higher price for all types of products. Therefore, the relationship between hours and price may not have an universal rule but depends on specific circumstances for Lyft. As for Uber, the price of “Share” products is higher at night from 10pm to 12am. For “Normal” products, the price at 10am,1pm, and 3pm are higher than other hours, and it is interesting to see that the price at 1pm is also higher for “SUV”, “Lux”, and “Lux SUV”.
The practice of surge pricing is based on logic of fairness and economic equilibrium. By adjusting prices, the company wants to match driver supply to rider demand at any given time. During periods of excessive demand, or when there aren’t enough drivers on the road, Uber increases its normal price with a “surge multiplier” so that it increases supply of drivers who were lured to earn more and control customer demand by discouraging those who value less to other means of transportation.
However, reports from The Washington Post show that surge prices fluctuate drastically and changes occur every 3 to 5 minutes. Furthermore, surge prices are also location-specific and may be several times higher in one neighborhood than an adjoining one. This might explain why we couldn’t find a simple and universal relationship between hours and price.
1. Variable Selection
Since we believe that most variables have a linear relation with price, we fitted a lasso model for both Uber and Lyft data with cross validation to find the best shrinkage parameter lambda and variables remaining important with that parameter. We found out that only Cab Type and Distance were shown to be important in either model.
2. Best Predictive Model Selection
We trained Linear Regression, LASSO, Random Forest, and Gradient Boosting models to evaluate their performance on the validation data. There are various metrics to measure the performance of a regression model, but we used the R-squares and RMSE (Root-mean-square-error) metrics to find the best model.
Linear regression is one of the most well known and the simplest way to predict the outputs, which fits a linear model to minimize the residual sum of squares between the predicted values and the true values. Though the main disadvantage of linear regression is that it assumes the linearity between the predicted and the response variables, but data are rarely linearly separable in the real world.
LASSO regression is the special type of linear model that adds the constraint to prevent having unnecessary variables in the model. This regularization reduces overfitting and helps in feature selection. LASSO is also well known for its interpretability of the model. We will discuss this later in the article.
Random Forest Regression
Random Forest works as it combines the predictions from multiple machine learning algorithms to make more accurate predictions than an individual model. It uses low bias and high variance to reduce the error. If you want to check out how Random Forest works in-depth, check out this website!
Gradient Boosting Regression
Like Random Forest, Gradient Boosting Regressor is an ensemble machine learning algorithm that also uses decision trees. We chose to test the Gradient Boosting Regression because they tackle to reduce the errors in the opposite ways of Random Forest. Boosting uses high bias and low variance, which reduces error mainly by reducing bias.
3. Importance of the significant predictors
Plot 1 shows the difference in predicted and the true values for the response variable without surge multiplier, and plot 2 with the surge multiplier. These plots illustrate the importance of having significant features in predictive models.
Although Gradient Boosting Regression plays a significant role in minimizing the prediction error rates, Random Forest performed the best in our case. We applied the basic Random Forest model without tuning any parameters. Even so, it turned out that the result executed by Random Forest was the best model.
While Random Forest gave us the best predicting power, the Lasso Linear Model told us a better story. To better understand how price is calculated for Uber’s and Lyft’s’ ride, we derived coefficients from the Lasso Model. We see that Lyft has a lower rate for distance than Uber, which implies that in general Lyft charges less for every additional mile, and long distance rides will be cheaper on Lyft compared to those on Uber. Also in Lyft, shared cab has a lower base price while high end products “Black” and “Black XL” have higher base prices.
V. Future Work
There are many ways to tackle this problem from different perspectives. We can refine the models to improve prediction accuracy. For example, we can consider the interactions between the variables, such as distance and cab types predictors. We can also explore different prediction error rates of Random Forests with the optimal values of the parameters. We plan to consider outside of data to incorporate traffic conditions and times.
More information about codes and analyses can be found in our project GitHub repository.