CMSC320 Final Project: Predicting Car Prices

Neel Tejwani


1. Introduction

The used car market is an important segment of the United States auto industry. In 2019, 40.42 million used cars were sold vs only 17 million new cars1. Pricing used cars is much harder since each car has its own unique history. Many people do not know how much their used car is worth, and often browse the market for similar cars to gauge the market price which no formal process. What if there was a way to enter your car's info and see what price you should list it at? This project aims to provide a rudimentary solution to that by observing a dataset, looking at trends, and picking a regressor that can make a reasonable prediction. We will gather data, process it, perform EDA, and compare model performance to provide something that could be used in real life. I have always been interested in cars, and running the data science pipeline on something I am interested in is something I am looking forward to!

1: https://fortunly.com/statistics/us-car-sales-statistics

2. Imports

Make sure everything here is imported in order to run the rest of the notebook without issues

  1. Pandas & Numpy: Needed for dataframe manipulation
  2. Matplotlib & Seaborn: Used to plot graphs
  3. Regressors and below: Used in ML Model Analysis

3. Data Collection

In order to predict used car sale prices, we need some data with descriptions of each car, and its price. I found this dataset on Kaggle here which has 12 different information points on each car and is available here: https://www.kaggle.com/doaaalsenani/usa-cers-dataset. This data was scraped from auctionexport.com and is regarding cars being exported from the US. Here is a description for the 12 columns we have, taken from the dataset description available in the link above:

  1. Price: The sale price of the vehicle in the ad
  2. Brand: The brand of car
  3. Model: model of the vehicle
  4. Year: The vehicle registration year
  5. Title: This feature included binary classification, which are clean title vehicles and salvage insurance
  6. Mileage: miles traveled by vehicle
  7. Color: Color of the vehicle
  8. Vin: The vehicle identification number is a collection of 17 characters (digits and capital letters)
  9. Lot: A lot number is an identification number assigned to a particular quantity or lot of material from a single manufacturer. For cars, a lot number is combined with a serial number to form the Vehicle Identification Number.
  10. State: The state in which the car is being available for purchase
  11. Country: The country in which the car is being available for purchase
  12. Condition: Time remaining in the auction

4. Data Processing

You might have noticed the first column above called "Unnamed: 0." This is simply an index column for a csv file, but it is redundant in a dataframe (since a dataframe already keeps track of indices) and may have unintended effects in our analysis. Additionally, the 'vin' and 'lot' columns are unique to specific cars and hence cannot have any correlation on price. Let's delete these three columns.

Let's look at our numerical columns next using the describe function

We can observe the following:

  1. The min price is 0, which is an outlier and can skew our results. To get a more appropriate dataset without outliers, let's drop all rows that have a price less than $1000.
  2. The median of the car model years is 2018, with a standard deviation of 3.44. So let's also drop all cars that are older than 2012 to avoid outliers.
  3. The max mileage is a million, which is very rare in real life, so let's drop all cars with a mileage > 350,000

Now let's see if any of our columns have any null values:

We can see that the no columns have any null entries. This is good!

Next, let's standardize the 'condition' column to an amount in hours, as it is currently a string and isn't usable in for further analysis. To do this, we create a new column called 'hours_left' and convert each string to an integer, which represents the hours left in the auction. Finally, we drop 'condition' at the end.

We can see that the average time remaining on a listing is 48 hours, or three days. We also have some expired listing, but that is ok.

How about the colors? How many do we have of each?

This is bad. We need to combine a lot of these colors together that are very similar. The code below does so:

This looks much better!

The final step in processing this data is making another dataframe that can be used later for running Machine Learning models. ML models require numerical values to make sense of data. So let's convert all categorical variables to numerical.

We can now proceed to EDA.

5. Exploratory Data Analysis

Let's now explore some trends in our data. First, we plot mileage over time

The trend seems reasonable: as a car gets older, the more miles it has. We can also note that 2017 has more outliers than the rest of the years.

Next, let's look at price vs brand

We can see that luxury brands such as Mercedes-Benz, BMW, Cadillac, Lincoln, Lexus, Audi, and more all have higher average prices than mainstream brands such as Hyundai, Jeep, Kia, Mazda, and others. Some anomalies: Honda is considerably cheaper than other mainstream brands; Jaguar is a lot cheaper than expected since it's a luxury brand.

Next, let's look at price vs year, but let's also observe how the title status affects this price.

This is also expected as cars with salvaged titles are going to be cheaper (a salvage title means that the insurance company has deemed this vehicle a loss since it has been damaged significantly). Additionally, the price trend over time also make sense.

Next, let's see the popularity of different brands

We can see that Ford, Dodge, Nissan, Chevy, and GMC are the most popular brands. This makes sense as this is a US dataset.

So far, we have only observed variables that we know should have an effect on the price by pure intuition. However, we don't know what trends to expect with a vehicle color, location, or time remaining in auction. Let's now pivot to observe these variables.

From this plot, we cans can see that neutral/mainstream colors such as silver, black, white, red, and blue have a higher price than unusual colors such as gold, green, orange, yellow, etc. This makes sense because people generally don't buy these unusual colors, and most mainstream cars don't even have these unusual colors as options.

Let's look at if the time remaining on auction has an effect on price:

Notice that as the auction is nearing its end, the price is very different. However earlier in the auction, the price is more stable.

Lastly, let's see the average price in each location

There seems to be a relation between price and state, as there is an upward trend when sorted by average sale prices.

6. Machine Learning & Analysis

In this section, we will use a few different models to develop a prediction, and pick the best one at the end. First, let's define X and y, the predictors (everything but price) and value to predict (price). Note we are using the data_ml dataframe defined above that has all numerical values. We also need to standardize this data before we use it, and we can simply use the StandardScaler to do so.

Let's start off with a simple Linear Regression:

The Linear Regressor gives us a score of around 0.305, which is not good, so it is not a good fit for our data.

Let's try Logistic Regression:

Logistic Regression is even worse! Seems like these two models do not fit out data well.

Next, let's try k-nearest neighbors (KNN).

This regressor approximates the associations by averaging the observations in the same neighborhood.

We can see that the KNN regressor only has a r^2 score of 0.499, which is not great. We need to explore more models.

Let's try Gradient Boost. It is another model that converts weak learners into strong learners,

Gradient Boosting gives us a score of 0.6217, which is much better!

Random Forests?

We could stop here as Gradient Boost did give us a reasonable model. However, let's see how Random Forests perform on our dataset. This model is widely used due to its ability to use decision trees to prevent overfitting. Due to this, I predict this will have the best accuracy. Let's see if the hypothesis holds true:

And it did - we got an accuracy of 0.6829! Since Random Forests performed the best, let's conduct hyper parameter tuning on it using Grid Search to see if we can get an ever better score.

Hyper parameter tuning

We can see that using these new parameters we have increased our accuracy from 0.6829 to 0.6864! While this accuracy is not amazing, it is reasonable since we are able to predict the price within \$4380 of a mean price of $19,780.

7. Conclusion

Thats it! We have walked thorugh the data science pipeline of data collection, data processing, Exploratory data analysis, ML, and providing insight. It was great being able to look into the factors that affect car price and make a model that makes a reasonable prediction. Big thanks to the people on Kaggle who scraped this data from auctionexport.com and to Prof. John Dickerson for a great intro to data science class. Hopefully this tutorial provided you with some insight on how different factors affect a car's price, and the model could be something of use to make further predictions.

Here are some links for further reading into some of the topics discussed:

  1. https://cars.usnews.com/cars-trucks/car-depreciation-what-factors-affect-car-values
  2. https://fortunly.com/statistics/us-car-sales-statistics
  3. https://towardsdatascience.com/random-forest-in-python-24d0893d51c0
  4. https://towardsdatascience.com/exploratory-data-analysis-in-python-c9a77dfa39ce
  5. https://towardsdatascience.com/understanding-gradient-boosting-machines-9be756fe76ab

Thanks for reading!


Word count and code line count check: