The used car market is an important segment of the United States auto industry. In 2019, 40.42 million used cars were sold vs only 17 million new cars1. Pricing used cars is much harder since each car has its own unique history. Many people do not know how much their used car is worth, and often browse the market for similar cars to gauge the market price which no formal process. What if there was a way to enter your car's info and see what price you should list it at? This project aims to provide a rudimentary solution to that by observing a dataset, looking at trends, and picking a regressor that can make a reasonable prediction. We will gather data, process it, perform EDA, and compare model performance to provide something that could be used in real life. I have always been interested in cars, and running the data science pipeline on something I am interested in is something I am looking forward to!
Make sure everything here is imported in order to run the rest of the notebook without issues
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, KFold
from sklearn.metrics import mean_absolute_error as mae
from sklearn.preprocessing import StandardScaler
import warnings
warnings.filterwarnings('ignore')
In order to predict used car sale prices, we need some data with descriptions of each car, and its price. I found this dataset on Kaggle here which has 12 different information points on each car and is available here: https://www.kaggle.com/doaaalsenani/usa-cers-dataset. This data was scraped from auctionexport.com and is regarding cars being exported from the US. Here is a description for the 12 columns we have, taken from the dataset description available in the link above:
data = pd.read_csv("USA_cars_datasets.csv")
data
Unnamed: 0 | price | brand | model | year | title_status | mileage | color | vin | lot | state | country | condition | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 6300 | toyota | cruiser | 2008 | clean vehicle | 274117.0 | black | jtezu11f88k007763 | 159348797 | new jersey | usa | 10 days left |
1 | 1 | 2899 | ford | se | 2011 | clean vehicle | 190552.0 | silver | 2fmdk3gc4bbb02217 | 166951262 | tennessee | usa | 6 days left |
2 | 2 | 5350 | dodge | mpv | 2018 | clean vehicle | 39590.0 | silver | 3c4pdcgg5jt346413 | 167655728 | georgia | usa | 2 days left |
3 | 3 | 25000 | ford | door | 2014 | clean vehicle | 64146.0 | blue | 1ftfw1et4efc23745 | 167753855 | virginia | usa | 22 hours left |
4 | 4 | 27700 | chevrolet | 1500 | 2018 | clean vehicle | 6654.0 | red | 3gcpcrec2jg473991 | 167763266 | florida | usa | 22 hours left |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2494 | 2494 | 7800 | nissan | versa | 2019 | clean vehicle | 23609.0 | red | 3n1cn7ap9kl880319 | 167722715 | california | usa | 1 days left |
2495 | 2495 | 9200 | nissan | versa | 2018 | clean vehicle | 34553.0 | silver | 3n1cn7ap5jl884088 | 167762225 | florida | usa | 21 hours left |
2496 | 2496 | 9200 | nissan | versa | 2018 | clean vehicle | 31594.0 | silver | 3n1cn7ap9jl884191 | 167762226 | florida | usa | 21 hours left |
2497 | 2497 | 9200 | nissan | versa | 2018 | clean vehicle | 32557.0 | black | 3n1cn7ap3jl883263 | 167762227 | florida | usa | 2 days left |
2498 | 2498 | 9200 | nissan | versa | 2018 | clean vehicle | 31371.0 | silver | 3n1cn7ap4jl884311 | 167762228 | florida | usa | 21 hours left |
2499 rows × 13 columns
You might have noticed the first column above called "Unnamed: 0." This is simply an index column for a csv file, but it is redundant in a dataframe (since a dataframe already keeps track of indices) and may have unintended effects in our analysis. Additionally, the 'vin' and 'lot' columns are unique to specific cars and hence cannot have any correlation on price. Let's delete these three columns.
del data['Unnamed: 0']
del data['vin']
del data['lot']
Let's look at our numerical columns next using the describe function
data.describe()
price | year | mileage | |
---|---|---|---|
count | 2499.000000 | 2499.000000 | 2.499000e+03 |
mean | 18767.671469 | 2016.714286 | 5.229869e+04 |
std | 12116.094936 | 3.442656 | 5.970552e+04 |
min | 0.000000 | 1973.000000 | 0.000000e+00 |
25% | 10200.000000 | 2016.000000 | 2.146650e+04 |
50% | 16900.000000 | 2018.000000 | 3.536500e+04 |
75% | 25555.500000 | 2019.000000 | 6.347250e+04 |
max | 84900.000000 | 2020.000000 | 1.017936e+06 |
We can observe the following:
data = data[(data.price >= 1000) & (data.year >= 2012) & (data.mileage <= 350000)]
Now let's see if any of our columns have any null values:
data.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 2348 entries, 2 to 2498 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 price 2348 non-null int64 1 brand 2348 non-null object 2 model 2348 non-null object 3 year 2348 non-null int64 4 title_status 2348 non-null object 5 mileage 2348 non-null float64 6 color 2348 non-null object 7 state 2348 non-null object 8 country 2348 non-null object 9 condition 2348 non-null object dtypes: float64(1), int64(2), object(7) memory usage: 201.8+ KB
We can see that the no columns have any null entries. This is good!
Next, let's standardize the 'condition' column to an amount in hours, as it is currently a string and isn't usable in for further analysis. To do this, we create a new column called 'hours_left' and convert each string to an integer, which represents the hours left in the auction. Finally, we drop 'condition' at the end.
data['condition'].unique() # see all unique entries
array(['2 days left', '22 hours left', '20 hours left', '19 hours left', '3 days left', '21 hours left', '17 hours left', '2 hours left', '3 hours left', '34 minutes', '16 hours left', '18 hours left', '1 days left', '32 minutes', '14 hours left', '6 days left', '5 hours left', '10 days left', '4 days left', '9 days left', '7 days left', '5 days left', '9 minutes', '1 minutes', '7 hours left', '16 minutes', '6 hours left', '1 hours left', 'Listing Expired', '13 days left', '24 hours left', '8 days left', '23 hours left', '15 hours left', '53 minutes', '27 minutes', '12 days left', '15 days left', '30 minutes', '29 minutes', '28 minutes', '48 minutes', '11 days left', '4 hours left', '47 minutes', '12 hours left', '36 minutes'], dtype=object)
# replace "Listing Expired" with "0 hours left" to have one format to work with
data['condition'].replace('Listing Expired', '0 hours left',inplace=True)
# create new column
data['hours_left'] = 0.0
# fill in new column
for i, row in data.iterrows():
s = row['condition']
number = s.split()[0]
time_type = s.split()[1]
val = 0
if (time_type == 'minutes'):
val = round(float(number)/60,6)
elif (time_type == 'hours'):
val = round(float(number),6)
else:
val = round(float(number)*24,6)
data['hours_left'][i] = val
# delete old column
del data['condition']
data['hours_left'].describe()
count 2348.000000 mean 55.735619 std 61.833545 min 0.000000 25% 21.000000 50% 48.000000 75% 48.000000 max 360.000000 Name: hours_left, dtype: float64
We can see that the average time remaining on a listing is 48 hours, or three days. We also have some expired listing, but that is ok.
How about the colors? How many do we have of each?
data['color'].value_counts()
white 669 black 491 gray 375 silver 287 red 177 blue 141 no_color 58 charcoal 18 green 17 orange 16 gold 13 brown 12 magnetic metallic 6 yellow 6 shadow black 5 beige 5 color: 4 oxford white 4 ingot silver metallic 4 billet silver metallic clearcoat 3 super black 3 triple yellow tri-coat 3 black clearcoat 2 white platinum tri-coat metallic 2 off-white 2 bright white clearcoat 2 ruby red metallic tinted clearcoat 2 tuxedo black metallic 2 cayenne red 2 morningsky blue 1 tan 1 glacier white 1 purple 1 burgundy 1 jazz blue pearlcoat 1 competition orange 1 kona blue metallic 1 royal crimson metallic tinted clearcoat 1 phantom black 1 ingot silver 1 guard 1 pearl white 1 ruby red 1 turquoise 1 lightning blue 1 toreador red 1 Name: color, dtype: int64
This is bad. We need to combine a lot of these colors together that are very similar. The code below does so:
for i, row in data.iterrows():
color = row['color']
if ("black" in color):
data['color'][i] = "black"
elif ("off-white" in color):
data['color'][i] = "beige"
elif ("silver" in color):
data['color'][i] = "silver"
elif ("yellow" in color):
data['color'][i] = "yellow"
elif ("white" in color):
data['color'][i] = "white"
elif ("red" in color):
data['color'][i] = "red"
elif ("tan" in color):
data['color'][i] = "brown"
elif ("blue" in color):
data['color'][i] = "blue"
elif ("crimson" in color):
data['color'][i] = "red"
elif ("burgundy" in color):
data['color'][i] = "red"
elif ("charcoal" in color):
data['color'][i] = "gray"
elif ("orange" in color):
data['color'][i] = "orange"
elif ("turquoise" in color):
data['color'][i] = "blue"
elif ("metallic" in color):
data['color'][i] = "silver"
elif ("color:" in color or "guard" in color):
data['color'][i] = "no_color"
data['color'].value_counts()
white 679 black 504 gray 393 silver 301 red 185 blue 146 no_color 63 green 17 orange 17 gold 13 brown 13 yellow 9 beige 7 purple 1 Name: color, dtype: int64
This looks much better!
The final step in processing this data is making another dataframe that can be used later for running Machine Learning models. ML models require numerical values to make sense of data. So let's convert all categorical variables to numerical.
data_ml = data.copy()
data_ml
price | brand | model | year | title_status | mileage | color | state | country | hours_left | |
---|---|---|---|---|---|---|---|---|---|---|
2 | 5350 | dodge | mpv | 2018 | clean vehicle | 39590.0 | silver | georgia | usa | 48.0 |
3 | 25000 | ford | door | 2014 | clean vehicle | 64146.0 | blue | virginia | usa | 22.0 |
4 | 27700 | chevrolet | 1500 | 2018 | clean vehicle | 6654.0 | red | florida | usa | 22.0 |
5 | 5700 | dodge | mpv | 2018 | clean vehicle | 45561.0 | white | texas | usa | 48.0 |
7 | 13350 | gmc | door | 2017 | clean vehicle | 23525.0 | gray | california | usa | 20.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2494 | 7800 | nissan | versa | 2019 | clean vehicle | 23609.0 | red | california | usa | 24.0 |
2495 | 9200 | nissan | versa | 2018 | clean vehicle | 34553.0 | silver | florida | usa | 21.0 |
2496 | 9200 | nissan | versa | 2018 | clean vehicle | 31594.0 | silver | florida | usa | 21.0 |
2497 | 9200 | nissan | versa | 2018 | clean vehicle | 32557.0 | black | florida | usa | 48.0 |
2498 | 9200 | nissan | versa | 2018 | clean vehicle | 31371.0 | silver | florida | usa | 21.0 |
2348 rows × 10 columns
data_ml['brand'] = data_ml['brand'].astype('category')
data_ml['brand'] = data_ml['brand'].cat.codes
data_ml['model'] = data_ml['model'].astype('category')
data_ml['model'] = data_ml['model'].cat.codes
data_ml['title_status'] = data_ml['title_status'].astype('category')
data_ml['title_status'] = data_ml['title_status'].cat.codes
data_ml['color'] = data_ml['color'].astype('category')
data_ml['color'] = data_ml['color'].cat.codes
data_ml['state'] = data_ml['state'].astype('category')
data_ml['state'] = data_ml['state'].cat.codes
data_ml['country'] = data_ml['country'].astype('category')
data_ml['country'] = data_ml['country'].cat.codes
data_ml
price | brand | model | year | title_status | mileage | color | state | country | hours_left | |
---|---|---|---|---|---|---|---|---|---|---|
2 | 5350 | 7 | 70 | 2018 | 0 | 39590.0 | 11 | 7 | 1 | 48.0 |
3 | 25000 | 8 | 27 | 2014 | 0 | 64146.0 | 2 | 38 | 1 | 22.0 |
4 | 27700 | 5 | 0 | 2018 | 0 | 6654.0 | 10 | 6 | 1 | 22.0 |
5 | 5700 | 7 | 70 | 2018 | 0 | 45561.0 | 12 | 35 | 1 | 48.0 |
7 | 13350 | 9 | 27 | 2017 | 0 | 23525.0 | 5 | 3 | 1 | 20.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2494 | 7800 | 24 | 108 | 2019 | 0 | 23609.0 | 10 | 3 | 1 | 24.0 |
2495 | 9200 | 24 | 108 | 2018 | 0 | 34553.0 | 11 | 6 | 1 | 21.0 |
2496 | 9200 | 24 | 108 | 2018 | 0 | 31594.0 | 11 | 6 | 1 | 21.0 |
2497 | 9200 | 24 | 108 | 2018 | 0 | 32557.0 | 1 | 6 | 1 | 48.0 |
2498 | 9200 | 24 | 108 | 2018 | 0 | 31371.0 | 11 | 6 | 1 | 21.0 |
2348 rows × 10 columns
We can now proceed to EDA.
Let's now explore some trends in our data. First, we plot mileage over time
fig = plt.figure(figsize=(12,8))
sns.boxplot(data = data, x = "year", y = "mileage").set_title('Model Year vs Mileage')
Text(0.5, 1.0, 'Model Year vs Mileage')
The trend seems reasonable: as a car gets older, the more miles it has. We can also note that 2017 has more outliers than the rest of the years.
Next, let's look at price vs brand
fig = plt.figure(figsize=(12,8))
sns.barplot(data = data, y = "brand", x = "price").set_title('Price vs Brand')
Text(0.5, 1.0, 'Price vs Brand')
We can see that luxury brands such as Mercedes-Benz, BMW, Cadillac, Lincoln, Lexus, Audi, and more all have higher average prices than mainstream brands such as Hyundai, Jeep, Kia, Mazda, and others. Some anomalies: Honda is considerably cheaper than other mainstream brands; Jaguar is a lot cheaper than expected since it's a luxury brand.
Next, let's look at price vs year, but let's also observe how the title status affects this price.
fig = plt.figure(figsize=(12,8))
sns.lineplot(data = data, x = "year", y = "price", hue = "title_status").set_title('Price over time with effect of title status')
Text(0.5, 1.0, 'Price over time with effect of title status')
This is also expected as cars with salvaged titles are going to be cheaper (a salvage title means that the insurance company has deemed this vehicle a loss since it has been damaged significantly). Additionally, the price trend over time also make sense.
Next, let's see the popularity of different brands
fig = plt.figure(figsize=(14,8))
sns.countplot(data = data, x = "brand", order = data['brand'].value_counts().iloc[:10].index) # see top 10
<AxesSubplot:xlabel='brand', ylabel='count'>
We can see that Ford, Dodge, Nissan, Chevy, and GMC are the most popular brands. This makes sense as this is a US dataset.
So far, we have only observed variables that we know should have an effect on the price by pure intuition. However, we don't know what trends to expect with a vehicle color, location, or time remaining in auction. Let's now pivot to observe these variables.
fig = plt.figure(figsize=(12,8))
g = sns.relplot(data = data, y = "color", x = "price")
g.fig.suptitle("Price vs Color")
Text(0.5, 0.98, 'Price vs Color')
<Figure size 864x576 with 0 Axes>
From this plot, we cans can see that neutral/mainstream colors such as silver, black, white, red, and blue have a higher price than unusual colors such as gold, green, orange, yellow, etc. This makes sense because people generally don't buy these unusual colors, and most mainstream cars don't even have these unusual colors as options.
Let's look at if the time remaining on auction has an effect on price:
fig = plt.figure(figsize=(12,8))
sns.lineplot(data = data, x = "hours_left", y = "price").set_title('Price vs auction time remaining')
Text(0.5, 1.0, 'Price vs auction time remaining')
Notice that as the auction is nearing its end, the price is very different. However earlier in the auction, the price is more stable.
Lastly, let's see the average price in each location
df = data.groupby(by=["state"]).mean().reset_index()
df = df.sort_values(["price"]).reset_index(drop=True)
fig = plt.figure(figsize=(12,8))
sns.barplot(data = df, x = "state", y = "price")
<AxesSubplot:xlabel='state', ylabel='price'>
There seems to be a relation between price and state, as there is an upward trend when sorted by average sale prices.
In this section, we will use a few different models to develop a prediction, and pick the best one at the end. First, let's define X and y, the predictors (everything but price) and value to predict (price). Note we are using the data_ml dataframe defined above that has all numerical values. We also need to standardize this data before we use it, and we can simply use the StandardScaler to do so.
X = data_ml.drop(["price"], axis = 1)
X = StandardScaler().fit_transform(X)
y = data_ml.price
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
Let's start off with a simple Linear Regression:
# Create Linear Regressor
lr = LinearRegression()
# fit
lr.fit(X_train, y_train)
# predict
pred = lr.predict(X_test)
print("Mean: " + str(data_ml['price'].mean()))
print("Mean absolute error: " + str(mae(y_test, pred)))
print("r^2: " + str(lr.score(X_test,y_test)))
Mean: 19783.816865417375 Mean absolute error: 7649.118911510767 r^2: 0.30466295151847744
The Linear Regressor gives us a score of around 0.305, which is not good, so it is not a good fit for our data.
Let's try Logistic Regression:
# Create Logistic Regressor
lr = LogisticRegression()
# fit
lr.fit(X_train, y_train)
# predict
pred = lr.predict(X_test)
print("Mean: " + str(data_ml['price'].mean()))
print("Mean absolute error: " + str(mae(y_test, pred)))
print("r^2: " + str(lr.score(X_test,y_test)))
Mean: 19783.816865417375 Mean absolute error: 9099.872340425532 r^2: 0.01702127659574468
Logistic Regression is even worse! Seems like these two models do not fit out data well.
Next, let's try k-nearest neighbors (KNN).
This regressor approximates the associations by averaging the observations in the same neighborhood.
# Create KNN Regressor
knn = KNeighborsRegressor()
# fit
knn.fit(X_train, y_train)
# predict
pred = knn.predict(X_test)
print("Mean: " + str(data_ml['price'].mean()))
print("Mean absolute error: " + str(mae(y_test, pred)))
print("r^2: " + str(knn.score(X_test,y_test)))
Mean: 19783.816865417375 Mean absolute error: 5881.34170212766 r^2: 0.49945672152215004
We can see that the KNN regressor only has a r^2 score of 0.499, which is not great. We need to explore more models.
Let's try Gradient Boost. It is another model that converts weak learners into strong learners,
# Create GB Regressor
gb = GradientBoostingRegressor(random_state = 1)
# fit
gb.fit(X_train, y_train)
# predict
pred = gb.predict(X_test)
print("Mean: " + str(data_ml['price'].mean()))
print("Mean absolute error: " + str(mae(y_test, pred)))
print("r^2: " + str(gb.score(X_test,y_test)))
Mean: 19783.816865417375 Mean absolute error: 5227.092008745825 r^2: 0.6216778914826813
Gradient Boosting gives us a score of 0.6217, which is much better!
Random Forests?
We could stop here as Gradient Boost did give us a reasonable model. However, let's see how Random Forests perform on our dataset. This model is widely used due to its ability to use decision trees to prevent overfitting. Due to this, I predict this will have the best accuracy. Let's see if the hypothesis holds true:
# Create Random Forest Regressor
rf = RandomForestRegressor(random_state = 1)
# fit
rf.fit(X_train, y_train)
# predict
pred = rf.predict(X_test)
print("Mean: " + str(data_ml['price'].mean()))
print("Mean absolute error: " + str(mae(y_test, pred)))
print("r^2: " + str(rf.score(X_test,y_test)))
Mean: 19783.816865417375 Mean absolute error: 4414.9401276595745 r^2: 0.6829084942628927
And it did - we got an accuracy of 0.6829! Since Random Forests performed the best, let's conduct hyper parameter tuning on it using Grid Search to see if we can get an ever better score.
Hyper parameter tuning
# Create Random Forest Classifier
rf = RandomForestRegressor()
# parameters for GridSearch
param_grid = {'max_depth' : [15,20,25,30],
'n_estimators' : [100,150,170],
'random_state': [1]}
# Grid Search
rf_gscv = GridSearchCV(rf, param_grid)
# fit data
rf_gscv.fit(X_train, y_train)
# print best params
print("Optimal parameters using GridSearch: " + str(rf_gscv.best_params_))
Optimal parameters using GridSearch: {'max_depth': 30, 'n_estimators': 170, 'random_state': 1}
pred = rf_gscv.predict(X_test)
print("Mean: " + str(data_ml['price'].mean()))
print("Mean absolute error: " + str(mae(y_test, pred)))
# score
rf_gscv.score(X_test,y_test)
Mean: 19783.816865417375 Mean absolute error: 4380.687312265332
0.686376151395311
We can see that using these new parameters we have increased our accuracy from 0.6829 to 0.6864! While this accuracy is not amazing, it is reasonable since we are able to predict the price within \$4380 of a mean price of $19,780.
Thats it! We have walked thorugh the data science pipeline of data collection, data processing, Exploratory data analysis, ML, and providing insight. It was great being able to look into the factors that affect car price and make a model that makes a reasonable prediction. Big thanks to the people on Kaggle who scraped this data from auctionexport.com and to Prof. John Dickerson for a great intro to data science class. Hopefully this tutorial provided you with some insight on how different factors affect a car's price, and the model could be something of use to make further predictions.
Here are some links for further reading into some of the topics discussed:
Thanks for reading!
Word count and code line count check:
# word count requirment:
import io
from IPython.nbformat import current
with io.open("Final Project.ipynb", 'r', encoding='utf-8') as f:
nb = current.read(f, 'json')
word_count = 0
for cell in nb.worksheets[0].cells:
if cell.cell_type == "markdown":
word_count += len(cell['source'].replace('#', '').lstrip().split(' '))
print("Word count: " + str(word_count))
code_line_count = 0
for cell in nb.worksheets[0].cells:
if cell.cell_type == "code":
lines = cell['input'].splitlines()
for l in lines:
if not l.startswith("#"): # not a comment
code_line_count += 1
print("Code line count: " + str(code_line_count))
Word count: 1622 Code line count: 189