當(dāng)前位置：首頁(yè) > 编程资源 > 编程问答 >内容正文

编程问答

汽车价格预测回归分析模型

發(fā)布時(shí)間：2023/12/14 编程问答 24 豆豆

生活随笔收集整理的這篇文章主要介紹了汽车价格预测回归分析模型小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

內(nèi)容簡(jiǎn)介：

本文使用python的sklearn庫(kù)對(duì)汽車歷史價(jià)格信息進(jìn)行回歸分析，包括數(shù)據(jù)預(yù)處理、特征相關(guān)性分析等步驟，最后利用lasso回歸建立價(jià)格預(yù)測(cè)模型。

數(shù)據(jù)集簡(jiǎn)介

數(shù)據(jù)中有分類變量也有連續(xù)變量，主要包括3類指標(biāo):
1. 汽車的各種特性.
2. 保險(xiǎn)風(fēng)險(xiǎn)評(píng)級(jí)：(-3, -2, -1, 0, 1, 2, 3).
3. 每輛保險(xiǎn)車輛年平均相對(duì)損失支付.

類別屬性 - make: 汽車的商標(biāo)（奧迪，寶馬。。。） - fuel-type: 汽油還是天然氣 - aspiration: 渦輪 - num-of-doors: 兩門還是四門 - body-style: 硬頂車、轎車、掀背車、敞篷車 - drive-wheels: 驅(qū)動(dòng)輪 - engine-location: 發(fā)動(dòng)機(jī)位置 - engine-type: 發(fā)動(dòng)機(jī)類型 - num-of-cylinders: 幾個(gè)氣缸 - fuel-system: 燃油系統(tǒng)連續(xù)指標(biāo) - bore: continuous from 2.54 to 3.94. - stroke: continuous from 2.07 to 4.17. - compression-ratio: continuous from 7 to 23. - horsepower: continuous from 48 to 288. - peak-rpm: continuous from 4150 to 6600. - city-mpg: continuous from 13 to 49. - highway-mpg: continuous from 16 to 54. - price: continuous from 5118 to 45400.

一、數(shù)據(jù)讀取與分析

# loading packages import numpy as np import pandas as pd from pandas import datetime# data visualization and missing values import matplotlib.pyplot as plt import seaborn as sns # advanced vizs import missingno as msno # missing values 對(duì)缺失值進(jìn)行可視化展示 %matplotlib inline# stats from statsmodels.distributions.empirical_distribution import ECDF from sklearn.metrics import mean_squared_error, r2_score# machine learning from sklearn.preprocessing import StandardScaler from sklearn.linear_model import Lasso, LassoCV from sklearn.model_selection import train_test_split, cross_val_score from sklearn.ensemble import RandomForestRegressor seed = 123# importing data ( ? = missing values) data = pd.read_csv("Auto-Data.csv", na_values = '?')# 查看數(shù)據(jù)信息 data.columns #查看每列列名 data.dtypes #查看字符類型 data.shape #查看數(shù)據(jù)有幾行幾列 data.head(5) #查看前面5行內(nèi)容

查看字符類型data.dtypes
查看數(shù)據(jù)形狀和前5行
print("In total: ",data.shape)
data.head(5)
數(shù)據(jù)集一共有, 26個(gè)特征，205行數(shù)據(jù)
數(shù)據(jù)描述性分析data.describe()

二、缺失值處理（missingno缺失值可視化）

sns.set(style = "ticks") #指定風(fēng)格 msno.matrix(data) #畫圖

空白的地方表示存在缺失值，這里一共有7個(gè)特征數(shù)據(jù)包含缺失值，其中normalized-losses 缺失比較嚴(yán)重。

sns.set(style = "ticks") plt.figure(figsize = (12, 5)) c = '#366DE8'# ECDF plt.subplot(121) cdf = ECDF(data['normalized-losses']) #查看連續(xù)分布，累計(jì)結(jié)果 plt.plot(cdf.x, cdf.y, label = "statmodels", color = c); plt.xlabel('normalized losses'); plt.ylabel('ECDF');# overall distribution plt.subplot(122) plt.hist(data['normalized-losses'].dropna(), bins = int(np.sqrt(len(data['normalized-losses']))),color = c);

通過(guò)查看缺失值具體情況，可以發(fā)現(xiàn) 80% 的 normalized losses 是低于200 并且絕大多數(shù)低于125。
這種情況下，如果直接用中位數(shù)或者平均數(shù)來(lái)填充缺失值可能不夠精確。因此，可以考慮這個(gè)特征與那些因素有關(guān)系。

接下來(lái)，根據(jù)不同的風(fēng)險(xiǎn)等級(jí)來(lái)劃分組，在使用每組的平均數(shù)來(lái)填充normalized losses的缺失值。

#查看每組的情況 data.groupby('symboling')['normalized-losses'].describe()#刪除和填充缺失值 data = data.dropna(subset = ['price', 'bore', 'stroke', 'peak-rpm', 'horsepower', 'num-of-doors']) #對(duì)于缺失值少的幾列，直接刪掉缺失值 data['normalized-losses'] = data.groupby('symboling')['normalized-losses'].transform(lambda x: x.fillna(x.mean())) #填充缺失值#查看結(jié)果 print('In total:', data.shape) data.head()

缺失值處理之后，數(shù)據(jù)變成了193*26的數(shù)據(jù)集。

三、特征相關(guān)性

4.1 相關(guān)性計(jì)算和展示

cormatrix = data.corr() # cormatrix #查看結(jié)果#不同的展現(xiàn)格式 cormatrix *= np.tri(*cormatrix.values.shape, k=-1).T #返回函數(shù)的上三角矩陣，把對(duì)角線上的置0，讓他們不是最高的。 cormatrix = cormatrix.stack() #某一指標(biāo)與其他指標(biāo)的關(guān)系# 找出前十個(gè)最相關(guān)的特征 cormatrix = cormatrix.reindex(cormatrix.abs().sort_values(ascending=False).index).reset_index() cormatrix.columns = ["FirstVariable", "SecondVariable", "Correlation"] cormatrix.head(10)

第一行中，city_mpg 和 highway-mpg 兩個(gè)特征的相關(guān)性高達(dá)0.97，需要?jiǎng)h除掉其中一個(gè)。
對(duì)于數(shù)據(jù)中長(zhǎng)寬高，他們應(yīng)該存在某種配對(duì)關(guān)系，可以讓這幾個(gè)特征組合成一個(gè)新特征。

data['volume'] = data.length * data.width * data.heightdata.drop(['width', 'length', 'height', 'curb-weight', 'city-mpg'], axis = 1, # 1 for columnsinplace = True)

4.2 熱度圖展示

# Compute the correlation matrix corr_all = data.corr()# Generate a mask for the upper triangle mask = np.zeros_like(corr_all, dtype = np.bool) mask[np.triu_indices_from(mask)] = True# Set up the matplotlib figure f, ax = plt.subplots(figsize = (11, 9))# Draw the heatmap with the mask and correct aspect ratio sns.heatmap(corr_all, mask = mask,square = True, linewidths = .5, ax = ax, cmap = "BuPu") plt.show()

看起來(lái) price 跟這幾個(gè)的相關(guān)程度比較大 wheel-base,enginine-size, bore,horsepower.
-也可以用seaborn展示具體的指標(biāo)情況 sns.pairplot(data, hue = 'fuel-type', palette = 'plasma')

4.3 進(jìn)一步回歸分析

從上面熱力圖中，得出價(jià)格price和另外幾個(gè)變量之間相關(guān)性很大。因此，接下來(lái)使用IMplot 進(jìn)一步查看兩個(gè)因素之間的關(guān)系。 lmplot對(duì)所選數(shù)據(jù)集進(jìn)行了一元線性回歸，擬合出了一條最佳的直線。

print('fuel_type:', data['fuel-type'].unique(), '\ndoors:', data['num-of-doors'].unique())

fuel_type 和doors 是兩個(gè)分類變量，可根據(jù)兩個(gè)變量分組進(jìn)行分析。

sns.lmplot('price', 'horsepower', data, hue = 'fuel-type', col = 'fuel-type', row = 'num-of-doors', palette = 'plasma', fit_reg = True);

根據(jù)燃料的類型和門的數(shù)量的不同，劃分成4組，即4個(gè)圖。我們看到，無(wú)論是那種組合下，一輛汽車馬力與價(jià)格都是正相關(guān)。。

四、數(shù)據(jù)預(yù)處理

4.1 標(biāo)準(zhǔn)化

對(duì)連續(xù)值進(jìn)行標(biāo)準(zhǔn)化。

# target and features target = data.priceregressors = [x for x in data.columns if x not in ['price']] features = data.loc[:, regressors]num = ['symboling', 'normalized-losses', 'volume', 'horsepower', 'wheel-base','bore', 'stroke','compression-ratio', 'peak-rpm']# scale the data standard_scaler = StandardScaler() features[num] = standard_scaler.fit_transform(features[num])# glimpse features.head()

4.2 獨(dú)熱編碼

對(duì)分類屬性就行one-hot編碼。

# categorical vars classes = ['make', 'fuel-type', 'aspiration', 'num-of-doors', 'body-style', 'drive-wheels', 'engine-location','engine-type', 'num-of-cylinders', 'fuel-system']# create new dataset with only continios vars dummies = pd.get_dummies(features[classes]) features = features.join(dummies).drop(classes, axis = 1)# new dataset print('In total:', features.shape) features.head()

五、模型建立：Lasso回歸

5.1 劃分?jǐn)?shù)據(jù)集

# 按照30%劃分?jǐn)?shù)據(jù)集 X_train, X_test, y_train, y_test = train_test_split(features, target, test_size = 0.3, random_state = seed) print("Train", X_train.shape, "and test", X_test.shape)

-將數(shù)據(jù)集分成135的訓(xùn)練集和58的測(cè)試集。

5.2 Lasso回歸

基于線性回歸的基礎(chǔ)上，多加了一個(gè)絕對(duì)值想來(lái)懲罰過(guò)大的系數(shù)。

# logarithmic scale: log base 2 # high values to zero-out more variables alphas = 2. ** np.arange(2, 12) #指定alphas的范圍 scores = np.empty_like(alphas)for i, a in enumerate(alphas):lasso = Lasso(random_state = seed)lasso.set_params(alpha = a)lasso.fit(X_train, y_train)scores[i] = lasso.score(X_test, y_test)# 交叉驗(yàn)證cross validation lassocv = LassoCV(cv = 10, random_state = seed) lassocv.fit(features, target) lassocv_score = lassocv.score(features, target) lassocv_alpha = lassocv.alpha_plt.figure(figsize = (10, 4)) plt.plot(alphas, scores, '-ko') plt.axhline(lassocv_score, color = c) plt.xlabel(r'$\alpha$') plt.ylabel('CV Score') plt.xscale('log', basex = 2) sns.despine(offset = 15)print('CV results:', lassocv_score, lassocv_alpha)

5.3 特征重要性分析

# lassocv coefficients coefs = pd.Series(lassocv.coef_, index = features.columns)# prints out the number of picked/eliminated features print("Lasso picked " + str(sum(coefs != 0)) + " features and eliminated the other " + \str(sum(coefs == 0)) + " features.")# 展示前5個(gè)和后5個(gè) coefs = pd.concat([coefs.sort_values().head(5), coefs.sort_values().tail(5)])plt.figure(figsize = (10, 4)) coefs.plot(kind = "barh", color = c) plt.title("Coefficients in the Lasso Model") plt.show()

# 將上面計(jì)算出來(lái)的Alphas 代入 model_l1 = LassoCV(alphas = alphas, cv = 10, random_state = seed).fit(X_train, y_train) y_pred_l1 = model_l1.predict(X_test)model_l1.score(X_test, y_test)

5.4 結(jié)果評(píng)估

5.4.1 residual plot 殘差圖

畫圖表示實(shí)際值和預(yù)測(cè)值之間的差異。

plt.rcParams['figure.figsize'] = (6.0, 6.0)preds = pd.DataFrame({"preds": model_l1.predict(X_train), "true": y_train}) preds["residuals"] = preds["true"] - preds["preds"] preds.plot(x = "preds", y = "residuals", kind = "scatter", color = c)

5.4.2 MSE和R2

#計(jì)算指標(biāo)：MSE和R2 def MSE(y_true,y_pred):mse = mean_squared_error(y_true, y_pred)print('MSE: %2.3f' % mse)return msedef R2(y_true,y_pred): r2 = r2_score(y_true, y_pred)print('R2: %2.3f' % r2) return r2MSE(y_test, y_pred_l1); R2(y_test, y_pred_l1);

5.4.3 查看具體實(shí)際值和預(yù)測(cè)值

#結(jié)果預(yù)測(cè) # predictions d = {'true' : list(y_test),'predicted' : pd.Series(y_pred_l1)}pd.DataFrame(d).head()

資料鏈接：https://edu.csdn.net/learn/7380/149737
missingno：https://github.com/ResidentMario/missingno

總結(jié)

以上是生活随笔為你收集整理的汽车价格预测回归分析模型的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問(wèn)題。

如果覺(jué)得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇：天阔服务器用户名密码,曙光天阔服务器远程
下一篇： tushare实战LSTM实现黄金价格预