當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

二手车交易价格预测

發(fā)布時間：2023/12/14 编程问答 30 豆豆

生活随笔收集整理的這篇文章主要介紹了二手车交易价格预测小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

（寫在前面的話，純小白，一開始上手有點難度，還需要將代碼一點點的扒明白）
二手車交易價格預測
數(shù)據(jù)來自某交易平臺的二手車交易記錄，總數(shù)據(jù)量超過40w，包含31列變量信息，其中15列為匿名變量。從數(shù)據(jù)中抽取15萬條作為訓練集，5萬條作為測試集A，5萬條作為測試集B，同時會對name、model、brand和regionCode等信息進行脫敏。

題目分析

1.價格預測是傳統(tǒng)的數(shù)據(jù)挖掘問題，通過數(shù)據(jù)科學以及機器學習深度學習的辦法來進行建模得到結果。該模型是一個典型的回歸問題。
2.主要應用xgb、lgb、catboost，以及pandas、numpy、matplotlib、seabon、sklearn、keras等等數(shù)據(jù)挖掘常用庫或者框架來進行數(shù)據(jù)挖掘任務。
3.通過EDA來挖掘數(shù)據(jù)的聯(lián)系和自我熟悉數(shù)據(jù)。

python庫或函數(shù)

1.XGBoost全名叫（eXtreme Gradient Boosting）極端梯度提升，經常被用在一些比賽中，其效果顯著。它是大規(guī)模并行boosted tree的工具，它是目前最快最好的開源boosted tree工具包。XGBoost 所應用的算法就是 GBDT（gradient boosting decision tree）的改進，既可以用于分類也可以用于回歸問題中。
2.LightGBM是一個梯度Boosting框架，使用基于決策樹的學習算法。它可以說是分布式的，高效的。與常見的機器學習算法對比，速度是非常快的。
https://www.cnblogs.com/jiangxinyang/p/9337094.html
3.CatBoost是俄羅斯的搜索巨頭Y andex在2017年開源的機器學習庫，也是Boosting族算法的一種，同前面介紹過的XGBoost和LightGBM類似，依然是在GBDT算法框架下的一種改進實現(xiàn)，是一種基于對稱決策樹（oblivious trees）算法的參數(shù)少、支持類別型變量和高準確性的GBDT框架，主要說解決的痛點是高效合理地處理類別型特征，這個從它的名字就可以看得出來，CatBoost是由catgorical和boost組成，另外是處理梯度偏差（Gradient bias）以及預測偏移（Prediction shift）問題，提高算法的準確性和泛化能力。
https://www.cnblogs.com/dudumiaomiao/p/9693711.html
4.Python Data Analysis Library 或 pandas 是基于NumPy 的一種工具，該工具是為了解決數(shù)據(jù)分析任務而創(chuàng)建的。Pandas 納入了大量庫和一些標準的數(shù)據(jù)模型，提供了高效地操作大型數(shù)據(jù)集所需的工具。pandas提供了大量能使我們快速便捷地處理數(shù)據(jù)的函數(shù)和方法。你很快就會發(fā)現(xiàn)，它是使Python成為強大而高效的數(shù)據(jù)分析環(huán)境的重要因素之一。
https://www.cnblogs.com/misswangxing/p/7903595.html
5.NumPy 是一個 Python 包。它代表 “Numeric Python”。它是一個由多維數(shù)組對象和用于處理數(shù)組的例程集合組成的庫。
https://blog.csdn.net/a373595475/article/details/79580734
6.Matplotlib 是 Python 的繪圖庫。它可與 NumPy 一起使用，提供了一種有效的 MatLab 開源替代方案。它也可以和圖形工具包一起使用，如 PyQt 和 wxPython。
https://www.runoob.com/numpy/numpy-matplotlib.html
7.seaborn包是對matplotlib的增強版，需要安裝matplotlib后才能使用。
https://blog.csdn.net/weixin_38331049/article/details/89462338
8.Sklearn (全稱 Scikit-Learn) 是基于 Python 語言的機器學習工具。它建立在 NumPy, SciPy, Pandas 和 Matplotlib 之上，里面的 API 的設計非常好，所有對象的接口簡單，很適合新手上路。
https://blog.csdn.net/algorithmPro/article/details/103045824
9.Keras是由純python編寫的基于theano/tensorflow的深度學習框架。Keras是一個高層神經網(wǎng)絡API，支持快速實驗，能夠把你的idea迅速轉換為結果。
https://www.cnblogs.com/lc1217/p/7132364.html

代碼實現(xiàn)

#1.導入函數(shù)工具箱

> ##基礎工具 >import numpy as np import pandas as pd import warnings import matplotlib import matplotlib.pyplot as plt import seaborn as sns from scipy.special import jn from IPython.display import display, clear_output import time >warnings.filterwarnings('ignore') %matplotlib inline >##模型預測的 from sklearn import linear_model from sklearn import preprocessing from sklearn.svm import SVR from sklearn.ensemble import RandomForestRegressor,GradientBoostingRegressor >##數(shù)據(jù)降維處理的 from sklearn.decomposition import PCA,FastICA,FactorAnalysis,SparsePCA >import lightgbm as lgb import xgboost as xgb >#參數(shù)搜索和評價的 from sklearn.model_selection import GridSearchCV,cross_val_score,StratifiedKFold,train_test_split from sklearn.metrics import mean_squared_error, mean_absolute_error

#2.數(shù)據(jù)讀取

>##通過Pandas對于數(shù)據(jù)進行讀取 (pandas是一個很友好的數(shù)據(jù)讀取函數(shù)庫) Train_data = pd.read_csv('D:/Anaconda/lib/site-packages/pandas/io/used_car_train_20200313.csv', sep=' ') TestA_data = pd.read_csv('D:/Anaconda/lib/site-packages/pandas/io/used_car_testA_20200313.csv', sep=' ') >##輸出數(shù)據(jù)的大小信息 print('Train data shape:',Train_data.shape) print('TestA data shape:',TestA_data.shape)

Train data shape: (150000, 31)
TestA data shape: (50000, 30)

#2.1數(shù)據(jù)簡要瀏覽

>##通過.head() 簡要瀏覽讀取數(shù)據(jù)的形式 Train_data.head()

#2.2數(shù)據(jù)信息查看

>##通過 .info() 簡要可以看到對應一些數(shù)據(jù)列名，以及NAN缺失信息 Train_data.info()

<class ‘pandas.core.frame.DataFrame’>
RangeIndex: 150000 entries, 0 to 149999
Data columns (total 31 columns):
SaleID 150000 non-null int64
name 150000 non-null int64
regDate 150000 non-null int64
model 149999 non-null float64
brand 150000 non-null int64
bodyType 145494 non-null float64
fuelType 141320 non-null float64
gearbox 144019 non-null float64
power 150000 non-null int64
kilometer 150000 non-null float64
notRepairedDamage 150000 non-null object
regionCode 150000 non-null int64
seller 150000 non-null int64
offerType 150000 non-null int64
creatDate 150000 non-null int64
price 150000 non-null int64
v_0 150000 non-null float64
v_1 150000 non-null float64
v_2 150000 non-null float64
v_3 150000 non-null float64
v_4 150000 non-null float64
v_5 150000 non-null float64
v_6 150000 non-null float64
v_7 150000 non-null float64
v_8 150000 non-null float64
v_9 150000 non-null float64
v_10 150000 non-null float64
v_11 150000 non-null float64
v_12 150000 non-null float64
v_13 150000 non-null float64
v_14 150000 non-null float64
dtypes: float64(20), int64(10), object(1)
memory usage: 35.5+ MB

>##通過 .columns 查看列名 Train_data.columns

Index([‘SaleID’, ‘name’, ‘regDate’, ‘model’, ‘brand’, ‘bodyType’, ‘fuelType’,
‘gearbox’, ‘power’, ‘kilometer’, ‘notRepairedDamage’, ‘regionCode’,
‘seller’, ‘offerType’, ‘creatDate’, ‘price’, ‘v_0’, ‘v_1’, ‘v_2’, ‘v_3’,
‘v_4’, ‘v_5’, ‘v_6’, ‘v_7’, ‘v_8’, ‘v_9’, ‘v_10’, ‘v_11’, ‘v_12’,
‘v_13’, ‘v_14’],
dtype=‘object’)

>TestA_data.info()

<class ‘pandas.core.frame.DataFrame’>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 30 columns):
SaleID 50000 non-null int64
name 50000 non-null int64
regDate 50000 non-null int64
model 50000 non-null float64
brand 50000 non-null int64
bodyType 48587 non-null float64
fuelType 47107 non-null float64
gearbox 48090 non-null float64
power 50000 non-null int64
kilometer 50000 non-null float64
notRepairedDamage 50000 non-null object
regionCode 50000 non-null int64
seller 50000 non-null int64
offerType 50000 non-null int64
creatDate 50000 non-null int64
v_0 50000 non-null float64
v_1 50000 non-null float64
v_2 50000 non-null float64
v_3 50000 non-null float64
v_4 50000 non-null float64
v_5 50000 non-null float64
v_6 50000 non-null float64
v_7 50000 non-null float64
v_8 50000 non-null float64
v_9 50000 non-null float64
v_10 50000 non-null float64
v_11 50000 non-null float64
v_12 50000 non-null float64
v_13 50000 non-null float64
v_14 50000 non-null float64
dtypes: float64(20), int64(9), object(1)
memory usage: 11.4+ MB

>##通過 .describe() 可以查看數(shù)值特征列的一些統(tǒng)計信息 Train_data.describe()

#2.3數(shù)據(jù)統(tǒng)計信息瀏覽

>TestA_data.describe()

#3.特征與標簽構建
#3.1提取數(shù)值類型特征列名

>numerical_cols = Train_data.select_dtypes(exclude = 'object').columns >print(numerical_cols)

Index([‘SaleID’, ‘name’, ‘regDate’, ‘model’, ‘brand’, ‘bodyType’, ‘fuelType’,
‘gearbox’, ‘power’, ‘kilometer’, ‘regionCode’, ‘seller’, ‘offerType’,
‘creatDate’, ‘price’, ‘v_0’, ‘v_1’, ‘v_2’, ‘v_3’, ‘v_4’, ‘v_5’, ‘v_6’,
‘v_7’, ‘v_8’, ‘v_9’, ‘v_10’, ‘v_11’, ‘v_12’, ‘v_13’, ‘v_14’],
dtype=‘object’)

>categorical_cols = Train_data.select_dtypes(include = 'object').columns >print(categorical_cols)

Index([‘notRepairedDamage’], dtype=‘object’)

#3.2構建訓練和測試樣本

>##選擇特征列 >feature_cols = [col for col in numerical_cols if col not in ['SaleID','name','regDate','creatDate','price','model','brand','regionCode','seller']] feature_cols = [col for col in feature_cols if 'Type' not in col] >##提前特征列，標簽列構造訓練樣本和測試樣本 >X_data = Train_data[feature_cols] >Y_data = Train_data['price'] >X_test = TestA_data[feature_cols] >print('X train shape:',X_data.shape) print('X test shape:',X_test.shape)

X train shape: (150000, 18)
X test shape: (50000, 18)

>##定義了一個統(tǒng)計函數(shù)，方便后續(xù)信息統(tǒng)計 >def Sta_inf(data):print('_min',np.min(data))print('_max:',np.max(data))print('_mean',np.mean(data))print('_ptp',np.ptp(data))print('_std',np.std(data))print('_var',np.var(data))

#3.3統(tǒng)計標簽的基本分布信息

>print('Sta of label:') Sta_inf(Y_data)

Sta of label:
_min 11
_max: 99999
_mean 5923.32733333
_ptp 99988
_std 7501.97346988
_var 56279605.9427

>##繪制標簽的統(tǒng)計圖，查看標簽分布 >plt.hist(Y_data) plt.show() plt.close()

#3.4缺省值用-1填補

>X_data = X_data.fillna(-1) X_test = X_test.fillna(-1)

#4模型訓練與預測
#4.1利用xgb進行五折交叉驗證查看模型的參數(shù)效果

>##xgb-Model >xgr = xgb.XGBRegressor(n_estimators=120, learning_rate=0.1, gamma=0, subsample=0.8,\colsample_bytree=0.9, max_depth=7) #,objective ='reg:squarederror' >scores_train = [] scores = [] >##5折交叉驗證方式 >sk=StratifiedKFold(n_splits=5,shuffle=True,random_state=0) for train_ind,val_ind in sk.split(X_data,Y_data): train_x=X_data.iloc[train_ind].valuestrain_y=Y_data.iloc[train_ind]val_x=X_data.iloc[val_ind].valuesval_y=Y_data.iloc[val_ind] xgr.fit(train_x,train_y)pred_train_xgb=xgr.predict(train_x)pred_xgb=xgr.predict(val_x) score_train = mean_absolute_error(train_y,pred_train_xgb)scores_train.append(score_train)score = mean_absolute_error(val_y,pred_xgb)scores.append(score) >print('Train mae:',np.mean(score_train)) print('Val mae',np.mean(scores))

Train mae: 628.086664863
Val mae 715.990013454

#4.2定義xgb和lgb模型函數(shù)

>def build_model_xgb(x_train,y_train):model = xgb.XGBRegressor(n_estimators=150, learning_rate=0.1, gamma=0, subsample=0.8,\colsample_bytree=0.9, max_depth=7) #, objective ='reg:squarederror'model.fit(x_train, y_train)return model >def build_model_lgb(x_train,y_train):estimator = lgb.LGBMRegressor(num_leaves=127,n_estimators = 150)param_grid = {'learning_rate': [0.01, 0.05, 0.1, 0.2],}gbm = GridSearchCV(estimator, param_grid)gbm.fit(x_train, y_train)return gbm

#4.3切分數(shù)據(jù)集（Train,Val）進行模型訓練，評價和預測

>##Split data with val >x_train,x_val,y_train,y_val = train_test_split(X_data,Y_data,test_size=0.3) >print('Train lgb...') model_lgb = build_model_lgb(x_train,y_train) val_lgb = model_lgb.predict(x_val) MAE_lgb = mean_absolute_error(y_val,val_lgb) print('MAE of val with lgb:',MAE_lgb) >print('Predict lgb...') model_lgb_pre = build_model_lgb(X_data,Y_data) subA_lgb = model_lgb_pre.predict(X_test) print('Sta of Predict lgb:') Sta_inf(subA_lgb)

Train lgb…
MAE of val with lgb: 689.084070621
Predict lgb…
Sta of Predict lgb:
_min -519.150259864
_max: 88575.1087721
_mean 5922.98242599
_ptp 89094.259032
_std 7377.29714126
_var 54424513.1104

>print('Train xgb...') model_xgb = build_model_xgb(x_train,y_train) val_xgb = model_xgb.predict(x_val) MAE_xgb = mean_absolute_error(y_val,val_xgb) print('MAE of val with xgb:',MAE_xgb) >print('Predict xgb...') model_xgb_pre = build_model_xgb(X_data,Y_data) subA_xgb = model_xgb_pre.predict(X_test) print('Sta of Predict xgb:') Sta_inf(subA_xgb)

Train xgb…
MAE of val with xgb: 715.37757816
Predict xgb…
Sta of Predict xgb:
_min -165.479
_max: 90051.8
_mean 5922.9
_ptp 90217.3
_std 7361.13
_var 5.41862e+07

#4.4進行兩模型的結果加權融合

>##這里我們采取了簡單的加權融合的方式 >val_Weighted = (1-MAE_lgb/(MAE_xgb+MAE_lgb))*val_lgb+(1-MAE_xgb/(MAE_xgb+MAE_lgb))*val_xgb val_Weighted[val_Weighted<0]=10 # 由于我們發(fā)現(xiàn)預測的最小值有負數(shù)，而真實情況下，price為負是不存在的，由此我們進行對應的后修正 print('MAE of val with Weighted ensemble:',mean_absolute_error(y_val,val_Weighted))

MAE of val with Weighted ensemble: 687.275745703

>sub_Weighted = (1-MAE_lgb/(MAE_xgb+MAE_lgb))*subA_lgb+(1-MAE_xgb/(MAE_xgb+MAE_lgb))*subA_xgb >##查看預測值的統(tǒng)計進行 >plt.hist(Y_data) plt.show() plt.close()

#4.5輸出結果

>sub = pd.DataFrame() sub['SaleID'] = TestA_data.SaleID sub['price'] = sub_Weighted sub.to_csv('./sub_Weighted.csv',index=False) >sub.head()

SaleID price
0 0 39533.727414
1 1 386.081960
2 2 7791.974571
3 3 11835.211966
4 4 585.420407

總結

以上是生活随笔為你收集整理的二手车交易价格预测的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： VUE+Canvas实现简单的五子棋游戏
下一篇：蜀国人物一览