二手车交易价格预测
(寫在前面的話,純小白,一開始上手有點難度,還需要將代碼一點點的扒明白)
二手車交易價格預測
數(shù)據(jù)來自某交易平臺的二手車交易記錄,總數(shù)據(jù)量超過40w,包含31列變量信息,其中15列為匿名變量。從數(shù)據(jù)中抽取15萬條作為訓練集,5萬條作為測試集A,5萬條作為測試集B,同時會對name、model、brand和regionCode等信息進行脫敏。
題目分析
1.價格預測是傳統(tǒng)的數(shù)據(jù)挖掘問題,通過數(shù)據(jù)科學以及機器學習深度學習的辦法來進行建模得到結果。該模型是一個典型的回歸問題。
2.主要應用xgb、lgb、catboost,以及pandas、numpy、matplotlib、seabon、sklearn、keras等等數(shù)據(jù)挖掘常用庫或者框架來進行數(shù)據(jù)挖掘任務。
3.通過EDA來挖掘數(shù)據(jù)的聯(lián)系和自我熟悉數(shù)據(jù)。
python庫或函數(shù)
1.XGBoost全名叫(eXtreme Gradient Boosting)極端梯度提升,經常被用在一些比賽中,其效果顯著。它是大規(guī)模并行boosted tree的工具,它是目前最快最好的開源boosted tree工具包。XGBoost 所應用的算法就是 GBDT(gradient boosting decision tree)的改進,既可以用于分類也可以用于回歸問題中。
2.LightGBM是一個梯度Boosting框架,使用基于決策樹的學習算法。它可以說是分布式的,高效的。與常見的機器學習算法對比,速度是非常快的。
https://www.cnblogs.com/jiangxinyang/p/9337094.html
3.CatBoost是俄羅斯的搜索巨頭Y andex在2017年開源的機器學習庫,也是Boosting族算法的一種,同前面介紹過的XGBoost和LightGBM類似,依然是在GBDT算法框架下的一種改進實現(xiàn),是一種基于對稱決策樹(oblivious trees)算法的參數(shù)少、支持類別型變量和高準確性的GBDT框架,主要說解決的痛點是高效合理地處理類別型特征,這個從它的名字就可以看得出來,CatBoost是由catgorical和boost組成,另外是處理梯度偏差(Gradient bias)以及預測偏移(Prediction shift)問題,提高算法的準確性和泛化能力。
https://www.cnblogs.com/dudumiaomiao/p/9693711.html
4.Python Data Analysis Library 或 pandas 是基于NumPy 的一種工具,該工具是為了解決數(shù)據(jù)分析任務而創(chuàng)建的。Pandas 納入了大量庫和一些標準的數(shù)據(jù)模型,提供了高效地操作大型數(shù)據(jù)集所需的工具。pandas提供了大量能使我們快速便捷地處理數(shù)據(jù)的函數(shù)和方法。你很快就會發(fā)現(xiàn),它是使Python成為強大而高效的數(shù)據(jù)分析環(huán)境的重要因素之一。
https://www.cnblogs.com/misswangxing/p/7903595.html
5.NumPy 是一個 Python 包。 它代表 “Numeric Python”。 它是一個由多維數(shù)組對象和用于處理數(shù)組的例程集合組成的庫。
https://blog.csdn.net/a373595475/article/details/79580734
6.Matplotlib 是 Python 的繪圖庫。 它可與 NumPy 一起使用,提供了一種有效的 MatLab 開源替代方案。 它也可以和圖形工具包一起使用,如 PyQt 和 wxPython。
https://www.runoob.com/numpy/numpy-matplotlib.html
7.seaborn包是對matplotlib的增強版,需要安裝matplotlib后才能使用。
https://blog.csdn.net/weixin_38331049/article/details/89462338
8.Sklearn (全稱 Scikit-Learn) 是基于 Python 語言的機器學習工具。它建立在 NumPy, SciPy, Pandas 和 Matplotlib 之上,里面的 API 的設計非常好,所有對象的接口簡單,很適合新手上路。
https://blog.csdn.net/algorithmPro/article/details/103045824
9.Keras是由純python編寫的基于theano/tensorflow的深度學習框架。Keras是一個高層神經網(wǎng)絡API,支持快速實驗,能夠把你的idea迅速轉換為結果。
https://www.cnblogs.com/lc1217/p/7132364.html
代碼實現(xiàn)
#1.導入函數(shù)工具箱
> ##基礎工具 >import numpy as np import pandas as pd import warnings import matplotlib import matplotlib.pyplot as plt import seaborn as sns from scipy.special import jn from IPython.display import display, clear_output import time >warnings.filterwarnings('ignore') %matplotlib inline >##模型預測的 from sklearn import linear_model from sklearn import preprocessing from sklearn.svm import SVR from sklearn.ensemble import RandomForestRegressor,GradientBoostingRegressor >##數(shù)據(jù)降維處理的 from sklearn.decomposition import PCA,FastICA,FactorAnalysis,SparsePCA >import lightgbm as lgb import xgboost as xgb >#參數(shù)搜索和評價的 from sklearn.model_selection import GridSearchCV,cross_val_score,StratifiedKFold,train_test_split from sklearn.metrics import mean_squared_error, mean_absolute_error#2.數(shù)據(jù)讀取
>##通過Pandas對于數(shù)據(jù)進行讀取 (pandas是一個很友好的數(shù)據(jù)讀取函數(shù)庫) Train_data = pd.read_csv('D:/Anaconda/lib/site-packages/pandas/io/used_car_train_20200313.csv', sep=' ') TestA_data = pd.read_csv('D:/Anaconda/lib/site-packages/pandas/io/used_car_testA_20200313.csv', sep=' ') >##輸出數(shù)據(jù)的大小信息 print('Train data shape:',Train_data.shape) print('TestA data shape:',TestA_data.shape)Train data shape: (150000, 31)
TestA data shape: (50000, 30)
#2.1數(shù)據(jù)簡要瀏覽
>##通過.head() 簡要瀏覽讀取數(shù)據(jù)的形式 Train_data.head()#2.2數(shù)據(jù)信息查看
>##通過 .info() 簡要可以看到對應一些數(shù)據(jù)列名,以及NAN缺失信息 Train_data.info()<class ‘pandas.core.frame.DataFrame’>
RangeIndex: 150000 entries, 0 to 149999
Data columns (total 31 columns):
SaleID 150000 non-null int64
name 150000 non-null int64
regDate 150000 non-null int64
model 149999 non-null float64
brand 150000 non-null int64
bodyType 145494 non-null float64
fuelType 141320 non-null float64
gearbox 144019 non-null float64
power 150000 non-null int64
kilometer 150000 non-null float64
notRepairedDamage 150000 non-null object
regionCode 150000 non-null int64
seller 150000 non-null int64
offerType 150000 non-null int64
creatDate 150000 non-null int64
price 150000 non-null int64
v_0 150000 non-null float64
v_1 150000 non-null float64
v_2 150000 non-null float64
v_3 150000 non-null float64
v_4 150000 non-null float64
v_5 150000 non-null float64
v_6 150000 non-null float64
v_7 150000 non-null float64
v_8 150000 non-null float64
v_9 150000 non-null float64
v_10 150000 non-null float64
v_11 150000 non-null float64
v_12 150000 non-null float64
v_13 150000 non-null float64
v_14 150000 non-null float64
dtypes: float64(20), int64(10), object(1)
memory usage: 35.5+ MB
Index([‘SaleID’, ‘name’, ‘regDate’, ‘model’, ‘brand’, ‘bodyType’, ‘fuelType’,
‘gearbox’, ‘power’, ‘kilometer’, ‘notRepairedDamage’, ‘regionCode’,
‘seller’, ‘offerType’, ‘creatDate’, ‘price’, ‘v_0’, ‘v_1’, ‘v_2’, ‘v_3’,
‘v_4’, ‘v_5’, ‘v_6’, ‘v_7’, ‘v_8’, ‘v_9’, ‘v_10’, ‘v_11’, ‘v_12’,
‘v_13’, ‘v_14’],
dtype=‘object’)
<class ‘pandas.core.frame.DataFrame’>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 30 columns):
SaleID 50000 non-null int64
name 50000 non-null int64
regDate 50000 non-null int64
model 50000 non-null float64
brand 50000 non-null int64
bodyType 48587 non-null float64
fuelType 47107 non-null float64
gearbox 48090 non-null float64
power 50000 non-null int64
kilometer 50000 non-null float64
notRepairedDamage 50000 non-null object
regionCode 50000 non-null int64
seller 50000 non-null int64
offerType 50000 non-null int64
creatDate 50000 non-null int64
v_0 50000 non-null float64
v_1 50000 non-null float64
v_2 50000 non-null float64
v_3 50000 non-null float64
v_4 50000 non-null float64
v_5 50000 non-null float64
v_6 50000 non-null float64
v_7 50000 non-null float64
v_8 50000 non-null float64
v_9 50000 non-null float64
v_10 50000 non-null float64
v_11 50000 non-null float64
v_12 50000 non-null float64
v_13 50000 non-null float64
v_14 50000 non-null float64
dtypes: float64(20), int64(9), object(1)
memory usage: 11.4+ MB
#2.3數(shù)據(jù)統(tǒng)計信息瀏覽
>TestA_data.describe()#3.特征與標簽構建
#3.1提取數(shù)值類型特征列名
Index([‘SaleID’, ‘name’, ‘regDate’, ‘model’, ‘brand’, ‘bodyType’, ‘fuelType’,
‘gearbox’, ‘power’, ‘kilometer’, ‘regionCode’, ‘seller’, ‘offerType’,
‘creatDate’, ‘price’, ‘v_0’, ‘v_1’, ‘v_2’, ‘v_3’, ‘v_4’, ‘v_5’, ‘v_6’,
‘v_7’, ‘v_8’, ‘v_9’, ‘v_10’, ‘v_11’, ‘v_12’, ‘v_13’, ‘v_14’],
dtype=‘object’)
Index([‘notRepairedDamage’], dtype=‘object’)
#3.2構建訓練和測試樣本
>##選擇特征列 >feature_cols = [col for col in numerical_cols if col not in ['SaleID','name','regDate','creatDate','price','model','brand','regionCode','seller']] feature_cols = [col for col in feature_cols if 'Type' not in col] >##提前特征列,標簽列構造訓練樣本和測試樣本 >X_data = Train_data[feature_cols] >Y_data = Train_data['price'] >X_test = TestA_data[feature_cols] >print('X train shape:',X_data.shape) print('X test shape:',X_test.shape)X train shape: (150000, 18)
X test shape: (50000, 18)
#3.3統(tǒng)計標簽的基本分布信息
>print('Sta of label:') Sta_inf(Y_data)Sta of label:
_min 11
_max: 99999
_mean 5923.32733333
_ptp 99988
_std 7501.97346988
_var 56279605.9427
#3.4缺省值用-1填補
#4模型訓練與預測
#4.1利用xgb進行五折交叉驗證查看模型的參數(shù)效果
Train mae: 628.086664863
Val mae 715.990013454
#4.2定義xgb和lgb模型函數(shù)
>def build_model_xgb(x_train,y_train):model = xgb.XGBRegressor(n_estimators=150, learning_rate=0.1, gamma=0, subsample=0.8,\colsample_bytree=0.9, max_depth=7) #, objective ='reg:squarederror'model.fit(x_train, y_train)return model >def build_model_lgb(x_train,y_train):estimator = lgb.LGBMRegressor(num_leaves=127,n_estimators = 150)param_grid = {'learning_rate': [0.01, 0.05, 0.1, 0.2],}gbm = GridSearchCV(estimator, param_grid)gbm.fit(x_train, y_train)return gbm#4.3切分數(shù)據(jù)集(Train,Val)進行模型訓練,評價和預測
>##Split data with val >x_train,x_val,y_train,y_val = train_test_split(X_data,Y_data,test_size=0.3) >print('Train lgb...') model_lgb = build_model_lgb(x_train,y_train) val_lgb = model_lgb.predict(x_val) MAE_lgb = mean_absolute_error(y_val,val_lgb) print('MAE of val with lgb:',MAE_lgb) >print('Predict lgb...') model_lgb_pre = build_model_lgb(X_data,Y_data) subA_lgb = model_lgb_pre.predict(X_test) print('Sta of Predict lgb:') Sta_inf(subA_lgb)Train lgb…
MAE of val with lgb: 689.084070621
Predict lgb…
Sta of Predict lgb:
_min -519.150259864
_max: 88575.1087721
_mean 5922.98242599
_ptp 89094.259032
_std 7377.29714126
_var 54424513.1104
Train xgb…
MAE of val with xgb: 715.37757816
Predict xgb…
Sta of Predict xgb:
_min -165.479
_max: 90051.8
_mean 5922.9
_ptp 90217.3
_std 7361.13
_var 5.41862e+07
#4.4進行兩模型的結果加權融合
>##這里我們采取了簡單的加權融合的方式 >val_Weighted = (1-MAE_lgb/(MAE_xgb+MAE_lgb))*val_lgb+(1-MAE_xgb/(MAE_xgb+MAE_lgb))*val_xgb val_Weighted[val_Weighted<0]=10 # 由于我們發(fā)現(xiàn)預測的最小值有負數(shù),而真實情況下,price為負是不存在的,由此我們進行對應的后修正 print('MAE of val with Weighted ensemble:',mean_absolute_error(y_val,val_Weighted))MAE of val with Weighted ensemble: 687.275745703
>sub_Weighted = (1-MAE_lgb/(MAE_xgb+MAE_lgb))*subA_lgb+(1-MAE_xgb/(MAE_xgb+MAE_lgb))*subA_xgb >##查看預測值的統(tǒng)計進行 >plt.hist(Y_data) plt.show() plt.close()
#4.5輸出結果
SaleID price
0 0 39533.727414
1 1 386.081960
2 2 7791.974571
3 3 11835.211966
4 4 585.420407
總結
- 上一篇: VUE+Canvas实现简单的五子棋游戏
- 下一篇: 蜀国人物一览