Kaggle: House Prices: Advanced Regression Techniques
Kaggle: House Prices: Advanced Regression Techniques
notebook來自https://www.kaggle.com/neviadomski/how-to-get-to-top-25-with-simple-model-sklearn
思路流程:
1.導入數據,查看數據結構和缺失值情況
重點在于查看缺失值情況的寫法:
NAs = pd.concat([train.isnull().sum(), test.isnull().sum()], axis = 1, keys = ['train', 'test']) NAs[NAs.sum(axis=1) > 0]
2.數據預處理(刪除無用特征,特征轉化,缺失值填充,構造新特征,特征值標準化,轉化為dummy)
Q:什么樣的特征需要做轉化?
A:如某些整型數據只表示類別,其數值本身沒有意義,則應轉化為dummy
重點學習手動將特征轉化為dummy的方法(這里情況稍微還要復雜一點,因為存在同一特征對應兩列的情況,如Condition1,Condition2)
3.隨機打亂數據,分離訓練集和測試集
4.構建多個單一模型
5.模型融合
問題:
1.如何判斷一個特征是否是無用特征?
2.模型融合的方法?這里為什是np.exp(GB_model.predict(test_features)) + np.exp(ENS_model.predict(test_features_std))?
3.為什么label分布偏斜需要做轉化?
In?[33]: #Kaggle: House Prices: Advanced Regression Techniques import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns from sklearn import ensemble, linear_model, tree from sklearn.model_selection import train_test_split, cross_val_score from sklearn.metrics import mean_squared_error, r2_score from sklearn.utils import shuffle ? %matplotlib inline import warnings warnings.filterwarnings('ignore') ? train = pd.read_csv('downloads/train.csv') test = pd.read_csv('downloads/test.csv') In?[8]: train.head() Out[8]:| 1 | 60 | RL | 65.0 | 8450 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 2 | 2008 | WD | Normal | 208500 |
| 2 | 20 | RL | 80.0 | 9600 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 5 | 2007 | WD | Normal | 181500 |
| 3 | 60 | RL | 68.0 | 11250 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 9 | 2008 | WD | Normal | 223500 |
| 4 | 70 | RL | 60.0 | 9550 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 2 | 2006 | WD | Abnorml | 140000 |
| 5 | 60 | RL | 84.0 | 14260 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 12 | 2008 | WD | Normal | 250000 |
5 rows × 81 columns?
In?[9]: #檢查缺失值 NAs = pd.concat([train.isnull().sum(), test.isnull().sum()], axis = 1, keys = ['train', 'test']) #sum()默認的axis=0,即跨行 NAs[NAs.sum(axis=1) > 0] #只顯示有缺失值的特征 Out[9]:| 1369 | 1352.0 |
| 37 | 45.0 |
| 38 | 44.0 |
| 0 | 1.0 |
| 0 | 1.0 |
| 37 | 42.0 |
| 38 | 42.0 |
| 0 | 2.0 |
| 0 | 2.0 |
| 37 | 44.0 |
| 0 | 1.0 |
| 1 | 0.0 |
| 0 | 1.0 |
| 0 | 1.0 |
| 1179 | 1169.0 |
| 690 | 730.0 |
| 0 | 2.0 |
| 0 | 1.0 |
| 0 | 1.0 |
| 81 | 78.0 |
| 81 | 78.0 |
| 81 | 78.0 |
| 81 | 76.0 |
| 81 | 78.0 |
| 0 | 1.0 |
| 259 | 227.0 |
| 0 | 4.0 |
| 8 | 15.0 |
| 8 | 16.0 |
| 1406 | 1408.0 |
| 1453 | 1456.0 |
| 0 | 1.0 |
| 0 | 1.0 |
| 0 | 2.0 |
| -0.202033 | -0.217841 | 0.413476 | 0.022999 |
| 0.501785 | -0.072032 | -0.471810 | -0.029167 |
| -0.061269 | 0.137173 | 0.563659 | 0.196886 |
| -0.436639 | -0.078371 | 0.427309 | -0.092511 |
| 0.689469 | 0.518814 | 1.377806 | 0.988072 |
| 1 | 60 | RL | 65.0 | 8450 | Pave | NOACCESS | Reg | Lvl | Inside | ... | 2566.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
| 2 | 20 | RL | 80.0 | 9600 | Pave | NOACCESS | Reg | Lvl | FR2 | ... | 2524.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 |
| 3 | 60 | RL | 68.0 | 11250 | Pave | NOACCESS | IR1 | Lvl | Inside | ... | 2706.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
| 4 | 70 | RL | 60.0 | 9550 | Pave | NOACCESS | IR1 | Lvl | Corner | ... | 2473.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
| 5 | 60 | RL | 84.0 | 14260 | Pave | NOACCESS | IR1 | Lvl | FR2 | ... | 3343.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
5 rows × 61 columns
In?[19]: #convert Exterior to dummies Exterior = set([x for x in features['Exterior1st']] + [x for x in features['Exterior2nd']]) dummies = pd.DataFrame(data = np.zeros([len(features.index), len(Exterior)]), index = features.index, columns = Exterior) for i, ext in enumerate(zip(features['Exterior1st'], features['Exterior2nd'])):dummies.ix[i, ext] = 1 features = pd.concat([features, dummies.add_prefix('Ext_')], axis = 1) features.drop(['Exterior1st', 'Exterior2nd', 'Ext_nan'], axis = 1, inplace = True) print(features.shape) (2919, 78) In?[20]: features.dtypes[features.dtypes == 'object'].index Out[20]: Index(['MSSubClass', 'MSZoning', 'Street', 'Alley', 'LotShape', 'LandContour','LotConfig', 'LandSlope', 'Neighborhood', 'BldgType', 'HouseStyle','OverallCond', 'RoofStyle', 'MasVnrType', 'ExterQual', 'ExterCond','Foundation', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1','BsmtFinType2', 'HeatingQC', 'CentralAir', 'Electrical', 'KitchenAbvGr','KitchenQual', 'FireplaceQu', 'GarageType', 'GarageFinish','GarageQual', 'PavedDrive', 'MoSold', 'YrSold', 'SaleType','SaleCondition'],dtype='object') In?[21]: #遍歷特定類型數據的方法:for col in features.dtypes[features.dtypes == 'object'].index #convert all other categorical vars to dummies for col in features.dtypes[features.dtypes == 'object'].index:for_dummy = features.pop(col)features = pd.concat([features, pd.get_dummies(for_dummy, prefix = col)], axis = 1) print(features.shape) (2919, 263) In?[22]: #用之前幾個標準化的數據更新features ? features_standardized = features.copy() features_standardized.update(num_features_standarized) In?[23]: #重新分離訓練集和測試集 ? #首先分離沒有標準化的features train_features = features.loc['train'].drop(['Id'], axis=1).select_dtypes(include=[np.number]).values test_features = features.loc['test'].drop(['Id'], axis=1).select_dtypes(include=[np.number]).values ? #再分離標準化的數據 train_features_std = features_standardized.loc['train'].drop(['Id'], axis=1).select_dtypes(include=[np.number]).values test_features_std = features_standardized.loc['test'].drop(['Id'], axis=1).select_dtypes(include=[np.number]).values print(train_features.shape) print(train_features_std.shape) (1460, 262) (1460, 262) In?[24]: #shuffle train dataset train_features_std, train_features, train_label = shuffle(train_features_std, train_features, train_label, random_state = 5) In?[25]: #split train and test data x_train, x_test, y_train, y_test = train_test_split(train_features, train_label, test_size = 0.1, random_state = 200) x_train_std, x_test_std, y_train_std, y_test_std = train_test_split(train_features_std, train_label, test_size = 0.1, random_state = 200) In?[26]: #構建第一個模型:ElasticNet ENSTest = linear_model.ElasticNetCV(alphas=[0.0001, 0.0005, 0.001, 0.01, 0.1, 1, 10], l1_ratio=[.01, .1, .5, .9, .99], max_iter=5000).fit(x_train_std, y_train_std) train_test_score(ENSTest, x_train_std, x_test_std, y_train_std, y_test_std) ------------train----------- R2: 0.9009283127352861 RMSE: 0.11921419084690392 ------------test------------ R2: 0.8967299522701895 RMSE: 0.11097042840114624 In?[27]: #測試模型的交叉驗證得分 score = cross_val_score(ENSTest, train_features_std, train_label, cv = 5) print('Accurary: %0.2f +/- %0.2f' % (score.mean(), score.std()*2)) Accurary: 0.88 +/- 0.10 In?[28]: #構建第二個模型:GradientBoosting GB = ensemble.GradientBoostingRegressor(n_estimators=3000, learning_rate = 0.05, max_depth = 3, max_features = 'sqrt', min_samples_leaf = 15,min_samples_split = 10, loss = 'huber').fit(x_train_std, y_train_std) train_test_score(GB, x_train_std, x_test_std, y_train_std, y_test_std) ------------train----------- R2: 0.9607778449577035 RMSE: 0.07698826081848897 ------------test------------ R2: 0.9002871760789876 RMSE: 0.10793269100940146 In?[29]: #構建第二個模型:GradientBoosting GB = ensemble.GradientBoostingRegressor(n_estimators=3000, learning_rate = 0.05, max_depth = 3, max_features = 'sqrt', min_samples_leaf = 15,min_samples_split = 10, loss = 'huber').fit(x_train_std, y_train_std) train_test_score(GB, x_train_std, x_test_std, y_train_std, y_test_std) Accurary: 0.90 +/- 0.04 In?[30]: #模型融合 GB_model = GB.fit(train_features, train_label) ENS_model = ENSTest.fit(train_features_std, train_label) In?[31]: #為什么模型融合公式是這樣的? Final_score = (np.exp(GB_model.predict(test_features)) + np.exp(ENS_model.predict(test_features_std))) / 2 In?[32]: #寫入csv文件 pd.DataFrame({'Id':test.Id, 'SalePrice':Final_score}).to_csv('submit.csv', index=False)轉載于:https://www.cnblogs.com/RB26DETT/p/11566650.html
總結
以上是生活随笔為你收集整理的Kaggle: House Prices: Advanced Regression Techniques的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: Java总结(叁)
- 下一篇: 学习笔记97—matlab 获取矩阵中特