sklearn 机器学习 Pipeline 模板
生活随笔
收集整理的這篇文章主要介紹了
sklearn 机器学习 Pipeline 模板
小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.
文章目錄
- 1. 導(dǎo)入工具包
- 2. 讀取數(shù)據(jù)
- 3. 數(shù)字特征、文字特征分離
- 4. 數(shù)據(jù)處理Pipeline
- 5. 嘗試不同的模型
- 6. 參數(shù)搜索
- 7. 特征重要性篩選
- 8. 最終完整Pipeline
使用 sklearn 的 pipeline 搭建機(jī)器學(xué)習(xí)的流程
本文例子為 [Kesci] 新人賽 · 員工滿意度預(yù)測(cè)
參考 [Hands On ML] 2. 一個(gè)完整的機(jī)器學(xué)習(xí)項(xiàng)目(加州房價(jià)預(yù)測(cè))
1. 導(dǎo)入工具包
import numpy as np import pandas as pd %matplotlib inline import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.model_selection import StratifiedShuffleSplit from sklearn.impute import SimpleImputer from sklearn.preprocessing import LabelEncoder from sklearn.preprocessing import OneHotEncoder from sklearn.preprocessing import LabelBinarizer from sklearn.base import BaseEstimator, TransformerMixin from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler from sklearn.pipeline import FeatureUnion from sklearn.model_selection import GridSearchCV from sklearn.model_selection import cross_val_score2. 讀取數(shù)據(jù)
data = pd.read_csv("../competition/Employee_Satisfaction/train.csv") test = pd.read_csv("../competition/Employee_Satisfaction/test.csv") data.columns Index(['id', 'last_evaluation', 'number_project', 'average_monthly_hours','time_spend_company', 'Work_accident', 'package','promotion_last_5years', 'division', 'salary', 'satisfaction_level'],dtype='object')- 訓(xùn)練數(shù)據(jù),標(biāo)簽分離
3. 數(shù)字特征、文字特征分離
def num_cat_splitor(X):s = (X.dtypes == 'object')object_cols = list(s[s].index)# object_cols # ['package', 'division', 'salary']num_cols = list(set(X.columns) - set(object_cols))# num_cols# ['Work_accident', 'time_spend_company', 'promotion_last_5years', 'id',# 'average_monthly_hours', 'last_evaluation', 'number_project']return num_cols, object_cols num_cols, object_cols = num_cat_splitor(X) # print(num_cols) # print(object_cols) # X[object_cols].values- 特征數(shù)值篩選器
4. 數(shù)據(jù)處理Pipeline
- 數(shù)字特征
- 文字特征
- 組合數(shù)字和文字特征
5. 嘗試不同的模型
from sklearn.ensemble import RandomForestRegressor forest_reg = RandomForestRegressor() forest_scores = cross_val_score(forest_reg,X_prepared,y,scoring='neg_mean_squared_error',cv=3) forest_rmse_scores = np.sqrt(-forest_scores) print(forest_rmse_scores) print(forest_rmse_scores.mean()) print(forest_rmse_scores.std())還可以嘗試別的模型
6. 參數(shù)搜索
param_grid = [{'n_estimators' : [3,10,30,50,80],'max_features':[2,4,6,8]},{'bootstrap':[False], 'n_estimators' : [3,10],'max_features':[2,3,4]}, ] forest_reg = RandomForestRegressor() grid_search = GridSearchCV(forest_reg, param_grid, cv=5,scoring='neg_mean_squared_error') grid_search.fit(X_prepared,y)- 最佳參數(shù)
- 最優(yōu)模型
- 搜索結(jié)果
7. 特征重要性篩選
feature_importances = grid_search.best_estimator_.feature_importances_- 選擇前 k 個(gè)最重要的特征
8. 最終完整Pipeline
prepare_select_and_predict_pipeline = Pipeline([('preparation', full_pipeline),('feature_selection', TopFeatureSelector(feature_importances, k)),('forst_reg', RandomForestRegressor()) ])- 參數(shù)搜索
- 訓(xùn)練
- 預(yù)測(cè)
以上只是粗略的大體框架,還有很多細(xì)節(jié),大家多指教!
我的CSDN博客地址 https://michael.blog.csdn.net/
長按或掃碼關(guān)注我的公眾號(hào)(Michael阿明),一起加油、一起學(xué)習(xí)進(jìn)步!
總結(jié)
以上是生活随笔為你收集整理的sklearn 机器学习 Pipeline 模板的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: LeetCode 898. 子数组按位或
- 下一篇: LeetCode 330. 按要求补齐数