當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

sklearn 机器学习 Pipeline 模板

發(fā)布時(shí)間：2024/7/5 编程问答 27 豆豆

生活随笔收集整理的這篇文章主要介紹了 sklearn 机器学习 Pipeline 模板小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

文章目錄

- 1. 導(dǎo)入工具包
- 2. 讀取數(shù)據(jù)
- 3. 數(shù)字特征、文字特征分離
- 4. 數(shù)據(jù)處理Pipeline
- 5. 嘗試不同的模型
- 6. 參數(shù)搜索
- 7. 特征重要性篩選
- 8. 最終完整Pipeline

使用 sklearn 的 pipeline 搭建機(jī)器學(xué)習(xí)的流程
本文例子為 [Kesci] 新人賽 · 員工滿意度預(yù)測(cè)
參考 [Hands On ML] 2. 一個(gè)完整的機(jī)器學(xué)習(xí)項(xiàng)目（加州房價(jià)預(yù)測(cè)）

1. 導(dǎo)入工具包

import numpy as np import pandas as pd %matplotlib inline import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.model_selection import StratifiedShuffleSplit from sklearn.impute import SimpleImputer from sklearn.preprocessing import LabelEncoder from sklearn.preprocessing import OneHotEncoder from sklearn.preprocessing import LabelBinarizer from sklearn.base import BaseEstimator, TransformerMixin from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler from sklearn.pipeline import FeatureUnion from sklearn.model_selection import GridSearchCV from sklearn.model_selection import cross_val_score

2. 讀取數(shù)據(jù)

data = pd.read_csv("../competition/Employee_Satisfaction/train.csv") test = pd.read_csv("../competition/Employee_Satisfaction/test.csv") data.columns Index(['id', 'last_evaluation', 'number_project', 'average_monthly_hours','time_spend_company', 'Work_accident', 'package','promotion_last_5years', 'division', 'salary', 'satisfaction_level'],dtype='object')

訓(xùn)練數(shù)據(jù)，標(biāo)簽分離

y = data['satisfaction_level'] X = data.drop(['satisfaction_level'], axis=1)

3. 數(shù)字特征、文字特征分離

def num_cat_splitor(X):s = (X.dtypes == 'object')object_cols = list(s[s].index)# object_cols # ['package', 'division', 'salary']num_cols = list(set(X.columns) - set(object_cols))# num_cols# ['Work_accident', 'time_spend_company', 'promotion_last_5years', 'id',# 'average_monthly_hours', 'last_evaluation', 'number_project']return num_cols, object_cols num_cols, object_cols = num_cat_splitor(X) # print(num_cols) # print(object_cols) # X[object_cols].values

特征數(shù)值篩選器

class DataFrameSelector(BaseEstimator, TransformerMixin):def __init__(self, attribute_names):self.attribute_names = attribute_namesdef fit(self, X, y=None):return selfdef transform(self, X):return X[self.attribute_names].values

4. 數(shù)據(jù)處理Pipeline

數(shù)字特征

num_pipeline = Pipeline([('selector', DataFrameSelector(num_cols)),('imputer', SimpleImputer(strategy="median")),('std_scaler', StandardScaler()),])

文字特征

cat_pipeline = Pipeline([('selector', DataFrameSelector(object_cols)),('cat_encoder', OneHotEncoder(sparse=False)),])

組合數(shù)字和文字特征

full_pipeline = FeatureUnion(transformer_list=[("num_pipeline", num_pipeline),("cat_pipeline", cat_pipeline),]) X_prepared = full_pipeline.fit_transform(X)

5. 嘗試不同的模型

from sklearn.ensemble import RandomForestRegressor forest_reg = RandomForestRegressor() forest_scores = cross_val_score(forest_reg,X_prepared,y,scoring='neg_mean_squared_error',cv=3) forest_rmse_scores = np.sqrt(-forest_scores) print(forest_rmse_scores) print(forest_rmse_scores.mean()) print(forest_rmse_scores.std())

還可以嘗試別的模型

6. 參數(shù)搜索

param_grid = [{'n_estimators' : [3,10,30,50,80],'max_features':[2,4,6,8]},{'bootstrap':[False], 'n_estimators' : [3,10],'max_features':[2,3,4]}, ] forest_reg = RandomForestRegressor() grid_search = GridSearchCV(forest_reg, param_grid, cv=5,scoring='neg_mean_squared_error') grid_search.fit(X_prepared,y)

最佳參數(shù)

grid_search.best_params_

最優(yōu)模型

grid_search.best_estimator_

搜索結(jié)果

cv_result = grid_search.cv_results_ for mean_score, params in zip(cv_result['mean_test_score'], cv_result['params']):print(np.sqrt(-mean_score), params) 0.2129252723367584 {'max_features': 2, 'n_estimators': 3} 0.19276874697889504 {'max_features': 2, 'n_estimators': 10} 0.1865548358477794 {'max_features': 2, 'n_estimators': 30} .......

7. 特征重要性篩選

feature_importances = grid_search.best_estimator_.feature_importances_

選擇前 k 個(gè)最重要的特征

k = 3 def indices_of_top_k(arr, k):return np.sort(np.argpartition(np.array(arr), -k)[-k:])class TopFeatureSelector(BaseEstimator, TransformerMixin):def __init__(self, feature_importances, k):self.feature_importances = feature_importancesself.k = kdef fit(self, X, y=None):self.feature_indices_ = indices_of_top_k(self.feature_importances, self.k)return selfdef transform(self, X):return X[:, self.feature_indices_]

8. 最終完整Pipeline

prepare_select_and_predict_pipeline = Pipeline([('preparation', full_pipeline),('feature_selection', TopFeatureSelector(feature_importances, k)),('forst_reg', RandomForestRegressor()) ])

參數(shù)搜索

param_grid = [{'preparation__num_pipeline__imputer__strategy': ['mean', 'median', 'most_frequent'],'feature_selection__k': list(range(5, len(feature_importances) + 1)),'forst_reg__n_estimators' : [200,250,300,310,330],'forst_reg__max_features':[2,4,6,8] }]grid_search_prep = GridSearchCV(prepare_select_and_predict_pipeline, param_grid, cv=10,scoring='neg_mean_squared_error', verbose=2, n_jobs=-1)

訓(xùn)練

grid_search_prep.fit(X,y) grid_search_prep.best_params_ final_model = grid_search_prep.best_estimator_

預(yù)測(cè)

y_pred_test = final_model.predict(test) result = pd.DataFrame() result['id'] = test['id'] result['satisfaction_level'] = y_pred_test result.to_csv('rf_ML_pipeline.csv',index=False)

以上只是粗略的大體框架，還有很多細(xì)節(jié)，大家多指教！

我的CSDN博客地址 https://michael.blog.csdn.net/

長按或掃碼關(guān)注我的公眾號(hào)（Michael阿明），一起加油、一起學(xué)習(xí)進(jìn)步！

總結(jié)

以上是生活随笔為你收集整理的sklearn 机器学习 Pipeline 模板的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇： LeetCode 898. 子数组按位或
下一篇： LeetCode 330. 按要求补齐数