當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

第七章 scikit-learn与机器学习实战

發布時間：2023/12/10 编程问答 32 豆豆

生活随笔收集整理的這篇文章主要介紹了第七章 scikit-learn与机器学习实战小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

文章目錄

1 scikit-learn
2 一個項目實戰
- 2.1 項目目標
- 2.2 劃定問題
- 2.3 選擇性能指標
- 2.4 核實假設
- 2.5 獲取數據
- 2.6 數據探索和可視化、發現規律
- 2.7 為機器學習算法準備數據
- 2.8 選擇并且訓練模型
- 2.9 模型微調
- 2.10 測試集上測試

1 scikit-learn

導航頁與算法指南
API：數據預處理Preprocessing and Normalization，特征抽取Feature Extraction，特征選擇Feature Selection，各種模型：Generalized linear models (GLM) for regression、Naive Bayes,Support Vector Machines、Decision Trees、Clustering，模型調優與超參數選擇：Model Selection，模型融合與增強Ensemble Methods，模型評估Metrics

2 一個項目實戰

2.1 項目目標

利用加州普查數據，建立一個加州房價模型。數據包含：每個街區人口、收入中位數、房價中位數等指標。
我們要建一個模型對數據進行學習，根據其他指標預測任何街區的房價中位數。

2.2 劃定問題

問老板的第一個問題是：商業目標是什么。
建立模型不是最終目標，公司會如何使用模型，并從模型中如何受益。老板可能會告訴你你的模型輸出會傳給另外一個系統。
下一個問題是：現有的方案效果如何。
開始設計系統。是監督學習、非監督學習還是強化學習？是分類還是回歸？
這是一個監督學習的問題，我們使用的是有標簽（房價中位數）的訓練樣本數據。這是一個回歸類問題。要預測的是一個數值。

2.3 選擇性能指標

回歸類問題一般選擇均方根誤差（RMSE）作為性能指標。

2.4 核實假設

跟下游溝通確定需要的值是什么：數值還是分類。

2.5 獲取數據

下載數據

查看數據結構

housing.head()
housing.info()
housing.describe()
housing[“ocean_proximity”].value_counts()
housing.hist(bins=50, figsize=(20,15))

查看數據描述

重點看數據量、每個屬性的類型和非空值的數量。查看數據是不是被截斷了。

創建測試集

如果每次都重新采集測試集，效果很不穩定。可以把測試集保存下來，或者設置隨機數生成種子，每次產生相同的測試集。
此外還有一點需要注意的是：如果測試集隨機抽取數據可能不具有代表性。例如在本例中，測試集有可能涵蓋不到高收入群體。我們需要按照原始數據集中收入中位數的分布分層抽樣數據。可以使用StratifiedShuffleSplit實現。split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)

# Divide by 1.5 to limit the number of income categories housing["income_cat"] = np.ceil(housing["median_income"] / 1.5) # Label those above 5 as 5 housing["income_cat"].where(housing["income_cat"] < 5, 5.0, inplace=True) from sklearn.model_selection import StratifiedShuffleSplit # 提供分層抽樣功能，確保每個標簽對應的樣本的比例 split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42) for train_index, test_index in split.split(housing, housing["income_cat"]):strat_train_set = housing.loc[train_index]strat_test_set = housing.loc[test_index]strat_test_set["income_cat"].value_counts() / len(strat_test_set) housing["income_cat"].value_counts() / len(housing) for set_ in (strat_train_set, strat_test_set):set_.drop("income_cat", axis=1, inplace=True)

會發現測試集和原數據集中的income_cat分布是一致的。

2.6 數據探索和可視化、發現規律

對要處理的數據，要有整體了解。
首先測試集放在一邊，不做了解。訓練集如果很大可以在采樣一個探索集。
第二，地理數據可視化
第三，查找關聯
使用pandas提供的corr_matrix = housing.corr()方法查看關聯性。

收入中位數與房價中位數關聯度比較高。分析后發現，在500000，480000，350000等一些水平線會有明顯的呈一條直線。可能需要去掉這些街區的數據，以防止算法重復這些巧合。

第四，屬性組合試驗

housing["rooms_per_household"] = housing["total_rooms"]/housing["households"] housing["bedrooms_per_room"] = housing["total_bedrooms"]/housing["total_rooms"] housing["population_per_household"]=housing["population"]/housing["households"]

屬性組合是打開腦洞的時間。可以嘗試每戶有幾個房間，每個房間有幾個臥室，每戶人口數。

2.7 為機器學習算法準備數據

1 要寫一些函數來做事情
這些函數可以讓你重復利用。
2 數據清洗
如果數據數據缺失有三種處理方式：1 去掉數值缺失的那些數據（按行），2 去掉整個屬性（按列），3 進行賦值（0，平均值、中位數等等）
3 處理文本和類別屬性
處理文本屬性：先將文本屬性轉為數值屬性，接著使用one-hot編碼：OrdinalEncoder、OneHotEncoder,也可以使用LabelBinarizer一步到位。
4 自定義轉換器
繼承BaseEstimator, TransformerMixin。

class CombinedAttributesAdder(BaseEstimator, TransformerMixin):def __init__(self, add_bedrooms_per_room = True): # no *args or **kargsself.add_bedrooms_per_room = add_bedrooms_per_roomdef fit(self, X, y=None):return self # nothing else to dodef transform(self, X, y=None):rooms_per_household = X[:, rooms_ix] / X[:, household_ix]population_per_household = X[:, population_ix] / X[:, household_ix]if self.add_bedrooms_per_room:bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]return np.c_[X, rooms_per_household, population_per_household,bedrooms_per_room]else:return np.c_[X, rooms_per_household, population_per_household]

5 特征縮放
通常我們不對目標值進行縮放。這里就是房價中位數，不縮放。
特征縮放針對的是訓練集數據，在測試集，以及預測過程中需要做同樣的操作。
有兩種縮放方式：歸一化和標準化。
歸一化的操作是: $x=x?minmax?minx=\dfrac{x-min}{max-min}$ ，值的范圍是(0,1)。但是受異常值影響大。API：MinmaxScaler
標準化的操作是: $x=x?mean方差x=\dfrac{x-mean}{方差}$ ，值沒有范圍，受異常值影響小。API：StandardScaler。

6 轉換流水線

from sklearn.base import BaseEstimator, TransformerMixin# Create a class to select numerical or categorical columns # since Scikit-Learn doesn't handle DataFrames yet class DataFrameSelector(BaseEstimator, TransformerMixin):def __init__(self, attribute_names):self.attribute_names = attribute_namesdef fit(self, X, y=None):return selfdef transform(self, X):return X[self.attribute_names].valuesnum_attribs = list(housing_num) cat_attribs = ["ocean_proximity"]num_pipeline = Pipeline([('selector', DataFrameSelector(num_attribs)),('imputer', Imputer(strategy="median")),('attribs_adder', CombinedAttributesAdder()),('std_scaler', StandardScaler()),])cat_pipeline = Pipeline([('selector', DataFrameSelector(cat_attribs)),('cat_encoder', OneHotEncoder(sparse=False)),])from sklearn.pipeline import FeatureUnionfull_pipeline = FeatureUnion(transformer_list=[("num_pipeline", num_pipeline),("cat_pipeline", cat_pipeline),]) housing_prepared = full_pipeline.fit_transform(housing)

2.8 選擇并且訓練模型

2.9 模型微調

2.10 測試集上測試

總結

以上是生活随笔為你收集整理的第七章 scikit-learn与机器学习实战的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：模糊综合评价法
下一篇： JS中双引号单引号，转义字符问题！！