sklearn pipeline_Sklearn介绍
簡單概念回顧
監督學習與無監督學習
最大的區別就是有沒有標簽 工業應用中主要是用監督學習
分類任務和回歸任務
能用線性模型,決不用非線性模型(容易過擬合,且計算量太大)
模型的評估
accuracy:很少用,樣本不均衡時,易出問題 recall與precision:二者之間的trade off F1-score:綜合均衡考量recall與precision AUC:ROC曲線下方面積
特征處理(特征工程)
決定機器學習建模效果的核心 業務經驗相關 熟悉相關工具
Sklearn的設計概述
官方文檔:
https://scikit-learn.org/stable/
Classification
Regression
Clustering
Dimensionality reduction
Model selection
Preprocessing機器學習流程
獲取數據爬蟲
數據庫
數據文件(csv、excel、txt)數據處理文本處理
量綱一致
降維建立模型分類
回歸
聚類評估模型超參數擇優
哪個模型更好簡單常用的sklearn APIfit:訓練模型
transform:將數據轉換為模型處理后的結果(label會放在test集后面)
predict:返回模型預測結果
predict_proba:預測概率值
score:模型準確率(很少用默認的accuracy,會設置為f1)
get_params:獲取參數?準備數據數據集劃分:Training data(70%) Validation dataTesting data(30%)
實際工作中,大部分的情況下不會完全隨機劃分,會用已經發生(時間在前的、過去的)數據作為訓練集,來預測未來(時間在后的)數據。否則使用未來數據預測過去的數據,會引入一些未來發生的先驗信息,是不合理的,容易造成過擬合。
另外也會有其他情況,例如按地域劃分。
數據處理
數據集:ML DATASETS
Standardization, or mean removal and variance scaling
In practice we often ignore the shape of the distribution and just transform the data to center it by removing the mean value of each feature,?then scale it by dividing non-constant features by their standard deviation.
For instance, many elements used in the objective function of a learning algorithm (such as the RBF kernel of Support Vector Machines or the l1 and l2 regularizers of linear models) assume that all features are centered around zero and have variance in the same order. If a feature has a variance that is orders of magnitude larger than others, it might dominate the objective function and make the estimator unable to learn from other features correctly as expected.
[Should I normalize/standardize/rescale the data](http://www.faqs.org/faqs/ai-faq/neural-nets/part2/section-16.html,"Should I normalize/standardize/rescale the data")
StandardScaler
The preprocessing module further provides a utility class StandardScaler that implements the Transformer API to compute the mean and standard deviation on a training set so as to be able to later reapply the same transformation on the testing set.
MinMaxScaler
Scaling features to lie between a given minimum and maximum value, often between zero and one, or so that the maximum absolute value of each feature is scaled to unit size.
Normalization
Normalization is the process of scaling individual samples to have unit norm. This process can be useful if you plan to use a quadratic form such as the dot-product or any other kernel to quantify the similarity of any pair of samples.
This assumption is the base of the Vector Space Model often used in text classification and clustering contexts.
Normalizer類也擁有fit、transform等轉換器API擁有的常見方法,但實際上fit和transform對其是沒有實際意義的,因為歸一化操作是對每個樣本單獨進行變換,不存在針對所有樣本上的統計學習過程。這里的設計,僅僅是為了供sklearn中的pipeline等API調用時,傳入該對象時,各API的方法能夠保持一致性,方便使用pipeline。
Binarization(離散化)
Feature binarization is the process of thresholding numerical features to get boolean values. This can be useful for downstream probabilistic estimators that make assumption that the input data is distributed according to a multi-variate Bernoulli distribution. For instance, this is the case for the sklearn.neural_network.BernoulliRBM.
Return indices of half-open bins to which each value of x belongs.
pandas.cut(x, bins, right=True, labels=None, retbins=False, precision=3, include_lowest=False)pandas.cut是按分位數劃分的
Encoding categorical features
We could encode categorical features as integers, but such integer representation can not be used directly with scikit-learn estimators, as these expect continuous input, and would interpret the categories as being ordered, which is often not desired.
One possibility to convert categorical features to features that can be used with scikit-learn estimators is to use a one-of-K or one-hot encoding, which is implemented in OneHotEncoder. This estimator transforms each categorical feature with m possible values into m binary features, with only one active.OneHotEncoder(一般不用)
class sklearn.preprocessing.OneHotEncoder(n_values='auto', categorical_features='all', dtype=,sparse=True, handle_unknown='error')
Convert categorical variable into dummy/indicator variables
pandas.get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False, columns=None, sparse=False, drop_first=False)Imputation of missing values
For various reasons, many real world datasets contain missing values, often encoded as blanks, NaNs or other placeholders. Such datasets however are incompatible with scikit-learn estimators which assume that all values in an array are numerical, and that all have and hold meaning. A basic strategy to use incomplete datasets is to discard entire rows and/or columns containing missing values. However, this comes at the price of losing data which may be valuable (even though incomplete).?A better strategy is to impute the missing values, i.e., to infer them from the known part of the data.
The SimpleImputer class provides basic strategies for imputing missing values, either using the mean, the median or the most frequent value of the row or column in which the missing values are located. This class also allows for different missing values encodings.
The imputation strategy:
If “mean”, then replace missing values using the mean along the axis.
If “median”, then replace missing values using the median along the axis.
If “most_frequent”, then replace missing using the most frequent value along the axis.
If “constant”, then replace missing values with fill_value. Can be used with strings or numeric data.classsklearn.impute.SimpleImputer(missing_values=nan, strategy='mean', fill_value=None, verbose=0, copy=True, add_indicator=False)
【一些實踐中的 tips】
盡量不要把包含個別特征缺失值的樣本刪除,實踐中最好使用一些業務經驗來做一些合理的推測值的填充,利用好樣本
如果沒有合適的推測手段來填充,可以填充一些像-999,-1這樣的沒有意義的值
其他一些可能用到的方法:np.nannp.infdf.fillnadf.replace特征選擇SelectFromModelThis can be used for feature selection/dimensionality reduction on sample sets, either to improve estimators’ accuracy scores or to boost their performance on very high-dimensional datasets. SelectFromModel is a meta-transformer that can be used along with any estimator that has a coef_ or feature_importances_ attribute after fitting. The features are considered unimportant and removed, if the corresponding coef_ or feature_importances_ values are below the provided threshold parameter. Apart from specifying the threshold numerically, there are built-in heuristics for finding a threshold using a string argument. Available heuristics are “mean”, “median” and float multiples of these like “0.1*mean”.class?sklearn.feature_selection.SelectFromModel(estimator, threshold=None, prefit=False, norm_order=1, max_features=None)
L1-based feature selection(實際應用中少于樹模型,這里演示用L1正則的模型來選取特征)
Tree-based feature selection(實際應用中優先考慮,這里演示RF)
estimator:對象。構建特征選擇實例的基本分類器。如果參數prefit為True,則該參數可以由一個已經訓練過的分類器初始化。如果prefit為False,則該參數只能傳入沒有經過訓練的分類器實例
threshold:字符串,浮點數,(可選的)默認為None。該參數指定特征選擇的閾值,詞語在分類模型中對應的系數值大于該值時被保留,否則被移除。如果該參數為字符串類型,則可設置的值有”mean”表示系數向量值的均值,”median”表示系數向量值的中值,也可以為”0.xmean”或”0.xmedian”。當該參數設置值為None時,如果分類器具有罰項且罰項設置為l1,則閾值為1e-5,否則該值為”mean”
prefit:布爾類型。默認值為False。是否對傳入的基本分類器事先進行訓練。如果設置該值為True,則需要對傳入的基本分類器進行訓練,如果設置該值為False,則只需要傳入分類器實例即可?注意:實際中,我們對于one-hot處理后的那些列一般不會刪除,除非這些列的系數都為0,才會刪除
※各特征獨立考量
【注意】很少用,因為實際中很難確定需要設置的閾值/個數/比例
Removing features with low variance
VarianceThreshold is a simple baseline approach to feature selection. It removes all features whose variance doesn’t meet some threshold. By default, it removes all zero-variance features, i.e. features that have the same value in all samples.
Univariate feature selection
SelectKBest removes all but the k highest scoring features
SelectPercentile removes all but a user-specified highest scoring percentage of features
These objects take as input a scoring function that returns univariate scores and p-values (or only scores for SelectKBest and SelectPercentile):
For regression: f_regression, mutual_info_regression
For classification: chi2, f_classif, mutual_info_classif降維(考慮了所有特征間的整體貢獻)【注意】一般實際應用的其實也較少難得選取到需要的業務特征機器學習中會使用正則項來懲罰共線性 ```python class sklearn.decomposition.PCA(n_components=None, copy=True, whiten=False, svd_solver='auto', tol=0.0, iterated_power='auto', random_state=None)
class sklearn.decomposition.TruncatedSVD(n_components=2, algorithm='randomized', n_iter=5, random_state=None, tol=0.0)
主成分分析PCA
PCA的工作原理是將原始數據集映射到一個新的空間,在這個空間中,矩陣的新列向量是每個正交的。從數據分析的角度來看,PCA將數據的協方差矩陣轉化為能夠 "解釋 "一定比例的方差的列向量。
最大方差解釋(保留的特征數越多,方差解釋越大):http://www.cnblogs.com/jerrylead/archive/2011/04/18/2020209.html
最小平方誤差解釋:http://www.cnblogs.com/jerrylead/archive/2011/04/18/2020216.html
【注意】正常數據集其實很少用PCA,用的最多的是在圖像壓縮上(如只需要抓住圖片中主要的人臉部分)
Truncated SVD(截斷的奇異矩陣分解)
TruncatedSVD與PCA非常相似,但不同的是,它直接對樣本矩陣X進行工作,而不是對其協方差矩陣進行工作。
Truncated SVD與普通SVD的不同之處在于,它產生的因子化結果的列數是等于我們指定的截斷數的。例如,給定一個n×n矩陣,普通SVD將生成具有n列的矩陣,而截斷后的SVD將生成我們指定的列數。
模型評估(一):參數選擇
Cross-validation: evaluating estimator performance
訓練一個預測模型的參數,并在相同的數據上測試這個模型效果,是一種錯誤的方式:
一個模型如果只是重復它剛剛看到的樣本的標簽,會有一個完美的分數,但在尚未看到的數據上卻無法預測任何有用的東西。這種情況被稱為過度擬合。為了避免這種情況,在進行(有監督的)機器學習實驗時,通常的做法是將部分可用數據作為測試集X_test、y_test,將其作為測試集保留出來。當評估不同設置("超參數")的模型時,例如SVM必須手動設置參數C,如果我們用測試集去選擇最優的超參數,那么在測試集上仍然存在著過度擬合的風險。這是因為我們在不斷調整超參數值,直到模型在測試集上的表現最佳為止。這樣一來,關于測試集的知識就會 "泄露 "到模型中,評估指標不再報告泛化性能。為了解決這個問題,可以將數據集的另一部分作為所謂的 "驗證集":在訓練集上進行訓練,然后在驗證集上進行評估模型選擇超參數,學習到一個我們認為“最好的”模型后,可以在測試集上進行最終評估。然而,通過將可用的數據分成三組,我們可以大幅減少可用于學習模型的樣本數量,結果可能取決于一對(訓練、驗證)集的特定隨機選擇。解決這個問題的方法是一個叫做交叉驗證(CrossValidation,簡稱CV)的過程。當然,仍應保留一個測試集進行最終評估,但在做CV時,不再需要單獨劃分出一個驗證集。在一種叫k-fold CV的簡單交叉驗證方法中,訓練集被分割成k個小集(其他方法將在下文中描述,但一般遵循相同的原則)。對于每一個k個?"折子",都要遵循以下步驟:
* 所得的模型在剩余的那1份數據上進行驗證(即,它被用作驗證集來計算一個性能評估標準,例如accuracy)
通過k折交叉驗證所報告的最終模型性能指標,是在k次循環中分別計算出的評估指標值的平均值(具體k折交叉驗證的方法和原理請參考sklearn官方文檔對這塊的解釋:https://scikit-learn.org/stable/modules/cross_validation.html )。這種方法在計算資源的開銷上可能很昂貴,但不會浪費太多數據(相比固定一個測試集時的情況),這在樣本數很少的情況下中是一個很大的優勢。
Computing cross-validated metrics
sklearn.model_selection.cross_val_score(estimator, X, y=None, groups=None, scoring=None, cv=None, n_jobs=None, verbose=0, fit_params=None, pre_dispatch='2*n_jobs', error_score=nan)參數:
estimator——用什么模型
X——數據集輸入
y——數據集標簽
scoring——用什么指標來評估
cv——幾
折交叉驗證(默認5,一般設置5-10, 也可以傳入一個KFold或Stratified迭代器,但實際上傳入整數,默認就是用Stratified迭代器)
n_jobs——開n個進程并行計算,默認為1(建議電腦閑置跑程序時設置為-1,讓之以電腦最大資源進行并行計算)
verbose——是否要將學習過程打印出來(如0或1或2或3,數字越大,打印信息越詳細。但有的模型沒有學習的過程,如這個perceptrom)
error_score——遇到不合理的參數是否要報錯Cross validation iteratorscv: int, cross-validation generator or an iterable, optional
sklearn中各種交叉驗證的api中的cv參數決定了交叉驗證拆分策略。cv可能的輸入是:
None, to use the default 5-fold cross validation,
integer, to specify the number of folds in a (Stratified)KFold,
CV splitter,
An iterable yielding (train, test) splits as arrays of indices.?其中,cv參數可以傳入sklearn中自帶的一些cv iterators:
K-fold
Stratified k-fold
Label k-fold
Leave-One-Out - LOO
Leave-P-Out - LPO
...?ShuffleSplit則是在原始順序的數據上,進行隨機采樣,拼成指定的test_size和train_size的數據供交叉驗證
示意圖如下
ShuffleSplit將在每次迭代期間隨機采樣整個數據集,以生成訓練集和測試集。在每次交叉驗證的迭代中,test_size和train_size參數控制測試和訓練集應該多大。由于是在每次迭代中從整個數據集中進行采樣(即有放回的采樣),因此ShuffleSplit可能在另一次迭代中再次選擇前一次迭代中選擇過的樣本(注意,KFold即使設置了shuffle參數為True也仍然在每一折的劃分中不會有重疊的樣本,這是兩者之間最大的區別)
ShuffleSplit劃分時跟圖中的classes or groups(類別占比)無關 實際中,我們一般會使用StratifiedKFold(按樣本各類別標簽分層抽樣)方式來做交叉驗證劃分樣本,確保訓練集、測試集中各類別樣本的比例與原始數據集中一致。
當然類似于上面KFold與ShuffleSplit區別,StratifiedKFold和StratifiedShuffleSplit區別同樣是在劃分或抽樣的方式上,只不過加上了分層的條件,所以StratifiedKFold是在每類樣本中都進行K折,StratifiedShuffleSplit是在每類樣本中都隨機有放回地抽樣指定比例的樣本后得到的這部分數據作為驗證集,其余作為訓練集,因此每次劃分都考慮到了各個類別間的分布占比,示意圖如下
Grid Search: Searching for estimator parameters
A search consists of:
an estimator (regressor or classifier such as sklearn.svm.SVC());
a parameter space;
a method for searching or sampling candidates;
a cross-validation scheme; and
a score function.class?sklearn.model_selection.GridSearchCV(estimator, param_grid, scoring=None, n_jobs=None, iid='deprecated', refit=True, cv=None, verbose=0, pre_dispatch='2*n_jobs', error_score=nan, return_train_score=False)參數:
estimator——用什么模型
param__grid——參數字典(key為要尋優的參數名,value為要嘗試尋優的值的列表)
scoring——用什么指標來評估(分類器默認用準確率,也可改為'f1'、'roc_auc'等)
cv—— 幾折交叉驗證(默認5,一般設置5-10, 也可以傳入一個KFold或Stratified迭代器,但實際上傳入整數默認就是用Stratified迭代器)
n_jobs——開n個進程并行計算,默認為1(建議設置-1,讓之并行計算)
verbose——是否要將學習過程打印出來(如0或1或2或3,數字越大,打印信息越詳細。但有的模型沒有學習的過程,如這個perceptrom)
iid——假設樣本是否是獨立同分布的(默認是True)
refit——是否需要直接返回在整個訓練集上的最佳分類器,默認為True,可直接將這個GridSearchCV實例用于predict
error_score——遇到不合理的參數是否要報錯,默認'nan'模型評估(二):評估指標具體可以參考sklearn文檔中列明的scoring指標:
https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter
例如分類任務可用的scoring指標如下
總結
以上是生活随笔為你收集整理的sklearn pipeline_Sklearn介绍的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: python xlwt写入excel_p
- 下一篇: poll函数_I/O--多路复用的三种机