c盘扩展卷功能只能向右扩展_信用风险管理:功能扩展和选择
c盤擴展卷功能只能向右擴展
Following feature engineering, this part moves to the next step in the data preparation process: feature scaling and selection, which transforms the dataset to a more digestible one prior to modelling.
在進行要素工程之后,此部分將轉到數據準備過程的下一步:要素縮放和選擇,該功能可在建模之前將數據集轉換為易于消化的數據集。
As a reminder, this end-to-end project aims to solve a classification problem in Data Science, particularly in finance industry and is divided into 3 parts:
提醒一下,此端到端項目旨在解決數據科學(特別是金融行業)中的分類問題,分為三個部分:
Feature Scaling and Selection (Bonus: Imbalanced Data Handling)
功能縮放和選擇(獎金:數據處理不平衡)
If you have missed the 1st part, feel free to check it out here before going through the 2nd part that follows here for a better context understanding.
如果您錯過了第一部分,請在此處先進行檢查,然后再閱讀后面的第二部分,以更好地理解上下文。
A.功能縮放 (A. Feature Scaling)
What is feature scaling and why do we need it prior to modelling?
什么是特征縮放?為什么在建模之前需要它?
According to Wikipedia,
根據維基百科,
Feature scaling is a method used to normalize the range of independent variables or features of data. In data processing, it is also known as data normalization and is generally performed during the data preprocessing step.
特征縮放是一種用于規范自變量或數據特征范圍的方法。 在數據處理中,這也稱為數據規范化,通常在數據預處理步驟中執行。
If you recall from the 1st part, we have completed engineering all of our features on both datasets (A & B) as below:
如果您從第一部分回顧,我們已經完成了對兩個數據集(A和B)的所有功能的工程設計,如下所示:
Dataset A (encoded without target)數據集A(無目標編碼) Dataset B (encoded with target)數據集B(用目標編碼)As seen above, data range and distribution among all features are relatively different from one another, not to mention some variables bearing with outliers. That being said, it is highly recommended that we apply feature scaling to the entire dataset consistently for the purpose of making it more digestible to machine learning algorithms.
如上所述,所有要素之間的數據范圍和分布都相對不同,更不用說一些帶有異常值的變量了。 話雖這么說,強烈建議我們對整個數據集一致地應用特征縮放,以使其更易于機器學習算法消化。
In fact, there are a number of different methods in the market, but I will only focus on the three which I believe are relatively distinctive: StandardScaler, MinMaxScaler and RobustScaler. In brief,
實際上,市場上有許多不同的方法,但是我只關注我認為相對獨特的三種方法: StandardScaler , MinMaxScaler和RobustScaler 。 簡單來說,
StandardScaler: this method removes the mean and scales the data to unit variance (mean = 0 and standard deviation = 1). However, it is highly influenced by outliers, especially those marginally extreme ones which can spread the scaled data range to further than 1 standard deviation.
StandardScaler :此方法刪除均值并將數據縮放為單位方差(平均值= 0,標準偏差= 1)。 但是,它受異常值的影響很大,尤其是那些可以將縮放后的數據范圍擴展到超過1個標準偏差的極限邊緣值。
MinMaxScaler: this method subtracts the minimum value in the feature and divides it by the range (which is the difference between the original maximum and minimum value). Essentially, it rescales the dataset to the range of 0 and 1. However, this method is relatively limited as it compress all data points to a narrow range and it doesn’t help much in the presence of outliers.
MinMaxScaler :此方法減去特征中的最小值并將其除以范圍(這是原始最大值和最小值之間的差)。 本質上,它將數據集重新縮放到0到1的范圍。但是,此方法相對有限,因為它將所有數據點壓縮到一個狹窄的范圍,并且在存在異常值時無濟于事。
RobustScaler: this method is based on percentiles, which subtracts the median and divides by the interquartile range (75% — 25%). It is generally more preferable than other two scalers since it is not greatly influenced by large marginal outliers if any.
RobustScaler :此方法基于百分位數,該位數減去中位數并除以四分位間距(75%— 25%)。 通常,它比其他兩個縮放器更可取,因為它不受較大的邊緣異常值(如果有)的很大影響。
Let’s see how the three scalers differ in our dataset:
讓我們看看三個縮放器在我們的數據集中有何不同:
from sklearn.preprocessing import StandardScaler, RobustScaler, MinMaxScaler#StandardScalerx_a_train_ss = pd.DataFrame(StandardScaler().fit_transform(x_a_train), columns=x_a_train.columns)#MinMaxScaler
x_a_train_mm = pd.DataFrame(MinMaxScaler().fit_transform(x_a_train), columns=x_a_train.columns)#RobustScaler
x_a_train_rs = pd.DataFrame(RobustScaler().fit_transform(x_a_train), columns=x_a_train.columns)StandardScaler標準縮放器 MinMaxScalerMinMaxScaler RobustScaler健壯的潔牙機
As seen above, the scaled data range of RobustScaler looks more digestible than the other two scalers, which might help the machine learning algorithms push the processing runtime faster and more efficiently. However, this is my assumption prior to modelling, but we can put it on trials when it comes to that phase.
如上所示,RobustScaler的縮放數據范圍看起來比其他兩個縮放器更易于消化,這可能有助于機器學習算法更快更有效地推動處理運行時間。 但是,這是我在建模之前的假設,但是在該階段我們可以進行試驗。
B.不平衡的數據處理 (B. Imbalanced Data Handling)
What is imbalanced data and how should we handle it?
什么是不平衡數據,我們應該如何處理?
In short, imbalanced datasets are those where there is a severe skew in the target distribution which might not be ideal for modelling. As a clearer example, let’s see if our dataset is imbalanced or not:
簡而言之,不平衡的數據集是那些在目標分布中存在嚴重偏斜的數據集,可能不適合建模。 作為更清晰的示例,讓我們看看我們的數據集是否不平衡:
a_target_0 = df_a[df_a.target == 0].target.count() / df_a.target.count()a_target_1 = df_a[df_a.target == 1].target.count() / df_a.target.count()
The result is that 76% of the data is classified as target 0 while the remaining 24% as target 1.
結果是76%的數據被分類為目標0,其余24%的數據被分類為目標1。
In fact, there’s no crystal clear benchmark that we should rely on to accurately determine whether our dataset is imbalanced or not. Some says that the ratio of 9:1 while others of 8:2, which indeed really depends on the nature of your dataset as well as the context/problem you are solving. In my case, I take the above result as imbalanced and will “resample” the dataset to make it relatively balanced.
實際上,沒有精確的基準可用來準確確定我們的數據集是否不平衡。 有人說比例為9:1,而其他人則為8:2,這確實取決于數據集的性質以及要解決的上下文/問題。 就我而言,我將上述結果視為不平衡,并將對數據集進行“重新采樣”以使其相對平衡。
Just a disclaimer, all the pre-processing steps I have taken here do not mean that they are all must-do and positively impact the accuracy of our model later on. It rather implies that I was aiming to test all possible scenarios that might help improve my models.
只是免責聲明 ,我在這里采取的所有預處理步驟并不意味著它們都是必須要做的,并且以后會對我們模型的準確性產生積極影響。 而是暗示我打算測試所有可能有助于改善模型的場景。
Back to resampling, there are two common methodologies that we might have heard of: over-sampling and under-sampling. In brief,
回到重采樣,我們可能聽說過兩種常見的方法:過采樣和欠采樣。 簡單來說,
Over-sampling: this method duplicates the samples from the minority class and add them to the dataset (training set).
過度采樣 :此方法復制少數類的樣本并將其添加到數據集(訓練集)。
Under-sampling: this is the opposite to the above which deletes some samples from the majority class.
欠采樣 :這與上面的相反,后者從多數類中刪除了一些采樣。
Visually, you can imagine something like this:
在視覺上,您可以想象這樣的事情:
Image credit: https://towardsdatascience.com/having-an-imbalanced-dataset-here-is-how-you-can-solve-it-1640568947eb圖片來源: https : //towardsdatascience.com/having-an-imbalanced-dataset-here-is-how-you-can-solve-it-1640568947ebLet’s test both out:
讓我們測試一下:
# Under-samplingfrom imblearn.under_sampling import RandomUnderSamplerundersample = RandomUnderSampler()x_a_train_rs_under, y_a_train_under = undersample.fit_resample(x_a_train_rs, y_a_train)
print(Counter(y_a_train_under))# Over-sampling
from imblearn.over_sampling import SMOTE
from collections import Counteroversample = SMOTE()x_a_train_rs_over, y_a_train_over = oversample.fit_resample(x_a_train_rs, y_a_train)
print(Counter(y_a_train_over))
For each methodology, there is a variety of options to resample your dataset, but I’ve chosen the most common one to execute, which is RandomUnderSampler (for under-sampling) and SMOTE (for over-sampling). The class distribution after resampling is:
對于每種方法,都有多種方法可以對數據集進行重新采樣,但是我選擇了最常見的選項來執行,即RandomUnderSampler (用于欠采樣)和SMOTE (用于過采樣)。 重采樣后的類分布為:
- RandomUnderSampler: 0: 5814, 1: 5814 RandomUnderSampler:0:5814,1:5814
- SMOTE: 1: 18324, 0: 18324 冒煙:1:18324,0:18324
Both has their pros and cons, but from the name suggests, RandomUnderSampler “randomly” selects data from the majority class to remove, which might result in information loss as not the entire dataset is taken into modelling. That said, I’ve opted for SMOTE instead.
兩者都有其優缺點,但是顧名思義,RandomUnderSampler是“隨機地”從多數類中選擇要刪除的數據,這可能會導致信息丟失,因為沒有將整個數據集都納入建模之中。 也就是說,我選擇了SMOTE。
C.功能選擇 (C. Feature Selection)
What is feature selection and what are the options available for our perusal?
什么是功能選擇?可供我們仔細閱讀的選項有哪些?
In short, feature selection is the process of selecting the variables in the dataset which has great correlation to/ impact on the target variable. Particularly, when it comes to bigger dataset, we might face up to hundreds of features, some of which might not be relevant or even relates to the output. Hence, we need to conduct feature selection prior to modelling to achieve the highest accuracy.
簡而言之,特征選擇是在數據集中選擇與目標變量有很大相關性/影響的變量的過程。 特別是在涉及更大的數據集時,我們可能會面臨多達數百個要素,其中某些要素可能與輸出不相關,甚至不相關。 因此,我們需要在建模之前進行特征選擇,以實現最高的精度。
Indeed, there are a handful of different techniques which can be grouped under two big categories: (1) Feature Selection and (2) Dimensionality Reduction. I believe the names sound really familiar to you, but essentially they are the same but the technique done for each is relatively different.
確實,有幾種不同的技術可以分為兩大類:(1) 特征選擇和(2) 降維 。 我相信您聽起來確實很熟悉這些名稱,但是從本質上講它們是相同的,但是每種方法所用的技術卻相對不同。
1.特征選擇 (1. Feature Selection)
If you are looking for a complete list of techniques, feel free to refer to this blog article which lists down all possible methods for your trial. In this project, I will just apply two for the sake of simplicity: (a) Feature Importance and (b) Correlation Matrix.
如果您正在尋找完整的技術列表,請隨時參考此博客文章 ,其中列出了所有可能的試驗方法。 在本項目中,為簡單起見,我將僅應用兩個:(a) 特征重要性和(b) 相關矩陣 。
For Feature Importance, as the name can tell, we will select the top features that have the highest correlation rate with the target variable (e.g. top 10 or 15, depending on the total number of features). Particularly, this technique deploys the attribute of ExtraTreeClassifier algorithm: feature_importances_
顧名思義,對于要素重要性 ,我們將選擇與目標變量具有最高相關率的排名最高的要素 (例如,排名最高的10或15,具體取決于要素的總數)。 特別地,此技術部署ExtraTreeClassifier算法的屬性: feature_importances_
from sklearn.ensemble import ExtraTreesClassifierfi = ExtraTreesClassifier()fi_a = fi.fit(x_a_train_rs_over, y_a_train_over)df_fi_a = pd.DataFrame(fi_a.feature_importances_,index=x_a_train_rs_over.columns)
df_fi_a.nlargest(10,df_fi_a.columns).plot(kind='barh')
plt.show().feature_importances_.feature_importances_
As you can see, annual income is the most important feature, followed by age and year of employment. Indeed, it really depends on you on the number of features to be selected.
如您所見,年收入是最重要的特征,其次是年齡和工作年份。 確實,這確實取決于您要選擇的功能數量。
Moving on to the 2nd method of Feature Selection, a Correlation Matrix is a table showing the correlation coefficients between variables in the dataset. Essentially, the higher the numbers, the more correlated it is between any two variable.
轉到特征選擇的第二種方法,“ 相關矩陣”是一個表格,顯示了數據集中變量之間的相關系數。 本質上,數字越高,任何兩個變量之間的關聯度就越高。
Let’s see it more visually for better illustration:
讓我們更直觀地查看它,以獲得更好的插圖:
df_b_train_processed = pd.concat([x_b_train_rs_over, y_b_train_over], axis=1) #combine processed features with their targetcm_b = df_b_train_processed.corr()print(cm_b.target.sort_values().tail(10))plt.figure(figsize=(20,20))sns.heatmap(cm_b, xticklabels=df_b_train_processed.columns, yticklabels=df_b_train_processed.columns,annot=True)df.corr()df.corr()
As seen above, we only need to take into consideration the last column in the table, which is the correlation of all independent variables with the target. It looks that all features share similar coefficients with the target as of the same colour shade. This might be slightly different from (1) Feature Importance that we just ran above. However, there’s no definite right-wrong answer as each technique was designed and functions differently.
如上所示,我們只需要考慮表中的最后一列,即所有自變量與目標的相關性。 看起來,所有特征與目標具有相同色度的系數都相似。 這可能與我們剛在上面提到的(1)功能重要性稍有不??同。 但是,由于每種技術的設計和功能都不相同,因此沒有確切的對錯答案。
2.降維 (2. Dimensionality Reduction)
Dimensionality Reduction is basically similar to Feature Selection, but it has its own technique. The common options that I often use can be grouped as Component-based (PCA) and Projection-based (t-SNE and UMAP).
降維基本上類似于特征選擇,但是它有自己的技術。 我經常使用的常見選項可以分為基于組件的(PCA)和基于投影的(t-SNE和UMAP)。
Component-based (PCA): as the name tells, it’s based off on the original features in the dataset which are transformed to a new set of variables that have better correlation with the target.
基于組件(PCA) :顧名思義,它基于數據集中的原始特征,這些原始特征被轉換為與目標具有更好相關性的一組新變量。
Projection-based (t-SNE and UMAP): the mathematical concept behind this technique is complicated, but in short, it refers to multi-dimensional data projected to a lower-dimensional space, which helps reduce the number of features in the dataset.
基于投影的(t-SNE和UMAP) :此技術背后的數學概念很復雜,但總而言之,它指的是將多維數據投影到較低維的空間,這有助于減少數據集中的特征數量。
Remember, feature scaling is required when using these techniques!
請記住,使用這些技術時需要特征縮放!
from sklearn.decomposition import PCApca = PCA(.95)pca_a_train = pca.fit(x_a_train_rs_over, y_a_train_over)print(pca_a_train.n_components_)plt.plot(np.cumsum(pca_a_train.explained_variance_ratio_))
plt.show()x_a_train_rs_over_pca = pd.DataFrame(pca_a_train.transform(x_a_train_rs_over))
x_a_train_rs_over_pca.head()
As for PCA, when I called the syntax, I set the explained variance of PCA to .95, which means that I was hoping to get the new set of variables that has 95% of variance from the original one. In this case, after we fit the training data to PCA, it is calculated that we only need 24 out of 46 features. Furthermore, looking at the explained_variance_ratio chart, the line stops increasing after 24 features, which could be the ideal number of features after PCA applied.
至于PCA,當我調用語法時,我將PCA的解釋方差設置為.95,這意味著我希望獲得一組新變量,其原始變量的方差為95%。 在這種情況下,將訓練數據擬合到PCA之后,可以計算出我們僅需要46個特征中的24個。 此外,查看解釋性的_variance_ratio圖表,該線在24個特征之后停止增加,這可能是應用PCA之后的理想特征數。
pca.fit_transform()pca.fit_transform()As for projection-based, t-SNE works well for large dataset but it’s proven to have limitations, specifically low computing time and large-scale information loss while UMAP performs better in terms of runtime while preserving information.
對于基于投影的t-SNE可以很好地用于大型數據集,但事實證明它有局限性,特別是計算時間短和信息丟失量大,而UMAP在保留信息的運行時間方面表現更好。
In short, how UMAP works is that it first calculates the distance between the points in high dimensional space, projects them onto the low dimensional space, and calculates the distance between points in this low dimensional space. Then, it uses Stochastic Gradient Descent to minimize the difference between these distances.
簡而言之,UMAP的工作原理是,它首先計算高維空間中的點之間的距離,將它們投影到低維空間中,然后計算該低維空間中的點之間的距離。 然后,它使用隨機梯度下降來最小化這些距離之間的差異。
import umapum = umap.UMAP(n_components=24)umap_a_train = um.fit_transform(x_a_train_rs_over)x_a_train_rs_over_umap = pd.DataFrame(umap_a_train)
x_a_train_rs_over_umap.head()umap.UMAP.fit_transform()umap.UMAP.fit_transform()
To compare between PCA and UMAP in terms of performance, it depends on the scale and the complex of our dataset so as to determine which to be utilized. However, for the sake of simplicity and better runtime, I’ve opted for PCA to be applied across datasets and leverage for the modelling phase.
要在性能方面比較PCA和UMAP,這取決于我們數據集的規模和復雜程度,從而確定要使用哪個。 但是,為了簡單起見和更好的運行時間,我選擇將PCA應用于數據集并在建模階段利用杠桿作用。
Voila! We’re done with the 2nd part of this end-to-end project, with the concentration on Feature Scaling and Selection!
瞧! 我們已經完成了此端到端項目的第二部分,重點是功能縮放和選擇!
I do hope you find it informative and easy to follow, so feel free to leave comments here! Do look out for the 3rd final part of this project which utilizes all of the data preparation steps to apply Machine Learning algorithms.
我希望您能從中找到有用且易于理解的信息,因此隨時在此處發表評論! 請注意本項目的第三部分,該部分利用所有數據準備步驟來應用機器學習算法 。
In the meantime, let’s connect:
同時,讓我們連接:
Github: https://github.com/andrewnguyen07LinkedIn: www.linkedin.com/in/andrewnguyen07
GitHub: https : //github.com/andrewnguyen07 LinkedIn: www.linkedin.com/in/andrewnguyen07
Thanks!
謝謝!
翻譯自: https://towardsdatascience.com/credit-risk-management-feature-scaling-selection-b734049867ea
c盤擴展卷功能只能向右擴展
總結
以上是生活随笔為你收集整理的c盘扩展卷功能只能向右扩展_信用风险管理:功能扩展和选择的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 用camelot读取表格_如何使用Cam
- 下一篇: java中的八大基本数据类型是什么