信用卡欺诈检测
信用卡欺詐檢測
信用卡欺詐檢測是kaggle上一個項目,數據來源是2013年歐洲持有信用卡的交易數據,詳細內容見https://www.kaggle.com/mlg-ulb/creditcardfraud
這個項目所要實現的目標是對一個交易預測它是否存在信用卡欺詐,和大部分機器學習項目的區別在于正負樣本的不均衡,而且是極不均衡的,所以這是特征工程需要處理的第一個問題。
除此之外,在數據預處理上減輕的負擔是缺失值的處理,并且大多數特征是經過了均值化處理的。
項目背景與數據初探
# 導入基礎的庫,其他的模型庫在需要時再導入 import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns import warnings warnings.filterwarnings('ignore') # 設置顯示的寬度,避免有過長行不顯示的問題 pd.set_option('display.max_columns', 10000) pd.set_option('display.max_colwidth', 10000) pd.set_option('display.width', 10000) # 導入數據并查看基本數據情況 data = pd.read_csv('D:/數據分析/kaggle/信用卡欺詐/creditcard.csv') data.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 284807 entries, 0 to 284806 Data columns (total 31 columns):# Column Non-Null Count Dtype --- ------ -------------- ----- 0 Time 284807 non-null float641 V1 284807 non-null float642 V2 284807 non-null float643 V3 284807 non-null float644 V4 284807 non-null float645 V5 284807 non-null float646 V6 284807 non-null float647 V7 284807 non-null float648 V8 284807 non-null float649 V9 284807 non-null float6410 V10 284807 non-null float6411 V11 284807 non-null float6412 V12 284807 non-null float6413 V13 284807 non-null float6414 V14 284807 non-null float6415 V15 284807 non-null float6416 V16 284807 non-null float6417 V17 284807 non-null float6418 V18 284807 non-null float6419 V19 284807 non-null float6420 V20 284807 non-null float6421 V21 284807 non-null float6422 V22 284807 non-null float6423 V23 284807 non-null float6424 V24 284807 non-null float6425 V25 284807 non-null float6426 V26 284807 non-null float6427 V27 284807 non-null float6428 V28 284807 non-null float6429 Amount 284807 non-null float6430 Class 284807 non-null int64 dtypes: float64(30), int64(1) memory usage: 67.4 MB data.shape (284807, 31) data.describe()| 284807.000000 | 2.848070e+05 | 2.848070e+05 | 2.848070e+05 | 2.848070e+05 | 2.848070e+05 | 2.848070e+05 | 2.848070e+05 | 2.848070e+05 | 2.848070e+05 | 2.848070e+05 | 2.848070e+05 | 2.848070e+05 | 2.848070e+05 | 2.848070e+05 | 2.848070e+05 | 2.848070e+05 | 2.848070e+05 | 2.848070e+05 | 2.848070e+05 | 2.848070e+05 | 2.848070e+05 | 2.848070e+05 | 2.848070e+05 | 2.848070e+05 | 2.848070e+05 | 2.848070e+05 | 2.848070e+05 | 2.848070e+05 | 284807.000000 | 284807.000000 |
| 94813.859575 | 3.919560e-15 | 5.688174e-16 | -8.769071e-15 | 2.782312e-15 | -1.552563e-15 | 2.010663e-15 | -1.694249e-15 | -1.927028e-16 | -3.137024e-15 | 1.768627e-15 | 9.170318e-16 | -1.810658e-15 | 1.693438e-15 | 1.479045e-15 | 3.482336e-15 | 1.392007e-15 | -7.528491e-16 | 4.328772e-16 | 9.049732e-16 | 5.085503e-16 | 1.537294e-16 | 7.959909e-16 | 5.367590e-16 | 4.458112e-15 | 1.453003e-15 | 1.699104e-15 | -3.660161e-16 | -1.206049e-16 | 88.349619 | 0.001727 |
| 47488.145955 | 1.958696e+00 | 1.651309e+00 | 1.516255e+00 | 1.415869e+00 | 1.380247e+00 | 1.332271e+00 | 1.237094e+00 | 1.194353e+00 | 1.098632e+00 | 1.088850e+00 | 1.020713e+00 | 9.992014e-01 | 9.952742e-01 | 9.585956e-01 | 9.153160e-01 | 8.762529e-01 | 8.493371e-01 | 8.381762e-01 | 8.140405e-01 | 7.709250e-01 | 7.345240e-01 | 7.257016e-01 | 6.244603e-01 | 6.056471e-01 | 5.212781e-01 | 4.822270e-01 | 4.036325e-01 | 3.300833e-01 | 250.120109 | 0.041527 |
| 0.000000 | -5.640751e+01 | -7.271573e+01 | -4.832559e+01 | -5.683171e+00 | -1.137433e+02 | -2.616051e+01 | -4.355724e+01 | -7.321672e+01 | -1.343407e+01 | -2.458826e+01 | -4.797473e+00 | -1.868371e+01 | -5.791881e+00 | -1.921433e+01 | -4.498945e+00 | -1.412985e+01 | -2.516280e+01 | -9.498746e+00 | -7.213527e+00 | -5.449772e+01 | -3.483038e+01 | -1.093314e+01 | -4.480774e+01 | -2.836627e+00 | -1.029540e+01 | -2.604551e+00 | -2.256568e+01 | -1.543008e+01 | 0.000000 | 0.000000 |
| 54201.500000 | -9.203734e-01 | -5.985499e-01 | -8.903648e-01 | -8.486401e-01 | -6.915971e-01 | -7.682956e-01 | -5.540759e-01 | -2.086297e-01 | -6.430976e-01 | -5.354257e-01 | -7.624942e-01 | -4.055715e-01 | -6.485393e-01 | -4.255740e-01 | -5.828843e-01 | -4.680368e-01 | -4.837483e-01 | -4.988498e-01 | -4.562989e-01 | -2.117214e-01 | -2.283949e-01 | -5.423504e-01 | -1.618463e-01 | -3.545861e-01 | -3.171451e-01 | -3.269839e-01 | -7.083953e-02 | -5.295979e-02 | 5.600000 | 0.000000 |
| 84692.000000 | 1.810880e-02 | 6.548556e-02 | 1.798463e-01 | -1.984653e-02 | -5.433583e-02 | -2.741871e-01 | 4.010308e-02 | 2.235804e-02 | -5.142873e-02 | -9.291738e-02 | -3.275735e-02 | 1.400326e-01 | -1.356806e-02 | 5.060132e-02 | 4.807155e-02 | 6.641332e-02 | -6.567575e-02 | -3.636312e-03 | 3.734823e-03 | -6.248109e-02 | -2.945017e-02 | 6.781943e-03 | -1.119293e-02 | 4.097606e-02 | 1.659350e-02 | -5.213911e-02 | 1.342146e-03 | 1.124383e-02 | 22.000000 | 0.000000 |
| 139320.500000 | 1.315642e+00 | 8.037239e-01 | 1.027196e+00 | 7.433413e-01 | 6.119264e-01 | 3.985649e-01 | 5.704361e-01 | 3.273459e-01 | 5.971390e-01 | 4.539234e-01 | 7.395934e-01 | 6.182380e-01 | 6.625050e-01 | 4.931498e-01 | 6.488208e-01 | 5.232963e-01 | 3.996750e-01 | 5.008067e-01 | 4.589494e-01 | 1.330408e-01 | 1.863772e-01 | 5.285536e-01 | 1.476421e-01 | 4.395266e-01 | 3.507156e-01 | 2.409522e-01 | 9.104512e-02 | 7.827995e-02 | 77.165000 | 0.000000 |
| 172792.000000 | 2.454930e+00 | 2.205773e+01 | 9.382558e+00 | 1.687534e+01 | 3.480167e+01 | 7.330163e+01 | 1.205895e+02 | 2.000721e+01 | 1.559499e+01 | 2.374514e+01 | 1.201891e+01 | 7.848392e+00 | 7.126883e+00 | 1.052677e+01 | 8.877742e+00 | 1.731511e+01 | 9.253526e+00 | 5.041069e+00 | 5.591971e+00 | 3.942090e+01 | 2.720284e+01 | 1.050309e+01 | 2.252841e+01 | 4.584549e+00 | 7.519589e+00 | 3.517346e+00 | 3.161220e+01 | 3.384781e+01 | 25691.160000 | 1.000000 |
| 0.0 | -1.359807 | -0.072781 | 2.536347 | 1.378155 | -0.338321 | 0.462388 | 0.239599 | 0.098698 | 0.363787 | 0.090794 | -0.551600 | -0.617801 | -0.991390 | -0.311169 | 1.468177 | -0.470401 | 0.207971 | 0.025791 | 0.403993 | 0.251412 | -0.018307 | 0.277838 | -0.110474 | 0.066928 | 0.128539 | -0.189115 | 0.133558 | -0.021053 | 149.62 | 0 |
| 0.0 | 1.191857 | 0.266151 | 0.166480 | 0.448154 | 0.060018 | -0.082361 | -0.078803 | 0.085102 | -0.255425 | -0.166974 | 1.612727 | 1.065235 | 0.489095 | -0.143772 | 0.635558 | 0.463917 | -0.114805 | -0.183361 | -0.145783 | -0.069083 | -0.225775 | -0.638672 | 0.101288 | -0.339846 | 0.167170 | 0.125895 | -0.008983 | 0.014724 | 2.69 | 0 |
| 1.0 | -1.358354 | -1.340163 | 1.773209 | 0.379780 | -0.503198 | 1.800499 | 0.791461 | 0.247676 | -1.514654 | 0.207643 | 0.624501 | 0.066084 | 0.717293 | -0.165946 | 2.345865 | -2.890083 | 1.109969 | -0.121359 | -2.261857 | 0.524980 | 0.247998 | 0.771679 | 0.909412 | -0.689281 | -0.327642 | -0.139097 | -0.055353 | -0.059752 | 378.66 | 0 |
| 1.0 | -0.966272 | -0.185226 | 1.792993 | -0.863291 | -0.010309 | 1.247203 | 0.237609 | 0.377436 | -1.387024 | -0.054952 | -0.226487 | 0.178228 | 0.507757 | -0.287924 | -0.631418 | -1.059647 | -0.684093 | 1.965775 | -1.232622 | -0.208038 | -0.108300 | 0.005274 | -0.190321 | -1.175575 | 0.647376 | -0.221929 | 0.062723 | 0.061458 | 123.50 | 0 |
| 2.0 | -1.158233 | 0.877737 | 1.548718 | 0.403034 | -0.407193 | 0.095921 | 0.592941 | -0.270533 | 0.817739 | 0.753074 | -0.822843 | 0.538196 | 1.345852 | -1.119670 | 0.175121 | -0.451449 | -0.237033 | -0.038195 | 0.803487 | 0.408542 | -0.009431 | 0.798278 | -0.137458 | 0.141267 | -0.206010 | 0.502292 | 0.219422 | 0.215153 | 69.99 | 0 |
| 172786.0 | -11.881118 | 10.071785 | -9.834783 | -2.066656 | -5.364473 | -2.606837 | -4.918215 | 7.305334 | 1.914428 | 4.356170 | -1.593105 | 2.711941 | -0.689256 | 4.626942 | -0.924459 | 1.107641 | 1.991691 | 0.510632 | -0.682920 | 1.475829 | 0.213454 | 0.111864 | 1.014480 | -0.509348 | 1.436807 | 0.250034 | 0.943651 | 0.823731 | 0.77 | 0 |
| 172787.0 | -0.732789 | -0.055080 | 2.035030 | -0.738589 | 0.868229 | 1.058415 | 0.024330 | 0.294869 | 0.584800 | -0.975926 | -0.150189 | 0.915802 | 1.214756 | -0.675143 | 1.164931 | -0.711757 | -0.025693 | -1.221179 | -1.545556 | 0.059616 | 0.214205 | 0.924384 | 0.012463 | -1.016226 | -0.606624 | -0.395255 | 0.068472 | -0.053527 | 24.79 | 0 |
| 172788.0 | 1.919565 | -0.301254 | -3.249640 | -0.557828 | 2.630515 | 3.031260 | -0.296827 | 0.708417 | 0.432454 | -0.484782 | 0.411614 | 0.063119 | -0.183699 | -0.510602 | 1.329284 | 0.140716 | 0.313502 | 0.395652 | -0.577252 | 0.001396 | 0.232045 | 0.578229 | -0.037501 | 0.640134 | 0.265745 | -0.087371 | 0.004455 | -0.026561 | 67.88 | 0 |
| 172788.0 | -0.240440 | 0.530483 | 0.702510 | 0.689799 | -0.377961 | 0.623708 | -0.686180 | 0.679145 | 0.392087 | -0.399126 | -1.933849 | -0.962886 | -1.042082 | 0.449624 | 1.962563 | -0.608577 | 0.509928 | 1.113981 | 2.897849 | 0.127434 | 0.265245 | 0.800049 | -0.163298 | 0.123205 | -0.569159 | 0.546668 | 0.108821 | 0.104533 | 10.00 | 0 |
| 172792.0 | -0.533413 | -0.189733 | 0.703337 | -0.506271 | -0.012546 | -0.649617 | 1.577006 | -0.414650 | 0.486180 | -0.915427 | -1.040458 | -0.031513 | -0.188093 | -0.084316 | 0.041333 | -0.302620 | -0.660377 | 0.167430 | -0.256117 | 0.382948 | 0.261057 | 0.643078 | 0.376777 | 0.008797 | -0.473649 | -0.818267 | -0.002415 | 0.013649 | 217.00 | 0 |
- 數據包含284807樣本,30個屬性特征和一個所屬類別,數據完整沒有缺失值,所以不需要缺失值的處理。非匿名特征包括時間和交易金額,以及所屬的類別,匿名特征28個從v1-v28,統計上的均值(基本上等于0)和方差(1左右)可以看出是已經進行了歸一化處理。(在Kaggle上的介紹說:匿名特征是經過了脫敏和PCA處理的,時間特征Time包含數據集中每個交易和第一個交易之間經過的秒數,應該是距離開始采集數據的時間,總共是兩天,172792正好是差不多48小時)
- 正負樣本不均衡的問題,從描述性統計結果上的class列,均值為0.001727,說明極大多數樣本都是0,也可以通過data.value_counts()查看,正常交易284315項,而異常的只有492項,這種極不平衡的樣本處理將是特征工程的主要任務
探索性數據分析(EDA)
* 單一屬性分析
數據屬性都是數值型,所以不需要區分數值屬性和類別屬性,也不需要對類別屬性的重新編碼,下面分析單一屬性的特點,從預測類別開始
通過類別的分布圖以及峰度、偏度的計算,可以更直觀的看到樣本分布的不均衡
# 欺詐與非欺詐類別分布的直方圖 sns.countplot('Class', data=data, color='blue') plt.xlabel('values') plt.ylabel('Counts') plt.title('Class Distributions \n (0: No Fraud || 1: Fraud)') Text(0.5, 1.0, 'Class Distributions \n (0: No Fraud || 1: Fraud)') print('Kurtosis:', data.Class.kurt()) print('Skewness:', data.Class.skew()) Kurtosis: 573.887842782971 Skewness: 23.99757931064749下面分析兩個沒有經過標準化的屬性:Time和Amount
# 數值屬性時間與金額分布圖,金額用正態分布和對數分布兩種方式擬合 import scipy.stats as st fig, ax = plt.subplots(1, 3, figsize=(18, 4)) print(ax) sns.distplot(data.Amount, color='blue', ax=ax[0],kde = False,fit=st.norm) ax[0].set_title('Distribution of transaction amount_normal')sns.distplot(data.Amount,color='blue',ax=ax[1],fit=st.lognorm) ax[1].set_title('Distribution of transaction amount_lognorm')sns.distplot(data.Time, color='r', ax=ax[2]) ax[2].set_title('Distribution of transaction time') print(data.Amount.value_counts()) 1.00 13688 1.98 6044 0.89 4872 9.99 4747 15.00 3280... 192.63 1 218.84 1 195.52 1 793.50 1 1080.06 1 Name: Amount, Length: 32767, dtype: int64 print('the ratio of Amount<5:', data.Amount[data.Amount < 5].value_counts( ).sum()/data.Amount.value_counts().sum()) print('the ratio of Amount<10:', data.Amount[data.Amount < 10].value_counts( ).sum()/data.Amount.value_counts().sum()) print('the ratio of Amount<20:', data.Amount[data.Amount < 20].value_counts( ).sum()/data.Amount.value_counts().sum()) print('the ratio of Amount<30:', data.Amount[data.Amount < 30].value_counts( ).sum()/data.Amount.value_counts().sum()) print('the ratio of Amount<50:', data.Amount[data.Amount < 50].value_counts( ).sum()/data.Amount.value_counts().sum()) print('the ratio of Amount<100:', data.Amount[data.Amount < 100].value_counts( ).sum()/data.Amount.value_counts().sum()) print('the ratio of Amount>5000:', data.Amount[data.Amount > 5000].value_counts( ).sum()/data.Amount.value_counts().sum()) the ratio of Amount<5: 0.2368726892246328 the ratio of Amount<10: 0.3416840175978821 the ratio of Amount<20: 0.481476929991187 the ratio of Amount<30: 0.562022703093674 the ratio of Amount<50: 0.6660791342909409 the ratio of Amount<100: 0.7985126770058321 the ratio of Amount>5000: 0.00019311323106524768在金額屬性上,絕大多數金額都是小于50美元較小的,存在少量的較大數值如1080美元;時間上在15000s和100000s附近出現了兩次低峰,距離開始采樣數據的時間分別是4小時和27小時,猜測這時候是凌晨3、4點,這也符合現實,畢竟深夜購物的人還是少數。由于其他的匿名屬性都進行了均值化處理,并且金額的擬合效果正態分布優于取對數,下面也將金額Amount進行標準化處理,同樣的,時間轉化為小時之后再作均值化處理。
from sklearn.preprocessing import StandardScaler sc = StandardScaler() data['Amount'] = sc.fit_transform(data.Amount.values.reshape(-1, 1)) # reshape()函數的兩個參數,-1表示不知道多少行,1表示一列 data['Hour'] = data.Time.apply(lambda x: divmod(x, 3600)[0]) data['Hour'] = data.Hour.apply(lambda x: divmod(x, 24)[1]) # 時間進一步轉換成24小時制,因為考慮到交易密度的周期性分布 data['Hour'] = sc.fit_transform(data['Hour'].values.reshape(-1, 1)) data.drop(columns='Time', inplace=True) data.head().append(data.tail())| -1.359807 | -0.072781 | 2.536347 | 1.378155 | -0.338321 | 0.462388 | 0.239599 | 0.098698 | 0.363787 | 0.090794 | -0.551600 | -0.617801 | -0.991390 | -0.311169 | 1.468177 | -0.470401 | 0.207971 | 0.025791 | 0.403993 | 0.251412 | -0.018307 | 0.277838 | -0.110474 | 0.066928 | 0.128539 | -0.189115 | 0.133558 | -0.021053 | 0.244964 | 0 | -2.40693 |
| 1.191857 | 0.266151 | 0.166480 | 0.448154 | 0.060018 | -0.082361 | -0.078803 | 0.085102 | -0.255425 | -0.166974 | 1.612727 | 1.065235 | 0.489095 | -0.143772 | 0.635558 | 0.463917 | -0.114805 | -0.183361 | -0.145783 | -0.069083 | -0.225775 | -0.638672 | 0.101288 | -0.339846 | 0.167170 | 0.125895 | -0.008983 | 0.014724 | -0.342475 | 0 | -2.40693 |
| -1.358354 | -1.340163 | 1.773209 | 0.379780 | -0.503198 | 1.800499 | 0.791461 | 0.247676 | -1.514654 | 0.207643 | 0.624501 | 0.066084 | 0.717293 | -0.165946 | 2.345865 | -2.890083 | 1.109969 | -0.121359 | -2.261857 | 0.524980 | 0.247998 | 0.771679 | 0.909412 | -0.689281 | -0.327642 | -0.139097 | -0.055353 | -0.059752 | 1.160686 | 0 | -2.40693 |
| -0.966272 | -0.185226 | 1.792993 | -0.863291 | -0.010309 | 1.247203 | 0.237609 | 0.377436 | -1.387024 | -0.054952 | -0.226487 | 0.178228 | 0.507757 | -0.287924 | -0.631418 | -1.059647 | -0.684093 | 1.965775 | -1.232622 | -0.208038 | -0.108300 | 0.005274 | -0.190321 | -1.175575 | 0.647376 | -0.221929 | 0.062723 | 0.061458 | 0.140534 | 0 | -2.40693 |
| -1.158233 | 0.877737 | 1.548718 | 0.403034 | -0.407193 | 0.095921 | 0.592941 | -0.270533 | 0.817739 | 0.753074 | -0.822843 | 0.538196 | 1.345852 | -1.119670 | 0.175121 | -0.451449 | -0.237033 | -0.038195 | 0.803487 | 0.408542 | -0.009431 | 0.798278 | -0.137458 | 0.141267 | -0.206010 | 0.502292 | 0.219422 | 0.215153 | -0.073403 | 0 | -2.40693 |
| -11.881118 | 10.071785 | -9.834783 | -2.066656 | -5.364473 | -2.606837 | -4.918215 | 7.305334 | 1.914428 | 4.356170 | -1.593105 | 2.711941 | -0.689256 | 4.626942 | -0.924459 | 1.107641 | 1.991691 | 0.510632 | -0.682920 | 1.475829 | 0.213454 | 0.111864 | 1.014480 | -0.509348 | 1.436807 | 0.250034 | 0.943651 | 0.823731 | -0.350151 | 0 | 1.53423 |
| -0.732789 | -0.055080 | 2.035030 | -0.738589 | 0.868229 | 1.058415 | 0.024330 | 0.294869 | 0.584800 | -0.975926 | -0.150189 | 0.915802 | 1.214756 | -0.675143 | 1.164931 | -0.711757 | -0.025693 | -1.221179 | -1.545556 | 0.059616 | 0.214205 | 0.924384 | 0.012463 | -1.016226 | -0.606624 | -0.395255 | 0.068472 | -0.053527 | -0.254117 | 0 | 1.53423 |
| 1.919565 | -0.301254 | -3.249640 | -0.557828 | 2.630515 | 3.031260 | -0.296827 | 0.708417 | 0.432454 | -0.484782 | 0.411614 | 0.063119 | -0.183699 | -0.510602 | 1.329284 | 0.140716 | 0.313502 | 0.395652 | -0.577252 | 0.001396 | 0.232045 | 0.578229 | -0.037501 | 0.640134 | 0.265745 | -0.087371 | 0.004455 | -0.026561 | -0.081839 | 0 | 1.53423 |
| -0.240440 | 0.530483 | 0.702510 | 0.689799 | -0.377961 | 0.623708 | -0.686180 | 0.679145 | 0.392087 | -0.399126 | -1.933849 | -0.962886 | -1.042082 | 0.449624 | 1.962563 | -0.608577 | 0.509928 | 1.113981 | 2.897849 | 0.127434 | 0.265245 | 0.800049 | -0.163298 | 0.123205 | -0.569159 | 0.546668 | 0.108821 | 0.104533 | -0.313249 | 0 | 1.53423 |
| -0.533413 | -0.189733 | 0.703337 | -0.506271 | -0.012546 | -0.649617 | 1.577006 | -0.414650 | 0.486180 | -0.915427 | -1.040458 | -0.031513 | -0.188093 | -0.084316 | 0.041333 | -0.302620 | -0.660377 | 0.167430 | -0.256117 | 0.382948 | 0.261057 | 0.643078 | 0.376777 | 0.008797 | -0.473649 | -0.818267 | -0.002415 | 0.013649 | 0.514355 | 0 | 1.53423 |
| 1.324299 | -1.085563 | 0.237279 | -1.479829 | -1.161581 | -0.372790 | -0.742945 | -0.049038 | -2.376563 | 1.516249 | 1.631516 | 0.137025 | 0.627911 | 0.049387 | 0.025037 | -0.532986 | 0.575155 | -0.556925 | -0.065486 | -0.214061 | -0.526897 | -1.342434 | 0.251389 | -0.065977 | -0.029430 | -0.636325 | 0.020683 | 0.022199 | -0.055132 | 0 | 0.848811 |
| -0.216830 | 0.175845 | -0.126917 | 0.116160 | 3.348383 | 4.292672 | 0.296703 | 0.622167 | 0.389999 | 0.133185 | -0.651432 | 0.098311 | -0.612388 | -0.477757 | -1.049531 | -1.558886 | 0.199438 | -0.772957 | 1.204000 | 0.094183 | -0.318508 | -0.357454 | -0.112793 | 0.685842 | -0.469426 | -0.769896 | -0.187049 | -0.232898 | -0.345233 | 0 | -0.864737 |
| 1.250249 | 0.019063 | -1.326108 | -0.039059 | 2.232341 | 3.300602 | -0.326435 | 0.757703 | -0.156352 | 0.062703 | -0.161649 | 0.029727 | -0.018457 | 0.530951 | 1.107512 | 0.377201 | -0.926363 | 0.295044 | -0.006783 | 0.033750 | -0.009900 | -0.189322 | -0.157734 | 1.005326 | 0.838403 | -0.315582 | 0.011439 | 0.018031 | -0.233287 | 0 | -2.406930 |
| -4.814708 | 4.736862 | -4.817819 | -1.103622 | -2.256585 | -2.425710 | -1.657050 | 3.493293 | -0.207819 | 0.013773 | -2.260313 | 0.744399 | -0.973800 | 3.250789 | -0.282599 | 0.313408 | 1.121537 | -0.022531 | -0.416665 | -0.117994 | 0.375361 | 0.554691 | 0.349764 | 0.026127 | 0.275542 | 0.148178 | 0.102320 | 0.185414 | -0.319965 | 0 | 1.362876 |
| 2.100716 | -0.778343 | -0.596761 | -0.557506 | -0.575207 | 0.293849 | -0.958246 | 0.146132 | -0.136921 | 0.798909 | 0.428648 | 0.958404 | 0.715275 | -0.043584 | -0.140031 | -1.084006 | -0.649764 | 1.591873 | -0.423356 | -0.580698 | -0.562834 | -1.044407 | 0.459820 | 0.203965 | -0.508869 | -0.686475 | 0.053956 | -0.034934 | -0.353189 | 0 | -0.693382 |
| 2.053311 | 0.089735 | -1.681836 | 0.454212 | 0.298310 | -0.953526 | 0.152003 | -0.207071 | 0.587335 | -0.362047 | -0.589598 | -0.174712 | -0.621127 | -0.703513 | 0.271957 | 0.318688 | 0.549365 | -0.257786 | 0.016256 | -0.187421 | -0.361158 | -0.984262 | 0.354198 | 0.620709 | -0.297138 | 0.166736 | -0.068299 | -0.029585 | -0.317287 | 0 | 0.334747 |
| 1.963377 | 0.175655 | -1.791505 | 1.180371 | 0.493289 | -1.157260 | 0.678691 | -0.322212 | -0.273120 | 0.553350 | 0.900665 | 0.380835 | -1.197776 | 1.219675 | -0.368170 | -0.251083 | -0.545004 | 0.080937 | -0.199778 | -0.296184 | 0.188454 | 0.525823 | -0.019335 | -0.007700 | 0.374702 | -0.503352 | -0.043733 | -0.070953 | -0.233887 | 0 | -2.406930 |
| 1.180264 | 0.668819 | -0.242382 | 1.284748 | 0.029324 | -1.039543 | 0.202761 | -0.102430 | -0.340626 | -0.508367 | 1.978050 | 0.556068 | -0.337921 | -1.095969 | 0.391429 | 0.740869 | 0.726637 | 1.040940 | -0.411526 | -0.103660 | -0.004479 | 0.014954 | -0.108992 | 0.397877 | 0.628531 | -0.356931 | 0.030663 | 0.049046 | -0.349231 | 0 | -0.179318 |
| -2.091027 | 1.249032 | 0.841086 | -0.777488 | -0.176500 | -0.077257 | -0.118603 | -0.256751 | 0.178740 | -0.000305 | 0.991856 | 0.698911 | -0.901970 | 0.341906 | -0.643972 | -0.011763 | -0.069715 | -0.449297 | -0.255400 | -0.517288 | 0.631502 | -0.413265 | 0.293367 | -0.000012 | -0.318688 | 0.224045 | -0.725597 | -0.392266 | -0.349231 | 0 | -0.693382 |
| 1.162447 | 0.272458 | 0.615165 | 1.058086 | -0.262004 | -0.359390 | -0.012728 | -0.015620 | 0.066470 | -0.087054 | -0.061459 | 0.470350 | 0.215349 | 0.293257 | 1.308914 | -0.006260 | -0.233542 | -0.816695 | -0.644417 | -0.141656 | -0.243925 | -0.693725 | 0.168349 | 0.031270 | 0.216868 | -0.665192 | 0.045823 | 0.031301 | -0.305252 | 0 | -0.179318 |
因為數據經過了標準化處理,所以匿名特征的偏度相對較小,峰度較大,分布上v_5, v_7, v_8, v_20, v_21, v_23, v_27,v_28的峰度值較高,說明這些特征的分布十分集中,其他的特征分布相對均勻。
* 分析兩個屬性之間的關系
# 相關性分析 plt.figure(figsize=(8, 6)) sns.heatmap(data.corr(), square=True,cmap='coolwarm_r', annot_kws={'size': 20}) plt.show() data.corr()| 1.000000e+00 | 1.717515e-16 | -9.049057e-16 | -2.483769e-16 | 3.029761e-16 | 1.242968e-16 | 4.904952e-17 | -2.809306e-17 | 5.255584e-17 | 5.521194e-17 | 2.590354e-16 | 1.898607e-16 | -3.778206e-17 | 4.132544e-16 | -1.522953e-16 | 3.007488e-16 | -3.106689e-17 | 1.645281e-16 | 1.045442e-16 | 9.852066e-17 | -1.808291e-16 | 7.990046e-17 | 1.056762e-16 | -5.113541e-17 | -2.165870e-16 | -1.408070e-16 | 1.664818e-16 | 2.419798e-16 | -0.227709 | -0.101347 | -0.005214 |
| 1.717515e-16 | 1.000000e+00 | 9.206734e-17 | -1.226183e-16 | 1.487342e-16 | 3.492970e-16 | -4.501183e-17 | -5.839820e-17 | -1.832049e-16 | -2.598347e-16 | 3.312314e-16 | -3.228806e-16 | -1.091369e-16 | -4.657267e-16 | 5.392819e-17 | -3.749559e-18 | -5.591430e-16 | 2.877862e-16 | -1.672445e-17 | 3.898167e-17 | 4.667231e-17 | 1.203109e-16 | 3.242403e-16 | -1.254911e-16 | 8.784216e-17 | 2.450901e-16 | -5.467509e-16 | -6.910935e-17 | -0.531409 | 0.091289 | 0.007802 |
| -9.049057e-16 | 9.206734e-17 | 1.000000e+00 | -2.981464e-16 | -6.943096e-16 | 1.308147e-15 | 2.120327e-16 | -8.586741e-17 | 9.780262e-17 | 2.764966e-16 | 1.500352e-16 | 2.182812e-16 | -4.679364e-17 | 6.942636e-16 | -5.333200e-17 | 5.460118e-16 | 2.134100e-16 | 2.870116e-16 | 3.728807e-16 | 1.267443e-16 | 1.189711e-16 | -2.343257e-16 | -8.182206e-17 | -3.300147e-17 | 1.123060e-16 | -2.136494e-16 | 4.752587e-16 | 6.073110e-16 | -0.210880 | -0.192961 | -0.021569 |
| -2.483769e-16 | -1.226183e-16 | -2.981464e-16 | 1.000000e+00 | -1.903391e-15 | -4.169652e-16 | -6.535390e-17 | 5.942856e-16 | 6.175719e-16 | -6.910284e-17 | -2.936726e-16 | -1.448546e-16 | 3.050372e-17 | -8.547776e-17 | 2.459280e-16 | -8.218577e-17 | -4.443050e-16 | 5.369916e-18 | -2.842719e-16 | -2.222520e-16 | 1.390687e-17 | 2.189964e-16 | 1.663593e-16 | 1.403733e-16 | 6.312530e-16 | -4.009636e-16 | -6.309346e-17 | -2.064064e-16 | 0.098732 | 0.133447 | -0.035063 |
| 3.029761e-16 | 1.487342e-16 | -6.943096e-16 | -1.903391e-15 | 1.000000e+00 | 1.159613e-15 | 7.659742e-17 | 7.328495e-16 | 4.435269e-16 | 1.632311e-16 | 6.784587e-16 | 4.520778e-16 | -2.979964e-16 | 2.516209e-16 | 1.016075e-16 | 6.264287e-16 | 4.535815e-16 | 4.196874e-16 | -1.277261e-16 | -2.414281e-16 | 9.325965e-17 | -6.982655e-17 | -1.848644e-16 | -9.892370e-16 | -1.561416e-16 | 3.403172e-16 | 3.299056e-16 | -3.491468e-16 | -0.386356 | -0.094974 | -0.035134 |
| 1.242968e-16 | 3.492970e-16 | 1.308147e-15 | -4.169652e-16 | 1.159613e-15 | 1.000000e+00 | -2.949670e-16 | -3.474079e-16 | -1.008735e-16 | 1.322142e-16 | 8.380230e-16 | 2.570184e-16 | -1.251524e-16 | 3.531769e-16 | -6.825844e-17 | -1.823748e-16 | 1.161080e-16 | 6.313161e-17 | 6.136340e-17 | -1.318056e-16 | -4.925144e-17 | -9.729827e-17 | -3.176032e-17 | -1.125379e-15 | 5.563670e-16 | -2.627057e-16 | -4.040640e-16 | 4.612882e-17 | 0.215981 | -0.043643 | -0.018945 |
| 4.904952e-17 | -4.501183e-17 | 2.120327e-16 | -6.535390e-17 | 7.659742e-17 | -2.949670e-16 | 1.000000e+00 | 3.038794e-17 | -5.250969e-17 | 3.186953e-16 | -3.362622e-16 | 7.265464e-16 | -1.485108e-16 | 7.720708e-17 | -1.845909e-16 | 4.901078e-16 | 7.173458e-16 | 1.638629e-16 | -1.132423e-16 | 1.889527e-16 | -7.597231e-17 | -6.887963e-16 | 1.393022e-16 | 2.078026e-17 | -1.507689e-17 | -7.709408e-16 | -2.647380e-18 | 2.115388e-17 | 0.397311 | -0.187257 | -0.009729 |
| -2.809306e-17 | -5.839820e-17 | -8.586741e-17 | 5.942856e-16 | 7.328495e-16 | -3.474079e-16 | 3.038794e-17 | 1.000000e+00 | 4.683000e-16 | -3.022868e-16 | 1.499830e-16 | 3.887009e-17 | -3.213252e-16 | -2.288651e-16 | 1.109628e-16 | 2.500367e-16 | -3.808536e-16 | -3.119192e-16 | -3.559019e-16 | 1.098800e-17 | -2.338214e-16 | -6.701600e-18 | 2.701514e-16 | -2.444390e-16 | -1.792313e-16 | 1.092765e-17 | 3.921512e-16 | -5.158971e-16 | -0.103079 | 0.019875 | 0.032106 |
| 5.255584e-17 | -1.832049e-16 | 9.780262e-17 | 6.175719e-16 | 4.435269e-16 | -1.008735e-16 | -5.250969e-17 | 4.683000e-16 | 1.000000e+00 | -4.733827e-16 | 3.289603e-16 | -1.339732e-15 | 9.374886e-16 | 9.287436e-16 | -8.883532e-16 | -5.409347e-16 | 7.071023e-16 | 1.471108e-16 | 1.293082e-16 | -3.112119e-16 | 2.755460e-16 | -2.171404e-16 | -1.011218e-16 | -2.940457e-16 | 2.137255e-16 | -1.039639e-16 | -1.499396e-16 | 7.982292e-16 | -0.044246 | -0.097733 | -0.189830 |
| 5.521194e-17 | -2.598347e-16 | 2.764966e-16 | -6.910284e-17 | 1.632311e-16 | 1.322142e-16 | 3.186953e-16 | -3.022868e-16 | -4.733827e-16 | 1.000000e+00 | -3.633385e-16 | 8.563304e-16 | -4.013607e-16 | 6.638602e-16 | 3.932439e-16 | 1.882434e-16 | 6.617837e-16 | 4.829483e-16 | 4.623218e-17 | -1.340974e-15 | 1.048675e-15 | -2.890990e-16 | 1.907376e-16 | -7.312196e-17 | -3.457860e-16 | -4.117783e-16 | -3.115507e-16 | 3.949646e-16 | -0.101502 | -0.216883 | 0.024177 |
| 2.590354e-16 | 3.312314e-16 | 1.500352e-16 | -2.936726e-16 | 6.784587e-16 | 8.380230e-16 | -3.362622e-16 | 1.499830e-16 | 3.289603e-16 | -3.633385e-16 | 1.000000e+00 | -7.116039e-16 | 4.369928e-16 | -1.283496e-16 | 1.903820e-16 | 1.158881e-16 | 6.624541e-16 | 9.910529e-17 | -1.093636e-15 | -1.478641e-16 | 6.632474e-18 | 1.312323e-17 | 1.404725e-16 | 1.672342e-15 | -6.082687e-16 | -1.240097e-16 | -1.519253e-16 | -2.909057e-16 | 0.000104 | 0.154876 | -0.135131 |
| 1.898607e-16 | -3.228806e-16 | 2.182812e-16 | -1.448546e-16 | 4.520778e-16 | 2.570184e-16 | 7.265464e-16 | 3.887009e-17 | -1.339732e-15 | 8.563304e-16 | -7.116039e-16 | 1.000000e+00 | -2.297323e-14 | 4.486162e-16 | -3.033543e-16 | 4.714076e-16 | -3.797286e-16 | -6.830564e-16 | 1.782434e-16 | 2.673446e-16 | 5.724276e-16 | -3.587155e-17 | 3.029886e-16 | 4.510178e-16 | 6.970336e-18 | 1.653468e-16 | -2.721798e-16 | 7.065902e-16 | -0.009542 | -0.260593 | 0.352459 |
| -3.778206e-17 | -1.091369e-16 | -4.679364e-17 | 3.050372e-17 | -2.979964e-16 | -1.251524e-16 | -1.485108e-16 | -3.213252e-16 | 9.374886e-16 | -4.013607e-16 | 4.369928e-16 | -2.297323e-14 | 1.000000e+00 | 1.415589e-15 | -1.185819e-16 | 4.849394e-16 | 8.705885e-17 | 2.432753e-16 | -6.331767e-17 | -3.200986e-17 | 1.428638e-16 | -4.602453e-17 | -7.174408e-16 | -6.376621e-16 | -1.142909e-16 | -1.478991e-16 | -5.300185e-16 | 1.043260e-15 | 0.005293 | -0.004570 | -0.187981 |
| 4.132544e-16 | -4.657267e-16 | 6.942636e-16 | -8.547776e-17 | 2.516209e-16 | 3.531769e-16 | 7.720708e-17 | -2.288651e-16 | 9.287436e-16 | 6.638602e-16 | -1.283496e-16 | 4.486162e-16 | 1.415589e-15 | 1.000000e+00 | -2.864454e-16 | -8.191302e-16 | 1.131442e-15 | -3.009169e-16 | 2.138702e-16 | -5.239826e-17 | -2.462983e-16 | 6.492362e-16 | 2.160339e-16 | -1.258007e-17 | -7.178656e-17 | -2.488490e-17 | -1.739150e-17 | 2.414117e-15 | 0.033751 | -0.302544 | -0.162918 |
| -1.522953e-16 | 5.392819e-17 | -5.333200e-17 | 2.459280e-16 | 1.016075e-16 | -6.825844e-17 | -1.845909e-16 | 1.109628e-16 | -8.883532e-16 | 3.932439e-16 | 1.903820e-16 | -3.033543e-16 | -1.185819e-16 | -2.864454e-16 | 1.000000e+00 | 9.678376e-16 | -5.606434e-16 | 6.692616e-16 | -1.423455e-15 | 2.118638e-16 | 6.349939e-17 | -3.516820e-16 | 1.024768e-16 | -4.337014e-16 | 2.281677e-16 | 1.108681e-16 | -1.246909e-15 | -9.799748e-16 | -0.002986 | -0.004223 | 0.112251 |
| 3.007488e-16 | -3.749559e-18 | 5.460118e-16 | -8.218577e-17 | 6.264287e-16 | -1.823748e-16 | 4.901078e-16 | 2.500367e-16 | -5.409347e-16 | 1.882434e-16 | 1.158881e-16 | 4.714076e-16 | 4.849394e-16 | -8.191302e-16 | 9.678376e-16 | 1.000000e+00 | 1.641102e-15 | -2.666175e-15 | 1.138371e-15 | 4.407936e-16 | -4.180114e-16 | 2.653008e-16 | 7.410993e-16 | -3.508969e-16 | -3.341605e-16 | -4.690618e-16 | 8.147869e-16 | 7.042089e-16 | -0.003910 | -0.196539 | 0.005517 |
| -3.106689e-17 | -5.591430e-16 | 2.134100e-16 | -4.443050e-16 | 4.535815e-16 | 1.161080e-16 | 7.173458e-16 | -3.808536e-16 | 7.071023e-16 | 6.617837e-16 | 6.624541e-16 | -3.797286e-16 | 8.705885e-17 | 1.131442e-15 | -5.606434e-16 | 1.641102e-15 | 1.000000e+00 | -5.251666e-15 | 3.694474e-16 | -8.921672e-16 | -1.086035e-15 | -3.486998e-16 | 4.072307e-16 | -1.897694e-16 | 7.587211e-17 | 2.084478e-16 | 6.669179e-16 | -5.419071e-17 | 0.007309 | -0.326481 | -0.064803 |
| 1.645281e-16 | 2.877862e-16 | 2.870116e-16 | 5.369916e-18 | 4.196874e-16 | 6.313161e-17 | 1.638629e-16 | -3.119192e-16 | 1.471108e-16 | 4.829483e-16 | 9.910529e-17 | -6.830564e-16 | 2.432753e-16 | -3.009169e-16 | 6.692616e-16 | -2.666175e-15 | -5.251666e-15 | 1.000000e+00 | -2.719935e-15 | -4.098224e-16 | -1.240266e-15 | -5.279657e-16 | -2.362311e-16 | -1.869482e-16 | -2.451121e-16 | 3.089442e-16 | 2.209663e-16 | 8.158517e-16 | 0.035650 | -0.111485 | -0.003518 |
| 1.045442e-16 | -1.672445e-17 | 3.728807e-16 | -2.842719e-16 | -1.277261e-16 | 6.136340e-17 | -1.132423e-16 | -3.559019e-16 | 1.293082e-16 | 4.623218e-17 | -1.093636e-15 | 1.782434e-16 | -6.331767e-17 | 2.138702e-16 | -1.423455e-15 | 1.138371e-15 | 3.694474e-16 | -2.719935e-15 | 1.000000e+00 | 2.693620e-16 | 6.052450e-16 | -1.036140e-15 | 5.861740e-16 | -9.630049e-17 | 8.161694e-16 | 5.479257e-16 | -1.243578e-16 | -1.291833e-15 | -0.056151 | 0.034783 | 0.021566 |
| 9.852066e-17 | 3.898167e-17 | 1.267443e-16 | -2.222520e-16 | -2.414281e-16 | -1.318056e-16 | 1.889527e-16 | 1.098800e-17 | -3.112119e-16 | -1.340974e-15 | -1.478641e-16 | 2.673446e-16 | -3.200986e-17 | -5.239826e-17 | 2.118638e-16 | 4.407936e-16 | -8.921672e-16 | -4.098224e-16 | 2.693620e-16 | 1.000000e+00 | -1.118296e-15 | 1.101689e-15 | 1.107203e-16 | 1.749671e-16 | -6.786605e-18 | -3.590893e-16 | -8.488785e-16 | -4.584320e-16 | 0.339403 | 0.020090 | 0.000978 |
| -1.808291e-16 | 4.667231e-17 | 1.189711e-16 | 1.390687e-17 | 9.325965e-17 | -4.925144e-17 | -7.597231e-17 | -2.338214e-16 | 2.755460e-16 | 1.048675e-15 | 6.632474e-18 | 5.724276e-16 | 1.428638e-16 | -2.462983e-16 | 6.349939e-17 | -4.180114e-16 | -1.086035e-15 | -1.240266e-15 | 6.052450e-16 | -1.118296e-15 | 1.000000e+00 | 3.540128e-15 | 4.521934e-16 | 1.014531e-16 | -1.173906e-16 | -4.337929e-16 | -1.484206e-15 | 1.584856e-16 | 0.105999 | 0.040413 | -0.011915 |
| 7.990046e-17 | 1.203109e-16 | -2.343257e-16 | 2.189964e-16 | -6.982655e-17 | -9.729827e-17 | -6.887963e-16 | -6.701600e-18 | -2.171404e-16 | -2.890990e-16 | 1.312323e-17 | -3.587155e-17 | -4.602453e-17 | 6.492362e-16 | -3.516820e-16 | 2.653008e-16 | -3.486998e-16 | -5.279657e-16 | -1.036140e-15 | 1.101689e-15 | 3.540128e-15 | 1.000000e+00 | 3.086083e-16 | 6.736130e-17 | -9.827185e-16 | -2.194486e-17 | 1.478149e-16 | -5.686304e-16 | -0.064801 | 0.000805 | -0.016610 |
| 1.056762e-16 | 3.242403e-16 | -8.182206e-17 | 1.663593e-16 | -1.848644e-16 | -3.176032e-17 | 1.393022e-16 | 2.701514e-16 | -1.011218e-16 | 1.907376e-16 | 1.404725e-16 | 3.029886e-16 | -7.174408e-16 | 2.160339e-16 | 1.024768e-16 | 7.410993e-16 | 4.072307e-16 | -2.362311e-16 | 5.861740e-16 | 1.107203e-16 | 4.521934e-16 | 3.086083e-16 | 1.000000e+00 | 7.328447e-17 | -7.508801e-16 | 1.284451e-15 | 4.254579e-16 | 1.281294e-15 | -0.112633 | -0.002685 | 0.006004 |
| -5.113541e-17 | -1.254911e-16 | -3.300147e-17 | 1.403733e-16 | -9.892370e-16 | -1.125379e-15 | 2.078026e-17 | -2.444390e-16 | -2.940457e-16 | -7.312196e-17 | 1.672342e-15 | 4.510178e-16 | -6.376621e-16 | -1.258007e-17 | -4.337014e-16 | -3.508969e-16 | -1.897694e-16 | -1.869482e-16 | -9.630049e-17 | 1.749671e-16 | 1.014531e-16 | 6.736130e-17 | 7.328447e-17 | 1.000000e+00 | 1.242718e-15 | 1.863258e-16 | -2.894257e-16 | -2.844233e-16 | 0.005146 | -0.007221 | 0.004328 |
| -2.165870e-16 | 8.784216e-17 | 1.123060e-16 | 6.312530e-16 | -1.561416e-16 | 5.563670e-16 | -1.507689e-17 | -1.792313e-16 | 2.137255e-16 | -3.457860e-16 | -6.082687e-16 | 6.970336e-18 | -1.142909e-16 | -7.178656e-17 | 2.281677e-16 | -3.341605e-16 | 7.587211e-17 | -2.451121e-16 | 8.161694e-16 | -6.786605e-18 | -1.173906e-16 | -9.827185e-16 | -7.508801e-16 | 1.242718e-15 | 1.000000e+00 | 2.449277e-15 | -5.340203e-16 | 2.699748e-16 | -0.047837 | 0.003308 | -0.003497 |
| -1.408070e-16 | 2.450901e-16 | -2.136494e-16 | -4.009636e-16 | 3.403172e-16 | -2.627057e-16 | -7.709408e-16 | 1.092765e-17 | -1.039639e-16 | -4.117783e-16 | -1.240097e-16 | 1.653468e-16 | -1.478991e-16 | -2.488490e-17 | 1.108681e-16 | -4.690618e-16 | 2.084478e-16 | 3.089442e-16 | 5.479257e-16 | -3.590893e-16 | -4.337929e-16 | -2.194486e-17 | 1.284451e-15 | 1.863258e-16 | 2.449277e-15 | 1.000000e+00 | -2.939564e-16 | -2.558739e-16 | -0.003208 | 0.004455 | 0.001146 |
| 1.664818e-16 | -5.467509e-16 | 4.752587e-16 | -6.309346e-17 | 3.299056e-16 | -4.040640e-16 | -2.647380e-18 | 3.921512e-16 | -1.499396e-16 | -3.115507e-16 | -1.519253e-16 | -2.721798e-16 | -5.300185e-16 | -1.739150e-17 | -1.246909e-15 | 8.147869e-16 | 6.669179e-16 | 2.209663e-16 | -1.243578e-16 | -8.488785e-16 | -1.484206e-15 | 1.478149e-16 | 4.254579e-16 | -2.894257e-16 | -5.340203e-16 | -2.939564e-16 | 1.000000e+00 | -2.403217e-16 | 0.028825 | 0.017580 | -0.008676 |
| 2.419798e-16 | -6.910935e-17 | 6.073110e-16 | -2.064064e-16 | -3.491468e-16 | 4.612882e-17 | 2.115388e-17 | -5.158971e-16 | 7.982292e-16 | 3.949646e-16 | -2.909057e-16 | 7.065902e-16 | 1.043260e-15 | 2.414117e-15 | -9.799748e-16 | 7.042089e-16 | -5.419071e-17 | 8.158517e-16 | -1.291833e-15 | -4.584320e-16 | 1.584856e-16 | -5.686304e-16 | 1.281294e-15 | -2.844233e-16 | 2.699748e-16 | -2.558739e-16 | -2.403217e-16 | 1.000000e+00 | 0.010258 | 0.009536 | -0.007492 |
| -2.277087e-01 | -5.314089e-01 | -2.108805e-01 | 9.873167e-02 | -3.863563e-01 | 2.159812e-01 | 3.973113e-01 | -1.030791e-01 | -4.424560e-02 | -1.015021e-01 | 1.039770e-04 | -9.541802e-03 | 5.293409e-03 | 3.375117e-02 | -2.985848e-03 | -3.909527e-03 | 7.309042e-03 | 3.565034e-02 | -5.615079e-02 | 3.394034e-01 | 1.059989e-01 | -6.480065e-02 | -1.126326e-01 | 5.146217e-03 | -4.783686e-02 | -3.208037e-03 | 2.882546e-02 | 1.025822e-02 | 1.000000 | 0.005632 | -0.006667 |
| -1.013473e-01 | 9.128865e-02 | -1.929608e-01 | 1.334475e-01 | -9.497430e-02 | -4.364316e-02 | -1.872566e-01 | 1.987512e-02 | -9.773269e-02 | -2.168829e-01 | 1.548756e-01 | -2.605929e-01 | -4.569779e-03 | -3.025437e-01 | -4.223402e-03 | -1.965389e-01 | -3.264811e-01 | -1.114853e-01 | 3.478301e-02 | 2.009032e-02 | 4.041338e-02 | 8.053175e-04 | -2.685156e-03 | -7.220907e-03 | 3.307706e-03 | 4.455398e-03 | 1.757973e-02 | 9.536041e-03 | 0.005632 | 1.000000 | -0.017109 |
| -5.214205e-03 | 7.802199e-03 | -2.156874e-02 | -3.506295e-02 | -3.513442e-02 | -1.894502e-02 | -9.729167e-03 | 3.210647e-02 | -1.898298e-01 | 2.417660e-02 | -1.351310e-01 | 3.524592e-01 | -1.879810e-01 | -1.629179e-01 | 1.122505e-01 | 5.517040e-03 | -6.480333e-02 | -3.518403e-03 | 2.156599e-02 | 9.780928e-04 | -1.191466e-02 | -1.660982e-02 | 6.004232e-03 | 4.328237e-03 | -3.497363e-03 | 1.146125e-03 | -8.676362e-03 | -7.492140e-03 | -0.006667 | -0.017109 | 1.000000 |
從數值上看,數值變量之間沒有明顯的相關性,因為樣本的不均衡,數值變量與預測類別之間的相關性也不大,這并不是我們想看到的,因而我們考慮先進行正負樣本的均衡
但是在進行樣本均衡化處理之前,我們需要先對樣本進行訓練集和測試集的劃分,這是為了保證測試的公平性,我們需要在原始數據集上測試,畢竟不能用自己構造的數據訓練集測試自己構造的測試集
為了使得訓練集和測試集的正負樣本分布一致,采用StratifiedKFold來劃分
對于處理正負樣本極不均衡的情況,主要有欠采樣和過采樣兩種方案.
- 欠采樣是從大樣本中隨機選擇與小樣本同樣數量的樣本,對于極不均衡的問題會出現欠擬合問題,因為樣本量太小。采用的方法是:
- 過采樣是利用利用小樣本生成與大樣本相同數量的樣本,有兩種方法:隨機過采樣和SMOTE法過采樣
- 隨機過采樣是從小樣本中隨機抽取一定的數量的舊樣本,組成一個與大樣本相同數量的新樣本,這種處理方法容易出現過擬合
- SMOTE,即合成少數類過采樣技術,針對隨機過采樣容易出現過擬合問題的改進方案。根據樣本不同,分成數據多和數據少的兩類,從數據少的類中隨機選一個樣本,再找到小類樣本中離選定樣本點近的幾個樣本,取這幾個樣本與選定樣本連線上的點,作為新生成的樣本,重復步驟直到達到大樣本的樣本數。
可以利用Counter(y_smotesampled)查看兩種樣本的數量是否一致
下面分別用隨機下采樣和SMOTE與隨機下采樣的結合,這兩種方案來對樣本進行處理,并分析對比處理后樣本的分布情況
from imblearn.under_sampling import RandomUnderSampler from imblearn.over_sampling import RandomOverSampler from imblearn.over_sampling import SMOTE from collections import CounterX = X_train.copy() y = y_train.copy() print('Imblanced samples: ', Counter(y))rus = RandomUnderSampler(random_state=1) X_rus, y_rus = rus.fit_resample(X, y) print('Random under sample: ', Counter(y_rus)) ros = RandomOverSampler(random_state=1) X_ros, y_ros = ros.fit_resample(X, y) print('Random over sample: ', Counter(y_ros)) smote = SMOTE(random_state=1,sampling_strategy=0.5) X_smote, y_smote = smote.fit_resample(X, y) print('SMOTE: ', Counter(y_smote)) under = RandomUnderSampler(sampling_strategy=1) X_smote, y_smote = under.fit_resample(X_smote,y_smote) print('SMOTE: ', Counter(y_smote)) Imblanced samples: Counter({0: 227452, 1: 394}) Random under sample: Counter({0: 394, 1: 394}) Random over sample: Counter({0: 227452, 1: 227452}) SMOTE: Counter({0: 227452, 1: 113726}) SMOTE: Counter({0: 113726, 1: 113726})根據結果,可以看出處理后的樣本數,欠采樣是正負樣本都變成了394,而過采樣是都變成了227452。下面分別分析不同平衡樣本的情況
隨機下采樣 Random under sample
data_rus = pd.concat([X_rus,y_rus],axis=1) # 單個數值特征分布情況 f = pd.melt(data_rus, value_vars=X_train.columns) g = sns.FacetGrid(f, col='variable', col_wrap=3, sharex=False, sharey=False) g = g.map(sns.distplot, 'value') # 單個數值特征分布箱線圖 f = pd.melt(data_rus, value_vars=X_train.columns) g = sns.FacetGrid(f,col='variable', col_wrap=3, sharex=False, sharey=False,size=5) g = g.map(sns.boxplot, 'value', color='lightskyblue') # 單個數值特征分布小提琴圖 f = pd.melt(data_rus, value_vars=X_train.columns) g = sns.FacetGrid(f, col='variable', col_wrap=3, sharex=False, sharey=False,size=5) g = g.map(sns.violinplot, 'value',color='lightskyblue')通過分布圖可以看到大部分的數值分布較集中,但也存在一些異常值(箱型圖和小提琴圖看起來更直觀),異常值會帶偏模型,所以我們需要去除異常值,判斷異常值有兩種方法:正態分布和上下四分位數。正態分布是采用3σ3\sigma3σ原則,上下四分位數是利用分位數,超過一分位數或三分位數一定的范圍將被確定為異常值
def outlier_process(data,column):Q1 = data[column].quantile(q=0.25)Q3 = data[column].quantile(q=0.75)low_whisker = Q1-3*(Q3-Q1)high_whisker = Q3+3*(Q3-Q1)# 刪除異常值data_drop = data[(data[column]>=low_whisker) & (data[column]<=high_whisker)]#畫出刪除前和刪除后的對比圖fig,(ax1,ax2) = plt.subplots(1,2,figsize=(12,5))sns.boxplot(y=data[column],ax=ax1,color='lightskyblue')ax1.set_title('before deleting outlier'+' '+column)sns.boxplot(y=data_drop[column],ax=ax2,color='lightskyblue')ax2.set_title('after deleting outlier'+' '+column)return data_drop numerical_columns = data_rus.columns.drop('Class') for col_name in numerical_columns:data_rus = outlier_process(data_rus,col_name)
因為有些屬性的分布較集中,沒有出現太大變化,比如Hour。
清理完臟數據,再探測變量之間的相關關系
#繪制采樣前和采樣后的熱力圖 fig,(ax1,ax2) = plt.subplots(2,1,figsize=(10,10)) sns.heatmap(data.corr(),cmap = 'coolwarm_r',ax=ax1,vmax=0.8) ax1.set_title('the relationship on imbalanced samples') sns.heatmap(data_rus.corr(),cmap = 'coolwarm_r',ax=ax2,vmax=0.8) ax2.set_title('the relationship on random under samples') Text(0.5, 1.0, 'the relationship on random under samples') # 分析數值屬性與Class之間的相關性 data_rus.corr()['Class'].sort_values(ascending=False) Class 1.000000 V4 0.722609 V11 0.694975 V2 0.481190 V19 0.245209 V20 0.151424 V21 0.126220 Amount 0.100853 V26 0.090654 V27 0.074491 V8 0.059647 V28 0.052788 V25 0.040578 V22 0.016379 V23 -0.003117 V15 -0.009488 V13 -0.055253 V24 -0.070806 Hour -0.196789 V5 -0.383632 V6 -0.407577 V1 -0.423665 V18 -0.462499 V7 -0.468273 V17 -0.561113 V3 -0.561767 V9 -0.562542 V16 -0.592382 V10 -0.629362 V12 -0.691652 V14 -0.751142 Name: Class, dtype: float64對比可以看到,處理后的數值屬性與分類類別之間的相關性明顯增加,根據排名,和預測值Class正相關值較大的有V4,V11,V2,V27,負相關值較大的有V_14,V_10,V_12,V_3,V7,V9,我們畫出這些特征與預測值之間的關系圖
# 正相關的屬性與Class分布圖 fig,(ax1,ax2,ax3) = plt.subplots(1,3,figsize=(24,6)) sns.violinplot(x='Class',y='V4',data=data_rus,palette=['lightsalmon','lightskyblue'],ax=ax1) ax1.set_title('V4 vs Class Positive Correlation')sns.violinplot(x='Class',y='V11',data=data_rus,palette=['lightsalmon','lightskyblue'],ax=ax2) ax2.set_title('V11 vs Class Positive Correlation')sns.violinplot(x='Class',y='V2',data=data_rus,palette=['lightsalmon','lightskyblue'],ax=ax3) ax3.set_title('V2 vs Class Positive Correlation') Text(0.5, 1.0, 'V2 vs Class Positive Correlation') # 負相關的屬性與Class分布圖 fig,((ax1,ax2,ax3),(ax4,ax5,ax6)) = plt.subplots(2,3,figsize=(24,12)) sns.violinplot(x='Class',y='V14',data=data_rus,palette=['lightsalmon','lightskyblue'],ax=ax1) ax1.set_title('V14 vs Class Negative Correlation')sns.violinplot(x='Class',y='V10',data=data_rus,palette=['lightsalmon','lightskyblue'],ax=ax2) ax2.set_title('V10 vs Class Negative Correlation')sns.violinplot(x='Class',y='V12',data=data_rus,palette=['lightsalmon','lightskyblue'],ax=ax3) ax3.set_title('V12 vs Class Negative Correlation')sns.violinplot(x='Class',y='V3',data=data_rus,palette=['lightsalmon','lightskyblue'],ax=ax4) ax4.set_title('V3 vs Class Negative Correlation')sns.violinplot(x='Class',y='V7',data=data_rus,palette=['lightsalmon','lightskyblue'],ax=ax5) ax5.set_title('V7 vs Class Negative Correlation')sns.violinplot(x='Class',y='V9',data=data_rus,palette=['lightsalmon','lightskyblue'],ax=ax6) ax6.set_title('V9 vs Class Negative Correlation') Text(0.5, 1.0, 'V9 vs Class Negative Correlation') #其他屬性與Class的分布圖 other_fea = list(data_rus.columns.drop(['V11','V4','V2','V17','V14','V12','V10','V7','V3','V9','Class'])) fig,ax = plt.subplots(5,4,figsize=(24,36)) for fea in other_fea:sns.violinplot(x='Class',y= fea,data=data_rus,palette=['lightsalmon','lightskyblue'],ax=ax[divmod(other_fea.index(fea),4)[0],divmod(other_fea.index(fea),4)[1]])ax[divmod(other_fea.index(fea),4)[0],divmod(other_fea.index(fea),4)[1]].set_title(fea)從以上小提琴圖可以看出,不同的正負樣本在屬性值上取值分布確實存在不同,而其他的屬性在正負樣本上區別相對不大。
好奇金額和時間是否與Class的關系,金額上正常交易金額更少更集中一些,而欺詐交易金額相對較大且分布更散,而時間上正常交易時間跨度小于欺詐交易的時間跨度,所以是在睡覺的時候更可能產生欺詐交易
查看完特征和類別之間的關系,下面分析特征之間的關系,從熱力圖上可以看出,屬性之間是存在相關性的,下面具體看看是否存在多重共線性
sns.set() sns.pairplot(data[list(data_rus.columns)],kind='scatter',diag_kind = 'kde') plt.show()從上圖可以看出,這些屬性之間不是完全相互獨立的,有些存在很強的線性相關性,我們利用方差膨脹系數(VIF)作進一步檢驗
from statsmodels.stats.outliers_influence import variance_inflation_factor vif = [variance_inflation_factor(data_rus.values, data_rus.columns.get_loc(i)) for i in data_rus.columns] vif [12.662373162589994,20.3132501576979,26.78027354608725,9.970255022795625,23.531563683157597,3.4386660732204946,67.84989394913013,5.76519495696649,7.129002458395831,23.226754020950764,11.753104213590975,29.49673779700361,1.3365891898690718,21.57973674600878,1.2669840488461022,27.61485162786757,31.081940593780782,14.088642210869459,2.502857511412321,4.96077803555917,5.169599871511768,3.1235143157354583,2.828445638986856,1.1937601054384332,1.628451339236206,1.1966413137632343,1.959903999050125,1.4573293665681395,6.314999796714301,2.0990707198901117,4.802392100187543]一般認為VIF<10VIF<10VIF<10時,該變量與其余變量之間不存在多重共線性,當10<VIF<10010<VIF<10010<VIF<100時存在較強的多重共線性,當VIF>100VIF>100VIF>100時,則認為是存在嚴重的多重共線性。從以上數值來看,變量之間確實存在多重共線性,也就是存在信息冗余,下面需要進行特征提取或者特征選擇
特征提取和特征選擇都是降維的兩種方法,特征提取所提取的特征是原特征的映射,而特征選擇選出的是原特征的子集。主成分分析和線性判別分析是特征提取的兩種經典方法。
特征選擇:當數據處理完成后,需要選擇有意義的特征輸入機器學習的算法和模型進行訓練,主要從兩個方面來選擇考慮特征
-
特征是否發散,如果一個特征不發散,例如方差接近0,也就是說樣本在這個特征上基本沒有差異,這個特征對于樣本的區分并沒有什么用
-
特征與目標的相關性,與目標相關性高的特征,應當優先選擇。
根據特征選擇的形式又可以將特征選擇分為3種:
1、過濾法:按照發散性或者相關性對各個特征進行評分,設定閾值或帶選擇閾值的個數,選擇特征(方差選擇法、相關系數法、卡方檢驗、互信息法)
2、包裝法:根據目標函數(通常是預測效果評分),每次選擇若干特征,或者排除若干特征(遞歸特征消除法)
3、嵌入法:先使用某些機器學習的算法和模型進行訓練,得到各個特征的權值系數,根據系數從大到小選擇特征,類似于filter方法,但是通過訓練來確定特征的優劣(基于懲罰項的特征選擇法、基于樹模型的特征選擇法)
嶺回歸和Lasso是兩種對線性模型特征選擇的方法,都是加入正則化項防止過擬合,嶺回歸加入的是二階范數的正則化項,而Lasso加入的是一級范式,其中Lasso能夠將一些作用比較小的特征的參數訓練為0,從而獲得稀疏解,也就是在訓練模型時實現了降維的目的。
對于樹模型,有隨機森林分類器對特征的重要性進行排序,可以達到篩選的目的。
本文先采用兩種特征選擇的方法分別選擇出重要的特征,看看特征有什么差別
| V12 | 0.134515 |
| V12V10 | 0.268117 |
| V12V10V17 | 0.388052 |
| V12V10V17V14 | 0.503750 |
| V12V10V17V14V4 | 0.615973 |
| V12V10V17V14V4V11 | 0.687904 |
| V12V10V17V14V4V11V2 | 0.731179 |
| V12V10V17V14V4V11V2V16 | 0.774247 |
| V12V10V17V14V4V11V2V16V3 | 0.813352 |
| V12V10V17V14V4V11V2V16V3V7 | 0.846050 |
| V12V10V17V14V4V11V2V16V3V7V19 | 0.860559 |
| V12V10V17V14V4V11V2V16V3V7V19V18 | 0.873657 |
| V12V10V17V14V4V11V2V16V3V7V19V18V21 | 0.886416 |
| V12V10V17V14V4V11V2V16V3V7V19V18V21Amount | 0.896986 |
| V12V10V17V14V4V11V2V16V3V7V19V18V21AmountV27 | 0.906369 |
| V12V10V17V14V4V11V2V16V3V7V19V18V21AmountV27V20 | 0.915172 |
| V12V10V17V14V4V11V2V16V3V7V19V18V21AmountV27V20V15 | 0.923700 |
| V12V10V17V14V4V11V2V16V3V7V19V18V21AmountV27V20V15V23 | 0.931990 |
| V12V10V17V14V4V11V2V16V3V7V19V18V21AmountV27V20V15V23V13 | 0.939298 |
| V12V10V17V14V4V11V2V16V3V7V19V18V21AmountV27V20V15V23V13V8 | 0.946163 |
| V12V10V17V14V4V11V2V16V3V7V19V18V21AmountV27V20V15V23V13V8V6 | 0.952952 |
| V12V10V17V14V4V11V2V16V3V7V19V18V21AmountV27V20V15V23V13V8V6V26 | 0.959418 |
| V12V10V17V14V4V11V2V16V3V7V19V18V21AmountV27V20V15V23V13V8V6V26V9 | 0.965869 |
| V12V10V17V14V4V11V2V16V3V7V19V18V21AmountV27V20V15V23V13V8V6V26V9V1 | 0.971975 |
| V12V10V17V14V4V11V2V16V3V7V19V18V21AmountV27V20V15V23V13V8V6V26V9V1V5 | 0.977297 |
| V12V10V17V14V4V11V2V16V3V7V19V18V21AmountV27V20V15V23V13V8V6V26V9V1V5V22 | 0.982325 |
| V12V10V17V14V4V11V2V16V3V7V19V18V21AmountV27V20V15V23V13V8V6V26V9V1V5V22V28 | 0.986990 |
| V12V10V17V14V4V11V2V16V3V7V19V18V21AmountV27V20V15V23V13V8V6V26V9V1V5V22V28V24 | 0.991551 |
| V12V10V17V14V4V11V2V16V3V7V19V18V21AmountV27V20V15V23V13V8V6V26V9V1V5V22V28V24Hour | 0.995913 |
| V12V10V17V14V4V11V2V16V3V7V19V18V21AmountV27V20V15V23V13V8V6V26V9V1V5V22V28V24HourV25 | 1.000000 |
從以上結果來看,兩種方法的排序,特征重要性差別很大,為了更大程度的保留數據的信息,我們采用兩種結合的特征,包括[‘V14’,‘V12’,‘V10’,‘V17’,‘V4’,‘V11’,‘V3’,‘V2’,‘V7’,‘V16’,‘V18’,‘Amount’,‘V19’,‘V20’,‘V23’,‘V21’,‘V15’,‘V9’,‘V6’,‘V27’,‘V25’,‘V5’,‘V13’,‘V22’,‘Hour’,‘V28’,‘V1’,‘V8’,‘V26’],其中選擇的標準是隨機森林中重要性總和95%以上,如果其中有Lasso回歸沒有的,則加入,共選出29個特征(只有V24沒有被選擇)
# 使用選擇的特征采進行訓練和測試 from sklearn.metrics import accuracy_score from sklearn.metrics import confusion_matrix from sklearn.metrics import plot_confusion_matrix from sklearn.metrics import classification_report from sklearn.linear_model import LogisticRegression # 邏輯回歸 from sklearn.neighbors import KNeighborsClassifier # KNN from sklearn.naive_bayes import GaussianNB # 樸素貝葉斯 from sklearn.svm import SVC # 支持向量分類 from sklearn.tree import DecisionTreeClassifier # 決策樹 from sklearn.ensemble import RandomForestClassifier # 隨機森林 from sklearn.ensemble import AdaBoostClassifier # Adaboost from sklearn.ensemble import GradientBoostingClassifier # GBDT from xgboost import XGBClassifier # XGBoost from lightgbm import LGBMClassifier # lightGBM from sklearn.metrics import roc_curve # 繪制ROC曲線 Classifiers = {'LG': LogisticRegression(random_state=1),'KNN': KNeighborsClassifier(),'Bayes': GaussianNB(),'SVC': SVC(random_state=1,probability=True),'DecisionTree': DecisionTreeClassifier(random_state=1),'RandomForest': RandomForestClassifier(random_state=1),'Adaboost':AdaBoostClassifier(random_state=1),'GBDT': GradientBoostingClassifier(random_state=1),'XGboost': XGBClassifier(random_state=1),'LightGBM': LGBMClassifier(random_state=1)} def train_test(Classifiers, X_train, y_train, X_test, y_test):y_pred = pd.DataFrame()Accuracy_Score = pd.DataFrame() # score.model_name = Classifiers.keysfor model_name, model in Classifiers.items():model.fit(X_train, y_train)y_pred[model_name] = model.predict(X_test)y_pred_pra = model.predict_proba(X_test)Accuracy_Score[model_name] = pd.Series(model.score(X_test, y_test))# 計算召回率print(model_name, '\n', classification_report(y_test, y_pred[model_name])) # confu_mat = confusion_matrix(y_test,y_pred[model_name]) # plt.matshow(confu_mat,cmap = plt.cm.Blues) # plt.title(model_name) # plt.colorbar()# 畫出混淆矩陣fig, ax = plt.subplots(1, 1)plot_confusion_matrix(model, X_test, y_test, labels=[0, 1], cmap='Blues', ax=ax)ax.set_title(model_name)# 畫出roc曲線plt.figure()fig,(ax1,ax2) = plt.subplots(1,2,figsize=(10,4))fpr, tpr, thres = roc_curve(y_test, y_pred_pra[:, -1])ax1.plot(fpr, tpr)ax1.set_title(model_name+' ROC')ax1.set_xlabel('fpr')ax1.set_ylabel('tpr')# 畫出KS曲線ax2.plot(thres[1:],tpr[1:])ax2.plot(thres[1:],fpr[1:])ax2.plot(thres[1:],tpr[1:]-fpr[1:])ax2.set_xlabel('threshold')ax2.legend(['tpr','fpr','tpr-fpr'])plt.sca(ax2)plt.gca().invert_xaxis() # ax2.gca().invert_xaxis()ax2.set_title(model_name+' KS')return y_pred,Accuracy_Score # test_cols = ['V12', 'V14', 'V10', 'V17', 'V11', 'V4', 'V2', 'V16', 'V7', 'V3', # 'V18', 'Amount', 'V19', 'V21', 'V20', 'V8', 'V15', 'V6', 'V27', 'V26', 'V1','V9','V13','V22','Hour','V23','V28'] test_cols = X_rus.columns.drop('V24') Y_pred,Accuracy_score = train_test(Classifiers, X_rus[test_cols], y_rus, X_test[test_cols], y_test) Accuracy_score LGprecision recall f1-score support0 1.00 0.96 0.98 568631 0.04 0.94 0.07 98accuracy 0.96 56961macro avg 0.52 0.95 0.53 56961 weighted avg 1.00 0.96 0.98 56961KNNprecision recall f1-score support0 1.00 0.97 0.99 568631 0.06 0.93 0.11 98accuracy 0.97 56961macro avg 0.53 0.95 0.55 56961 weighted avg 1.00 0.97 0.99 56961Bayesprecision recall f1-score support0 1.00 0.96 0.98 568631 0.04 0.88 0.08 98accuracy 0.96 56961macro avg 0.52 0.92 0.53 56961 weighted avg 1.00 0.96 0.98 56961SVCprecision recall f1-score support0 1.00 0.98 0.99 568631 0.08 0.91 0.14 98accuracy 0.98 56961macro avg 0.54 0.94 0.57 56961 weighted avg 1.00 0.98 0.99 56961DecisionTreeprecision recall f1-score support0 1.00 0.88 0.93 568631 0.01 0.95 0.03 98accuracy 0.88 56961macro avg 0.51 0.91 0.48 56961 weighted avg 1.00 0.88 0.93 56961RandomForestprecision recall f1-score support0 1.00 0.97 0.98 568631 0.05 0.93 0.09 98accuracy 0.97 56961macro avg 0.52 0.95 0.54 56961 weighted avg 1.00 0.97 0.98 56961Adaboostprecision recall f1-score support0 1.00 0.95 0.98 568631 0.03 0.95 0.07 98accuracy 0.95 56961macro avg 0.52 0.95 0.52 56961 weighted avg 1.00 0.95 0.97 56961GBDTprecision recall f1-score support0 1.00 0.96 0.98 568631 0.04 0.92 0.07 98accuracy 0.96 56961macro avg 0.52 0.94 0.53 56961 weighted avg 1.00 0.96 0.98 56961[15:05:57] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.4.0/src/learner.cc:1095: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior. XGboostprecision recall f1-score support0 1.00 0.97 0.98 568631 0.04 0.93 0.09 98accuracy 0.97 56961macro avg 0.52 0.95 0.53 56961 weighted avg 1.00 0.97 0.98 56961LightGBMprecision recall f1-score support0 1.00 0.97 0.98 568631 0.05 0.94 0.10 98accuracy 0.97 56961macro avg 0.53 0.95 0.54 56961 weighted avg 1.00 0.97 0.98 56961| 0.959446 | 0.973561 | 0.962887 | 0.981479 | 0.877899 | 0.967451 | 0.953407 | 0.960482 | 0.965678 | 0.969611 |
通過以上結果,模型的精確度accuracy都挺高的,尤其是支持向量分類的達到了98.1479%,但對于不平衡樣本來說,更應該關注召回率,在SVC中,識別出欺詐交易的概率是91%,而對于非欺詐樣本的識別率是99%,分別有1046和9份樣本識別錯誤。
在進行模型調優之前,嘗試著用默認參數來構成模型集成,選擇出精確度高的3個模型,KNN,SVC,和Light GBM,得到的結果的精確度是98%,分別有6份和1114份樣本識別錯誤,在欺詐樣本的識別上有所提升,但是非欺詐樣本檢測成欺詐樣本數量增加了。
因為模型中使用的參數都是默認的,考慮利用網格搜索確定最佳參數,當然網格搜索也不一定能找到最好的參數
初步模型已經建立,但是模型的參數采用的是模型默認的,下面進行調參。調參的方法有三種:隨機搜索、網格搜索以及貝葉斯優化。而隨即搜索和網格搜索當超參數過多時,極其耗時,因為他們的搜索次數是所有參數的組合,而利用貝葉斯優化進行調參會考慮之前的參數信息,不斷更新先驗,且迭代次數少,速度快。
| 0.972508 | 0.963747 | 0.962887 | 0.983129 | 0.897087 | 0.974474 | 0.962185 | 0.966152 | 0.960043 | 0.969769 |
從精確度上來看,大部分模型都有提升,也有一些模型的精確度是退化的。在召回率上沒有明顯提升,甚至有些退化比較明顯,這說明對于欺詐樣本的識別還是不夠明顯。因為采用的訓練樣本數遠小于測試樣本數,這也導致了測試效果不行。
SMOTE上采樣與隨機下采樣相結合
還是對訓練樣本進行分析再訓練模型
data_smote = pd.concat([X_smote,y_smote],axis=1) # 單個數值特征分布情況 f = pd.melt(data_smote, value_vars=X_train.columns) g = sns.FacetGrid(f, col='variable', col_wrap=3, sharex=False, sharey=False) g = g.map(sns.distplot, 'value') # 單個數值特征分布箱線圖 f = pd.melt(data_smote, value_vars=X_train.columns) g = sns.FacetGrid(f,col='variable', col_wrap=3, sharex=False, sharey=False,size=5) g = g.map(sns.boxplot, 'value', color='lightskyblue') # 單個數值特征分布小提琴圖 f = pd.melt(data_smote, value_vars=X_train.columns) g = sns.FacetGrid(f, col='variable', col_wrap=3, sharex=False, sharey=False,size=5) g = g.map(sns.violinplot, 'value',color='lightskyblue') numerical_columns = data_smote.columns.drop('Class') for col_name in numerical_columns:data_smote = outlier_process(data_smote,col_name)print(data_smote.shape) (214362, 31) (213002, 31) (212497, 31) (212497, 31) (204320, 31) (202425, 31) (198165, 31) (189719, 31) (189492, 31) (189427, 31) (189020, 31) (189019, 31) (189019, 31) (189019, 31) (189016, 31) (188944, 31) (184836, 31) (184556, 31) (184552, 31) (180426, 31) (178559, 31) (178559, 31) (174526, 31) (174457, 31) (174354, 31) (174098, 31) (169488, 31) (166755, 31) (156423, 31) (156423, 31)
根據排名,和預測值Class正相關值較大的有V4,V11,V2,負相關值較大的有V_14,V_10,V_12,V_3,V7,V9,我們畫出這些特征與預測值之間的關系圖
# 正相關的屬性與Class分布圖 fig,(ax1,ax2,ax3) = plt.subplots(1,3,figsize=(24,6)) sns.violinplot(x='Class',y='V4',data=data_smote,palette=['lightsalmon','lightskyblue'],ax=ax1) ax1.set_title('V11 vs Class Positive Correlation')sns.violinplot(x='Class',y='V11',data=data_smote,palette=['lightsalmon','lightskyblue'],ax=ax2) ax2.set_title('V4 vs Class Positive Correlation')sns.violinplot(x='Class',y='V2',data=data_smote,palette=['lightsalmon','lightskyblue'],ax=ax3) ax3.set_title('V2 vs Class Positive Correlation') Text(0.5, 1.0, 'V2 vs Class Positive Correlation') # 正相關的屬性與Class分布圖 fig,((ax1,ax2,ax3),(ax4,ax5,ax6)) = plt.subplots(2,3,figsize=(24,14)) sns.violinplot(x='Class',y='V14',data=data_smote,palette=['lightsalmon','lightskyblue'],ax=ax1) ax1.set_title('V14 vs Class negative Correlation')sns.violinplot(x='Class',y='V10',data=data_smote,palette=['lightsalmon','lightskyblue'],ax=ax2) ax2.set_title('V4 vs Class negative Correlation')sns.violinplot(x='Class',y='V12',data=data_smote,palette=['lightsalmon','lightskyblue'],ax=ax3) ax3.set_title('V12 vs Class negative Correlation')sns.violinplot(x='Class',y='V3',data=data_smote,palette=['lightsalmon','lightskyblue'],ax=ax4) ax4.set_title('V3 vs Class negative Correlation')sns.violinplot(x='Class',y='V7',data=data_smote,palette=['lightsalmon','lightskyblue'],ax=ax5) ax5.set_title('V7 vs Class negative Correlation')sns.violinplot(x='Class',y='V9',data=data_smote,palette=['lightsalmon','lightskyblue'],ax=ax6) ax6.set_title('V9 vs Class negative Correlation') Text(0.5, 1.0, 'V9 vs Class negative Correlation') vif = [variance_inflation_factor(data_smote.values, data_smote.columns.get_loc(i)) for i in data_smote.columns] vif [8.91488639686892,41.95138589208644,29.439659166987383,9.321076032190051,18.073107065112527,7.88968653431388,38.13243240821064,2.61807436913295,4.202219415722627,20.898417802753006,5.976908659263689,10.856930462897152,1.2514060420970867,20.23958581764367,1.176425772463202,6.444784613229281,6.980222815257359,2.7742773520511372,2.4906782119059176,4.348667463801223,3.409678717638936,1.9626453781659197,2.1167419900555884,1.1352046295467655,1.9935984979230046,1.1029041559046275,3.084861887885401,1.9565486505075638,13.535498930988794,1.7451075607624895,4.64505815138509]相較于隨機下采樣來說,在多重共線性上相對有了改善
# 利用Lasso進行特征選擇 #調用LassoCV函數,并進行交叉驗證 model_lasso = LassoCV(alphas=[0.1,0.01,0.005,1],random_state=1,cv=5).fit(X_smote,y_smote) #輸出看模型中最終選擇的特征 coef = pd.Series(model_lasso.coef_,index=X_smote.columns) print(coef[coef != 0]) V1 -0.019851 V2 0.004587 V3 0.000523 V4 0.052236 V5 -0.000597 V6 -0.012474 V7 0.030960 V8 -0.012043 V9 0.007895 V10 -0.023509 V11 0.005633 V12 0.009648 V13 -0.036565 V14 -0.053919 V15 0.012297 V17 -0.009149 V18 0.030941 V20 0.010266 V21 0.013880 V22 0.019031 V23 -0.009253 V26 -0.068311 V27 -0.003680 V28 0.008911 dtype: float64 # 利用隨機森林進行特征重要性排序 from sklearn.ensemble import RandomForestClassifier rfc_fea_model = RandomForestClassifier(random_state=1) rfc_fea_model.fit(X_smote,y_smote) fea = X_smote.columns importance = rfc_fea_model.feature_importances_ a = pd.DataFrame() a['feature'] = fea a['importance'] = importance a = a.sort_values('importance',ascending = False) plt.figure(figsize=(20,10)) plt.bar(a['feature'],a['importance']) plt.title('the importance orders sorted by random forest') plt.show() a.cumsum()| V14 | 0.140330 |
| V14V12 | 0.274028 |
| V14V12V10 | 0.392515 |
| V14V12V10V17 | 0.501863 |
| V14V12V10V17V4 | 0.592110 |
| V14V12V10V17V4V11 | 0.680300 |
| V14V12V10V17V4V11V2 | 0.728448 |
| V14V12V10V17V4V11V2V3 | 0.770334 |
| V14V12V10V17V4V11V2V3V16 | 0.808083 |
| V14V12V10V17V4V11V2V3V16V7 | 0.842341 |
| V14V12V10V17V4V11V2V3V16V7V18 | 0.857924 |
| V14V12V10V17V4V11V2V3V16V7V18V8 | 0.869777 |
| V14V12V10V17V4V11V2V3V16V7V18V8V21 | 0.880431 |
| V14V12V10V17V4V11V2V3V16V7V18V8V21Amount | 0.890861 |
| V14V12V10V17V4V11V2V3V16V7V18V8V21AmountV1 | 0.900571 |
| V14V12V10V17V4V11V2V3V16V7V18V8V21AmountV1V9 | 0.909978 |
| V14V12V10V17V4V11V2V3V16V7V18V8V21AmountV1V9V13 | 0.918623 |
| V14V12V10V17V4V11V2V3V16V7V18V8V21AmountV1V9V13V19 | 0.926978 |
| V14V12V10V17V4V11V2V3V16V7V18V8V21AmountV1V9V13V19V27 | 0.935072 |
| V14V12V10V17V4V11V2V3V16V7V18V8V21AmountV1V9V13V19V27V5 | 0.942688 |
| V14V12V10V17V4V11V2V3V16V7V18V8V21AmountV1V9V13V19V27V5Hour | 0.950275 |
| V14V12V10V17V4V11V2V3V16V7V18V8V21AmountV1V9V13V19V27V5HourV20 | 0.957287 |
| V14V12V10V17V4V11V2V3V16V7V18V8V21AmountV1V9V13V19V27V5HourV20V15 | 0.963873 |
| V14V12V10V17V4V11V2V3V16V7V18V8V21AmountV1V9V13V19V27V5HourV20V15V6 | 0.970225 |
| V14V12V10V17V4V11V2V3V16V7V18V8V21AmountV1V9V13V19V27V5HourV20V15V6V26 | 0.976519 |
| V14V12V10V17V4V11V2V3V16V7V18V8V21AmountV1V9V13V19V27V5HourV20V15V6V26V28 | 0.982202 |
| V14V12V10V17V4V11V2V3V16V7V18V8V21AmountV1V9V13V19V27V5HourV20V15V6V26V28V23 | 0.987770 |
| V14V12V10V17V4V11V2V3V16V7V18V8V21AmountV1V9V13V19V27V5HourV20V15V6V26V28V23V22 | 0.992335 |
| V14V12V10V17V4V11V2V3V16V7V18V8V21AmountV1V9V13V19V27V5HourV20V15V6V26V28V23V22V25 | 0.996295 |
| V14V12V10V17V4V11V2V3V16V7V18V8V21AmountV1V9V13V19V27V5HourV20V15V6V26V28V23V22V25V24 | 1.000000 |
從以上結果來看,兩種方法的排序,特征重要性差別很大,為了更大程度的保留數據的信息,我們采用兩種結合的特征,其中選擇的標準是隨機森林中重要性總和95%以上,如果其中有Lasso回歸沒有的,則加入,共選出除去V24,V25之外的28個特征
# test_cols = ['V12', 'V14', 'V10', 'V17', 'V4', 'V11', 'V2', 'V7', 'V16', 'V3', 'V18', # 'V8', 'Amount', 'V19', 'V21', 'V1', 'V5', 'V13', 'V27','V6','V15','V26'] test_cols = X_smote.columns.drop(['V24','V25']) Classifiers = {'LG': LogisticRegression(random_state=1),'KNN': KNeighborsClassifier(),'Bayes': GaussianNB(),'SVC': SVC(random_state=1, probability=True),'DecisionTree': DecisionTreeClassifier(random_state=1),'RandomForest': RandomForestClassifier(random_state=1),'Adaboost': AdaBoostClassifier(random_state=1),'GBDT': GradientBoostingClassifier(random_state=1),'XGboost': XGBClassifier(random_state=1),'LightGBM': LGBMClassifier(random_state=1) } Y_pred, Accuracy_score = train_test(Classifiers, X_smote[test_cols], y_smote, X_test[test_cols], y_test) print(Accuracy_score) Y_pred.head() LGprecision recall f1-score support0 1.00 0.98 0.99 568631 0.06 0.86 0.12 98accuracy 0.98 56961macro avg 0.53 0.92 0.55 56961 weighted avg 1.00 0.98 0.99 56961KNNprecision recall f1-score support0 1.00 1.00 1.00 568631 0.29 0.83 0.42 98accuracy 1.00 56961macro avg 0.64 0.91 0.71 56961 weighted avg 1.00 1.00 1.00 56961Bayesprecision recall f1-score support0 1.00 0.98 0.99 568631 0.06 0.84 0.11 98accuracy 0.98 56961macro avg 0.53 0.91 0.55 56961 weighted avg 1.00 0.98 0.99 56961SVCprecision recall f1-score support0 1.00 0.99 0.99 568631 0.09 0.86 0.17 98accuracy 0.99 56961macro avg 0.55 0.92 0.58 56961 weighted avg 1.00 0.99 0.99 56961DecisionTreeprecision recall f1-score support0 1.00 1.00 1.00 568631 0.27 0.80 0.40 98accuracy 1.00 56961macro avg 0.63 0.90 0.70 56961 weighted avg 1.00 1.00 1.00 56961RandomForestprecision recall f1-score support0 1.00 1.00 1.00 568631 0.81 0.82 0.81 98accuracy 1.00 56961macro avg 0.90 0.91 0.91 56961 weighted avg 1.00 1.00 1.00 56961Adaboostprecision recall f1-score support0 1.00 0.98 0.99 568631 0.06 0.89 0.12 98accuracy 0.98 56961macro avg 0.53 0.93 0.55 56961 weighted avg 1.00 0.98 0.99 56961GBDTprecision recall f1-score support0 1.00 0.99 0.99 568631 0.13 0.85 0.22 98accuracy 0.99 56961macro avg 0.56 0.92 0.61 56961 weighted avg 1.00 0.99 0.99 56961[15:24:00] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.4.0/src/learner.cc:1095: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior. XGboostprecision recall f1-score support0 1.00 1.00 1.00 568631 0.69 0.84 0.76 98accuracy 1.00 56961macro avg 0.84 0.92 0.88 56961 weighted avg 1.00 1.00 1.00 56961LightGBMprecision recall f1-score support0 1.00 1.00 1.00 568631 0.54 0.83 0.66 98accuracy 1.00 56961macro avg 0.77 0.91 0.83 56961 weighted avg 1.00 1.00 1.00 56961LG KNN Bayes SVC DecisionTree RandomForest Adaboost GBDT XGboost LightGBM 0 0.977862 0.996138 0.976335 0.985113 0.995874 0.99935 0.977054 0.9898 0.99907 0.998508| 1 | 1 | 1 | 0 | 0 | 1 | 1 | 1 | 1 | 1 |
| 1 | 1 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 1 |
| 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
從精確度上可以看出,經過了SMOTE上采樣和隨機下采樣之后,精確度有了很好的提升,比如隨機森林RandomForest達到了99.94%的精確度,在樣本數量上分別是13個欺詐樣本的沒有識別,誤將18個非欺詐樣本識別為欺詐樣本,其他的算法如XGBoost,lightGBM,KNN等都達到了99%的準確率,下面使用準確率高的模型進行集成
因為樣本量較大,而參數調優也比較耗時,目前的效果也比較好,因而省略網格調優的過程,時間足夠的可以像前面一樣調優。
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
最終,根據我們選擇的模型,能達到幾乎100%的準確率,欺詐樣本中有17個沒有識別出,而在非欺詐樣本中只有33個被識別為存在信用卡欺詐,與隨機下采樣在準確率上有了極大提升。在欺詐樣本的識別上還有提升空間!
總結
- 上一篇: 基于Qsys的DDR2内存驱动
- 下一篇: 视频号的视频也不是那么难下载嘛