當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

信用卡欺诈检测

發布時間：2023/12/20 编程问答 25 豆豆

生活随笔收集整理的這篇文章主要介紹了信用卡欺诈检测小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

信用卡欺詐檢測

信用卡欺詐檢測是kaggle上一個項目，數據來源是2013年歐洲持有信用卡的交易數據，詳細內容見https://www.kaggle.com/mlg-ulb/creditcardfraud
這個項目所要實現的目標是對一個交易預測它是否存在信用卡欺詐，和大部分機器學習項目的區別在于正負樣本的不均衡，而且是極不均衡的，所以這是特征工程需要處理的第一個問題。
除此之外，在數據預處理上減輕的負擔是缺失值的處理，并且大多數特征是經過了均值化處理的。

項目背景與數據初探

# 導入基礎的庫，其他的模型庫在需要時再導入 import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns import warnings warnings.filterwarnings('ignore') # 設置顯示的寬度，避免有過長行不顯示的問題 pd.set_option('display.max_columns', 10000) pd.set_option('display.max_colwidth', 10000) pd.set_option('display.width', 10000) # 導入數據并查看基本數據情況 data = pd.read_csv('D:/數據分析/kaggle/信用卡欺詐/creditcard.csv') data.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 284807 entries, 0 to 284806 Data columns (total 31 columns):# Column Non-Null Count Dtype --- ------ -------------- ----- 0 Time 284807 non-null float641 V1 284807 non-null float642 V2 284807 non-null float643 V3 284807 non-null float644 V4 284807 non-null float645 V5 284807 non-null float646 V6 284807 non-null float647 V7 284807 non-null float648 V8 284807 non-null float649 V9 284807 non-null float6410 V10 284807 non-null float6411 V11 284807 non-null float6412 V12 284807 non-null float6413 V13 284807 non-null float6414 V14 284807 non-null float6415 V15 284807 non-null float6416 V16 284807 non-null float6417 V17 284807 non-null float6418 V18 284807 non-null float6419 V19 284807 non-null float6420 V20 284807 non-null float6421 V21 284807 non-null float6422 V22 284807 non-null float6423 V23 284807 non-null float6424 V24 284807 non-null float6425 V25 284807 non-null float6426 V26 284807 non-null float6427 V27 284807 non-null float6428 V28 284807 non-null float6429 Amount 284807 non-null float6430 Class 284807 non-null int64 dtypes: float64(30), int64(1) memory usage: 67.4 MB data.shape (284807, 31) data.describe() TimeV1V2V3V4V5V6V7V8V9V10V11V12V13V14V15V16V17V18V19V20V21V22V23V24V25V26V27V28AmountClasscountmeanstdmin25%50%75%max

284807.000000

2.848070e+05

284807.000000

94813.859575

3.919560e-15

5.688174e-16

-8.769071e-15

2.782312e-15

-1.552563e-15

2.010663e-15

-1.694249e-15

-1.927028e-16

-3.137024e-15

1.768627e-15

9.170318e-16

-1.810658e-15

1.693438e-15

1.479045e-15

3.482336e-15

1.392007e-15

-7.528491e-16

4.328772e-16

9.049732e-16

5.085503e-16

1.537294e-16

7.959909e-16

5.367590e-16

4.458112e-15

1.453003e-15

1.699104e-15

-3.660161e-16

-1.206049e-16

88.349619

0.001727

47488.145955

1.958696e+00

1.651309e+00

1.516255e+00

1.415869e+00

1.380247e+00

1.332271e+00

1.237094e+00

1.194353e+00

1.098632e+00

1.088850e+00

1.020713e+00

9.992014e-01

9.952742e-01

9.585956e-01

9.153160e-01

8.762529e-01

8.493371e-01

8.381762e-01

8.140405e-01

7.709250e-01

7.345240e-01

7.257016e-01

6.244603e-01

6.056471e-01

5.212781e-01

4.822270e-01

4.036325e-01

3.300833e-01

250.120109

0.041527

0.000000

-5.640751e+01

-7.271573e+01

-4.832559e+01

-5.683171e+00

-1.137433e+02

-2.616051e+01

-4.355724e+01

-7.321672e+01

-1.343407e+01

-2.458826e+01

-4.797473e+00

-1.868371e+01

-5.791881e+00

-1.921433e+01

-4.498945e+00

-1.412985e+01

-2.516280e+01

-9.498746e+00

-7.213527e+00

-5.449772e+01

-3.483038e+01

-1.093314e+01

-4.480774e+01

-2.836627e+00

-1.029540e+01

-2.604551e+00

-2.256568e+01

-1.543008e+01

0.000000

54201.500000

-9.203734e-01

-5.985499e-01

-8.903648e-01

-8.486401e-01

-6.915971e-01

-7.682956e-01

-5.540759e-01

-2.086297e-01

-6.430976e-01

-5.354257e-01

-7.624942e-01

-4.055715e-01

-6.485393e-01

-4.255740e-01

-5.828843e-01

-4.680368e-01

-4.837483e-01

-4.988498e-01

-4.562989e-01

-2.117214e-01

-2.283949e-01

-5.423504e-01

-1.618463e-01

-3.545861e-01

-3.171451e-01

-3.269839e-01

-7.083953e-02

-5.295979e-02

5.600000

0.000000

84692.000000

1.810880e-02

6.548556e-02

1.798463e-01

-1.984653e-02

-5.433583e-02

-2.741871e-01

4.010308e-02

2.235804e-02

-5.142873e-02

-9.291738e-02

-3.275735e-02

1.400326e-01

-1.356806e-02

5.060132e-02

4.807155e-02

6.641332e-02

-6.567575e-02

-3.636312e-03

3.734823e-03

-6.248109e-02

-2.945017e-02

6.781943e-03

-1.119293e-02

4.097606e-02

1.659350e-02

-5.213911e-02

1.342146e-03

1.124383e-02

22.000000

0.000000

139320.500000

1.315642e+00

8.037239e-01

1.027196e+00

7.433413e-01

6.119264e-01

3.985649e-01

5.704361e-01

3.273459e-01

5.971390e-01

4.539234e-01

7.395934e-01

6.182380e-01

6.625050e-01

4.931498e-01

6.488208e-01

5.232963e-01

3.996750e-01

5.008067e-01

4.589494e-01

1.330408e-01

1.863772e-01

5.285536e-01

1.476421e-01

4.395266e-01

3.507156e-01

2.409522e-01

9.104512e-02

7.827995e-02

77.165000

0.000000

172792.000000

2.454930e+00

2.205773e+01

9.382558e+00

1.687534e+01

3.480167e+01

7.330163e+01

1.205895e+02

2.000721e+01

1.559499e+01

2.374514e+01

1.201891e+01

7.848392e+00

7.126883e+00

1.052677e+01

8.877742e+00

1.731511e+01

9.253526e+00

5.041069e+00

5.591971e+00

3.942090e+01

2.720284e+01

1.050309e+01

2.252841e+01

4.584549e+00

7.519589e+00

3.517346e+00

3.161220e+01

3.384781e+01

25691.160000

1.000000

data.head().append(data.tail()) TimeV1V2V3V4V5V6V7V8V9V10V11V12V13V14V15V16V17V18V19V20V21V22V23V24V25V26V27V28AmountClass01234284802284803284804284805284806

0.0	-1.359807	-0.072781	2.536347	1.378155	-0.338321	0.462388	0.239599	0.098698	0.363787	0.090794	-0.551600	-0.617801	-0.991390	-0.311169	1.468177	-0.470401	0.207971	0.025791	0.403993	0.251412	-0.018307	0.277838	-0.110474	0.066928	0.128539	-0.189115	0.133558	-0.021053	149.62
0.0	1.191857	0.266151	0.166480	0.448154	0.060018	-0.082361	-0.078803	0.085102	-0.255425	-0.166974	1.612727	1.065235	0.489095	-0.143772	0.635558	0.463917	-0.114805	-0.183361	-0.145783	-0.069083	-0.225775	-0.638672	0.101288	-0.339846	0.167170	0.125895	-0.008983	0.014724	2.69
1.0	-1.358354	-1.340163	1.773209	0.379780	-0.503198	1.800499	0.791461	0.247676	-1.514654	0.207643	0.624501	0.066084	0.717293	-0.165946	2.345865	-2.890083	1.109969	-0.121359	-2.261857	0.524980	0.247998	0.771679	0.909412	-0.689281	-0.327642	-0.139097	-0.055353	-0.059752	378.66
1.0	-0.966272	-0.185226	1.792993	-0.863291	-0.010309	1.247203	0.237609	0.377436	-1.387024	-0.054952	-0.226487	0.178228	0.507757	-0.287924	-0.631418	-1.059647	-0.684093	1.965775	-1.232622	-0.208038	-0.108300	0.005274	-0.190321	-1.175575	0.647376	-0.221929	0.062723	0.061458	123.50
2.0	-1.158233	0.877737	1.548718	0.403034	-0.407193	0.095921	0.592941	-0.270533	0.817739	0.753074	-0.822843	0.538196	1.345852	-1.119670	0.175121	-0.451449	-0.237033	-0.038195	0.803487	0.408542	-0.009431	0.798278	-0.137458	0.141267	-0.206010	0.502292	0.219422	0.215153	69.99
172786.0	-11.881118	10.071785	-9.834783	-2.066656	-5.364473	-2.606837	-4.918215	7.305334	1.914428	4.356170	-1.593105	2.711941	-0.689256	4.626942	-0.924459	1.107641	1.991691	0.510632	-0.682920	1.475829	0.213454	0.111864	1.014480	-0.509348	1.436807	0.250034	0.943651	0.823731	0.77
172787.0	-0.732789	-0.055080	2.035030	-0.738589	0.868229	1.058415	0.024330	0.294869	0.584800	-0.975926	-0.150189	0.915802	1.214756	-0.675143	1.164931	-0.711757	-0.025693	-1.221179	-1.545556	0.059616	0.214205	0.924384	0.012463	-1.016226	-0.606624	-0.395255	0.068472	-0.053527	24.79
172788.0	1.919565	-0.301254	-3.249640	-0.557828	2.630515	3.031260	-0.296827	0.708417	0.432454	-0.484782	0.411614	0.063119	-0.183699	-0.510602	1.329284	0.140716	0.313502	0.395652	-0.577252	0.001396	0.232045	0.578229	-0.037501	0.640134	0.265745	-0.087371	0.004455	-0.026561	67.88
172788.0	-0.240440	0.530483	0.702510	0.689799	-0.377961	0.623708	-0.686180	0.679145	0.392087	-0.399126	-1.933849	-0.962886	-1.042082	0.449624	1.962563	-0.608577	0.509928	1.113981	2.897849	0.127434	0.265245	0.800049	-0.163298	0.123205	-0.569159	0.546668	0.108821	0.104533	10.00
172792.0	-0.533413	-0.189733	0.703337	-0.506271	-0.012546	-0.649617	1.577006	-0.414650	0.486180	-0.915427	-1.040458	-0.031513	-0.188093	-0.084316	0.041333	-0.302620	-0.660377	0.167430	-0.256117	0.382948	0.261057	0.643078	0.376777	0.008797	-0.473649	-0.818267	-0.002415	0.013649	217.00

data.Class.value_counts() 0 284315 1 492 Name: Class, dtype: int64

數據包含284807樣本，30個屬性特征和一個所屬類別，數據完整沒有缺失值，所以不需要缺失值的處理。非匿名特征包括時間和交易金額，以及所屬的類別，匿名特征28個從v1-v28，統計上的均值(基本上等于0)和方差(1左右)可以看出是已經進行了歸一化處理。(在Kaggle上的介紹說：匿名特征是經過了脫敏和PCA處理的，時間特征Time包含數據集中每個交易和第一個交易之間經過的秒數，應該是距離開始采集數據的時間，總共是兩天，172792正好是差不多48小時)
正負樣本不均衡的問題，從描述性統計結果上的class列，均值為0.001727，說明極大多數樣本都是0，也可以通過data.value_counts()查看，正常交易284315項，而異常的只有492項，這種極不平衡的樣本處理將是特征工程的主要任務

探索性數據分析(EDA)

* 單一屬性分析

數據屬性都是數值型，所以不需要區分數值屬性和類別屬性，也不需要對類別屬性的重新編碼，下面分析單一屬性的特點，從預測類別開始

通過類別的分布圖以及峰度、偏度的計算，可以更直觀的看到樣本分布的不均衡

# 欺詐與非欺詐類別分布的直方圖 sns.countplot('Class', data=data, color='blue') plt.xlabel('values') plt.ylabel('Counts') plt.title('Class Distributions \n (0: No Fraud || 1: Fraud)') Text(0.5, 1.0, 'Class Distributions \n (0: No Fraud || 1: Fraud)')

print('Kurtosis:', data.Class.kurt()) print('Skewness:', data.Class.skew()) Kurtosis: 573.887842782971 Skewness: 23.99757931064749

下面分析兩個沒有經過標準化的屬性：Time和Amount

# 數值屬性時間與金額分布圖，金額用正態分布和對數分布兩種方式擬合 import scipy.stats as st fig, ax = plt.subplots(1, 3, figsize=(18, 4)) print(ax) sns.distplot(data.Amount, color='blue', ax=ax[0],kde = False,fit=st.norm) ax[0].set_title('Distribution of transaction amount_normal')sns.distplot(data.Amount,color='blue',ax=ax[1],fit=st.lognorm) ax[1].set_title('Distribution of transaction amount_lognorm')sns.distplot(data.Time, color='r', ax=ax[2]) ax[2].set_title('Distribution of transaction time')

print(data.Amount.value_counts()) 1.00 13688 1.98 6044 0.89 4872 9.99 4747 15.00 3280... 192.63 1 218.84 1 195.52 1 793.50 1 1080.06 1 Name: Amount, Length: 32767, dtype: int64 print('the ratio of Amount<5:', data.Amount[data.Amount < 5].value_counts( ).sum()/data.Amount.value_counts().sum()) print('the ratio of Amount<10:', data.Amount[data.Amount < 10].value_counts( ).sum()/data.Amount.value_counts().sum()) print('the ratio of Amount<20:', data.Amount[data.Amount < 20].value_counts( ).sum()/data.Amount.value_counts().sum()) print('the ratio of Amount<30:', data.Amount[data.Amount < 30].value_counts( ).sum()/data.Amount.value_counts().sum()) print('the ratio of Amount<50:', data.Amount[data.Amount < 50].value_counts( ).sum()/data.Amount.value_counts().sum()) print('the ratio of Amount<100:', data.Amount[data.Amount < 100].value_counts( ).sum()/data.Amount.value_counts().sum()) print('the ratio of Amount>5000:', data.Amount[data.Amount > 5000].value_counts( ).sum()/data.Amount.value_counts().sum()) the ratio of Amount<5: 0.2368726892246328 the ratio of Amount<10: 0.3416840175978821 the ratio of Amount<20: 0.481476929991187 the ratio of Amount<30: 0.562022703093674 the ratio of Amount<50: 0.6660791342909409 the ratio of Amount<100: 0.7985126770058321 the ratio of Amount>5000: 0.00019311323106524768

在金額屬性上，絕大多數金額都是小于50美元較小的，存在少量的較大數值如1080美元；時間上在15000s和100000s附近出現了兩次低峰，距離開始采樣數據的時間分別是4小時和27小時，猜測這時候是凌晨3、4點，這也符合現實，畢竟深夜購物的人還是少數。由于其他的匿名屬性都進行了均值化處理，并且金額的擬合效果正態分布優于取對數，下面也將金額Amount進行標準化處理，同樣的，時間轉化為小時之后再作均值化處理。

from sklearn.preprocessing import StandardScaler sc = StandardScaler() data['Amount'] = sc.fit_transform(data.Amount.values.reshape(-1, 1)) # reshape()函數的兩個參數，-1表示不知道多少行，1表示一列 data['Hour'] = data.Time.apply(lambda x: divmod(x, 3600)[0]) data['Hour'] = data.Hour.apply(lambda x: divmod(x, 24)[1]) # 時間進一步轉換成24小時制，因為考慮到交易密度的周期性分布 data['Hour'] = sc.fit_transform(data['Hour'].values.reshape(-1, 1)) data.drop(columns='Time', inplace=True) data.head().append(data.tail()) V1V2V3V4V5V6V7V8V9V10V11V12V13V14V15V16V17V18V19V20V21V22V23V24V25V26V27V28AmountClassHour01234284802284803284804284805284806

-1.359807	-0.072781	2.536347	1.378155	-0.338321	0.462388	0.239599	0.098698	0.363787	0.090794	-0.551600	-0.617801	-0.991390	-0.311169	1.468177	-0.470401	0.207971	0.025791	0.403993	0.251412	-0.018307	0.277838	-0.110474	0.066928	0.128539	-0.189115	0.133558	-0.021053	0.244964	-2.40693
1.191857	0.266151	0.166480	0.448154	0.060018	-0.082361	-0.078803	0.085102	-0.255425	-0.166974	1.612727	1.065235	0.489095	-0.143772	0.635558	0.463917	-0.114805	-0.183361	-0.145783	-0.069083	-0.225775	-0.638672	0.101288	-0.339846	0.167170	0.125895	-0.008983	0.014724	-0.342475	-2.40693
-1.358354	-1.340163	1.773209	0.379780	-0.503198	1.800499	0.791461	0.247676	-1.514654	0.207643	0.624501	0.066084	0.717293	-0.165946	2.345865	-2.890083	1.109969	-0.121359	-2.261857	0.524980	0.247998	0.771679	0.909412	-0.689281	-0.327642	-0.139097	-0.055353	-0.059752	1.160686	-2.40693
-0.966272	-0.185226	1.792993	-0.863291	-0.010309	1.247203	0.237609	0.377436	-1.387024	-0.054952	-0.226487	0.178228	0.507757	-0.287924	-0.631418	-1.059647	-0.684093	1.965775	-1.232622	-0.208038	-0.108300	0.005274	-0.190321	-1.175575	0.647376	-0.221929	0.062723	0.061458	0.140534	-2.40693
-1.158233	0.877737	1.548718	0.403034	-0.407193	0.095921	0.592941	-0.270533	0.817739	0.753074	-0.822843	0.538196	1.345852	-1.119670	0.175121	-0.451449	-0.237033	-0.038195	0.803487	0.408542	-0.009431	0.798278	-0.137458	0.141267	-0.206010	0.502292	0.219422	0.215153	-0.073403	-2.40693
-11.881118	10.071785	-9.834783	-2.066656	-5.364473	-2.606837	-4.918215	7.305334	1.914428	4.356170	-1.593105	2.711941	-0.689256	4.626942	-0.924459	1.107641	1.991691	0.510632	-0.682920	1.475829	0.213454	0.111864	1.014480	-0.509348	1.436807	0.250034	0.943651	0.823731	-0.350151	1.53423
-0.732789	-0.055080	2.035030	-0.738589	0.868229	1.058415	0.024330	0.294869	0.584800	-0.975926	-0.150189	0.915802	1.214756	-0.675143	1.164931	-0.711757	-0.025693	-1.221179	-1.545556	0.059616	0.214205	0.924384	0.012463	-1.016226	-0.606624	-0.395255	0.068472	-0.053527	-0.254117	1.53423
1.919565	-0.301254	-3.249640	-0.557828	2.630515	3.031260	-0.296827	0.708417	0.432454	-0.484782	0.411614	0.063119	-0.183699	-0.510602	1.329284	0.140716	0.313502	0.395652	-0.577252	0.001396	0.232045	0.578229	-0.037501	0.640134	0.265745	-0.087371	0.004455	-0.026561	-0.081839	1.53423
-0.240440	0.530483	0.702510	0.689799	-0.377961	0.623708	-0.686180	0.679145	0.392087	-0.399126	-1.933849	-0.962886	-1.042082	0.449624	1.962563	-0.608577	0.509928	1.113981	2.897849	0.127434	0.265245	0.800049	-0.163298	0.123205	-0.569159	0.546668	0.108821	0.104533	-0.313249	1.53423
-0.533413	-0.189733	0.703337	-0.506271	-0.012546	-0.649617	1.577006	-0.414650	0.486180	-0.915427	-1.040458	-0.031513	-0.188093	-0.084316	0.041333	-0.302620	-0.660377	0.167430	-0.256117	0.382948	0.261057	0.643078	0.376777	0.008797	-0.473649	-0.818267	-0.002415	0.013649	0.514355	1.53423

# 將樣本順序打亂，因為Hour屬性是有順序的，為后期樣本訓練集測試集的劃分做準備 data = data.sample(frac=1) data.head().append(data.tail()) V1V2V3V4V5V6V7V8V9V10V11V12V13V14V15V16V17V18V19V20V21V22V23V24V25V26V27V28AmountClassHour1072161680473289279921178297224944144794565613711558951

1.324299	-1.085563	0.237279	-1.479829	-1.161581	-0.372790	-0.742945	-0.049038	-2.376563	1.516249	1.631516	0.137025	0.627911	0.049387	0.025037	-0.532986	0.575155	-0.556925	-0.065486	-0.214061	-0.526897	-1.342434	0.251389	-0.065977	-0.029430	-0.636325	0.020683	0.022199	-0.055132	0.848811
-0.216830	0.175845	-0.126917	0.116160	3.348383	4.292672	0.296703	0.622167	0.389999	0.133185	-0.651432	0.098311	-0.612388	-0.477757	-1.049531	-1.558886	0.199438	-0.772957	1.204000	0.094183	-0.318508	-0.357454	-0.112793	0.685842	-0.469426	-0.769896	-0.187049	-0.232898	-0.345233	-0.864737
1.250249	0.019063	-1.326108	-0.039059	2.232341	3.300602	-0.326435	0.757703	-0.156352	0.062703	-0.161649	0.029727	-0.018457	0.530951	1.107512	0.377201	-0.926363	0.295044	-0.006783	0.033750	-0.009900	-0.189322	-0.157734	1.005326	0.838403	-0.315582	0.011439	0.018031	-0.233287	-2.406930
-4.814708	4.736862	-4.817819	-1.103622	-2.256585	-2.425710	-1.657050	3.493293	-0.207819	0.013773	-2.260313	0.744399	-0.973800	3.250789	-0.282599	0.313408	1.121537	-0.022531	-0.416665	-0.117994	0.375361	0.554691	0.349764	0.026127	0.275542	0.148178	0.102320	0.185414	-0.319965	1.362876
2.100716	-0.778343	-0.596761	-0.557506	-0.575207	0.293849	-0.958246	0.146132	-0.136921	0.798909	0.428648	0.958404	0.715275	-0.043584	-0.140031	-1.084006	-0.649764	1.591873	-0.423356	-0.580698	-0.562834	-1.044407	0.459820	0.203965	-0.508869	-0.686475	0.053956	-0.034934	-0.353189	-0.693382
2.053311	0.089735	-1.681836	0.454212	0.298310	-0.953526	0.152003	-0.207071	0.587335	-0.362047	-0.589598	-0.174712	-0.621127	-0.703513	0.271957	0.318688	0.549365	-0.257786	0.016256	-0.187421	-0.361158	-0.984262	0.354198	0.620709	-0.297138	0.166736	-0.068299	-0.029585	-0.317287	0.334747
1.963377	0.175655	-1.791505	1.180371	0.493289	-1.157260	0.678691	-0.322212	-0.273120	0.553350	0.900665	0.380835	-1.197776	1.219675	-0.368170	-0.251083	-0.545004	0.080937	-0.199778	-0.296184	0.188454	0.525823	-0.019335	-0.007700	0.374702	-0.503352	-0.043733	-0.070953	-0.233887	-2.406930
1.180264	0.668819	-0.242382	1.284748	0.029324	-1.039543	0.202761	-0.102430	-0.340626	-0.508367	1.978050	0.556068	-0.337921	-1.095969	0.391429	0.740869	0.726637	1.040940	-0.411526	-0.103660	-0.004479	0.014954	-0.108992	0.397877	0.628531	-0.356931	0.030663	0.049046	-0.349231	-0.179318
-2.091027	1.249032	0.841086	-0.777488	-0.176500	-0.077257	-0.118603	-0.256751	0.178740	-0.000305	0.991856	0.698911	-0.901970	0.341906	-0.643972	-0.011763	-0.069715	-0.449297	-0.255400	-0.517288	0.631502	-0.413265	0.293367	-0.000012	-0.318688	0.224045	-0.725597	-0.392266	-0.349231	-0.693382
1.162447	0.272458	0.615165	1.058086	-0.262004	-0.359390	-0.012728	-0.015620	0.066470	-0.087054	-0.061459	0.470350	0.215349	0.293257	1.308914	-0.006260	-0.233542	-0.816695	-0.644417	-0.141656	-0.243925	-0.693725	0.168349	0.031270	0.216868	-0.665192	0.045823	0.031301	-0.305252	-0.179318

# 下面查看匿名屬性的分布 # 匿名屬性的峰度與偏度 numerical_columns = data.columns.drop(['Class','Hour','Amount']) for num_col in numerical_columns:print('{:10}'.format(num_col), 'Skewness:', '{:8.2f}'.format(data[num_col].skew()), ' Kurtosis:', '{:8.2f}'.format(data[num_col].kurt())) V1 Skewness: -3.28 Kurtosis: 32.49 V2 Skewness: -4.62 Kurtosis: 95.77 V3 Skewness: -2.24 Kurtosis: 26.62 V4 Skewness: 0.68 Kurtosis: 2.64 V5 Skewness: -2.43 Kurtosis: 206.90 V6 Skewness: 1.83 Kurtosis: 42.64 V7 Skewness: 2.55 Kurtosis: 405.61 V8 Skewness: -8.52 Kurtosis: 220.59 V9 Skewness: 0.55 Kurtosis: 3.73 V10 Skewness: 1.19 Kurtosis: 31.99 V11 Skewness: 0.36 Kurtosis: 1.63 V12 Skewness: -2.28 Kurtosis: 20.24 V13 Skewness: 0.07 Kurtosis: 0.20 V14 Skewness: -2.00 Kurtosis: 23.88 V15 Skewness: -0.31 Kurtosis: 0.28 V16 Skewness: -1.10 Kurtosis: 10.42 V17 Skewness: -3.84 Kurtosis: 94.80 V18 Skewness: -0.26 Kurtosis: 2.58 V19 Skewness: 0.11 Kurtosis: 1.72 V20 Skewness: -2.04 Kurtosis: 271.02 V21 Skewness: 3.59 Kurtosis: 207.29 V22 Skewness: -0.21 Kurtosis: 2.83 V23 Skewness: -5.88 Kurtosis: 440.09 V24 Skewness: -0.55 Kurtosis: 0.62 V25 Skewness: -0.42 Kurtosis: 4.29 V26 Skewness: 0.58 Kurtosis: 0.92 V27 Skewness: -1.17 Kurtosis: 244.99 V28 Skewness: 11.19 Kurtosis: 933.40 f = pd.melt(data, value_vars=numerical_columns) g = sns.FacetGrid(f, col='variable', col_wrap=4, sharex=False, sharey=False) g = g.map(sns.distplot, 'value')

因為數據經過了標準化處理，所以匿名特征的偏度相對較小，峰度較大，分布上v_5, v_7, v_8, v_20, v_21, v_23, v_27,v_28的峰度值較高，說明這些特征的分布十分集中，其他的特征分布相對均勻。

* 分析兩個屬性之間的關系

# 相關性分析 plt.figure(figsize=(8, 6)) sns.heatmap(data.corr(), square=True,cmap='coolwarm_r', annot_kws={'size': 20}) plt.show() data.corr()

V1V2V3V4V5V6V7V8V9V10V11V12V13V14V15V16V17V18V19V20V21V22V23V24V25V26V27V28AmountClassHourV1V2V3V4V5V6V7V8V9V10V11V12V13V14V15V16V17V18V19V20V21V22V23V24V25V26V27V28AmountClassHour

1.000000e+00

1.717515e-16

-9.049057e-16

-2.483769e-16

3.029761e-16

1.242968e-16

4.904952e-17

-2.809306e-17

5.255584e-17

5.521194e-17

2.590354e-16

1.898607e-16

-3.778206e-17

4.132544e-16

-1.522953e-16

3.007488e-16

-3.106689e-17

1.645281e-16

1.045442e-16

9.852066e-17

-1.808291e-16

7.990046e-17

1.056762e-16

-5.113541e-17

-2.165870e-16

-1.408070e-16

1.664818e-16

2.419798e-16

-0.227709

-0.101347

-0.005214

1.717515e-16

1.000000e+00

9.206734e-17

-1.226183e-16

1.487342e-16

3.492970e-16

-4.501183e-17

-5.839820e-17

-1.832049e-16

-2.598347e-16

3.312314e-16

-3.228806e-16

-1.091369e-16

-4.657267e-16

5.392819e-17

-3.749559e-18

-5.591430e-16

2.877862e-16

-1.672445e-17

3.898167e-17

4.667231e-17

1.203109e-16

3.242403e-16

-1.254911e-16

8.784216e-17

2.450901e-16

-5.467509e-16

-6.910935e-17

-0.531409

0.091289

0.007802

-9.049057e-16

9.206734e-17

1.000000e+00

-2.981464e-16

-6.943096e-16

1.308147e-15

2.120327e-16

-8.586741e-17

9.780262e-17

2.764966e-16

1.500352e-16

2.182812e-16

-4.679364e-17

6.942636e-16

-5.333200e-17

5.460118e-16

2.134100e-16

2.870116e-16

3.728807e-16

1.267443e-16

1.189711e-16

-2.343257e-16

-8.182206e-17

-3.300147e-17

1.123060e-16

-2.136494e-16

4.752587e-16

6.073110e-16

-0.210880

-0.192961

-0.021569

-2.483769e-16

-1.226183e-16

-2.981464e-16

1.000000e+00

-1.903391e-15

-4.169652e-16

-6.535390e-17

5.942856e-16

6.175719e-16

-6.910284e-17

-2.936726e-16

-1.448546e-16

3.050372e-17

-8.547776e-17

2.459280e-16

-8.218577e-17

-4.443050e-16

5.369916e-18

-2.842719e-16

-2.222520e-16

1.390687e-17

2.189964e-16

1.663593e-16

1.403733e-16

6.312530e-16

-4.009636e-16

-6.309346e-17

-2.064064e-16

0.098732

0.133447

-0.035063

3.029761e-16

1.487342e-16

-6.943096e-16

-1.903391e-15

1.000000e+00

1.159613e-15

7.659742e-17

7.328495e-16

4.435269e-16

1.632311e-16

6.784587e-16

4.520778e-16

-2.979964e-16

2.516209e-16

1.016075e-16

6.264287e-16

4.535815e-16

4.196874e-16

-1.277261e-16

-2.414281e-16

9.325965e-17

-6.982655e-17

-1.848644e-16

-9.892370e-16

-1.561416e-16

3.403172e-16

3.299056e-16

-3.491468e-16

-0.386356

-0.094974

-0.035134

1.242968e-16

3.492970e-16

1.308147e-15

-4.169652e-16

1.159613e-15

1.000000e+00

-2.949670e-16

-3.474079e-16

-1.008735e-16

1.322142e-16

8.380230e-16

2.570184e-16

-1.251524e-16

3.531769e-16

-6.825844e-17

-1.823748e-16

1.161080e-16

6.313161e-17

6.136340e-17

-1.318056e-16

-4.925144e-17

-9.729827e-17

-3.176032e-17

-1.125379e-15

5.563670e-16

-2.627057e-16

-4.040640e-16

4.612882e-17

0.215981

-0.043643

-0.018945

4.904952e-17

-4.501183e-17

2.120327e-16

-6.535390e-17

7.659742e-17

-2.949670e-16

1.000000e+00

3.038794e-17

-5.250969e-17

3.186953e-16

-3.362622e-16

7.265464e-16

-1.485108e-16

7.720708e-17

-1.845909e-16

4.901078e-16

7.173458e-16

1.638629e-16

-1.132423e-16

1.889527e-16

-7.597231e-17

-6.887963e-16

1.393022e-16

2.078026e-17

-1.507689e-17

-7.709408e-16

-2.647380e-18

2.115388e-17

0.397311

-0.187257

-0.009729

-2.809306e-17

-5.839820e-17

-8.586741e-17

5.942856e-16

7.328495e-16

-3.474079e-16

3.038794e-17

1.000000e+00

4.683000e-16

-3.022868e-16

1.499830e-16

3.887009e-17

-3.213252e-16

-2.288651e-16

1.109628e-16

2.500367e-16

-3.808536e-16

-3.119192e-16

-3.559019e-16

1.098800e-17

-2.338214e-16

-6.701600e-18

2.701514e-16

-2.444390e-16

-1.792313e-16

1.092765e-17

3.921512e-16

-5.158971e-16

-0.103079

0.019875

0.032106

5.255584e-17

-1.832049e-16

9.780262e-17

6.175719e-16

4.435269e-16

-1.008735e-16

-5.250969e-17

4.683000e-16

1.000000e+00

-4.733827e-16

3.289603e-16

-1.339732e-15

9.374886e-16

9.287436e-16

-8.883532e-16

-5.409347e-16

7.071023e-16

1.471108e-16

1.293082e-16

-3.112119e-16

2.755460e-16

-2.171404e-16

-1.011218e-16

-2.940457e-16

2.137255e-16

-1.039639e-16

-1.499396e-16

7.982292e-16

-0.044246

-0.097733

-0.189830

5.521194e-17

-2.598347e-16

2.764966e-16

-6.910284e-17

1.632311e-16

1.322142e-16

3.186953e-16

-3.022868e-16

-4.733827e-16

1.000000e+00

-3.633385e-16

8.563304e-16

-4.013607e-16

6.638602e-16

3.932439e-16

1.882434e-16

6.617837e-16

4.829483e-16

4.623218e-17

-1.340974e-15

1.048675e-15

-2.890990e-16

1.907376e-16

-7.312196e-17

-3.457860e-16

-4.117783e-16

-3.115507e-16

3.949646e-16

-0.101502

-0.216883

0.024177

2.590354e-16

3.312314e-16

1.500352e-16

-2.936726e-16

6.784587e-16

8.380230e-16

-3.362622e-16

1.499830e-16

3.289603e-16

-3.633385e-16

1.000000e+00

-7.116039e-16

4.369928e-16

-1.283496e-16

1.903820e-16

1.158881e-16

6.624541e-16

9.910529e-17

-1.093636e-15

-1.478641e-16

6.632474e-18

1.312323e-17

1.404725e-16

1.672342e-15

-6.082687e-16

-1.240097e-16

-1.519253e-16

-2.909057e-16

0.000104

0.154876

-0.135131

1.898607e-16

-3.228806e-16

2.182812e-16

-1.448546e-16

4.520778e-16

2.570184e-16

7.265464e-16

3.887009e-17

-1.339732e-15

8.563304e-16

-7.116039e-16

1.000000e+00

-2.297323e-14

4.486162e-16

-3.033543e-16

4.714076e-16

-3.797286e-16

-6.830564e-16

1.782434e-16

2.673446e-16

5.724276e-16

-3.587155e-17

3.029886e-16

4.510178e-16

6.970336e-18

1.653468e-16

-2.721798e-16

7.065902e-16

-0.009542

-0.260593

0.352459

-3.778206e-17

-1.091369e-16

-4.679364e-17

3.050372e-17

-2.979964e-16

-1.251524e-16

-1.485108e-16

-3.213252e-16

9.374886e-16

-4.013607e-16

4.369928e-16

-2.297323e-14

1.000000e+00

1.415589e-15

-1.185819e-16

4.849394e-16

8.705885e-17

2.432753e-16

-6.331767e-17

-3.200986e-17

1.428638e-16

-4.602453e-17

-7.174408e-16

-6.376621e-16

-1.142909e-16

-1.478991e-16

-5.300185e-16

1.043260e-15

0.005293

-0.004570

-0.187981

4.132544e-16

-4.657267e-16

6.942636e-16

-8.547776e-17

2.516209e-16

3.531769e-16

7.720708e-17

-2.288651e-16

9.287436e-16

6.638602e-16

-1.283496e-16

4.486162e-16

1.415589e-15

1.000000e+00

-2.864454e-16

-8.191302e-16

1.131442e-15

-3.009169e-16

2.138702e-16

-5.239826e-17

-2.462983e-16

6.492362e-16

2.160339e-16

-1.258007e-17

-7.178656e-17

-2.488490e-17

-1.739150e-17

2.414117e-15

0.033751

-0.302544

-0.162918

-1.522953e-16

5.392819e-17

-5.333200e-17

2.459280e-16

1.016075e-16

-6.825844e-17

-1.845909e-16

1.109628e-16

-8.883532e-16

3.932439e-16

1.903820e-16

-3.033543e-16

-1.185819e-16

-2.864454e-16

1.000000e+00

9.678376e-16

-5.606434e-16

6.692616e-16

-1.423455e-15

2.118638e-16

6.349939e-17

-3.516820e-16

1.024768e-16

-4.337014e-16

2.281677e-16

1.108681e-16

-1.246909e-15

-9.799748e-16

-0.002986

-0.004223

0.112251

3.007488e-16

-3.749559e-18

5.460118e-16

-8.218577e-17

6.264287e-16

-1.823748e-16

4.901078e-16

2.500367e-16

-5.409347e-16

1.882434e-16

1.158881e-16

4.714076e-16

4.849394e-16

-8.191302e-16

9.678376e-16

1.000000e+00

1.641102e-15

-2.666175e-15

1.138371e-15

4.407936e-16

-4.180114e-16

2.653008e-16

7.410993e-16

-3.508969e-16

-3.341605e-16

-4.690618e-16

8.147869e-16

7.042089e-16

-0.003910

-0.196539

0.005517

-3.106689e-17

-5.591430e-16

2.134100e-16

-4.443050e-16

4.535815e-16

1.161080e-16

7.173458e-16

-3.808536e-16

7.071023e-16

6.617837e-16

6.624541e-16

-3.797286e-16

8.705885e-17

1.131442e-15

-5.606434e-16

1.641102e-15

1.000000e+00

-5.251666e-15

3.694474e-16

-8.921672e-16

-1.086035e-15

-3.486998e-16

4.072307e-16

-1.897694e-16

7.587211e-17

2.084478e-16

6.669179e-16

-5.419071e-17

0.007309

-0.326481

-0.064803

1.645281e-16

2.877862e-16

2.870116e-16

5.369916e-18

4.196874e-16

6.313161e-17

1.638629e-16

-3.119192e-16

1.471108e-16

4.829483e-16

9.910529e-17

-6.830564e-16

2.432753e-16

-3.009169e-16

6.692616e-16

-2.666175e-15

-5.251666e-15

1.000000e+00

-2.719935e-15

-4.098224e-16

-1.240266e-15

-5.279657e-16

-2.362311e-16

-1.869482e-16

-2.451121e-16

3.089442e-16

2.209663e-16

8.158517e-16

0.035650

-0.111485

-0.003518

1.045442e-16

-1.672445e-17

3.728807e-16

-2.842719e-16

-1.277261e-16

6.136340e-17

-1.132423e-16

-3.559019e-16

1.293082e-16

4.623218e-17

-1.093636e-15

1.782434e-16

-6.331767e-17

2.138702e-16

-1.423455e-15

1.138371e-15

3.694474e-16

-2.719935e-15

1.000000e+00

2.693620e-16

6.052450e-16

-1.036140e-15

5.861740e-16

-9.630049e-17

8.161694e-16

5.479257e-16

-1.243578e-16

-1.291833e-15

-0.056151

0.034783

0.021566

9.852066e-17

3.898167e-17

1.267443e-16

-2.222520e-16

-2.414281e-16

-1.318056e-16

1.889527e-16

1.098800e-17

-3.112119e-16

-1.340974e-15

-1.478641e-16

2.673446e-16

-3.200986e-17

-5.239826e-17

2.118638e-16

4.407936e-16

-8.921672e-16

-4.098224e-16

2.693620e-16

1.000000e+00

-1.118296e-15

1.101689e-15

1.107203e-16

1.749671e-16

-6.786605e-18

-3.590893e-16

-8.488785e-16

-4.584320e-16

0.339403

0.020090

0.000978

-1.808291e-16

4.667231e-17

1.189711e-16

1.390687e-17

9.325965e-17

-4.925144e-17

-7.597231e-17

-2.338214e-16

2.755460e-16

1.048675e-15

6.632474e-18

5.724276e-16

1.428638e-16

-2.462983e-16

6.349939e-17

-4.180114e-16

-1.086035e-15

-1.240266e-15

6.052450e-16

-1.118296e-15

1.000000e+00

3.540128e-15

4.521934e-16

1.014531e-16

-1.173906e-16

-4.337929e-16

-1.484206e-15

1.584856e-16

0.105999

0.040413

-0.011915

7.990046e-17

1.203109e-16

-2.343257e-16

2.189964e-16

-6.982655e-17

-9.729827e-17

-6.887963e-16

-6.701600e-18

-2.171404e-16

-2.890990e-16

1.312323e-17

-3.587155e-17

-4.602453e-17

6.492362e-16

-3.516820e-16

2.653008e-16

-3.486998e-16

-5.279657e-16

-1.036140e-15

1.101689e-15

3.540128e-15

1.000000e+00

3.086083e-16

6.736130e-17

-9.827185e-16

-2.194486e-17

1.478149e-16

-5.686304e-16

-0.064801

0.000805

-0.016610

1.056762e-16

3.242403e-16

-8.182206e-17

1.663593e-16

-1.848644e-16

-3.176032e-17

1.393022e-16

2.701514e-16

-1.011218e-16

1.907376e-16

1.404725e-16

3.029886e-16

-7.174408e-16

2.160339e-16

1.024768e-16

7.410993e-16

4.072307e-16

-2.362311e-16

5.861740e-16

1.107203e-16

4.521934e-16

3.086083e-16

1.000000e+00

7.328447e-17

-7.508801e-16

1.284451e-15

4.254579e-16

1.281294e-15

-0.112633

-0.002685

0.006004

-5.113541e-17

-1.254911e-16

-3.300147e-17

1.403733e-16

-9.892370e-16

-1.125379e-15

2.078026e-17

-2.444390e-16

-2.940457e-16

-7.312196e-17

1.672342e-15

4.510178e-16

-6.376621e-16

-1.258007e-17

-4.337014e-16

-3.508969e-16

-1.897694e-16

-1.869482e-16

-9.630049e-17

1.749671e-16

1.014531e-16

6.736130e-17

7.328447e-17

1.000000e+00

1.242718e-15

1.863258e-16

-2.894257e-16

-2.844233e-16

0.005146

-0.007221

0.004328

-2.165870e-16

8.784216e-17

1.123060e-16

6.312530e-16

-1.561416e-16

5.563670e-16

-1.507689e-17

-1.792313e-16

2.137255e-16

-3.457860e-16

-6.082687e-16

6.970336e-18

-1.142909e-16

-7.178656e-17

2.281677e-16

-3.341605e-16

7.587211e-17

-2.451121e-16

8.161694e-16

-6.786605e-18

-1.173906e-16

-9.827185e-16

-7.508801e-16

1.242718e-15

1.000000e+00

2.449277e-15

-5.340203e-16

2.699748e-16

-0.047837

0.003308

-0.003497

-1.408070e-16

2.450901e-16

-2.136494e-16

-4.009636e-16

3.403172e-16

-2.627057e-16

-7.709408e-16

1.092765e-17

-1.039639e-16

-4.117783e-16

-1.240097e-16

1.653468e-16

-1.478991e-16

-2.488490e-17

1.108681e-16

-4.690618e-16

2.084478e-16

3.089442e-16

5.479257e-16

-3.590893e-16

-4.337929e-16

-2.194486e-17

1.284451e-15

1.863258e-16

2.449277e-15

1.000000e+00

-2.939564e-16

-2.558739e-16

-0.003208

0.004455

0.001146

1.664818e-16

-5.467509e-16

4.752587e-16

-6.309346e-17

3.299056e-16

-4.040640e-16

-2.647380e-18

3.921512e-16

-1.499396e-16

-3.115507e-16

-1.519253e-16

-2.721798e-16

-5.300185e-16

-1.739150e-17

-1.246909e-15

8.147869e-16

6.669179e-16

2.209663e-16

-1.243578e-16

-8.488785e-16

-1.484206e-15

1.478149e-16

4.254579e-16

-2.894257e-16

-5.340203e-16

-2.939564e-16

1.000000e+00

-2.403217e-16

0.028825

0.017580

-0.008676

2.419798e-16

-6.910935e-17

6.073110e-16

-2.064064e-16

-3.491468e-16

4.612882e-17

2.115388e-17

-5.158971e-16

7.982292e-16

3.949646e-16

-2.909057e-16

7.065902e-16

1.043260e-15

2.414117e-15

-9.799748e-16

7.042089e-16

-5.419071e-17

8.158517e-16

-1.291833e-15

-4.584320e-16

1.584856e-16

-5.686304e-16

1.281294e-15

-2.844233e-16

2.699748e-16

-2.558739e-16

-2.403217e-16

1.000000e+00

0.010258

0.009536

-0.007492

-2.277087e-01

-5.314089e-01

-2.108805e-01

9.873167e-02

-3.863563e-01

2.159812e-01

3.973113e-01

-1.030791e-01

-4.424560e-02

-1.015021e-01

1.039770e-04

-9.541802e-03

5.293409e-03

3.375117e-02

-2.985848e-03

-3.909527e-03

7.309042e-03

3.565034e-02

-5.615079e-02

3.394034e-01

1.059989e-01

-6.480065e-02

-1.126326e-01

5.146217e-03

-4.783686e-02

-3.208037e-03

2.882546e-02

1.025822e-02

1.000000

0.005632

-0.006667

-1.013473e-01

9.128865e-02

-1.929608e-01

1.334475e-01

-9.497430e-02

-4.364316e-02

-1.872566e-01

1.987512e-02

-9.773269e-02

-2.168829e-01

1.548756e-01

-2.605929e-01

-4.569779e-03

-3.025437e-01

-4.223402e-03

-1.965389e-01

-3.264811e-01

-1.114853e-01

3.478301e-02

2.009032e-02

4.041338e-02

8.053175e-04

-2.685156e-03

-7.220907e-03

3.307706e-03

4.455398e-03

1.757973e-02

9.536041e-03

0.005632

1.000000

-0.017109

-5.214205e-03

7.802199e-03

-2.156874e-02

-3.506295e-02

-3.513442e-02

-1.894502e-02

-9.729167e-03

3.210647e-02

-1.898298e-01

2.417660e-02

-1.351310e-01

3.524592e-01

-1.879810e-01

-1.629179e-01

1.122505e-01

5.517040e-03

-6.480333e-02

-3.518403e-03

2.156599e-02

9.780928e-04

-1.191466e-02

-1.660982e-02

6.004232e-03

4.328237e-03

-3.497363e-03

1.146125e-03

-8.676362e-03

-7.492140e-03

-0.006667

-0.017109

1.000000

從數值上看，數值變量之間沒有明顯的相關性，因為樣本的不均衡，數值變量與預測類別之間的相關性也不大，這并不是我們想看到的，因而我們考慮先進行正負樣本的均衡

但是在進行樣本均衡化處理之前，我們需要先對樣本進行訓練集和測試集的劃分，這是為了保證測試的公平性，我們需要在原始數據集上測試，畢竟不能用自己構造的數據訓練集測試自己構造的測試集
為了使得訓練集和測試集的正負樣本分布一致，采用StratifiedKFold來劃分

from sklearn.model_selection import StratifiedKFold X_original = data.drop(columns='Class') y_original = data['Class'] sss = StratifiedKFold(n_splits=5,random_state=None,shuffle=False) for train_index,test_index in sss.split(X_original,y_original):print('Train:',train_index,'test:',test_index)X_train,X_test = X_original.iloc[train_index],X_original.iloc[test_index]y_train,y_test = y_original.iloc[train_index],y_original.iloc[test_index]print(X_train.shape) print(X_test.shape) Train: [ 52587 52905 53232 ... 284804 284805 284806] test: [ 0 1 2 ... 56966 56967 56968] Train: [ 0 1 2 ... 284804 284805 284806] test: [ 52587 52905 53232 ... 113937 113938 113939] Train: [ 0 1 2 ... 284804 284805 284806] test: [104762 104905 105953 ... 170897 170898 170899] Train: [ 0 1 2 ... 284804 284805 284806] test: [162214 162834 162995 ... 227854 227855 227856] Train: [ 0 1 2 ... 227854 227855 227856] test: [222268 222591 222837 ... 284804 284805 284806] (227846, 30) (56961, 30) #查看訓練集和測試集的類別分布是否一致 print('Train：',[y_train.value_counts()/y_train.value_counts().sum()]) print('Test:',[y_test.value_counts()/y_test.value_counts().sum()]) Train： [0 0.998271 1 0.001729 Name: Class, dtype: float64] Test: [0 0.99828 1 0.00172 Name: Class, dtype: float64]

對于處理正負樣本極不均衡的情況，主要有欠采樣和過采樣兩種方案.

欠采樣是從大樣本中隨機選擇與小樣本同樣數量的樣本，對于極不均衡的問題會出現欠擬合問題，因為樣本量太小。采用的方法是：

from imblearn.under_sampling import RandomUnderSamplerrus = RandomUnderSampler(random_state=1)X_undersampled,y_undersampled = rus.fit_resampled(X,y)

過采樣是利用利用小樣本生成與大樣本相同數量的樣本，有兩種方法：隨機過采樣和SMOTE法過采樣
- 隨機過采樣是從小樣本中隨機抽取一定的數量的舊樣本，組成一個與大樣本相同數量的新樣本，這種處理方法容易出現過擬合
from imblearn.over_sampling import RandomOverSampler ros = RandomOverSampler(random_state = 1) X_oversampled,y_oversampled = ros.fit_resample(X,y)
- SMOTE，即合成少數類過采樣技術，針對隨機過采樣容易出現過擬合問題的改進方案。根據樣本不同，分成數據多和數據少的兩類，從數據少的類中隨機選一個樣本，再找到小類樣本中離選定樣本點近的幾個樣本，取這幾個樣本與選定樣本連線上的點，作為新生成的樣本，重復步驟直到達到大樣本的樣本數。
from imblearn.over_sampling import SMOTE smote = SMOTE(random_state = 1) X_smotesampled,y_smotesampled = smote.fit_resample(X,y)

可以利用Counter(y_smotesampled)查看兩種樣本的數量是否一致

下面分別用隨機下采樣和SMOTE與隨機下采樣的結合，這兩種方案來對樣本進行處理，并分析對比處理后樣本的分布情況

from imblearn.under_sampling import RandomUnderSampler from imblearn.over_sampling import RandomOverSampler from imblearn.over_sampling import SMOTE from collections import CounterX = X_train.copy() y = y_train.copy() print('Imblanced samples: ', Counter(y))rus = RandomUnderSampler(random_state=1) X_rus, y_rus = rus.fit_resample(X, y) print('Random under sample: ', Counter(y_rus)) ros = RandomOverSampler(random_state=1) X_ros, y_ros = ros.fit_resample(X, y) print('Random over sample: ', Counter(y_ros)) smote = SMOTE(random_state=1,sampling_strategy=0.5) X_smote, y_smote = smote.fit_resample(X, y) print('SMOTE: ', Counter(y_smote)) under = RandomUnderSampler(sampling_strategy=1) X_smote, y_smote = under.fit_resample(X_smote,y_smote) print('SMOTE: ', Counter(y_smote)) Imblanced samples: Counter({0: 227452, 1: 394}) Random under sample: Counter({0: 394, 1: 394}) Random over sample: Counter({0: 227452, 1: 227452}) SMOTE: Counter({0: 227452, 1: 113726}) SMOTE: Counter({0: 113726, 1: 113726})

根據結果，可以看出處理后的樣本數，欠采樣是正負樣本都變成了394，而過采樣是都變成了227452。下面分別分析不同平衡樣本的情況

隨機下采樣 Random under sample

data_rus = pd.concat([X_rus,y_rus],axis=1) # 單個數值特征分布情況 f = pd.melt(data_rus, value_vars=X_train.columns) g = sns.FacetGrid(f, col='variable', col_wrap=3, sharex=False, sharey=False) g = g.map(sns.distplot, 'value')

# 單個數值特征分布箱線圖 f = pd.melt(data_rus, value_vars=X_train.columns) g = sns.FacetGrid(f,col='variable', col_wrap=3, sharex=False, sharey=False,size=5) g = g.map(sns.boxplot, 'value', color='lightskyblue')

# 單個數值特征分布小提琴圖 f = pd.melt(data_rus, value_vars=X_train.columns) g = sns.FacetGrid(f, col='variable', col_wrap=3, sharex=False, sharey=False,size=5) g = g.map(sns.violinplot, 'value',color='lightskyblue')

通過分布圖可以看到大部分的數值分布較集中，但也存在一些異常值(箱型圖和小提琴圖看起來更直觀)，異常值會帶偏模型，所以我們需要去除異常值，判斷異常值有兩種方法：正態分布和上下四分位數。正態分布是采用 $3σ3\sigma$ 原則，上下四分位數是利用分位數，超過一分位數或三分位數一定的范圍將被確定為異常值

def outlier_process(data,column):Q1 = data[column].quantile(q=0.25)Q3 = data[column].quantile(q=0.75)low_whisker = Q1-3*(Q3-Q1)high_whisker = Q3+3*(Q3-Q1)# 刪除異常值data_drop = data[(data[column]>=low_whisker) & (data[column]<=high_whisker)]#畫出刪除前和刪除后的對比圖fig,(ax1,ax2) = plt.subplots(1,2,figsize=(12,5))sns.boxplot(y=data[column],ax=ax1,color='lightskyblue')ax1.set_title('before deleting outlier'+' '+column)sns.boxplot(y=data_drop[column],ax=ax2,color='lightskyblue')ax2.set_title('after deleting outlier'+' '+column)return data_drop numerical_columns = data_rus.columns.drop('Class') for col_name in numerical_columns:data_rus = outlier_process(data_rus,col_name)

因為有些屬性的分布較集中，沒有出現太大變化，比如Hour。

清理完臟數據，再探測變量之間的相關關系

#繪制采樣前和采樣后的熱力圖 fig,(ax1,ax2) = plt.subplots(2,1,figsize=(10,10)) sns.heatmap(data.corr(),cmap = 'coolwarm_r',ax=ax1,vmax=0.8) ax1.set_title('the relationship on imbalanced samples') sns.heatmap(data_rus.corr(),cmap = 'coolwarm_r',ax=ax2,vmax=0.8) ax2.set_title('the relationship on random under samples') Text(0.5, 1.0, 'the relationship on random under samples')

# 分析數值屬性與Class之間的相關性 data_rus.corr()['Class'].sort_values(ascending=False) Class 1.000000 V4 0.722609 V11 0.694975 V2 0.481190 V19 0.245209 V20 0.151424 V21 0.126220 Amount 0.100853 V26 0.090654 V27 0.074491 V8 0.059647 V28 0.052788 V25 0.040578 V22 0.016379 V23 -0.003117 V15 -0.009488 V13 -0.055253 V24 -0.070806 Hour -0.196789 V5 -0.383632 V6 -0.407577 V1 -0.423665 V18 -0.462499 V7 -0.468273 V17 -0.561113 V3 -0.561767 V9 -0.562542 V16 -0.592382 V10 -0.629362 V12 -0.691652 V14 -0.751142 Name: Class, dtype: float64

對比可以看到，處理后的數值屬性與分類類別之間的相關性明顯增加，根據排名，和預測值Class正相關值較大的有V4,V11，V2，V27，負相關值較大的有V_14，V_10，V_12，V_3，V7，V9，我們畫出這些特征與預測值之間的關系圖

# 正相關的屬性與Class分布圖 fig,(ax1,ax2,ax3) = plt.subplots(1,3,figsize=(24,6)) sns.violinplot(x='Class',y='V4',data=data_rus,palette=['lightsalmon','lightskyblue'],ax=ax1) ax1.set_title('V4 vs Class Positive Correlation')sns.violinplot(x='Class',y='V11',data=data_rus,palette=['lightsalmon','lightskyblue'],ax=ax2) ax2.set_title('V11 vs Class Positive Correlation')sns.violinplot(x='Class',y='V2',data=data_rus,palette=['lightsalmon','lightskyblue'],ax=ax3) ax3.set_title('V2 vs Class Positive Correlation') Text(0.5, 1.0, 'V2 vs Class Positive Correlation')

# 負相關的屬性與Class分布圖 fig,((ax1,ax2,ax3),(ax4,ax5,ax6)) = plt.subplots(2,3,figsize=(24,12)) sns.violinplot(x='Class',y='V14',data=data_rus,palette=['lightsalmon','lightskyblue'],ax=ax1) ax1.set_title('V14 vs Class Negative Correlation')sns.violinplot(x='Class',y='V10',data=data_rus,palette=['lightsalmon','lightskyblue'],ax=ax2) ax2.set_title('V10 vs Class Negative Correlation')sns.violinplot(x='Class',y='V12',data=data_rus,palette=['lightsalmon','lightskyblue'],ax=ax3) ax3.set_title('V12 vs Class Negative Correlation')sns.violinplot(x='Class',y='V3',data=data_rus,palette=['lightsalmon','lightskyblue'],ax=ax4) ax4.set_title('V3 vs Class Negative Correlation')sns.violinplot(x='Class',y='V7',data=data_rus,palette=['lightsalmon','lightskyblue'],ax=ax5) ax5.set_title('V7 vs Class Negative Correlation')sns.violinplot(x='Class',y='V9',data=data_rus,palette=['lightsalmon','lightskyblue'],ax=ax6) ax6.set_title('V9 vs Class Negative Correlation') Text(0.5, 1.0, 'V9 vs Class Negative Correlation')

#其他屬性與Class的分布圖 other_fea = list(data_rus.columns.drop(['V11','V4','V2','V17','V14','V12','V10','V7','V3','V9','Class'])) fig,ax = plt.subplots(5,4,figsize=(24,36)) for fea in other_fea:sns.violinplot(x='Class',y= fea,data=data_rus,palette=['lightsalmon','lightskyblue'],ax=ax[divmod(other_fea.index(fea),4)[0],divmod(other_fea.index(fea),4)[1]])ax[divmod(other_fea.index(fea),4)[0],divmod(other_fea.index(fea),4)[1]].set_title(fea)

從以上小提琴圖可以看出，不同的正負樣本在屬性值上取值分布確實存在不同，而其他的屬性在正負樣本上區別相對不大。

好奇金額和時間是否與Class的關系,金額上正常交易金額更少更集中一些，而欺詐交易金額相對較大且分布更散，而時間上正常交易時間跨度小于欺詐交易的時間跨度，所以是在睡覺的時候更可能產生欺詐交易

查看完特征和類別之間的關系，下面分析特征之間的關系，從熱力圖上可以看出，屬性之間是存在相關性的，下面具體看看是否存在多重共線性

sns.set() sns.pairplot(data[list(data_rus.columns)],kind='scatter',diag_kind = 'kde') plt.show()

從上圖可以看出，這些屬性之間不是完全相互獨立的，有些存在很強的線性相關性，我們利用方差膨脹系數(VIF)作進一步檢驗

from statsmodels.stats.outliers_influence import variance_inflation_factor vif = [variance_inflation_factor(data_rus.values, data_rus.columns.get_loc(i)) for i in data_rus.columns] vif [12.662373162589994,20.3132501576979,26.78027354608725,9.970255022795625,23.531563683157597,3.4386660732204946,67.84989394913013,5.76519495696649,7.129002458395831,23.226754020950764,11.753104213590975,29.49673779700361,1.3365891898690718,21.57973674600878,1.2669840488461022,27.61485162786757,31.081940593780782,14.088642210869459,2.502857511412321,4.96077803555917,5.169599871511768,3.1235143157354583,2.828445638986856,1.1937601054384332,1.628451339236206,1.1966413137632343,1.959903999050125,1.4573293665681395,6.314999796714301,2.0990707198901117,4.802392100187543]

一般認為 $V I F < 10$ 時，該變量與其余變量之間不存在多重共線性，當 $10 < V I F < 100$ 時存在較強的多重共線性，當 $V I F > 100$ 時，則認為是存在嚴重的多重共線性。從以上數值來看，變量之間確實存在多重共線性，也就是存在信息冗余，下面需要進行特征提取或者特征選擇

特征提取和特征選擇都是降維的兩種方法，特征提取所提取的特征是原特征的映射，而特征選擇選出的是原特征的子集。主成分分析和線性判別分析是特征提取的兩種經典方法。

特征選擇：當數據處理完成后，需要選擇有意義的特征輸入機器學習的算法和模型進行訓練，主要從兩個方面來選擇考慮特征

特征是否發散，如果一個特征不發散，例如方差接近0，也就是說樣本在這個特征上基本沒有差異，這個特征對于樣本的區分并沒有什么用
特征與目標的相關性，與目標相關性高的特征，應當優先選擇。

根據特征選擇的形式又可以將特征選擇分為3種：
1、過濾法：按照發散性或者相關性對各個特征進行評分，設定閾值或帶選擇閾值的個數，選擇特征(方差選擇法、相關系數法、卡方檢驗、互信息法)
2、包裝法：根據目標函數(通常是預測效果評分)，每次選擇若干特征，或者排除若干特征(遞歸特征消除法)
3、嵌入法：先使用某些機器學習的算法和模型進行訓練，得到各個特征的權值系數，根據系數從大到小選擇特征，類似于filter方法，但是通過訓練來確定特征的優劣(基于懲罰項的特征選擇法、基于樹模型的特征選擇法)

嶺回歸和Lasso是兩種對線性模型特征選擇的方法，都是加入正則化項防止過擬合，嶺回歸加入的是二階范數的正則化項，而Lasso加入的是一級范式，其中Lasso能夠將一些作用比較小的特征的參數訓練為0，從而獲得稀疏解，也就是在訓練模型時實現了降維的目的。
對于樹模型，有隨機森林分類器對特征的重要性進行排序，可以達到篩選的目的。
本文先采用兩種特征選擇的方法分別選擇出重要的特征，看看特征有什么差別

# 利用Lasso進行特征選擇 from sklearn.linear_model import LassoCV from sklearn.model_selection import cross_val_score #調用LassoCV函數，并進行交叉驗證 model_lasso = LassoCV(alphas=[0.1,0.01,0.005,1],random_state=1,cv=5).fit(X_rus,y_rus) #輸出看模型中最終選擇的特征 coef = pd.Series(model_lasso.coef_,index=X_rus.columns) print(coef[coef != 0].abs().sort_values(ascending = False)) V4 0.062065 V14 0.045851 Amount 0.040011 V26 0.038201 V13 0.031702 V7 0.028889 V22 0.028509 V18 0.028171 V6 0.019226 V1 0.018757 V21 0.016032 V10 0.014742 V28 0.012483 V8 0.011273 V20 0.010726 V9 0.010358 V24 0.010227 V17 0.007217 V2 0.006838 Hour 0.004757 V15 0.003393 V27 0.002588 V19 0.000275 dtype: float64 # 利用隨機森林進行特征重要性排序 from sklearn.ensemble import RandomForestClassifier rfc_fea_model = RandomForestClassifier(random_state=1) rfc_fea_model.fit(X_rus,y_rus) fea = X_rus.columns importance = rfc_fea_model.feature_importances_ a = pd.DataFrame() a['feature'] = fea a['importance'] = importance a = a.sort_values('importance',ascending = False) plt.figure(figsize=(20,10)) plt.bar(a['feature'],a['importance']) plt.title('the importance orders sorted by random forest') plt.show()

a.cumsum() featureimportance11916133101152618172028261914221275258042127232924

V12	0.134515
V12V10	0.268117
V12V10V17	0.388052
V12V10V17V14	0.503750
V12V10V17V14V4	0.615973
V12V10V17V14V4V11	0.687904
V12V10V17V14V4V11V2	0.731179
V12V10V17V14V4V11V2V16	0.774247
V12V10V17V14V4V11V2V16V3	0.813352
V12V10V17V14V4V11V2V16V3V7	0.846050
V12V10V17V14V4V11V2V16V3V7V19	0.860559
V12V10V17V14V4V11V2V16V3V7V19V18	0.873657
V12V10V17V14V4V11V2V16V3V7V19V18V21	0.886416
V12V10V17V14V4V11V2V16V3V7V19V18V21Amount	0.896986
V12V10V17V14V4V11V2V16V3V7V19V18V21AmountV27	0.906369
V12V10V17V14V4V11V2V16V3V7V19V18V21AmountV27V20	0.915172
V12V10V17V14V4V11V2V16V3V7V19V18V21AmountV27V20V15	0.923700
V12V10V17V14V4V11V2V16V3V7V19V18V21AmountV27V20V15V23	0.931990
V12V10V17V14V4V11V2V16V3V7V19V18V21AmountV27V20V15V23V13	0.939298
V12V10V17V14V4V11V2V16V3V7V19V18V21AmountV27V20V15V23V13V8	0.946163
V12V10V17V14V4V11V2V16V3V7V19V18V21AmountV27V20V15V23V13V8V6	0.952952
V12V10V17V14V4V11V2V16V3V7V19V18V21AmountV27V20V15V23V13V8V6V26	0.959418
V12V10V17V14V4V11V2V16V3V7V19V18V21AmountV27V20V15V23V13V8V6V26V9	0.965869
V12V10V17V14V4V11V2V16V3V7V19V18V21AmountV27V20V15V23V13V8V6V26V9V1	0.971975
V12V10V17V14V4V11V2V16V3V7V19V18V21AmountV27V20V15V23V13V8V6V26V9V1V5	0.977297
V12V10V17V14V4V11V2V16V3V7V19V18V21AmountV27V20V15V23V13V8V6V26V9V1V5V22	0.982325
V12V10V17V14V4V11V2V16V3V7V19V18V21AmountV27V20V15V23V13V8V6V26V9V1V5V22V28	0.986990
V12V10V17V14V4V11V2V16V3V7V19V18V21AmountV27V20V15V23V13V8V6V26V9V1V5V22V28V24	0.991551
V12V10V17V14V4V11V2V16V3V7V19V18V21AmountV27V20V15V23V13V8V6V26V9V1V5V22V28V24Hour	0.995913
V12V10V17V14V4V11V2V16V3V7V19V18V21AmountV27V20V15V23V13V8V6V26V9V1V5V22V28V24HourV25	1.000000

從以上結果來看，兩種方法的排序，特征重要性差別很大，為了更大程度的保留數據的信息，我們采用兩種結合的特征，包括[‘V14’,‘V12’,‘V10’,‘V17’,‘V4’,‘V11’,‘V3’,‘V2’,‘V7’,‘V16’,‘V18’,‘Amount’,‘V19’,‘V20’,‘V23’,‘V21’,‘V15’,‘V9’,‘V6’,‘V27’,‘V25’,‘V5’,‘V13’,‘V22’,‘Hour’,‘V28’，‘V1’,‘V8’,‘V26’]，其中選擇的標準是隨機森林中重要性總和95%以上，如果其中有Lasso回歸沒有的，則加入，共選出29個特征(只有V24沒有被選擇)

# 使用選擇的特征采進行訓練和測試 from sklearn.metrics import accuracy_score from sklearn.metrics import confusion_matrix from sklearn.metrics import plot_confusion_matrix from sklearn.metrics import classification_report from sklearn.linear_model import LogisticRegression # 邏輯回歸 from sklearn.neighbors import KNeighborsClassifier # KNN from sklearn.naive_bayes import GaussianNB # 樸素貝葉斯 from sklearn.svm import SVC # 支持向量分類 from sklearn.tree import DecisionTreeClassifier # 決策樹 from sklearn.ensemble import RandomForestClassifier # 隨機森林 from sklearn.ensemble import AdaBoostClassifier # Adaboost from sklearn.ensemble import GradientBoostingClassifier # GBDT from xgboost import XGBClassifier # XGBoost from lightgbm import LGBMClassifier # lightGBM from sklearn.metrics import roc_curve # 繪制ROC曲線 Classifiers = {'LG': LogisticRegression(random_state=1),'KNN': KNeighborsClassifier(),'Bayes': GaussianNB(),'SVC': SVC(random_state=1,probability=True),'DecisionTree': DecisionTreeClassifier(random_state=1),'RandomForest': RandomForestClassifier(random_state=1),'Adaboost':AdaBoostClassifier(random_state=1),'GBDT': GradientBoostingClassifier(random_state=1),'XGboost': XGBClassifier(random_state=1),'LightGBM': LGBMClassifier(random_state=1)} def train_test(Classifiers, X_train, y_train, X_test, y_test):y_pred = pd.DataFrame()Accuracy_Score = pd.DataFrame() # score.model_name = Classifiers.keysfor model_name, model in Classifiers.items():model.fit(X_train, y_train)y_pred[model_name] = model.predict(X_test)y_pred_pra = model.predict_proba(X_test)Accuracy_Score[model_name] = pd.Series(model.score(X_test, y_test))# 計算召回率print(model_name, '\n', classification_report(y_test, y_pred[model_name])) # confu_mat = confusion_matrix(y_test,y_pred[model_name]) # plt.matshow(confu_mat,cmap = plt.cm.Blues) # plt.title(model_name) # plt.colorbar()# 畫出混淆矩陣fig, ax = plt.subplots(1, 1)plot_confusion_matrix(model, X_test, y_test, labels=[0, 1], cmap='Blues', ax=ax)ax.set_title(model_name)# 畫出roc曲線plt.figure()fig,(ax1,ax2) = plt.subplots(1,2,figsize=(10,4))fpr, tpr, thres = roc_curve(y_test, y_pred_pra[:, -1])ax1.plot(fpr, tpr)ax1.set_title(model_name+' ROC')ax1.set_xlabel('fpr')ax1.set_ylabel('tpr')# 畫出KS曲線ax2.plot(thres[1:],tpr[1:])ax2.plot(thres[1:],fpr[1:])ax2.plot(thres[1:],tpr[1:]-fpr[1:])ax2.set_xlabel('threshold')ax2.legend(['tpr','fpr','tpr-fpr'])plt.sca(ax2)plt.gca().invert_xaxis() # ax2.gca().invert_xaxis()ax2.set_title(model_name+' KS')return y_pred,Accuracy_Score # test_cols = ['V12', 'V14', 'V10', 'V17', 'V11', 'V4', 'V2', 'V16', 'V7', 'V3', # 'V18', 'Amount', 'V19', 'V21', 'V20', 'V8', 'V15', 'V6', 'V27', 'V26', 'V1','V9','V13','V22','Hour','V23','V28'] test_cols = X_rus.columns.drop('V24') Y_pred,Accuracy_score = train_test(Classifiers, X_rus[test_cols], y_rus, X_test[test_cols], y_test) Accuracy_score LGprecision recall f1-score support0 1.00 0.96 0.98 568631 0.04 0.94 0.07 98accuracy 0.96 56961macro avg 0.52 0.95 0.53 56961 weighted avg 1.00 0.96 0.98 56961KNNprecision recall f1-score support0 1.00 0.97 0.99 568631 0.06 0.93 0.11 98accuracy 0.97 56961macro avg 0.53 0.95 0.55 56961 weighted avg 1.00 0.97 0.99 56961Bayesprecision recall f1-score support0 1.00 0.96 0.98 568631 0.04 0.88 0.08 98accuracy 0.96 56961macro avg 0.52 0.92 0.53 56961 weighted avg 1.00 0.96 0.98 56961SVCprecision recall f1-score support0 1.00 0.98 0.99 568631 0.08 0.91 0.14 98accuracy 0.98 56961macro avg 0.54 0.94 0.57 56961 weighted avg 1.00 0.98 0.99 56961DecisionTreeprecision recall f1-score support0 1.00 0.88 0.93 568631 0.01 0.95 0.03 98accuracy 0.88 56961macro avg 0.51 0.91 0.48 56961 weighted avg 1.00 0.88 0.93 56961RandomForestprecision recall f1-score support0 1.00 0.97 0.98 568631 0.05 0.93 0.09 98accuracy 0.97 56961macro avg 0.52 0.95 0.54 56961 weighted avg 1.00 0.97 0.98 56961Adaboostprecision recall f1-score support0 1.00 0.95 0.98 568631 0.03 0.95 0.07 98accuracy 0.95 56961macro avg 0.52 0.95 0.52 56961 weighted avg 1.00 0.95 0.97 56961GBDTprecision recall f1-score support0 1.00 0.96 0.98 568631 0.04 0.92 0.07 98accuracy 0.96 56961macro avg 0.52 0.94 0.53 56961 weighted avg 1.00 0.96 0.98 56961[15:05:57] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.4.0/src/learner.cc:1095: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior. XGboostprecision recall f1-score support0 1.00 0.97 0.98 568631 0.04 0.93 0.09 98accuracy 0.97 56961macro avg 0.52 0.95 0.53 56961 weighted avg 1.00 0.97 0.98 56961LightGBMprecision recall f1-score support0 1.00 0.97 0.98 568631 0.05 0.94 0.10 98accuracy 0.97 56961macro avg 0.53 0.95 0.54 56961 weighted avg 1.00 0.97 0.98 56961 LGKNNBayesSVCDecisionTreeRandomForestAdaboostGBDTXGboostLightGBM0

0.959446	0.973561	0.962887	0.981479	0.877899	0.967451	0.953407	0.960482	0.965678	0.969611

0.959446

0.973561

0.962887

0.981479

0.877899

0.967451

0.953407

0.960482

0.965678

0.969611

# 集成學習 # 根據以上auc以及recall的結果，選擇LG，DT以及GBDT當作基模型 from sklearn.ensemble import VotingClassifier voting_clf = VotingClassifier(estimators=[ # ('LG', LogisticRegression(random_state=1)),('KNN', KNeighborsClassifier()), # ('Bayes',GaussianNB()),('SVC', SVC(random_state=1,probability=True)), # ('DecisionTree', DecisionTreeClassifier(random_state=1)), # ('RandomForest', RandomForestClassifier(random_state=1)), # ('Adaboost',AdaBoostClassifier(random_state=1)), # ('GBDT', GradientBoostingClassifier(random_state=1)), # ('XGboost', XGBClassifier(random_state=1)),('LightGBM', LGBMClassifier(random_state=1))]) voting_clf.fit(X_rus[test_cols], y_rus) y_final_pred = voting_clf.predict(X_test[test_cols]) print(classification_report(y_test, y_final_pred)) fig, ax = plt.subplots(1, 1) plot_confusion_matrix(voting_clf, X_test[test_cols], y_test, labels=[0, 1], cmap='Blues', ax=ax) precision recall f1-score support0 1.00 0.98 0.99 568631 0.08 0.94 0.14 98accuracy 0.98 56961macro avg 0.54 0.96 0.57 56961 weighted avg 1.00 0.98 0.99 56961<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x213686f0700>

通過以上結果，模型的精確度accuracy都挺高的，尤其是支持向量分類的達到了98.1479%，但對于不平衡樣本來說，更應該關注召回率，在SVC中，識別出欺詐交易的概率是91%，而對于非欺詐樣本的識別率是99%，分別有1046和9份樣本識別錯誤。

在進行模型調優之前，嘗試著用默認參數來構成模型集成，選擇出精確度高的3個模型，KNN，SVC,和Light GBM，得到的結果的精確度是98%，分別有6份和1114份樣本識別錯誤，在欺詐樣本的識別上有所提升，但是非欺詐樣本檢測成欺詐樣本數量增加了。

因為模型中使用的參數都是默認的，考慮利用網格搜索確定最佳參數，當然網格搜索也不一定能找到最好的參數

初步模型已經建立，但是模型的參數采用的是模型默認的，下面進行調參。調參的方法有三種：隨機搜索、網格搜索以及貝葉斯優化。而隨即搜索和網格搜索當超參數過多時，極其耗時，因為他們的搜索次數是所有參數的組合，而利用貝葉斯優化進行調參會考慮之前的參數信息，不斷更新先驗，且迭代次數少，速度快。

# 網格搜索找最佳參數 from sklearn.model_selection import GridSearchCV def reg_best(X_train, y_train):log_reg_parames = {'penalty': ['l1', 'l2'],'C': [0.001, 0.01, 0.05, 0.1, 1, 10]}grid_log_reg = GridSearchCV(LogisticRegression(), log_reg_parames)grid_log_reg.fit(X_train, y_train)log_reg_best = grid_log_reg.best_estimator_print(log_reg_best)return log_reg_bestdef KNN_best(X_train, y_train):KNN_parames = {'n_neighbors': [3, 5, 7, 9, 11, 15], 'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute']}grid_KNN = GridSearchCV(KNeighborsClassifier(), KNN_parames)grid_KNN.fit(X_train, y_train)KNN_best_ = grid_KNN.best_estimator_print(KNN_best_)return KNN_best_def SVC_best(X_train, y_train):SVC_parames = {'C': [0.5, 0.7, 0.9, 1], 'kernel': ['rbf', 'poly', 'sigmoid', 'linear'], 'probability': [True]}grid_SVC = GridSearchCV(SVC(), SVC_parames)grid_SVC.fit(X_train, y_train)SVC_best = grid_SVC.best_estimator_print(SVC_best)return SVC_bestdef DecisionTree_best(X_train, y_train):DT_parames = {"criterion": ["gini", "entropy"], "max_depth": list(range(2, 4, 1)),"min_samples_leaf": list(range(5, 7, 1))}grid_DT = GridSearchCV(DecisionTreeClassifier(), DT_parames)grid_DT.fit(X_train, y_train)DT_best = grid_DT.best_estimator_print(DT_best)return DT_bestdef RandomForest_best(X_train, y_train):RF_params = {'n_estimators': [10, 50, 100, 150, 200], 'criterion': ['gini', 'entropy'], "min_samples_leaf": list(range(5, 7, 1))}grid_RF = GridSearchCV(RandomForestClassifier(), RF_params)grid_RF.fit(X_train, y_train)RT_best = grid_RF.best_estimator_print(RT_best)return RT_bestdef Adaboost_best(X_train, y_train):Adaboost_params = {'n_estimators': [10, 50, 100, 150, 200], 'learning_rate': [0.01, 0.05, 0.1, 0.5, 1], 'algorithm': ['SAMME', 'SAMME.R']}grid_Adaboost = GridSearchCV(AdaBoostClassifier(), Adaboost_params)grid_Adaboost.fit(X_train, y_train)Adaboost_best_ = grid_Adaboost.best_estimator_print(Adaboost_best_)return Adaboost_best_def GBDT_best(X_train, y_train):GBDT_params = {'n_estimators': [10, 50, 100, 150], 'loss': ['deviance', 'exponential'], 'learning_rate': [0.01, 0.05, 0.1], 'criterion': ['friedman_mse', 'mse']}grid_GBDT = GridSearchCV(GradientBoostingClassifier(), GBDT_params)grid_GBDT.fit(X_train, y_train)GBDT_best_ = grid_GBDT.best_estimator_print(GBDT_best_)return GBDT_best_def XGboost_best(X_train, y_train):XGB_params = {'n_estimators': [10, 50, 100, 150, 200], 'max_depth': [5, 10, 15, 20], 'learning_rate': [0.01, 0.05, 0.1, 0.5, 1]}grid_XGB = GridSearchCV(XGBClassifier(), XGB_params)grid_XGB.fit(X_train, y_train)XGB_best_ = grid_XGB.best_estimator_print(XGB_best_)return XGB_best_def LGBM_best(X_train, y_train):LGBM_params = {'boosting_type': ['gbdt', 'dart', 'goss', 'rf'], 'num_leaves': [21, 31, 51], 'n_estimators': [10, 50, 100, 150, 200], 'max_depth': [5, 10, 15, 20], 'learning_rate': [0.01, 0.05, 0.1, 0.5, 1]}grid_LGBM = GridSearchCV(LGBMClassifier(), LGBM_params)grid_LGBM.fit(X_train, y_train)LGBM_best_ = grid_LGBM.best_estimator_print(LGBM_best_)return LGBM_best_ Classifiers = {'LG': reg_best(X_rus[test_cols], y_rus),'KNN': KNN_best(X_rus[test_cols], y_rus),'Bayes': GaussianNB(),'SVC': SVC_best(X_rus[test_cols], y_rus),'DecisionTree': DecisionTree_best(X_rus[test_cols], y_rus),'RandomForest': RandomForest_best(X_rus[test_cols], y_rus),'Adaboost':Adaboost_best(X_rus[test_cols], y_rus),'GBDT': GBDT_best(X_rus[test_cols], y_rus),'XGboost': XGboost_best(X_rus[test_cols], y_rus),'LightGBM': LGBM_best(X_rus[test_cols], y_rus)} LogisticRegression(C=0.05) KNeighborsClassifier(n_neighbors=3) SVC(C=0.7, probability=True) DecisionTreeClassifier(criterion='entropy', max_depth=2, min_samples_leaf=5) RandomForestClassifier(min_samples_leaf=5) AdaBoostClassifier(algorithm='SAMME', learning_rate=0.5, n_estimators=100) GradientBoostingClassifier(criterion='mse', loss='exponential') XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,importance_type='gain', interaction_constraints='',learning_rate=0.5, max_delta_step=0, max_depth=5,min_child_weight=1, missing=nan, monotone_constraints='()',n_estimators=10, n_jobs=8, num_parallel_tree=1, random_state=0,reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,tree_method='exact', validate_parameters=1, verbosity=None) LGBMClassifier(boosting_type='dart', learning_rate=1, max_depth=5,n_estimators=150, num_leaves=21) # 利用優化后的參數再訓練測試 Y_pred,Accuracy_score = train_test(Classifiers, X_rus[test_cols], y_rus, X_test[test_cols], y_test) Accuracy_score LGprecision recall f1-score support0 1.00 0.97 0.99 568631 0.05 0.92 0.10 98accuracy 0.97 56961macro avg 0.53 0.95 0.54 56961 weighted avg 1.00 0.97 0.98 56961KNNprecision recall f1-score support0 1.00 0.96 0.98 568631 0.04 0.93 0.08 98accuracy 0.96 56961macro avg 0.52 0.95 0.53 56961 weighted avg 1.00 0.96 0.98 56961Bayesprecision recall f1-score support0 1.00 0.96 0.98 568631 0.04 0.88 0.08 98accuracy 0.96 56961macro avg 0.52 0.92 0.53 56961 weighted avg 1.00 0.96 0.98 56961SVCprecision recall f1-score support0 1.00 0.98 0.99 568631 0.09 0.91 0.16 98accuracy 0.98 56961macro avg 0.54 0.95 0.57 56961 weighted avg 1.00 0.98 0.99 56961DecisionTreeprecision recall f1-score support0 1.00 0.90 0.95 568631 0.02 0.95 0.03 98accuracy 0.90 56961macro avg 0.51 0.92 0.49 56961 weighted avg 1.00 0.90 0.94 56961RandomForestprecision recall f1-score support0 1.00 0.97 0.99 568631 0.06 0.93 0.11 98accuracy 0.97 56961macro avg 0.53 0.95 0.55 56961 weighted avg 1.00 0.97 0.99 56961Adaboostprecision recall f1-score support0 1.00 0.96 0.98 568631 0.04 0.94 0.08 98accuracy 0.96 56961macro avg 0.52 0.95 0.53 56961 weighted avg 1.00 0.96 0.98 56961GBDTprecision recall f1-score support0 1.00 0.97 0.98 568631 0.05 0.93 0.09 98accuracy 0.97 56961macro avg 0.52 0.95 0.53 56961 weighted avg 1.00 0.97 0.98 56961[15:36:45] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.4.0/src/learner.cc:1095: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior. XGboostprecision recall f1-score support0 1.00 0.96 0.98 568631 0.04 0.93 0.07 98accuracy 0.96 56961macro avg 0.52 0.94 0.53 56961 weighted avg 1.00 0.96 0.98 56961LightGBMprecision recall f1-score support0 1.00 0.97 0.98 568631 0.05 0.93 0.10 98accuracy 0.97 56961macro avg 0.53 0.95 0.54 56961 weighted avg 1.00 0.97 0.98 56961 LGKNNBayesSVCDecisionTreeRandomForestAdaboostGBDTXGboostLightGBM0

0.972508	0.963747	0.962887	0.983129	0.897087	0.974474	0.962185	0.966152	0.960043	0.969769

0.972508

0.963747

0.962887

0.983129

0.897087

0.974474

0.962185

0.966152

0.960043

0.969769

<Figure size 432x288 with 0 Axes> # 集成學習 # 根據以上auc以及recall的結果，選擇LG，DT以及GBDT當作基模型 from sklearn.ensemble import VotingClassifier voting_clf = VotingClassifier(estimators=[('LG', LogisticRegression(random_state=1,C=0.05)),('SVC', SVC(random_state=1,probability=True,C=0.7)),('RandomForest', RandomForestClassifier(random_state=1,min_samples_leaf=5)),]) voting_clf.fit(X_rus[test_cols], y_rus) y_final_pred=voting_clf.predict(X_test[test_cols]) print(classification_report(y_test, y_final_pred)) fig, ax=plt.subplots(1, 1) plot_confusion_matrix(voting_clf, X_test[test_cols], y_test, labels=[0, 1], cmap='Blues', ax=ax) precision recall f1-score support0 1.00 0.98 0.99 568631 0.07 0.92 0.14 98accuracy 0.98 56961macro avg 0.54 0.95 0.56 56961 weighted avg 1.00 0.98 0.99 56961<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x2135666d070>

從精確度上來看，大部分模型都有提升,也有一些模型的精確度是退化的。在召回率上沒有明顯提升，甚至有些退化比較明顯，這說明對于欺詐樣本的識別還是不夠明顯。因為采用的訓練樣本數遠小于測試樣本數，這也導致了測試效果不行。

SMOTE上采樣與隨機下采樣相結合

還是對訓練樣本進行分析再訓練模型

data_smote = pd.concat([X_smote,y_smote],axis=1) # 單個數值特征分布情況 f = pd.melt(data_smote, value_vars=X_train.columns) g = sns.FacetGrid(f, col='variable', col_wrap=3, sharex=False, sharey=False) g = g.map(sns.distplot, 'value')

# 單個數值特征分布箱線圖 f = pd.melt(data_smote, value_vars=X_train.columns) g = sns.FacetGrid(f,col='variable', col_wrap=3, sharex=False, sharey=False,size=5) g = g.map(sns.boxplot, 'value', color='lightskyblue')

# 單個數值特征分布小提琴圖 f = pd.melt(data_smote, value_vars=X_train.columns) g = sns.FacetGrid(f, col='variable', col_wrap=3, sharex=False, sharey=False,size=5) g = g.map(sns.violinplot, 'value',color='lightskyblue')

numerical_columns = data_smote.columns.drop('Class') for col_name in numerical_columns:data_smote = outlier_process(data_smote,col_name)print(data_smote.shape) (214362, 31) (213002, 31) (212497, 31) (212497, 31) (204320, 31) (202425, 31) (198165, 31) (189719, 31) (189492, 31) (189427, 31) (189020, 31) (189019, 31) (189019, 31) (189019, 31) (189016, 31) (188944, 31) (184836, 31) (184556, 31) (184552, 31) (180426, 31) (178559, 31) (178559, 31) (174526, 31) (174457, 31) (174354, 31) (174098, 31) (169488, 31) (166755, 31) (156423, 31) (156423, 31)

#繪制采樣前和采樣后的熱力圖 fig,(ax1,ax2) = plt.subplots(2,1,figsize=(10,10)) sns.heatmap(data.corr(),cmap = 'coolwarm_r',ax=ax1,vmax=0.8) ax1.set_title('the relationship on imbalanced samples') sns.heatmap(data_smote.corr(),cmap = 'coolwarm_r',ax=ax2,vmax=0.8) ax2.set_title('the relationship on random under samples') Text(0.5, 1.0, 'the relationship on random under samples')

# 分析數值屬性與Class之間的相關性 data_smote.corr()['Class'].sort_values(ascending=False) Class 1.000000 V11 0.713013 V4 0.705057 V2 0.641477 V21 0.466069 V27 0.459954 V28 0.383395 V20 0.381292 V8 0.253905 V25 0.147601 V26 0.052486 V19 0.008546 Amount -0.001365 V15 -0.023988 V22 -0.028250 Hour -0.048206 V5 -0.098754 V13 -0.142258 V24 -0.154369 V23 -0.157138 V18 -0.170994 V1 -0.295297 V17 -0.445826 V6 -0.465282 V16 -0.500016 V7 -0.547495 V9 -0.567131 V3 -0.658294 V12 -0.713942 V10 -0.748285 V14 -0.790148 Name: Class, dtype: float64

根據排名，和預測值Class正相關值較大的有V4，V11,V2，負相關值較大的有V_14，V_10，V_12，V_3，V7，V9，我們畫出這些特征與預測值之間的關系圖

# 正相關的屬性與Class分布圖 fig,(ax1,ax2,ax3) = plt.subplots(1,3,figsize=(24,6)) sns.violinplot(x='Class',y='V4',data=data_smote,palette=['lightsalmon','lightskyblue'],ax=ax1) ax1.set_title('V11 vs Class Positive Correlation')sns.violinplot(x='Class',y='V11',data=data_smote,palette=['lightsalmon','lightskyblue'],ax=ax2) ax2.set_title('V4 vs Class Positive Correlation')sns.violinplot(x='Class',y='V2',data=data_smote,palette=['lightsalmon','lightskyblue'],ax=ax3) ax3.set_title('V2 vs Class Positive Correlation') Text(0.5, 1.0, 'V2 vs Class Positive Correlation')

# 正相關的屬性與Class分布圖 fig,((ax1,ax2,ax3),(ax4,ax5,ax6)) = plt.subplots(2,3,figsize=(24,14)) sns.violinplot(x='Class',y='V14',data=data_smote,palette=['lightsalmon','lightskyblue'],ax=ax1) ax1.set_title('V14 vs Class negative Correlation')sns.violinplot(x='Class',y='V10',data=data_smote,palette=['lightsalmon','lightskyblue'],ax=ax2) ax2.set_title('V4 vs Class negative Correlation')sns.violinplot(x='Class',y='V12',data=data_smote,palette=['lightsalmon','lightskyblue'],ax=ax3) ax3.set_title('V12 vs Class negative Correlation')sns.violinplot(x='Class',y='V3',data=data_smote,palette=['lightsalmon','lightskyblue'],ax=ax4) ax4.set_title('V3 vs Class negative Correlation')sns.violinplot(x='Class',y='V7',data=data_smote,palette=['lightsalmon','lightskyblue'],ax=ax5) ax5.set_title('V7 vs Class negative Correlation')sns.violinplot(x='Class',y='V9',data=data_smote,palette=['lightsalmon','lightskyblue'],ax=ax6) ax6.set_title('V9 vs Class negative Correlation') Text(0.5, 1.0, 'V9 vs Class negative Correlation')

vif = [variance_inflation_factor(data_smote.values, data_smote.columns.get_loc(i)) for i in data_smote.columns] vif [8.91488639686892,41.95138589208644,29.439659166987383,9.321076032190051,18.073107065112527,7.88968653431388,38.13243240821064,2.61807436913295,4.202219415722627,20.898417802753006,5.976908659263689,10.856930462897152,1.2514060420970867,20.23958581764367,1.176425772463202,6.444784613229281,6.980222815257359,2.7742773520511372,2.4906782119059176,4.348667463801223,3.409678717638936,1.9626453781659197,2.1167419900555884,1.1352046295467655,1.9935984979230046,1.1029041559046275,3.084861887885401,1.9565486505075638,13.535498930988794,1.7451075607624895,4.64505815138509]

相較于隨機下采樣來說，在多重共線性上相對有了改善

# 利用Lasso進行特征選擇 #調用LassoCV函數，并進行交叉驗證 model_lasso = LassoCV(alphas=[0.1,0.01,0.005,1],random_state=1,cv=5).fit(X_smote,y_smote) #輸出看模型中最終選擇的特征 coef = pd.Series(model_lasso.coef_,index=X_smote.columns) print(coef[coef != 0]) V1 -0.019851 V2 0.004587 V3 0.000523 V4 0.052236 V5 -0.000597 V6 -0.012474 V7 0.030960 V8 -0.012043 V9 0.007895 V10 -0.023509 V11 0.005633 V12 0.009648 V13 -0.036565 V14 -0.053919 V15 0.012297 V17 -0.009149 V18 0.030941 V20 0.010266 V21 0.013880 V22 0.019031 V23 -0.009253 V26 -0.068311 V27 -0.003680 V28 0.008911 dtype: float64 # 利用隨機森林進行特征重要性排序 from sklearn.ensemble import RandomForestClassifier rfc_fea_model = RandomForestClassifier(random_state=1) rfc_fea_model.fit(X_smote,y_smote) fea = X_smote.columns importance = rfc_fea_model.feature_importances_ a = pd.DataFrame() a['feature'] = fea a['importance'] = importance a = a.sort_values('importance',ascending = False) plt.figure(figsize=(20,10)) plt.bar(a['feature'],a['importance']) plt.title('the importance orders sorted by random forest') plt.show()

a.cumsum() featureimportance13119163101215617720280812182642919145252722212423

V14	0.140330
V14V12	0.274028
V14V12V10	0.392515
V14V12V10V17	0.501863
V14V12V10V17V4	0.592110
V14V12V10V17V4V11	0.680300
V14V12V10V17V4V11V2	0.728448
V14V12V10V17V4V11V2V3	0.770334
V14V12V10V17V4V11V2V3V16	0.808083
V14V12V10V17V4V11V2V3V16V7	0.842341
V14V12V10V17V4V11V2V3V16V7V18	0.857924
V14V12V10V17V4V11V2V3V16V7V18V8	0.869777
V14V12V10V17V4V11V2V3V16V7V18V8V21	0.880431
V14V12V10V17V4V11V2V3V16V7V18V8V21Amount	0.890861
V14V12V10V17V4V11V2V3V16V7V18V8V21AmountV1	0.900571
V14V12V10V17V4V11V2V3V16V7V18V8V21AmountV1V9	0.909978
V14V12V10V17V4V11V2V3V16V7V18V8V21AmountV1V9V13	0.918623
V14V12V10V17V4V11V2V3V16V7V18V8V21AmountV1V9V13V19	0.926978
V14V12V10V17V4V11V2V3V16V7V18V8V21AmountV1V9V13V19V27	0.935072
V14V12V10V17V4V11V2V3V16V7V18V8V21AmountV1V9V13V19V27V5	0.942688
V14V12V10V17V4V11V2V3V16V7V18V8V21AmountV1V9V13V19V27V5Hour	0.950275
V14V12V10V17V4V11V2V3V16V7V18V8V21AmountV1V9V13V19V27V5HourV20	0.957287
V14V12V10V17V4V11V2V3V16V7V18V8V21AmountV1V9V13V19V27V5HourV20V15	0.963873
V14V12V10V17V4V11V2V3V16V7V18V8V21AmountV1V9V13V19V27V5HourV20V15V6	0.970225
V14V12V10V17V4V11V2V3V16V7V18V8V21AmountV1V9V13V19V27V5HourV20V15V6V26	0.976519
V14V12V10V17V4V11V2V3V16V7V18V8V21AmountV1V9V13V19V27V5HourV20V15V6V26V28	0.982202
V14V12V10V17V4V11V2V3V16V7V18V8V21AmountV1V9V13V19V27V5HourV20V15V6V26V28V23	0.987770
V14V12V10V17V4V11V2V3V16V7V18V8V21AmountV1V9V13V19V27V5HourV20V15V6V26V28V23V22	0.992335
V14V12V10V17V4V11V2V3V16V7V18V8V21AmountV1V9V13V19V27V5HourV20V15V6V26V28V23V22V25	0.996295
V14V12V10V17V4V11V2V3V16V7V18V8V21AmountV1V9V13V19V27V5HourV20V15V6V26V28V23V22V25V24	1.000000

從以上結果來看，兩種方法的排序，特征重要性差別很大，為了更大程度的保留數據的信息，我們采用兩種結合的特征，其中選擇的標準是隨機森林中重要性總和95%以上，如果其中有Lasso回歸沒有的，則加入，共選出除去V24，V25之外的28個特征

# test_cols = ['V12', 'V14', 'V10', 'V17', 'V4', 'V11', 'V2', 'V7', 'V16', 'V3', 'V18', # 'V8', 'Amount', 'V19', 'V21', 'V1', 'V5', 'V13', 'V27','V6','V15','V26'] test_cols = X_smote.columns.drop(['V24','V25']) Classifiers = {'LG': LogisticRegression(random_state=1),'KNN': KNeighborsClassifier(),'Bayes': GaussianNB(),'SVC': SVC(random_state=1, probability=True),'DecisionTree': DecisionTreeClassifier(random_state=1),'RandomForest': RandomForestClassifier(random_state=1),'Adaboost': AdaBoostClassifier(random_state=1),'GBDT': GradientBoostingClassifier(random_state=1),'XGboost': XGBClassifier(random_state=1),'LightGBM': LGBMClassifier(random_state=1) } Y_pred, Accuracy_score = train_test(Classifiers, X_smote[test_cols], y_smote, X_test[test_cols], y_test) print(Accuracy_score) Y_pred.head() LGprecision recall f1-score support0 1.00 0.98 0.99 568631 0.06 0.86 0.12 98accuracy 0.98 56961macro avg 0.53 0.92 0.55 56961 weighted avg 1.00 0.98 0.99 56961KNNprecision recall f1-score support0 1.00 1.00 1.00 568631 0.29 0.83 0.42 98accuracy 1.00 56961macro avg 0.64 0.91 0.71 56961 weighted avg 1.00 1.00 1.00 56961Bayesprecision recall f1-score support0 1.00 0.98 0.99 568631 0.06 0.84 0.11 98accuracy 0.98 56961macro avg 0.53 0.91 0.55 56961 weighted avg 1.00 0.98 0.99 56961SVCprecision recall f1-score support0 1.00 0.99 0.99 568631 0.09 0.86 0.17 98accuracy 0.99 56961macro avg 0.55 0.92 0.58 56961 weighted avg 1.00 0.99 0.99 56961DecisionTreeprecision recall f1-score support0 1.00 1.00 1.00 568631 0.27 0.80 0.40 98accuracy 1.00 56961macro avg 0.63 0.90 0.70 56961 weighted avg 1.00 1.00 1.00 56961RandomForestprecision recall f1-score support0 1.00 1.00 1.00 568631 0.81 0.82 0.81 98accuracy 1.00 56961macro avg 0.90 0.91 0.91 56961 weighted avg 1.00 1.00 1.00 56961Adaboostprecision recall f1-score support0 1.00 0.98 0.99 568631 0.06 0.89 0.12 98accuracy 0.98 56961macro avg 0.53 0.93 0.55 56961 weighted avg 1.00 0.98 0.99 56961GBDTprecision recall f1-score support0 1.00 0.99 0.99 568631 0.13 0.85 0.22 98accuracy 0.99 56961macro avg 0.56 0.92 0.61 56961 weighted avg 1.00 0.99 0.99 56961[15:24:00] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.4.0/src/learner.cc:1095: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior. XGboostprecision recall f1-score support0 1.00 1.00 1.00 568631 0.69 0.84 0.76 98accuracy 1.00 56961macro avg 0.84 0.92 0.88 56961 weighted avg 1.00 1.00 1.00 56961LightGBMprecision recall f1-score support0 1.00 1.00 1.00 568631 0.54 0.83 0.66 98accuracy 1.00 56961macro avg 0.77 0.91 0.83 56961 weighted avg 1.00 1.00 1.00 56961LG KNN Bayes SVC DecisionTree RandomForest Adaboost GBDT XGboost LightGBM 0 0.977862 0.996138 0.976335 0.985113 0.995874 0.99935 0.977054 0.9898 0.99907 0.998508 LGKNNBayesSVCDecisionTreeRandomForestAdaboostGBDTXGboostLightGBM01234

1	1	1	0	0	1	1	1	1	1
1	1	1	0	0	0	1	1	0	1
1	1	1	1	1	1	1	1	1	1
1	1	1	1	1	1	1	1	1	1
1	1	1	1	1	1	1	1	1	1

從精確度上可以看出，經過了SMOTE上采樣和隨機下采樣之后，精確度有了很好的提升，比如隨機森林RandomForest達到了99.94%的精確度，在樣本數量上分別是13個欺詐樣本的沒有識別，誤將18個非欺詐樣本識別為欺詐樣本，其他的算法如XGBoost,lightGBM,KNN等都達到了99%的準確率，下面使用準確率高的模型進行集成
因為樣本量較大，而參數調優也比較耗時，目前的效果也比較好，因而省略網格調優的過程，時間足夠的可以像前面一樣調優。

Classifiers = {'LG': reg_best(X_smote[test_cols], y_smote),'KNN': KNN_best(X_smote[test_cols], y_smote),'Bayes': GaussianNB(),'SVC': SVC_best(X_smote[test_cols], y_smote),'DecisionTree': DecisionTree_best(X_smote[test_cols], y_smote),'RandomForest': RandomForest_best(X_smote[test_cols], y_smote),'Adaboost':Adaboost_best(X_smote[test_cols], y_smote),'GBDT': GBDT_best(X_smote[test_cols], y_smote),'XGboost': XGboost_best(X_smote[test_cols], y_smote),'LightGBM': LGBM_best(X_smote[test_cols], y_smote)} LogisticRegression(C=1) KNeighborsClassifier(n_neighbors=3) SVC(C=1, probability=True) DecisionTreeClassifier(max_depth=3, min_samples_leaf=5) RandomForestClassifier(criterion='entropy', min_samples_leaf=5) AdaBoostClassifier(learning_rate=1, n_estimators=200) GradientBoostingClassifier(n_estimators=150) XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,importance_type='gain', interaction_constraints='',learning_rate=0.5, max_delta_step=0, max_depth=5,min_child_weight=1, missing=nan, monotone_constraints='()',n_estimators=200, n_jobs=8, num_parallel_tree=1, random_state=0,reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,tree_method='exact', validate_parameters=1, verbosity=None) LGBMClassifier(learning_rate=0.5, max_depth=15, num_leaves=51) Y_pred,Accuracy_score = train_test(Classifiers, X_smote[test_cols], y_smote, X_test[test_cols], y_test) print(Accuracy_score) Y_pred.head().append(Y_pred.tail()) LGprecision recall f1-score support0 1.00 0.97 0.99 568631 0.06 0.91 0.11 98accuracy 0.97 56961macro avg 0.53 0.94 0.55 56961 weighted avg 1.00 0.97 0.99 56961KNNprecision recall f1-score support0 1.00 0.98 0.99 568631 0.09 0.92 0.17 98accuracy 0.98 56961macro avg 0.55 0.95 0.58 56961 weighted avg 1.00 0.98 0.99 56961Bayesprecision recall f1-score support0 1.00 0.98 0.99 568631 0.07 0.88 0.12 98accuracy 0.98 56961macro avg 0.53 0.93 0.56 56961 weighted avg 1.00 0.98 0.99 56961SVCprecision recall f1-score support0 1.00 0.98 0.99 568631 0.08 0.90 0.15 98accuracy 0.98 56961macro avg 0.54 0.94 0.57 56961 weighted avg 1.00 0.98 0.99 56961DecisionTreeprecision recall f1-score support0 1.00 0.96 0.98 568631 0.04 0.90 0.08 98accuracy 0.96 56961macro avg 0.52 0.93 0.53 56961 weighted avg 1.00 0.96 0.98 56961RandomForestprecision recall f1-score support0 1.00 1.00 1.00 568631 0.33 0.90 0.48 98accuracy 1.00 56961macro avg 0.66 0.95 0.74 56961 weighted avg 1.00 1.00 1.00 56961Adaboostprecision recall f1-score support0 1.00 0.98 0.99 568631 0.08 0.90 0.15 98accuracy 0.98 56961macro avg 0.54 0.94 0.57 56961 weighted avg 1.00 0.98 0.99 56961GBDTprecision recall f1-score support0 1.00 0.99 0.99 568631 0.11 0.89 0.19 98accuracy 0.99 56961macro avg 0.55 0.94 0.59 56961 weighted avg 1.00 0.99 0.99 56961[17:58:32] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.4.0/src/learner.cc:1095: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior. XGboostprecision recall f1-score support0 1.00 1.00 1.00 568631 0.24 0.90 0.38 98accuracy 0.99 56961macro avg 0.62 0.95 0.69 56961 weighted avg 1.00 0.99 1.00 56961LightGBMprecision recall f1-score support0 1.00 1.00 1.00 568631 0.30 0.90 0.45 98accuracy 1.00 56961macro avg 0.65 0.95 0.72 56961 weighted avg 1.00 1.00 1.00 56961LG KNN Bayes SVC DecisionTree RandomForest Adaboost GBDT XGboost LightGBM 0 0.97393 0.984463 0.978477 0.982936 0.963273 0.996629 0.982093 0.987026 0.994944 0.996208 LGKNNBayesSVCDecisionTreeRandomForestAdaboostGBDTXGboostLightGBM012345695656957569585695956960

0	0	0	0	0	0	0	0	0	0
0	0	0	0	0	0	0	0	0	0
0	0	0	0	0	0	0	0	0	0
0	0	0	0	0	0	0	0	0	0
0	0	1	0	0	0	0	0	0	0
0	0	0	0	0	0	0	0	0	0
0	0	0	0	0	0	0	0	0	0
0	0	0	0	0	0	0	0	0	0
0	0	1	0	0	0	0	0	0	0
0	0	0	0	0	0	0	0	0	0

# 集成學習 # 根據以上auc以及recall的結果，選擇LG，DT以及GBDT當作基模型 from sklearn.ensemble import VotingClassifier voting_clf = VotingClassifier(estimators=[ # ('LG', LogisticRegression(random_state=1)),('KNN', KNeighborsClassifier(n_neighbors=3)), # ('Bayes',GaussianNB()), # ('SVC', SVC(random_state=1,probability=True)),('DecisionTree', DecisionTreeClassifier(random_state=1)),('RandomForest', RandomForestClassifier(random_state=1)), # ('Adaboost',AdaBoostClassifier(random_state=1)), # ('GBDT', GradientBoostingClassifier(random_state=1)),('XGboost', XGBClassifier(random_state=1)),('LightGBM', LGBMClassifier(random_state=1))]) voting_clf.fit(X_smote[test_cols], y_smote) y_final_pred = voting_clf.predict(X_test[test_cols]) print(classification_report(y_test, y_final_pred)) fig, ax = plt.subplots(1, 1) plot_confusion_matrix(voting_clf, X_test[test_cols], y_test, labels=[0, 1], cmap='Blues', ax=ax) precision recall f1-score support0 1.00 1.00 1.00 568631 0.71 0.83 0.76 98accuracy 1.00 56961macro avg 0.86 0.91 0.88 56961 weighted avg 1.00 1.00 1.00 56961<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x1fe3047f3a0>

最終，根據我們選擇的模型，能達到幾乎100%的準確率，欺詐樣本中有17個沒有識別出，而在非欺詐樣本中只有33個被識別為存在信用卡欺詐，與隨機下采樣在準確率上有了極大提升。在欺詐樣本的識別上還有提升空間！

總結

以上是生活随笔為你收集整理的信用卡欺诈检测的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

信用卡

0	0	0	0	0	0	0	0	0	0
0	0	0	0	0	0	0	0	0	0
0	0	0	0	0	0	0	0	0	0
0	0	0	0	0	0	0	0	0	0
0	0	1	0	0	0	0	0	0	0
0	0	0	0	0	0	0	0	0	0
0	0	0	0	0	0	0	0	0	0
0	0	0	0	0	0	0	0	0	0
0	0	1	0	0	0	0	0	0	0
0	0	0	0	0	0	0	0	0	0

0	0	0	0	0	0	0	0	0	0
0	0	0	0	0	0	0	0	0	0
0	0	0	0	0	0	0	0	0	0
0	0	0	0	0	0	0	0	0	0
0	0	1	0	0	0	0	0	0	0
0	0	0	0	0	0	0	0	0	0
0	0	0	0	0	0	0	0	0	0
0	0	0	0	0	0	0	0	0	0
0	0	1	0	0	0	0	0	0	0
0	0	0	0	0	0	0	0	0	0

0	0	0	0	0	0	0	0	0	0
0	0	0	0	0	0	0	0	0	0
0	0	0	0	0	0	0	0	0	0
0	0	0	0	0	0	0	0	0	0
0	0	1	0	0	0	0	0	0	0
0	0	0	0	0	0	0	0	0	0
0	0	0	0	0	0	0	0	0	0
0	0	0	0	0	0	0	0	0	0
0	0	1	0	0	0	0	0	0	0
0	0	0	0	0	0	0	0	0	0