當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

数据分析案例-基于随机森林模型对信用卡欺诈检测

發(fā)布時(shí)間：2023/12/20 编程问答 34 豆豆

生活随笔收集整理的這篇文章主要介紹了数据分析案例-基于随机森林模型对信用卡欺诈检测小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

項(xiàng)目背景

信用卡欺詐是指故意使用偽造、作廢的信用卡，冒用他人的信用卡騙取財(cái)物，或用本人信用卡進(jìn)行惡意透支的行為,信用卡欺詐形式分為3種：失卡冒用、假冒申請(qǐng)、偽造信用卡。欺詐案件中，有60%以上是偽造信用卡詐騙，其特點(diǎn)是團(tuán)伙性質(zhì)，從盜取卡資料、制造假卡、販賣假卡，到用假卡作案，牟取暴利。而信用卡欺詐檢測(cè)是銀行減少損失的重要手段。

數(shù)據(jù)介紹

部分?jǐn)?shù)據(jù)如下?

導(dǎo)入數(shù)據(jù)?

import pandas as pd import matplotlib.pyplot as plt from sklearn.preprocessing import StandardScaler from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import confusion_matrix,accuracy_score,classification_report import numpy as np import warnings warnings.filterwarnings('ignore')data = pd.read_csv("creditcard.csv") data.head() Time V1 V2 V3 V4 V5 V6 V7 V8 V9 ... V21 V22 V23 V24 V25 V26 V27 V28 Amount Class 0 0.0 -1.359807 -0.072781 2.536347 1.378155 -0.338321 0.462388 0.239599 0.098698 0.363787 ... -0.018307 0.277838 -0.110474 0.066928 0.128539 -0.189115 0.133558 -0.021053 149.62 0 1 0.0 1.191857 0.266151 0.166480 0.448154 0.060018 -0.082361 -0.078803 0.085102 -0.255425 ... -0.225775 -0.638672 0.101288 -0.339846 0.167170 0.125895 -0.008983 0.014724 2.69 0 2 1.0 -1.358354 -1.340163 1.773209 0.379780 -0.503198 1.800499 0.791461 0.247676 -1.514654 ... 0.247998 0.771679 0.909412 -0.689281 -0.327642 -0.139097 -0.055353 -0.059752 378.66 0 3 1.0 -0.966272 -0.185226 1.792993 -0.863291 -0.010309 1.247203 0.237609 0.377436 -1.387024 ... -0.108300 0.005274 -0.190321 -1.175575 0.647376 -0.221929 0.062723 0.061458 123.50 0 4 2.0 -1.158233 0.877737 1.548718 0.403034 -0.407193 0.095921 0.592941 -0.270533 0.817739 ... -0.009431 0.798278 -0.137458 0.141267 -0.206010 0.502292 0.219422 0.215153 69.99 0

?查看數(shù)據(jù)大小

原始數(shù)據(jù)共有284807行，31列

數(shù)據(jù)清洗?

刪除缺失值和重復(fù)值

# 刪除缺失值 data.dropna(inplace=True) # 刪除重復(fù)值 data.drop_duplicates(inplace=True) data.shape (283726, 31)

經(jīng)過數(shù)據(jù)清洗后還有283726行，31列數(shù)據(jù)

數(shù)據(jù)可視化

查看是否欺詐的數(shù)量比例

count_classes = pd.value_counts(data['Class'], sort = True).sort_index() count_classes.plot(kind = 'bar') plt.title("Fraud class histogram") plt.xlabel("Class") plt.ylabel("Frequency") for x,y in enumerate(count_classes.values):plt.text(x,y+100,'%s' % y,ha='center',va='bottom') plt.show()

我們發(fā)現(xiàn)原始數(shù)據(jù)中欺詐的數(shù)量很少，只有492條，而沒有欺詐的樣本有284315條?，

比例相差較大。

數(shù)據(jù)預(yù)處理

由于前面我們發(fā)現(xiàn)要分類的類別比例相差較大，所有這里我們要用到欠采樣，也就是在類別0里面是數(shù)據(jù)隨機(jī)取出與類別1數(shù)量想等的樣本進(jìn)行模型預(yù)測(cè)，然后對(duì)比沒有進(jìn)行欠采樣和經(jīng)過欠采樣之后的對(duì)比。

# 準(zhǔn)備數(shù)據(jù) X = data.drop('Class',axis=1) y = data['Class']# 統(tǒng)計(jì)欺詐的數(shù)量及挑選出它的索引 number_records_fraud = len(data[data.Class == 1]) fraud_indices = np.array(data[data.Class == 1].index) # 挑選出沒有欺詐的索引 normal_indices = data[data.Class == 0].index # 從沒有欺詐中隨機(jī)挑選出與欺詐數(shù)量想等的索引 random_normal_indices = np.random.choice(normal_indices, number_records_fraud, replace = False) random_normal_indices = np.array(random_normal_indices) # 合并兩個(gè)數(shù)量想等的欺詐和非欺詐的索引 under_sample_indices = np.concatenate([fraud_indices,random_normal_indices]) # 根據(jù)其索引找到其對(duì)應(yīng)的數(shù)據(jù) under_sample_data = data.iloc[under_sample_indices,:] under_sample_data # 欠采樣后的X Y X_undersample = under_sample_data.drop('Class',1) y_undersample = under_sample_data['Class']

當(dāng)然這步我們也可以直接調(diào)用sklearn中的api直接使用

建立模型?

首先需要?jiǎng)澐謹(jǐn)?shù)據(jù)集

# 劃分原數(shù)據(jù)集 X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.3, random_state = 42) # 劃分經(jīng)過欠采樣后的數(shù)據(jù)集 X_train_undersample, X_test_undersample, y_train_undersample, y_test_undersample = train_test_split(X_undersample,y_undersample,test_size = 0.3,random_state = 42)

?接著建立隨機(jī)森林模型

# 用原始數(shù)據(jù)訓(xùn)練模型 rfc1 = RandomForestClassifier(n_estimators=100) rfc1.fit(X_train,y_train) y_pred = rfc1.predict(X_test) print('模型準(zhǔn)確率',accuracy_score(y_test,y_pred)) print(confusion_matrix(y_test,y_pred)) print(classification_report(y_test,y_pred))

首先這是原始數(shù)據(jù)，也就是沒有經(jīng)過欠采樣的數(shù)據(jù)來進(jìn)行建立模型的。模型準(zhǔn)確率為99.9%，準(zhǔn)確率很高，但是不要被這表面迷惑了，因?yàn)閺姆诸悎?bào)告中我們看出，有與0類別的樣本數(shù)量很大，所有0類別的準(zhǔn)確率和召回率都是100%，而1類別的準(zhǔn)確率就降下來了，尤其是召回率，只有74%。接下來用我們欠采樣后的數(shù)據(jù)進(jìn)行建模看看。

# 用欠采樣數(shù)據(jù)訓(xùn)練模型 rfc2 = RandomForestClassifier(n_estimators=100) rfc2.fit(X_train_undersample,y_train_undersample) y_pred = rfc2.predict(X_test_undersample) print('模型準(zhǔn)確率',accuracy_score(y_test_undersample,y_pred)) print(confusion_matrix(y_test_undersample,y_pred)) print(classification_report(y_test_undersample,y_pred))

從模型結(jié)果我們看出模型的準(zhǔn)確率為95%，也還是很高的，而且從分類報(bào)告中，兩個(gè)類別的準(zhǔn)確率和召回率都很高，模型的適用性更強(qiáng)。

總結(jié)

在分類問題中，目標(biāo)分類的種類比例差別過大，就必須要進(jìn)行處理，要么欠采樣，要么過采樣，這里我們講的是欠采樣。這樣我們訓(xùn)練出來的模型才更有價(jià)值，適用性才更高。?

總結(jié)

以上是生活随笔為你收集整理的数据分析案例-基于随机森林模型对信用卡欺诈检测的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇： Linux—文件系统与磁盘管理（后）
下一篇：实验：温湿度数据oled显示