贷款违约预测带有真实银行数据的端到端ml项目1
表中的內容 (Table of Content)
· Introduction· About the Dataset· Import Dataset into the Database· Connect Python to MySQL Database· Feature Extraction· Feature Transformation· Modeling· Conclusion and Future Directions· About Me
· 簡介 · 關于數據集 · 將數據集 導入數據庫 · 將Python連接到MySQL數據庫 · 特征提取 · 特征轉換 · 建模 · 結論和未來方向 · 關于我
Note: If you are interested in the details beyond this post, the Berka Dataset, all the code, and notebooks can be found in my GitHub Page.
注意 :如果您對本文之外的詳細信息感興趣,可以在我的GitHub Page中找到Berka Dataset,所有代碼和筆記本。
介紹 (Introduction)
For banks, it is always an interesting and challenging problem to predict how likely a client is going to default the loan when they only have a handful of information. In the modern era, the data science teams in the banks build predictive models using machine learning. The datasets used by them are most likely to be proprietary and are usually collected internally through their daily businesses. In other words, there are not many real-world datasets that we can use if we want to work on such financial projects. Fortunately, there is an exception: the Berka Dataset.
對于銀行而言,預測客戶僅擁有少量信息時將拖欠貸款的可能性始終是一個有趣且具有挑戰性的問題。 在現代時代,銀行中的數據科學團隊使用機器學習來構建預測模型。 他們使用的數據集很可能是專有數據,通常是通過日常業務在內部收集的。 換句話說,如果我們要從事此類金融項目,則可以使用的現實世界數據集并不多。 幸運的是,有一個例外: Berka Dataset 。
關于數據集 (About the Dataset)
The Berka Dataset, or the PKDD’99 Financial Dataset, is a collection of real anonymized financial information from a Czech bank, used for PKDD’99 Discovery Challenge. The dataset can be accessed from my GitHub page.
Berka數據集或PKDD'99財務數據集是來自捷克銀行的真實匿名財務信息的集合,用于PKDD'99發現挑戰賽。 可以從我的GitHub頁面訪問該數據集。
In the dataset, 8 raw files include 8 tables:
在數據集中,8個原始文件包括8個表:
account (4500 objects in the file ACCOUNT.ASC) — each record describes static characteristics of an account.
帳戶 (文件ACCOUNT.ASC中有4500個對象)—每個記錄描述一個帳戶的靜態特征。
client (5369 objects in the file CLIENT.ASC) — each record describes characteristics of a client.
客戶 (文件CLIENT.ASC中有5369個對象)—每個記錄都描述了客戶的特征。
disposition (5369 objects in the file DISP.ASC) — each record relates together a client with an account i.e. this relation describes the rights of clients to operate accounts.
處置 (文件DISP.ASC中的5369個對象)—每個記錄將一個客戶與一個帳戶關聯在一起,即該關系描述了客戶操作帳戶的權利。
permanent order (6471 objects in the file ORDER.ASC) — each record describes characteristics of a payment order.
永久訂單 (文件ORDER.ASC中有6471個對象)—每個記錄都描述了付款訂單的特征。
transaction (1056320 objects in the file TRANS.ASC) — each record describes one transaction on an account.
交易 (文件TRANS.ASC中有1056320個對象)—每個記錄描述一個帳戶上的一項交易。
loan (682 objects in the file LOAN.ASC) — each record describes a loan granted for a given account.
貸款 (文件LOAN.ASC中的682個對象)—每個記錄都描述了為給定帳戶授予的貸款。
credit card (892 objects in the file CARD.ASC) — each record describes a credit card issued to an account.
信用卡 (CARD.ASC文件中的892個對象)—每個記錄都描述了發給帳戶的信用卡。
demographic data (77 objects in the file DISTRICT.ASC) — each record describes demographic characteristics of a district.
人口統計數據 (文件DISTRICT.ASC中有77個對象)—每個記錄都描述一個地區的人口統計特征。
- Each account has both static characteristics (e.g. date of creation, address of the branch) given in relation “account” and dynamic characteristics (e.g. payments debited or credited, balances) given in the relations “permanent order” and “transaction”. 每個帳戶都具有在“帳戶”關系中給出的靜態特征(例如,創建日期,分支機構的地址)和在“永久訂單”和“交易”關系中給出的動態特征(例如,借方或貸方的付款,余額)。
- Relation “client” describes the characteristics of persons who can manipulate the accounts. 關系“客戶”描述了可以操縱賬戶的人的特征。
- One client can have more accounts, more clients can manipulate with a single account; clients and accounts are related together in relation “disposition”. 一個客戶可以擁有更多帳戶,更多的客戶可以使用一個帳戶進行操作; 客戶和帳戶在“處置”關系中相互關聯。
- Relations “loan” and “credit card” describe some services which the bank offers to its clients. 關系“貸款”和“信用卡”描述了銀行向客戶提供的一些服務。
- More than one credit card can be issued to an account. 一個帳戶可以發行一張以上的信用卡。
- At most one loan can be granted for an account. 一個賬戶最多可以提供一筆貸款。
- Relation “demographic data” gives some publicly available information about the districts (e.g. the unemployment rate); additional information about the clients can be deduced from this. 關系“人口數據”提供了有關地區的一些公共可用信息(例如失業率); 由此可以推斷出有關客戶的其他信息。
將數據集導入數據庫 (Import Dataset into the Database)
This is an optional step since the raw files contain only delimiter-separated values, so it can be directly imported into data frames using pandas.
這是一個可選步驟,因為原始文件僅包含定界符分隔的值,因此可以使用熊貓將其直接導入數據幀。
Here I wrote SQL queries to import the raw data files into MySQL database for simple and fast data manipulations (eg. select, join and aggregation functions) on the data.
在這里,我編寫了SQL查詢,將原始數據文件導入MySQL數據庫,以便對數據進行簡單,快速的數據操作(例如,選擇,聯接和聚合功能)。
/* Create Bank Database */ CREATE DATABASE IF NOT EXISTS bank; USE bank;/* Create Account Table */ CREATE TABLE IF NOT EXISTS Account(account_id INT,district_id INT,frequency VARCHAR(20),`date` DATE );/* Load Data into the Account Table */ LOAD DATA LOCAL INFILE '~/Documents/DataScience/ds_projects/loan_default_prediction/data/account.asc' INTO TABLE Account FIELDS TERMINATED BY ';' ENCLOSED BY '"' LINES TERMINATED BY '\r\n' IGNORE 1 LINES (account_id, district_id, frequency, @c4) SET `date` = STR_TO_DATE(@c4, '%y%m%d');Above is a code snippet showing how to create the bank database and import the Account table. It includes three steps:
上面的代碼段顯示了如何創建銀行數據庫和導入Account表。 它包括三個步驟:
- Create and use database 創建和使用數據庫
- Create a table 建立表格
- Load data into the table 將數據加載到表中
There should not be any troubles in the first two steps if you are familiar with MySQL and the database systems. For the “Load data” step, you need to make sure that you have enabled the LOCAL_INFILE in MySQL. Detailed instruction can be found from this thread.
如果您熟悉MySQL和數據庫系統,則前兩個步驟應該不會有任何麻煩。 對于“加載數據”步驟,您需要確保已在MySQL中啟用LOCAL_INFILE 。 可以從該線程中找到詳細的說明。
By repeating step 2 and step 3 on each table, all the data can be imported into the database.
通過在每個表上重復步驟2和步驟3,可以將所有數據導入數據庫。
將Python連接到MySQL數據庫 (Connect Python to MySQL Database)
Again, if you choose to import the data directly into Python using Pandas, this step is optional. But if you have created the database and become familiar with the dataset through some SQL data manipulations, the next step is to transfer the prepared tables into Python and perform data analysis there. One way is to use the MySQL Connector for Python to execute SQL queries in Python and make Pandas DataFrames using the results. Here is my approach:
同樣,如果您選擇使用Pandas將數據直接導入Python,則此步驟是可選的。 但是,如果您已創建數據庫并通過一些SQL數據操作熟悉了數據集,則下一步是將準備好的表轉移到Python中并在其中執行數據分析。 一種方法是使用MySQL Connector for Python在Python中執行SQL查詢,并使用結果創建Pandas DataFrame。 這是我的方法:
import mysql.connectorclass MysqlIO:"""Connect to MySQL server with python and excecute SQL commands."""def __init__(self, database='test'):try:# Change the host, user and password as neededconnection = mysql.connector.connect(host='localhost',database=database,user='Zhou',password='jojojo',use_pure=True)if connection.is_connected():db_info = connection.get_server_info()print("Connected to MySQL Server version", db_info)print("Your're connected to database:", database)self.connection = connectionexcept Exception as e:print("Error while connecting to MySQL", e)def execute(self, query, header=False):"""Execute SQL commands and return retrieved queries."""cursor = self.connection.cursor(buffered=True)cursor.execute(query)try:record = cursor.fetchall()if header:header = [i[0] for i in cursor.description]return {'header': header, 'record': record}else: return recordexcept:passdef to_df(self, query):"""Return the retrieved SQL queries into pandas dataframe."""res = self.execute(query, header=True)df = pd.DataFrame(res['record'])df.columns = res['header']return dfAfter modifying the database info such as host, database, user, password, we can initiate a connection instance, execute the query and convert it into Pandas DataFrame:
修改數據庫信息(例如主機,數據庫,用戶,密碼)后,我們可以啟動連接實例,執行查詢并將其轉換為Pandas DataFrame:
# Create a connection instance db = MysqlIO()# Call .to_df method to execute the query and make dataframe from the results. query = """select *from Loan join Account using(account_id);""" df = db.to_df(query)Even though this is an optional step, it is advantageous in terms of speed, convenience, and good for experimentation purposes compared to directly import the files into Pandas DataFrames. Unlike other ML projects where we are only given with acsv file (1 table), this dataset is quite complicated and there is a lot of useful information hidden between the connections of tables, so this is another reason why I want to introduce the way of loading data into the database first.
即使這是一個可選步驟,與直接將文件導入Pandas DataFrames相比,它在速度,便利性和實驗性方面都具有優勢。 與其他僅提供csv文件(1個表)的ML項目不同,此數據集非常復雜,并且在表的連接之間隱藏了許多有用的信息,因此這也是我要介紹這種方式的另一個原因首先將數據加載到數據庫中。
Now the data is in MySQL server and we have connected it Python so that we can smoothly access the data in data frames. The next steps are to extract features from the table, transform the variables, load them into one array, and train a machine learning model.
現在,數據位于MySQL服務器中,并且已將其連接到Python,以便我們可以順利訪問數據幀中的數據。 下一步是從表中提取特征,轉換變量,將它們加載到一個數組中以及訓練機器學習模型。
特征提取 (Feature Extraction)
Since predicting the loan default is a binary classification problem, we first need to know how many instances in each class. By looking at the status variable in the Loan table, there are 4 distinct values: A, B, C, and D.
由于預測貸款違約是一個二進制分類問題,因此我們首先需要知道每個類中有多少個實例。 通過查看“ Loan表中的status變量,有4個不同的值:A,B,C和D。
- A: Contract finished, no problems. 答:合同完成,沒有問題。
- B: Contract finished, loan not paid. B:合同完成,未償還貸款。
- C: Running contract, okay so far. C:簽合同,到目前為止還可以。
- D: Running contract, client in debt. D:簽訂合同,客戶欠債。
According to the definitions from the dataset description, we can make them into binary classes: good (A or C) and bad (B or D). There are 606 loans that fall into the “good” class and 76 of them are in the “bad” class.
根據數據集描述中的定義,我們可以將它們分為二類:好(A或C)和壞(B或D)。 有606筆貸款屬于“好”類,其中76筆屬于“壞”類。
With the two distinct classes defined, we can look into the variables and plot the histograms to see if they correspond to different distributions.
在定義了兩個不同的類之后,我們可以查看變量并繪制直方圖,以查看它們是否對應于不同的分布。
The loan amount shown below is a good example to see the difference between the two classes. Even though both are right-skewed, it still shows an interesting pattern that loans with a higher amount tend to default.
下面顯示的貸款金額是了解兩個類別之間差異的一個很好的例子。 即使兩者都是右偏,它仍然顯示出一種有趣的模式,即較高金額的貸款傾向于違約。
Histogram of Loan Amount (Good vs Bad)貸款金額的直方圖(好壞)When extracting features, they don’t have to be the existing variables provided in the tables. Instead, we can always be creative and come up with some out-of-the-box solutions on creating our own features. For example, when joining the Loan table and the Account table, we can get both the date of loan issuance and the date of account creation. We may wonder if the time gap between creating the account and applying for the loan plays a role, so a simple subtraction would give us a new variable consists of days between the two such activities on the same account. The histograms are shown below, where a clear trend can be seen that people who apply for the loan right after creating the bank account tend to default.
提取要素時,它們不必是表中提供的現有變量。 相反,我們始終可以發揮創造力,并在創建我們自己的功能時提出一些現成的解決方案。 例如,當加入“ Loan表和“ Account表時,我們可以同時獲得貸款發放日期和帳戶創建日期。 我們可能想知道在創建帳戶和申請貸款之間的時間間隔是否起作用,因此簡單的減法將為我們提供一個新變量,該變量包括在同一帳戶上兩次此類活動之間的天數。 直方圖如下所示,可以清楚地看到在創建銀行帳戶后立即申請貸款的人傾向于違約的趨勢。
Histogram of Days between Account Creation and Loan Issuance創建帳戶與發放貸款之間的天數直方圖By repeating the process of experimenting with existing features and created features, I finally prepared a table that consists of 18 feature columns and 1 label column. The selected features are:
通過重復試驗現有功能和創建的功能的過程,我最終準備了一個表,該表包含18個功能列和1個標簽列。 所選功能為:
- amount: Loan amount 金額:貸款金額
- duration: Loan duration 期限:貸款期限
- payments: Loan payments 付款:貸款付款
- days_between: Days between account creation and loan issuance days_between:創建帳戶和發放貸款之間的天數
- frequency: Frequency of issuance of statements 頻率:報表的發布頻率
- average_order_amount: Average amount of the permanent orders made by the account average_order_amount:該帳戶發出的永久訂單的平均數量
- average_trans_amount: Average amount of the transactions made by the account average_trans_amount:該帳戶進行的平均交易金額
- average_trans_balance: Average balance amount after transactions made by the account average_trans_balance:帳戶進行交易后的平均余額
- n_trans: Transaction number of account n_trans:帳戶的交易號
- card_type: Type of credit card associated with the account card_type:與帳戶關聯的信用卡類型
- n_inhabitants: Number of inhabitants in the district of account n_inhabitants:帳戶區域中的居民數量
- average_salary: Average salary in the district of account average_salary:會計區域中的平均工資
- average_unemployment: Average unemployment rate in the district of account average_unemployment:會計區域的平均失業率
- entrepreneur_rate: Number of entrepreneurs per 1000 inhabitants in the district of account 企業家率:賬戶區每千居民中企業家人數
- average_crime_rate: Average crime rate in the district of account average_crime_rate:帳戶區域中的平均犯罪率
- owner_gender: Account owner’s gender owner_gender:帳戶所有者的性別
- owner_age: Account owner’s age owner_age:帳戶所有者的年齡
- same_district: A boolean that represents if the owner has the same district information as the account same_district:布爾值,表示所有者是否具有與帳戶相同的地區信息
特征轉換 (Feature Transformation)
After the features are extracted and put into a big table, it is necessary to transform the data so that they can be fed into the machine learning model in an organic way. In our case, we have two types of features. One is numerical, such as amount, duration, and n_trans. The other one is categorical, such as card_type and owners_gender.
將特征提取并放入大表中之后,有必要對數據進行轉換,以便以有機方式將其輸入到機器學習模型中。 就我們而言,我們有兩種類型的功能。 一個是數字 ,例如amount , duration和n_trans 。 另一個是分類的 ,例如card_type和owners_gender 。
Our dataset is pretty clean and there is any missing value, so we can skip the imputation and directly jumpy into scaling for the numerical values. The are several options of scalers from scikit-learn , such as StandardScaler , MinMaxScaler and RobustScaler . Here, I used MinMaxScaler to rescale the numerical values between 0 and 1. On the other hand, the typical strategy of dealing with categorical variables is to use OneHotEncoder to transform the features into binary 0 and 1 values.
我們的數據集非常干凈,并且沒有任何遺漏的值,因此我們可以跳過插補,直接跳入數值的換算。 scikit-learn的縮放器有多個選項,例如StandardScaler , MinMaxScaler和RobustScaler 。 在這里,我使用MinMaxScaler重新縮放0到1之間的數值。另一方面,處理分類變量的典型策略是使用OneHotEncoder將OneHotEncoder轉換為二進制0和1值。
The code below is a representation of the feature transformation steps:
以下代碼表示要素轉換步驟:
from sklearn.compose import ColumnTransformer from sklearn.preprocessing import OneHotEncoder, MinMaxScaler# Define the numerical and categorical columns num_cols = df_ml.columns[:-5] cat_cols = df_ml.columns[-5:]# Build the column transformer and transform the dataframe col_trans = ColumnTransformer([('num', MinMaxScaler(), num_cols),('cat', OneHotEncoder(drop='if_binary'), cat_cols) ]) df_transformed = col_trans.fit_transform(df_ml)造型 (Modeling)
The first thing in training a machine learning model is to split the train and test sets. It is tricky in our dataset because it is not balanced: there are almost 10 times more good loans than bad loans. A stratified split is a good option here because it preserves the ratio between classes in both train and test sets.
訓練機器學習模型的第一件事是將訓練集和測試集分開。 這在我們的數據集中非常棘手,因為它不平衡:好貸比壞貸幾乎多10倍。 分層拆分在這里是一個很好的選擇,因為它可以保留訓練集和測試集中的類之間的比率。
from sklearn.model_selection import train_test_split# Stratified split of the train and test set with train-test ratio of 7:3 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=10)There are many good machine learning models for binary classification tasks. Here, the Random Forest model is used in this project for its decent performance and quick-prototyping capability. An initial RandomForrestClassifier model is fit and three distinct measures are used to represent the model performance: Accuracy, F1 Score, and ROC AUC.
對于二進制分類任務,有許多好的機器學習模型。 在這里,該項目使用了隨機森林模型,因為它具有不錯的性能和快速原型設計能力。 初始的RandomForrestClassifier模型是擬合的,并且使用三種不同的度量來表示模型的性能: 準確性 , F1得分和ROC AUC 。
It is noticeable that Accuracy is not sufficient for this unbalanced dataset. If we finetune the model purely by accuracy, then it would favor toward predicting the loan as “good loan”. F1 score is the harmonic mean between precision and recall, and ROC AUC is the area under the ROC curve. These two are better metrics for evaluating the model performance for unbalanced data.
值得注意的是,精度對于此不平衡數據集是不夠的。 如果我們僅通過準確性對模型進行微調,那么它將有助于將貸款預測為“良好貸款”。 F1分數是精度和查全率之間的諧波平均值,ROC AUC是ROC曲線下的面積。 這兩個是評估不平衡數據的模型性能的更好指標。
The code below shows how to apply 5-fold stratified cross-validation on the training set, and calculate the average of each score:
以下代碼顯示了如何對訓練集應用5倍分層交叉驗證,以及如何計算每個分數的平均值:
from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import f1_score, accuracy_score, roc_auc_score from sklearn.model_selection import StratifiedKFold# See the inital model performance clf = RandomForestClassifier(random_state=10) print('Acc:', cross_val_score(clf, X_train, y_train, cv=StratifiedKFold(n_splits=5), scoring='accuracy').mean()) print('F1:', cross_val_score(clf, X_train, y_train, cv=StratifiedKFold(n_splits=5), scoring='f1').mean()) print('ROC AUC:', cross_val_score(clf, X_train, y_train, cv=StratifiedKFold(n_splits=5), scoring='roc_auc').mean())Acc: 0.8973F1: 0.1620
ROC AUC: 0.7253
It is clearly seen that the accuracy is high, almost 0.9, but the F1 score is very low because of low recall. There is room for the model to be finetuned and strive for better performance, and one of the methods is Grid Search. By assigning different values to the hyperparameters of theRandomForestClassifier such as n_estimators max_depth min_samples_split and min_samples_leaf , it will iterate through the combinations of hyperparameters and output the one with the best performance on the score that we are interested in. A code snippet is shown below:
可以清楚地看到,準確性很高,幾乎為0.9,但是由于召回率低,F1分數非常低。 可以對模型進行微調并爭取更好的性能,而其中的一種方法是網格搜索。 通過向的超參數指定不同的價值RandomForestClassifier如n_estimators max_depth min_samples_split和min_samples_leaf ,它將通過超參數和輸出的一個與所述分數,我們感興趣的是一個代碼段的最佳性能的組合迭代如下所示:
from sklearn.model_selection import GridSearchCV, StratifiedKFold from sklearn.ensemble import RandomForestClassifier# Assign different values for the hyperparameter params = {'n_estimators': [10, 50, 100, 200],'max_depth': [None, 10, 20, 30],'min_samples_split': [2, 5, 10],'min_samples_leaf': [1, 2, 5] }# Grid search with 5-fold cross-validation on F1-score clf = GridSearchCV(RandomForestClassifier(random_state=10), param_grid=params, cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=10),scoring='f1') clf.fit(X_train, y_train)print(clf.best_params_)Refitting the model with the best parameters, we can take a look at the model performance one the whole train set and the test set:
用最佳參數重新擬合模型,我們可以看看整個列車和測試組的模型性能:
Performance on Train Set:Acc: 0.9706F1: 0.8478
ROC AUC: 0.9952Performance on Test Set:Acc: 0.8927
F1: 0.2667
ROC AUC: 0.6957
The performance on the train set is great: more than 2/3 of the bad loans and all of the good loans are correctly classified, and all of the three performance measures are above 0.84. On the other hand, when the model is used on the test set, the result is not quite satisfying: most of the bad loans are labeled as “good” and the F1 score is only 0.267. There is evidence that overfitting is involved, so more effort should be put into such iterative processes in order to get better model performance.
火車上的表現很棒:正確分類了超過2/3的不良貸款和所有不良貸款,并且這三個績效指標均高于0.84。 另一方面,在測試集上使用該模型時,結果并不十分令人滿意:大多數不良貸款被標記為“好”,F1分數僅為0.267。 有證據表明涉及過度擬合,因此應該在這種迭代過程中付出更多的努力,以獲得更好的模型性能。
With the model built, we can now rank the features based on their importance. The top 5 features that have the most prediction powers are:
建立模型后,我們現在可以根據功能的重要性對其進行排名。 具有最大預測能力的前5個功能是:
- Average Transaction Balance 平均交易余額
- Average Transaction Amount 平均交易金額
- Loan Amount 貸款額度
- Average Salary 平均工資
- Days between account creation and loan application 創建賬戶和申請貸款之間的天數
There is not to much surprise here, since for many of these, we have already seen the unusual behaviors that could be related to the loan default, such as the loan amount and days between account creation and loan application.
這里并沒有什么奇怪的,因為對于許多這樣的情況,我們已經看到了與貸款違約有關的異常行為,例如貸款金額以及創建賬戶與申請貸款之間的天數 。
結論和未來方向 (Conclusion and Future Directions)
In this post, I introduced the whole pipeline of an end-to-end machine learning model in a banking application, loan default prediction, with real-world banking dataset Berka. I described the Berka dataset and the relationships between each table. Steps and codes were demonstrated on how to import the dataset into MySQL database and then connect to Python and convert processed records into Pandas DataFrame. Features were extracted and transformed into an array, ready for feeding into machine learning models. As the last step, I fit a Random Forest model using the data, evaluated the model performance, and generated the list of top 5 features that play roles in predicting loan default.
在本文中,我介紹了銀行應用程序中端到端機器學習模型的整個流程,貸款違約預測以及真實銀行數據集Berka。 我描述了Berka數據集以及每個表之間的關系。 演示了有關如何將數據集導入MySQL數據庫,然后連接至Python并將處理后的記錄轉換為Pandas DataFrame的步驟和代碼。 提取特征并將其轉換為數組,以供輸入機器學習模型。 作為最后一步,我使用數據擬合了一個隨機森林模型,評估了模型的性能,并生成了在預測貸款違約中起重要作用的前5個功能的列表。
This machine learning pipeline is just a gentle touch of the one application that could be used with the Berka dataset. It could go deeper since there is more useful information hidden in the intricate relationship among tables; it could also go wider since it can be extended to other applications such as credit card and client’s transaction behaviors. But if just focusing on this loan default prediction, there could be three directions to dive further in the future:
這個機器學習管道只是可以與Berka數據集一起使用的一個應用程序的一種輕柔的接觸。 由于表之間錯綜復雜的關系中隱藏著更多有用的信息,因此可能會更深入。 它也可以擴展,因為它可以擴展到其他應用程序,例如信用卡和客戶的交易行為。 但是,如果僅關注此貸款違約預測,將來可能會有三個方向進一步跳水:
Extract more features: Due to the time limit, it is not possible to conduct a thorough study and have a deep understanding of the dataset. There are still many features in the dataset that are unused and a lot of the information has not been fully digested with knowledge in the banking industry.
提取更多功能 :由于時間限制,無法進行深入研究并深入了解數據集。 數據集中仍然有許多未使用的功能,并且銀行業的知識還沒有完全消化很多信息。
Try other models: Only the Random Forest model is used, but there are many good ones out there, such as Logistic Regression, XGBoost, SVM, or even neural networks. The models can also be improved further by finer tunings on hyperparameters or using ensemble methods such as bagging, boosting, and stacking.
嘗試其他模型 :僅使用隨機森林模型,但那里有很多好的模型,例如Logistic回歸,XGBoost,SVM甚至神經網絡。 還可以通過對超參數進行更精細的調整或使用集成方法(例如裝袋,增強和堆疊)來進一步改進模型。
Deal with the unbalanced data: It is important to notice this fact that the default loans are only about 10% of the total loans, thus during the training process, the model will favor predicting more negatives than positive results. We have already used the F1 score and ROC AUC instead of just accuracy. However, the performance is still not as good as it could be. In order to solve this problem, other methods such as collecting or resampling more data can be used in the future.
處理不平衡的數據 :值得注意的事實是,拖欠貸款僅占總貸款的10%,因此在訓練過程中,該模型將傾向于預測負數而不是正數結果。 我們已經使用了F1分數和ROC AUC,而不僅僅是準確性。 但是,性能仍未達到應有的水平。 為了解決此問題,將來可以使用其他方法,例如收集或重新采樣更多數據。
關于我 (About Me)
I am a data scientist with engineering backgrounds. I embrace technology and learn new skills every day. Currently, I am seeking career opportunities in Toronto. You are welcome to reach me from Medium Blog, LinkedIn, or GitHub.
我是具有工程背景的數據科學家。 我每天都擁抱技術并學習新技能。 目前,我正在多倫多尋求職業機會。 歡迎您通過Medium Blog , LinkedIn或GitHub與我聯系 。
翻譯自: https://towardsdatascience.com/loan-default-prediction-an-end-to-end-ml-project-with-real-bank-data-part-1-1405f7aecb9e
總結
以上是生活随笔為你收集整理的贷款违约预测带有真实银行数据的端到端ml项目1的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: Linux:一位猫奴的意外逆袭
- 下一篇: ReportNG测试报告的定制修改