當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

客户细分模型_Avarto金融解决方案的客户细分和监督学习模型

發(fā)布時間：2023/11/29 编程问答 29 豆豆

生活随笔收集整理的這篇文章主要介紹了客户细分模型_Avarto金融解决方案的客户细分和监督学习模型小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

客戶細分模型

Lets assume that you are a CEO of a company which have some X amount of customers in a city with 1000 *X population. Analyzing the trends/features of your customer and segmenting the population of the city to land new potential customers would be a very important step for any company.

假設(shè)您是一家公司的CEO，該公司在人口為1000 * X的城市中有大約X個客戶。對于任何公司來說，分析客戶的趨勢/特征并劃分城市人口以吸引新的潛在客戶都是非常重要的一步。

This is very interesting work for an aspiring Data Scientist like me who would love to work with such real world datasets. The following article is a similar work on data provide by Avarto Financial Solutions for Udacity Nano-Degree Capstone Project.

對于像我這樣有抱負的數(shù)據(jù)科學(xué)家來說，這是一件非常有趣的工作，他們愿意使用這樣的現(xiàn)實世界數(shù)據(jù)集。以下文章是Avarto Financial Solutions為Udacity Nano-Degree Capstone項目提供的數(shù)據(jù)的類似工作。

This project can we divided into three parts:

這個項目我們可以分為三個部分：

Customer Segmentation using unsupervised techniques

使用無監(jiān)督技術(shù)進行客戶細分

Supervised Learning Model

監(jiān)督學(xué)習(xí)模型

Kaggle Competition

Kaggle比賽

Data Overview:

數(shù)據(jù)概述：

The following the files which are given by Avarto Financial Services:

Avarto Financial Services提供的以下文件：

Azdias.csv : Demographics data for the general population of Germany 891,211 persons (rows) x 366 features (columns)

Azdias.csv：德國總?cè)丝诘娜丝诮y(tǒng)計數(shù)據(jù)891,211人(行)x 366個特征(列)

Description of AZDIAS dataAZDIAS數(shù)據(jù)的描述

2. Customers.csv: Demographics data for customers in a mail-order company 191652 persons (rows) x 369 features (columns)

2. Customers.csv：郵購公司中客戶的人口統(tǒng)計數(shù)據(jù)191652人(行)x 369個特征(列)

Description of Customers data客戶數(shù)據(jù)描述

3. Train.csv: Demographics data for individuals who were targets of a marketing campaign (train) — 42,982 persons (rows) x 367 (columns)

3. Train.csv：作為營銷活動目標的個人的人口統(tǒng)計數(shù)據(jù)(火車)-42,982人(行)x 367(列)

4. Test.csv: Demographics data for individuals who were targets of a marketing campaign (test) — 42,833 persons (rows) x 366 (columns)

4. Test.csv：作為營銷活動目標的個人的人口統(tǒng)計數(shù)據(jù)(測試)-42,833人(行)x 366(列)

An excel for describing all the features and the explanations of each of their values was given in a seperate file. This excel was read and a dictionary was made out it for mapping all these values to NaNs.

在單獨的文件中提供了一個用于描述所有功能以及每個值的解釋的excel。讀取了此excel，并制作了字典以將所有這些值映射到NaN。

Screenshot of the Dictionary字典的屏幕截圖

Pre-processing Data:

預(yù)處理數(shù)據(jù)：

Initially, the number of NaNs in the Azdias dataset were less. After mapping, the unknown values of each column increased. The following figures shows the top 15 columns which had maximum number of null values after mapping of the unknown values.

最初，Azdias數(shù)據(jù)集中的NaN數(shù)量較少。映射后，每列的未知值增加。下圖顯示了映射未知值后具有最大空值數(shù)量的前15列。

Null values Proportion空值比例

On checking the proportion of null values, there were 42 columns which had more than 50% of their values as NaNs. These columns were subsequently dropped from the dataset. One thing to note here is to store these column names which can be applied later to the customers dataset.

在檢查空值的比例時，有42列的NaN占其值的50％以上。這些列隨后從數(shù)據(jù)集中刪除。這里要注意的一件事是存儲這些列名，這些列名以后可以應(yīng)用于客戶數(shù)據(jù)集。

Dropping columns with more than 50% of null values刪除具有超過50％的空值的列

There 6 features which were categorical in nature. Features such as ‘EINGEFUEGT_AM’ , ’D19_LETZTER_KAUF_BRANCHE’ were dropped because these were too detailed. Dummies were created for column ‘CAMEO_DEU_2015’ for all its individual values.

本質(zhì)上有6個特征是分類的。刪除了諸如“ EINGEFUEGT_AM”，“ D19_LETZTER_KAUF_BRANCHE”之類的功能，因為這些功能過于詳細。為“ CAMEO_DEU_2015”列創(chuàng)建了所有虛擬值的虛擬變量。

Categorical Columns分類欄

The columns ‘CAMEO_INTL_2015’ and ‘CAMEO_DEUG_2015’ had arbitrary values such as ‘XX’ and ‘X’ which needed to be replaced by NaN. These values also present in ‘CAMEO_DEU_2015’ feature and was subsequently dropped after creating dummies.

列“ CAMEO_INTL_2015”和“ CAMEO_DEUG_2015”具有任意值，例如“ XX”和“ X”，需要用NaN替換。這些值也出現(xiàn)在“ CAMEO_DEU_2015”功能中，并在創(chuàng)建虛擬變量后被刪除。

Unknown entries in ‘CAMEO_INTL_2015’“ CAMEO_INTL_2015”中的未知條目

Pre-processing on three particular columns involved manipulation of data.

在三個特定列上的預(yù)處理涉及數(shù)據(jù)處理。

‘CAMEO_INTL_2015’ : column for international topology. This column had two internal features such as ‘WEALTH’ and ‘LIFE_CYCLE’ combined in the dataset. Thus, two independent columns were generated from this column.

'CAMEO_INTL_2015'：國際拓撲列。該列在數(shù)據(jù)集中具有兩個內(nèi)部功能，例如“ WEALTH”和“ LIFE_CYCLE”。因此，從該列中生成了兩個獨立的列。

‘PRAEGENDE_JUGENDJAHRE’: column for dominating movement in the person’s youth (avantgarde or mainstream). This column was dropped and another column named ‘MOVEMENT’ was generated from this column with two values. {1:mainstream,2:avantgarde}

'PRAEGENDE_JUGENDJAHRE'：專欄，用于支配青年(前衛(wèi)或主流)青年的運動。刪除該列，并從該列中生成另一個名為“ MOVEMENT”的列，其中包含兩個值。 {1：主流，2：前衛(wèi)}

The ‘OST_WEST_KZ’ is converted into integer values by mapping ‘o’ to 1 and ‘w’ to 2.

通過將'o'映射為1并將'w'映射為2，將'OST_WEST_KZ'轉(zhuǎn)換為整數(shù)值。

Converting Mixed attributes into individual features將混合屬性轉(zhuǎn)換為單個特征

Using the sklearn library, the SimpleImputer and StandardScaler for every column was used. Similar, process of pre-processing, imputation and standardization was applied to the customers data as well.

使用sklearn庫，每列都使用SimpleImputer和StandardScaler。同樣，預(yù)處理，插補和標準化過程也應(yīng)用于客戶數(shù)據(jù)。

Azdias data after cleaning, imputing and standarding清潔，插補和標定后的Azdias數(shù)據(jù)

Before jumping into the 1st part of the project, the number of features were reduced to 363 columns. All the columns represented float values and the data was ready to be utilize further for the customer segmentation.

在進入項目的第一部分之前，要素數(shù)量已減少到363列。所有列均表示浮點值，數(shù)據(jù)已準備好進一步用于客戶細分。

Part 1: Customer Segment

第1部分：客戶群

Principal Component Analysis:

主成分分析：

It is not feasible to work with 363 features. Each feature would represent some amount of variance. Thus, PCA was applied to reduce the dimensionality of the data. After applying PCA for many different n_components, I decided to finalize 181 features which is exactly the half number of features of clean data.

使用363個功能不可行。每個特征將代表一定數(shù)量的方差。因此，應(yīng)用PCA來減少數(shù)據(jù)的維數(shù)。在將PCA應(yīng)用于許多不同的n_components之后，我決定完成181個特征，這些特征恰好是干凈數(shù)據(jù)特征的一半。

PCA applied on the cleaned dataPCA應(yīng)用于清除的數(shù)據(jù)

As it can be seen, 181 features roughly can give 90% of the information of the entire data. The principal components were studied in detail regarding the features which were most important.

可以看出，大約181個要素可以提供全部數(shù)據(jù)的90％的信息。對主要組成部分進行了詳細的最重要的功能研究。

Following figures can give a rough estimation about the feature importance for the first three principal components.

下圖可以粗略地估計出前三個主要組件的功能重要性。

Feature mapping for principal components主要組件的特征映射

For cluster segmentation, initially the Elbow method was implemented to identify the ideal number of clusters for k-means clustering. The sum of squared errors was calculated for each number of cluster using the MiniBatchKmeans.

對于聚類分割，最初采用Elbow方法來確定k均值聚類的理想聚類數(shù)。使用MiniBatchKmeans為每個簇數(shù)計算平方誤差總和。

Clustering using K-Means使用K均值聚類

Using the same pipeline, the customers data set was transformed. The number of clusters used was 30. Although, number of clusters=26 would also suffice the needs. The Kmeans algorithm with number of clusters as 30 was used to fit_predict the PCA fitted azdias data. The same model was used to transform the PCA transformed customers data. The below plot shows the similarity of both the datasets for the 30 clusters.

使用相同的管道，轉(zhuǎn)換了客戶數(shù)據(jù)集。使用的群集數(shù)為30。盡管，群集數(shù)= 26也可以滿足需求。使用簇數(shù)為30的Kmeans算法來fit_預(yù)測PCA擬合的azdias數(shù)據(jù)。使用相同的模型來轉(zhuǎn)換PCA轉(zhuǎn)換的客戶數(shù)據(jù)。下圖顯示了30個聚類的兩個數(shù)據(jù)集的相似性。

Proportion of azdias and customers for each kmeans cluster每個kmeans集群的Azdias和客戶比例

From the above figure, it would be ideal to plot the differences between the proportion of customers and azdias. This difference in proportion would be useful to predict the group of general population which can be considered as potential future customers.

根據(jù)上圖，繪制客戶和azdias比例之間的差異是理想的。這種比例上的差異將有助于預(yù)測可被視為潛在未來客戶的普通人群。

Representation of clusters for customers and azdias代表客戶和Azdias的集群

It can be clearly found out that, general audience in the clusters of #18, #14, and #0 would be an ideal choice or have a potential to become future customers. The azdias data points of people in groups #10,#9,#19, and #3 would be considered as non target audience for the mail order company.

可以清楚地發(fā)現(xiàn)，在＃18，＃14和＃0集群中的一般受眾將是理想的選擇，或者具有成為未來客戶的潛力。＃10，＃9，＃19和＃3組中人員的azdias數(shù)據(jù)點將被視為郵購公司的非目標受眾。

Part 2: Supervised Learning Model

第2部分：監(jiān)督學(xué)習(xí)模型

In order to predict whether a person would be a target or non_target for the mail_order company, it would be ideal to model a supervised model. The training data provides the demographic data for such individuals which can be classified as target and non target. The labels for these data points are given to train the supervised model.

為了預(yù)測某個人是郵購公司的目標還是非目標，對受監(jiān)管模型進行建模是理想的。訓(xùn)練數(shù)據(jù)為此類個人提供了人口統(tǒng)計數(shù)據(jù)，可以將其分類為目標和非目標。這些數(shù)據(jù)點的標簽用于訓(xùn)練監(jiān)督模型。

Taking a look at the responses, it can be found that the training data is imbalanced in terms its responses.Only 12% of the total responses are 1 (i.e. targets) while others are non-targets. Thus, the data is subjected to predicting non targets more often the targets.

看一下響應(yīng)，可以發(fā)現(xiàn)訓(xùn)練數(shù)據(jù)的響應(yīng)是不平衡的?？傢憫?yīng)中只有12％是1(即目標)，而其他不是目標。因此，更經(jīng)常地對數(shù)據(jù)進行非目標預(yù)測。

Proportion of Responses in train_datatrain_data中的響應(yīng)比例

In order to tackle this problem, two steps were taken.

為了解決這個問題，采取了兩個步驟。

Using the Stratified Cross Fold Validation : This type of cross validation technique was used because Stratified Kfold maintains the proportion of each class in every fold. In other words, for two fold Stratified cross fold validation, each fold would get 0.06% of target data points and 0.44% of non target data points.

使用分層交叉折疊驗證：之所以使用這種類型的交叉驗證技術(shù)，是因為分層Kfold保持每個折疊中每個類別的比例。換句話說，對于兩次分層交叉折疊驗證，每一次折疊將獲得0.06％的目標數(shù)據(jù)點和0.44％的非目標數(shù)據(jù)點。

Using advanced bagging/boosting techniques: Ensemble learning is method of building a model on top of other models. Adaboost Regressor, Gradient Boosting Regressor, and XGBoost Regressor in Scikit-learn are all ensemble classifiers built on top of Decision tree model (by default).

使用高級裝袋/增強技術(shù)：集成學(xué)習(xí)是在其他模型之上構(gòu)建模型的方法。 Scikit-learn中的Adaboost Regressor，Gradient Boosting Regressor和XGBoost Regressor都是基于決策樹模型(默認情況下)構(gòu)建的集成分類器。

These ensemble models work great both for classification and for regression problems and in general has the following benefits:

這些集成模型對于分類和回歸問題均適用，并且通常具有以下優(yōu)點：

improves the stability and accuracy of machine learning algorithms;
提高機器學(xué)習(xí)算法的穩(wěn)定性和準確性；
reduces variance of a single estimate from one model as they combine several estimates from several models;
減少來自一個模型的單個估計的方差，因為它們組合了來自多個模型的多個估計；
helps to avoid overfitting.
有助于避免過度擬合。

The models are initially fitted on the training data which was splitted into training and validation data using the Stratified K-fold technique with n_fold=7.

首先將模型擬合到訓(xùn)練數(shù)據(jù)上，然后使用n_fold = 7的分層K折技術(shù)將其分為訓(xùn)練和驗證數(shù)據(jù)。

Comparison of models before hyper-parameter tuning.超參數(shù)調(diào)整之前的模型比較。

Model Evaluation:

模型評估：

ROC_AUC was selected to be the metric. To evaluate the models, the training data was split into training and validation data with test_size=0.3. Then, the training data was split into percentages from 30 to 100 of the training size and the corresponding roc_auc for training and validation were plotted.

選擇ROC_AUC作為度量標準。為了評估模型，將訓(xùn)練數(shù)據(jù)分為test_size = 0.3的訓(xùn)練和驗證數(shù)據(jù)。然后，將訓(xùn)練數(shù)據(jù)分成訓(xùn)練大小的30％到100％的百分比，并繪制相應(yīng)的roc_auc進行訓(xùn)練和驗證。

Model evaluation模型評估

As it can be seen that, the XGBoost Regressor does not perform well with the max val roc_auc of 0.65. The remaining models have similar roc_auc. These models were further fined tuned by hyper-paramater tuning using grid search.

可以看出，在最大val roc_auc為0.65的情況下，XGBoost回歸器的性能不佳。其余模型具有類似的roc_auc。通過使用網(wǎng)格搜索的超參數(shù)調(diào)整對這些模型進行了進一步的調(diào)整。

Grid Search

網(wǎng)格搜索

The Adaptive Boosting was further fined tuned by tweaking learning_rate and n_estimators. It was found that smaller learning rates and larger n_estimators were giving the best results. Gradient Boosting Regression was also fined similarly and was able to interpret similar trends. However, it takes longer time to run this model. Although, XG Boosting (Extreme Boosting) was not performing well enough, it was fined tuned further by tweaking learning rate and gamma.

通過調(diào)整learning_rate和n_estimators進一步調(diào)整了自適應(yīng)提升。結(jié)果發(fā)現(xiàn)，較小的學(xué)習(xí)率和較大的n_estimator給出最佳結(jié)果。梯度提升回歸也被類似地罰款，并且能夠解釋類似的趨勢。但是，運行此模型需要更長的時間。盡管XG Boosting(Extreme Boosting)的性能不夠好，但通過調(diào)整學(xué)習(xí)率和伽瑪系數(shù)進一步對其進行了微調(diào)。

Grid Search for Gradient Boosting網(wǎng)格搜索以實現(xiàn)梯度提升 Grid Search for Adaptive Boosting網(wǎng)格搜索以實現(xiàn)自適應(yīng)提升 Grid Search for XG BoostingXG Boosting的網(wǎng)格搜索

The final models after the grid search and fine tuning the hyper parameters are as follows:

在網(wǎng)格搜索和微調(diào)超參數(shù)之后的最終模型如下：

Final models after gridsearch網(wǎng)格搜索后的最終模型

All the three models were able to get 76% accuracy on the training data.

這三個模型在訓(xùn)練數(shù)據(jù)上都能獲得76％的準確性。

Final Scores for each model每個模型的最終成績

These models were traced back to figure out which feature stands as the most important out of the 363 clean features. It is found out that, the feature ‘D19_SOZIALES’ is the most important feature for two out the three models.

追溯這些模型以找出363個清潔功能中最重要的功能。結(jié)果發(fā)現(xiàn)，對于三個模型中的兩個，特征“ D19_SOZIALES”是最重要的特征。

Top ten features for each model每種型號的十大功能

Part 3: Kaggle Competition

第3部分：Kaggle比賽

The cleaning data process was applied to the test data as well. The training data was fit on all the three models and predictions on test data for each model was submitted to Kaggle. As we were allowed to submit 5 submissions per, all the three predictions were submitted. The Gradient Boosting submission was the best, getting an accuracy of 79.959 on Kaggle.

清潔數(shù)據(jù)處理也適用于測試數(shù)據(jù)。訓(xùn)練數(shù)據(jù)適合所有三個模型，并且每個模型的測試數(shù)據(jù)預(yù)測已提交給Kaggle。由于我們每個人可以提交5份意見書，因此所有三個預(yù)測均已提交。梯度增強提交是最好的，在Kaggle上獲得了79.959的準確性。

In order to improve the accuracy, one can refine all the parameters of the models. Other models such as DecisionTreeRegressor can also be tried. One advice would be to try data augmentation technique. Try adding customer dataset points to the training data to handle the imbalance nature.

為了提高精度，可以對模型的所有參數(shù)進行細化。也可以嘗試使用其他模型，例如DecisionTreeRegressor。一種建議是嘗試使用數(shù)據(jù)增強技術(shù)。嘗試將客戶數(shù)據(jù)集點添加到訓(xùn)練數(shù)據(jù)中以處理不平衡性。

Improvements:

改進之處：

Different cut off for dropping the columns in the datasets can be used. Eg 0.30 null values.

可以使用不同的截距來刪除數(shù)據(jù)集中的列。例如0.30空值。

Different Imputation techniques can be used like Median, Max etc. The features can also be divided into log attributes which are skewed, into mixed attributes and numerical attributes.

可以使用不同的插補技術(shù)，例如中位數(shù)，最大值等。特征還可以分為偏斜的對數(shù)屬性，混合屬性和數(shù)值屬性。

Standardization can also changed like MinMaxScaler etc.

標準化也可以更改，例如MinMaxScaler等。

Different Models like SVR, DecisionTreeClassifier can be test

可以測試SVR，DecisionTreeClassifier等不同模型

Data Augmentation technique can be implemented to handle the imbalance nature of the dataset. Data points from customers data can be added to the training data to increase the positive responses in the training data.

可以實施數(shù)據(jù)增強技術(shù)來處理數(shù)據(jù)集的不平衡性質(zhì)。來自客戶數(shù)據(jù)的數(shù)據(jù)點可以添加到培訓(xùn)數(shù)據(jù)中，以增加培訓(xùn)數(shù)據(jù)中的積極響應(yīng)。

Summary:

摘要：

In this project, provided by Udacity partners at Bertelsmann Arvato Analytics, the real-life demographics data of Germany population and customer segment was analyzed.

在這個由Bertelsmann Arvato Analytics的Udacity合作伙伴提供的項目中，分析了德國人口和客戶群的真實人口統(tǒng)計數(shù)據(jù)。

In the first part of the project, pre-processing of the data was done. This was the most challenging part of the project. Conversion of mixed attributes and categorical features into numerical attributes consumes most time in this project. Following the data cleaning part was the imputation and standardization of the dataset.

在項目的第一部分，對數(shù)據(jù)進行了預(yù)處理。這是該項目最具挑戰(zhàn)性的部分。在此項目中，將混合屬性和分類特征轉(zhuǎn)換為數(shù)字屬性會花費大量時間。數(shù)據(jù)清理部分之后是數(shù)據(jù)集的插補和標準化。

For the Unsupervised learning part, PCA was applied to the cleaned dataset containing 363 features. The number of features were reduced to 181 after applying PCA which represents approx 90% of the data. Elbow method was used to determine the number of clusters in the kmeans algorithm. After applying Kmeans to both the datasets, it can be seen that the general population in cluster number #18, #14, and #0 would be an ideal choice or have a potential to become future customers. The azdias data points of people in groups #10,#9,#19, and #3 would be considered as non target audience for the mail order company.

對于無監(jiān)督學(xué)習(xí)部分，將PCA應(yīng)用于包含363個特征的清理數(shù)據(jù)集。應(yīng)用PCA后，特征數(shù)量減少到181，這代表了大約90％的數(shù)據(jù)。 kmeans算法中使用肘方法確定聚類數(shù)。將Kmeans應(yīng)用于這兩個數(shù)據(jù)集后，可以看出群集編號18、14和0的總體人口將是理想選擇，或者具有成為未來客戶的潛力。＃10，＃9，＃19和＃3組中人員的azdias數(shù)據(jù)點將被視為郵購公司的非目標受眾。

A supervised learning algorithm was built using advanced regression models like AdaBoost, GradientBoost and XGBoost. Techniques to handle imbalanced data like Stratified cross fold techniques were also learnt. ROC_AUC curves for these models were plotted to evaluate them. These models were further fined tuning using pipelines and gridsearch methods as learnt from the nano degree program.

使用高級回歸模型(如AdaBoost，GradientBoost和XGBoost)構(gòu)建了監(jiān)督學(xué)習(xí)算法。還學(xué)習(xí)了處理不平衡數(shù)據(jù)的技術(shù)，例如分層交叉折疊技術(shù)。繪制了這些模型的ROC_AUC曲線以對其進行評估。這些模型通過使用管道和網(wǎng)格搜索方法(從納米程度程序中獲悉)進行了進一步的優(yōu)化調(diào)整。

A Kaggle score of 0.79959 was obtained from fine-tuned GradientBoost Regressor model. These results could be improved more by data augmentation techniques and further fine tuning other parameters of the model.

從微調(diào)的GradientBoost回歸模型獲得了0.79959的Kaggle分數(shù)。通過數(shù)據(jù)增強技術(shù)和進一步微調(diào)模型的其他參數(shù)，可以進一步改善這些結(jié)果。

5. Improvements like different imputation, standardization and data augmentation techniques were also discussed which can augment the performance of this project further.

5.還討論了諸如不同歸因，標準化和數(shù)據(jù)增強技術(shù)之類的改進，這些改進可以進一步提高該項目的性能。

This project be very useful as it could be applied to various fields. I would like to thank again Avarto Financial Solutions for providing this real world datasets. A big thanks to all the instructors and team members of Udacity for their constant support during the nano degree journey.

該項目非常有用，因為它可以應(yīng)用于各個領(lǐng)域。我還要再次感謝Avarto Financial Solutions提供了這個真實的數(shù)據(jù)集。非常感謝Udacity的所有講師和團隊成員在納米學(xué)位之旅中的不斷支持。

The project workbooks can be found here.[https://github.com/RuchitDoshi/Avarto_Customer_Segmentation]

項目工作簿可以在這里找到。[ https://github.com/RuchitDoshi/Avarto_Customer_Segmentation]

翻譯自: https://medium.com/@ruchitsd/customer-segmentation-and-supervised-learning-model-for-avarto-financial-solutions-374c946c99ac

客戶細分模型

總結(jié)

以上是生活随笔為你收集整理的客户细分模型_Avarto金融解决方案的客户细分和监督学习模型的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。