當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

【实战】用机器学习来提升你的用户增长（二）

發布時間：2025/3/8 编程问答 12 豆豆

生活随笔收集整理的這篇文章主要介紹了【实战】用机器学习来提升你的用户增长（二）小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

作者：Bar?? KaramanFollow??編譯：ronghuaiyang

正文共：?9230?字 18?圖

預計閱讀時間：?27?分鐘

導讀

今天給大家介紹客戶分群方面的內容，為什么要對客戶進行細分，如何細分，依據是什么，方法是什么，都會一一介紹。

第二部分：客戶分群

在前一篇文章中，我們分析了我們在線零售業務的主要指標。現在我們知道了使用Python跟蹤什么以及如何跟蹤。現在，是時候關注一下客戶并細分他們了。

首先，我們為什么要做客戶細分？

因為你不能用同樣的內容，同樣的渠道，同樣的重要性來對待每個客戶。他們到更了解他們的地方去。

使用你的平臺的客戶有不同的需求，他們有自己不同的屬性。你應該據此來調整你的行為。

根據你想要達到的目標，你可以做許多不同的分群。如果你想提高用戶留存率，你可以基于用戶流失概率進行細分并采取行動。但也有非常常見和有用的分群方法。現在我們要在我們的業務中實現其中一個方法：RFM

RFM代表Recency-Frequency-Monetary Value。理論上我們會有如下幾個部分：

低價值：不活躍的客戶，不是很頻繁的買家/訪客，產生的收入為零，或者很低，或者是負的。
中等價值：經常使用我們的平臺(但沒有我們的高價值客戶多)，活動非常頻繁，產生中等收入。
高價值：我們不想失去的群體。產生的收入高，活動頻繁。

作為一個方法，我們需要計算Recency, Frequency以及Monetary Value(我們將從現在開始稱它為收入)，并應用無監督機器學習來識別每個不同的組(群體)。讓我們進入代碼，看看如何實現RFM分群。

Recency

為了計算Recency，我們需要找出每個客戶最近的購買日期，并查看他們有多少天是不活躍的。對于每個客戶的非活動天數，我們將應用K-means聚類來為客戶分配一個recency score。

對于我們這個例子，我們將繼續使用相同的數據集：https://www.kaggle.com/vijayuv/onlineretail。在開始recency計算之前，讓我們回顧一下我們之前所做的工作。

#?import?libraries from?datetime?import?datetime,?timedelta import?pandas?as?pd %matplotlib?inline import?matplotlib.pyplot?as?plt import?numpy?as?np import?seaborn?as?sns from?__future__?import?divisionimport?plotly.plotly?as?py import?plotly.offline?as?pyoff import?plotly.graph_objs?as?go#inititate?Plotly pyoff.init_notebook_mode()#load?our?data?from?CSV tx_data?=?pd.read_csv('data.csv')#convert?the?string?date?field?to?datetime tx_data['InvoiceDate']?=?pd.to_datetime(tx_data['InvoiceDate'])#we?will?be?using?only?UK?data tx_uk?=?tx_data.query("Country=='United?Kingdom'").reset_index(drop=True)

現在我們可以計算recency：

#create?a?generic?user?dataframe?to?keep?CustomerID?and?new?segmentation?scores tx_user?=?pd.DataFrame(tx_data['CustomerID'].unique()) tx_user.columns?=?['CustomerID']#get?the?max?purchase?date?for?each?customer?and?create?a?dataframe?with?it tx_max_purchase?=?tx_uk.groupby('CustomerID').InvoiceDate.max().reset_index() tx_max_purchase.columns?=?['CustomerID','MaxPurchaseDate']#we?take?our?observation?point?as?the?max?invoice?date?in?our?dataset tx_max_purchase['Recency']?=?(tx_max_purchase['MaxPurchaseDate'].max()?-?tx_max_purchase['MaxPurchaseDate']).dt.days#merge?this?dataframe?to?our?new?user?dataframe tx_user?=?pd.merge(tx_user,?tx_max_purchase[['CustomerID','Recency']],?on='CustomerID')tx_user.head()#plot?a?recency?histogramplot_data?=?[go.Histogram(x=tx_user['Recency']) ]plot_layout?=?go.Layout(title='Recency') fig?=?go.Figure(data=plot_data,?layout=plot_layout) pyoff.iplot(fig)

我們的新dataframe tx_user現在包含了recency數據：

要獲得關于recency的大致情況，我們可以使用pandas的.describe()方法。它顯示了我們的數據的平均值、最小值、最大值、計數和百分位數。

我們看到，平均是90天，中位數是49天。

上面的代碼有一個柱狀圖輸出，向我們展示了客戶的recency是如何分布的。

現在是有趣的部分。我們使用K-means聚類來分配recency score。但是我們應該告訴K-means算法需要多少個簇。為了找出答案，我們將使用Elbow方法。Elbow方法簡單地給出了最優慣性下的最優簇數量。代碼和慣性圖如下：

from?sklearn.cluster?import?KMeanssse={} tx_recency?=?tx_user[['Recency']] for?k?in?range(1,?10):kmeans?=?KMeans(n_clusters=k,?max_iter=1000).fit(tx_recency)tx_recency["clusters"]?=?kmeans.labels_sse[k]?=?kmeans.inertia_? plt.figure() plt.plot(list(sse.keys()),?list(sse.values())) plt.xlabel("Number?of?cluster") plt.show()

慣性圖：

這里看起來3是最優的。根據業務需求，我們可以繼續使用更少或更多的分群數量。我們為這個例子選擇4：

#build?4?clusters?for?recency?and?add?it?to?dataframe kmeans?=?KMeans(n_clusters=4) kmeans.fit(tx_user[['Recency']]) tx_user['RecencyCluster']?=?kmeans.predict(tx_user[['Recency']])#function?for?ordering?cluster?numbers def?order_cluster(cluster_field_name,?target_field_name,df,ascending):new_cluster_field_name?=?'new_'?+?cluster_field_namedf_new?=?df.groupby(cluster_field_name)[target_field_name].mean().reset_index()df_new?=?df_new.sort_values(by=target_field_name,ascending=ascending).reset_index(drop=True)df_new['index']?=?df_new.indexdf_final?=?pd.merge(df,df_new[[cluster_field_name,'index']],?on=cluster_field_name)df_final?=?df_final.drop([cluster_field_name],axis=1)df_final?=?df_final.rename(columns={"index":cluster_field_name})return?df_finaltx_user?=?order_cluster('RecencyCluster',?'Recency',tx_user,False)

我們在dataframe tx_user中計算了聚類并將它們分配給每個客戶。

我們可以看到我們的recency分群是如何具有不同的特征的。與分群2相比，分群1中的客戶是最近才出現的。我們在代碼中添加了一個函數，order_cluster()。K-means將分群分配為數字，但不是按順序分配的。我們不能說集群0是最差的，而集群4是最好的。order_cluster()方法為我們做了這些，我們的新dataframe看起來更整潔：

非常好！3包含最近的客戶，而0包含最不活躍的客戶。對于Frequency和Revenue，我們也采用同樣的方法。

Frequency

要創建frequency分群，我們需要找到每個客戶的訂單總數。首先計算這個，看看我們的客戶數據庫中的frequency是什么樣的：

#get?order?counts?for?each?user?and?create?a?dataframe?with?it tx_frequency?=?tx_uk.groupby('CustomerID').InvoiceDate.count().reset_index() tx_frequency.columns?=?['CustomerID','Frequency']#add?this?data?to?our?main?dataframe tx_user?=?pd.merge(tx_user,?tx_frequency,?on='CustomerID')#plot?the?histogram plot_data?=?[go.Histogram(x=tx_user.query('Frequency?<?1000')['Frequency']) ]plot_layout?=?go.Layout(title='Frequency') fig?=?go.Figure(data=plot_data,?layout=plot_layout) pyoff.iplot(fig)

應用相同的邏輯得到frequency聚類，并分配給每個客戶：

#k-means kmeans?=?KMeans(n_clusters=4) kmeans.fit(tx_user[['Frequency']]) tx_user['FrequencyCluster']?=?kmeans.predict(tx_user[['Frequency']])#order?the?frequency?cluster tx_user?=?order_cluster('FrequencyCluster',?'Frequency',tx_user,True)#see?details?of?each?cluster tx_user.groupby('FrequencyCluster')['Frequency'].describe()

我們的frequency聚類看起來是這樣的：

frequency高的數字表示更好的客戶，這與recency聚類的表示相同。

Revenue

當我們根據revenue對他們進行聚類時，讓我們來看看我們的客戶數據是什么樣子的。我們計算每個客戶的revenue，繪制直方圖，并應用相同的聚類方法。

#calculate?revenue?for?each?customer tx_uk['Revenue']?=?tx_uk['UnitPrice']?*?tx_uk['Quantity'] tx_revenue?=?tx_uk.groupby('CustomerID').Revenue.sum().reset_index()#merge?it?with?our?main?dataframe tx_user?=?pd.merge(tx_user,?tx_revenue,?on='CustomerID')#plot?the?histogram plot_data?=?[go.Histogram(x=tx_user.query('Revenue?<?10000')['Revenue']) ]plot_layout?=?go.Layout(title='Monetary?Value') fig?=?go.Figure(data=plot_data,?layout=plot_layout) pyoff.iplot(fig)

我們也有一些revenue 為負的客戶，讓我們繼續并應用k-means聚類：

#apply?clustering kmeans?=?KMeans(n_clusters=4) kmeans.fit(tx_user[['Revenue']]) tx_user['RevenueCluster']?=?kmeans.predict(tx_user[['Revenue']])#order?the?cluster?numbers tx_user?=?order_cluster('RevenueCluster',?'Revenue',tx_user,True)#show?details?of?the?dataframe tx_user.groupby('RevenueCluster')['Revenue'].describe()

總得分

很好！我們現在有了recency，frequency和revenue的得分（聚類編號）。讓我們來計算一個總得分：

#calculate?overall?score?and?use?mean()?to?see?details tx_user['OverallScore']?=?tx_user['RecencyCluster']?+?tx_user['FrequencyCluster']?+?tx_user['RevenueCluster'] tx_user.groupby('OverallScore')['Recency','Frequency','Revenue'].mean()

上面的分數清楚地告訴我們，得到8分的客戶是我們最好的客戶，而得到0分的客戶是最差的客戶。

為了簡單起見，我們最好將這些分數命名為:

0到2：低價值
3至4：中等價值
5+：高價值

我們可以很容易地把這個命名方在我們的dataframe中：

tx_user['Segment']?=?'Low-Value' tx_user.loc[tx_user['OverallScore']>2,'Segment']?=?'Mid-Value'? tx_user.loc[tx_user['OverallScore']>4,'Segment']?=?'High-Value'?

現在，讓我們看看我們的分群在散點圖中是如何分布的：

可以看到，在RFM上，這些分群是如何明顯地相互區分的。你可以找到下圖的代碼：

#Revenue?vs?Frequency tx_graph?=?tx_user.query("Revenue?<?50000?and?Frequency?<?2000")plot_data?=?[go.Scatter(x=tx_graph.query("Segment?==?'Low-Value'")['Frequency'],y=tx_graph.query("Segment?==?'Low-Value'")['Revenue'],mode='markers',name='Low',marker=?dict(size=?7,line=?dict(width=1),color=?'blue',opacity=?0.8)),go.Scatter(x=tx_graph.query("Segment?==?'Mid-Value'")['Frequency'],y=tx_graph.query("Segment?==?'Mid-Value'")['Revenue'],mode='markers',name='Mid',marker=?dict(size=?9,line=?dict(width=1),color=?'green',opacity=?0.5)),go.Scatter(x=tx_graph.query("Segment?==?'High-Value'")['Frequency'],y=tx_graph.query("Segment?==?'High-Value'")['Revenue'],mode='markers',name='High',marker=?dict(size=?11,line=?dict(width=1),color=?'red',opacity=?0.9)), ]plot_layout?=?go.Layout(yaxis=?{'title':?"Revenue"},xaxis=?{'title':?"Frequency"},title='Segments') fig?=?go.Figure(data=plot_data,?layout=plot_layout) pyoff.iplot(fig)#Revenue?Recencytx_graph?=?tx_user.query("Revenue?<?50000?and?Frequency?<?2000")plot_data?=?[go.Scatter(x=tx_graph.query("Segment?==?'Low-Value'")['Recency'],y=tx_graph.query("Segment?==?'Low-Value'")['Revenue'],mode='markers',name='Low',marker=?dict(size=?7,line=?dict(width=1),color=?'blue',opacity=?0.8)),go.Scatter(x=tx_graph.query("Segment?==?'Mid-Value'")['Recency'],y=tx_graph.query("Segment?==?'Mid-Value'")['Revenue'],mode='markers',name='Mid',marker=?dict(size=?9,line=?dict(width=1),color=?'green',opacity=?0.5)),go.Scatter(x=tx_graph.query("Segment?==?'High-Value'")['Recency'],y=tx_graph.query("Segment?==?'High-Value'")['Revenue'],mode='markers',name='High',marker=?dict(size=?11,line=?dict(width=1),color=?'red',opacity=?0.9)), ]plot_layout?=?go.Layout(yaxis=?{'title':?"Revenue"},xaxis=?{'title':?"Recency"},title='Segments') fig?=?go.Figure(data=plot_data,?layout=plot_layout) pyoff.iplot(fig)#?Revenue?vs?Frequency tx_graph?=?tx_user.query("Revenue?<?50000?and?Frequency?<?2000")plot_data?=?[go.Scatter(x=tx_graph.query("Segment?==?'Low-Value'")['Recency'],y=tx_graph.query("Segment?==?'Low-Value'")['Frequency'],mode='markers',name='Low',marker=?dict(size=?7,line=?dict(width=1),color=?'blue',opacity=?0.8)),go.Scatter(x=tx_graph.query("Segment?==?'Mid-Value'")['Recency'],y=tx_graph.query("Segment?==?'Mid-Value'")['Frequency'],mode='markers',name='Mid',marker=?dict(size=?9,line=?dict(width=1),color=?'green',opacity=?0.5)),go.Scatter(x=tx_graph.query("Segment?==?'High-Value'")['Recency'],y=tx_graph.query("Segment?==?'High-Value'")['Frequency'],mode='markers',name='High',marker=?dict(size=?11,line=?dict(width=1),color=?'red',opacity=?0.9)), ]plot_layout?=?go.Layout(yaxis=?{'title':?"Frequency"},xaxis=?{'title':?"Recency"},title='Segments') fig?=?go.Figure(data=plot_data,?layout=plot_layout) pyoff.iplot(fig)

我們可以開始采取行動，進行分群。策略很明確：

高價值：提高留存率
中等價值：提高留存率 + 增加Frequency
低價值：增加Frequency

越來越刺激了！在下一部分中，我們將計算和預測客戶的終生價值。

理想情況下，我們可以通過使用分位數或簡單的binning來輕松實現我們在這里所做的工作，為了熟悉一下k-means聚類，所以之類用了這個聚類的方法。

—END—

英文原文：

https://towardsdatascience.com/data-driven-growth-with-python-part-2-customer-segmentation-5c019d150444

往期精彩回顧適合初學者入門人工智能的路線及資料下載機器學習在線手冊深度學習在線手冊AI基礎下載（pdf更新到25集）本站qq群1003271085，加入微信群請回復“加群”獲取一折本站知識星球優惠券，復制鏈接直接打開：https://t.zsxq.com/yFQV7am喜歡文章，點個在看

總結

以上是生活随笔為你收集整理的【实战】用机器学习来提升你的用户增长（二）的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：【基础】pandas中apply与map
下一篇：新冠肺炎的可视化和预测分析（附代码）