當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

客户细分_客户细分：K-Means聚类和A / B测试

發(fā)布時(shí)間：2023/12/15 编程问答 24 豆豆

生活随笔收集整理的這篇文章主要介紹了客户细分_客户细分：K-Means聚类和A / B测试小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

客戶細(xì)分

語境 (Context)

I have been working in Advertising, specifically Digital Media and Performance, for nearly 3 years and customer behaviour analysis is one of the core concentrations in my day-to-day job. With the help of different analytics platforms (e.g. Google Analytics, Adobe Analytics), my life has been made easier than before since these platforms come with the built-in function of segmentation that analyses user behaviours across dimensions and metrics.

我從事廣告業(yè)，特別是數(shù)字媒體和表演業(yè)已近3年，客戶行為分析是我日常工作的核心內(nèi)容之一。在不同的分析平臺(tái)(例如Google Analytics(分析)，Adobe Analytics)的幫助下，我的生活變得比以前更加輕松，因?yàn)檫@些平臺(tái)具有內(nèi)置的細(xì)分功能，可以根據(jù)維度和指標(biāo)分析用戶行為。

However, despite the convenience provided, I was hoping to leverage Machine Learning to do customer segmentation that can be scalable and applicable to other optimizations in Data Science (e.g. A/B Testing). Then, I came across the dataset provided by Google Analytics for a Kaggle competition and decided to use it for this project.

但是，盡管提供了便利，但我還是希望利用機(jī)器學(xué)習(xí)來進(jìn)行客戶細(xì)分 ，該細(xì)分可以擴(kuò)展并適用于數(shù)據(jù)科學(xué)中的其他優(yōu)化(例如A / B測試)。然后，我遇到了Google Analytics(分析)提供的Kaggle競賽數(shù)據(jù)集，并決定將其用于該項(xiàng)目。

Feel free to check out the dataset here if you’re keen! Beware that the dataset has several sub-datasets and each has more than 900k rows!

如果您愿意，可以在這里簽出數(shù)據(jù)集！請注意，數(shù)據(jù)集具有多個(gè)子數(shù)據(jù)集， 每個(gè)子數(shù)據(jù)集具有超過900k的行 ！

A.解釋性數(shù)據(jù)分析(EDA) (A. Explanatory Data Analysis (EDA))

This always remain an essential step in every Data Science project to ensure the dataset is clean and properly pre-processed to be used for modelling.

這始終是每個(gè)Data Science項(xiàng)目中必不可少的步驟，以確保數(shù)據(jù)集干凈且經(jīng)過適當(dāng)預(yù)處理以用于建模。

First of all, let’s import all the necessary libraries and read the csv file:

首先，讓我們導(dǎo)入所有必需的庫并讀取csv文件：

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as snsdf_raw = pd.read_csv("google-analytics.csv")
df_raw.head()

1.展平JSON字段 (1. Flatten JSON Fields)

As you can see, the raw dataset above is a bit “messy” and not digestible at all since some variables are formatted as JSON fields which compress different values of different sub-variables into one field. For example, for geoNetwork variable, we can tell that there are several sub-variables such as continent, subContinent, etc. that are grouped together.

如您所見，上面的原始數(shù)據(jù)集有點(diǎn)“混亂”，根本無法消化，因?yàn)槟承┳兞康母袷皆O(shè)置為JSON字段，可將不同子變量的不同值壓縮到一個(gè)字段中。例如，對于geoNetwork變量，我們可以知道有幾個(gè)子變量(例如，continent，subContinent等)組合在一起。

Thanks to the help of a Kaggler, I was able to convert these variables to a more digestible ones by flattening those JSON fields:

多虧了Kaggler的幫助，我能夠通過展平那些JSON字段將這些變量轉(zhuǎn)換為更易消化的變量：

import os
import json
from pandas import json_normalizedef load_df(csv_path="google-analytics.csv", nrows=None):
json_columns = ['device', 'geoNetwork', 'totals', 'trafficSource']
df = pd.read_csv(csv_path, converters={column: json.loads for column in json_columns},dtype={'fullVisitorID':'str'}, nrows=nrows)
for column in json_columns:
column_converted = json_normalize(df[column])
column_converted.columns = [f"{column}_{subcolumn}" for subcolumn in column_converted.columns]
df = df.drop(column, axis=1).merge(column_converted, right_index=True, left_index=True)
return df

After flattening those JSON fields, we are able to see a much cleaner dataset, especially those JSON variables split into sub-variables (e.g. device split into device_browser, device_browserVersion, etc.).

展平那些JSON字段后，我們可以看到一個(gè)更整潔的數(shù)據(jù)集，尤其是那些JSON變量拆分為子變量(例如，設(shè)備拆分為device_browser，device_browserVersion等)。

2.數(shù)據(jù)重新格式化和分組 (2. Data Re-formatting & Grouping)

For this project, I have chosen the variables that I believe have better impact or correlation to the user behaviours:

在這個(gè)項(xiàng)目中，我選擇了我認(rèn)為對用戶行為有更好影響或相關(guān)性的變量：

df = df.loc[:,['channelGrouping', 'date', 'fullVisitorId', 'sessionId', 'visitId', 'visitNumber', 'device_browser', 'device_operatingSystem', 'device_isMobile', 'geoNetwork_country', 'trafficSource_source', 'totals_visits', 'totals_hits', 'totals_pageviews', 'totals_bounces', 'totals_transactionRevenue']]df = df.fillna(value=0)
df.head()

Moving on, as the new dataset has fewer variables which, however, vary in terms of data type, I took some time to analyze each and every variable to ensure the data is “clean enough” prior to modelling. Below are some quick examples of un-clean data to be cleaned:

繼續(xù)，由于新數(shù)據(jù)集的變量較少，但是變量的數(shù)據(jù)類型不同，我花了一些時(shí)間分析每個(gè)變量，以確保在建模之前數(shù)據(jù)“足夠干凈”。以下是一些要清除的不干凈數(shù)據(jù)的快速示例：

#Format the values
df.channelGrouping.unique()
df.channelGrouping = df.channelGrouping.replace("(Other)", "Others")#Convert boolean type to string
df.device_isMobile.unique()
df.device_isMobile = df.device_isMobile.astype(str)
df.loc[df.device_isMobile == "False", "device"] = "Desktop"
df.loc[df.device_isMobile == "True", "device"] = "Mobile"#Categorize similar valuesdf['traffic_source'] = df.trafficSource_sourcemain_traffic_source = ["google","baidu","bing","yahoo",...., "pinterest","yandex"]df.traffic_source[df.traffic_source.str.contains("google")] = "google"
df.traffic_source[df.traffic_source.str.contains("baidu")] = "baidu"
df.traffic_source[df.traffic_source.str.contains("bing")] = "bing"
df.traffic_source[df.traffic_source.str.contains("yahoo")] = "yahoo"
.....
df.traffic_source[~df.traffic_source.isin(main_traffic_source)] = "Others"

After re-formatting, I found that fullVisitorID’s unique values are fewer than the total rows of the dataset, meaning there are multiple fullVisitorIDs that were recorded. Hence, I proceeded to group the variables by fullVisitorID and sort by Revenue:

重新格式化后，我發(fā)現(xiàn)fullVisitorID的唯一值少于數(shù)據(jù)集的總行數(shù)，這意味著記錄了多個(gè)fullVisitorID。因此，我著手按照fullVisitorID對變量進(jìn)行分組，然后按Revenue進(jìn)行排序：

df_groupby = df.groupby(['fullVisitorId', 'channelGrouping', 'geoNetwork_country', 'traffic_source', 'device', 'deviceBrowser', 'device_operatingSystem'])
.agg({'totals_hits':'sum', 'totals_pageviews':'sum', 'totals_bounces':'sum','totals_transactionRevenue':'sum'})
.reset_index()df_groupby = df_groupby.sort_values(by='totals_transactionRevenue', ascending=False).reset_index(drop=True)df.groupby() and df.sort_values()df.groupby()和df.sort_values()

3.異常值處理 (3. Outlier Handling)

The last step of any EDA process that cannot be overlooked is detecting and handling outliers of the dataset. The reason being is that outliers, especially those marginally extreme ones, impact the performance of a machine learning model, mostly negatively. That said, we need to either remove those outliers from the dataset or convert them (by mean or mode) to fit them to the range that the majority of the data points lie in:

任何EDA流程中不可忽視的最后一步是檢測和處理數(shù)據(jù)集的異常值。原因是離群值，尤其是那些極度極端的值，對機(jī)器學(xué)習(xí)模型的性能產(chǎn)生了很大的負(fù)面影響。也就是說，我們需要從數(shù)據(jù)集中刪除那些離群值，或者將它們轉(zhuǎn)換(通過均值或眾數(shù))以使其適合大多數(shù)數(shù)據(jù)點(diǎn)所在的范圍：

#Seaborn Boxplot to see how far outliers lie compared to the restsns.boxplot(df_groupby.totals_transactionRevenue)sns.boxplot()sns.boxplot()

As you can see, most of the data points in Revenue lie below USD200,000 and there’s only one extreme outlier that hits nearly USD600,000. If we don’t remove this outlier, the model also takes it into consideration that produces a less objective reflection.

如您所見，“收入”中的大多數(shù)數(shù)據(jù)點(diǎn)都在200,000美元以下，只有一個(gè)極端的異常值達(dá)到了600,000美元。如果我們不刪除此異常值，則模型也會(huì)將其考慮在內(nèi)，從而產(chǎn)生較少客觀的反映。

So let’s go ahead and remove it, and please do so for other variables. Just a quick note, there are several methods of dealing with outliers (such as inter-quantiles). However, in my case, there’s only one so I just went ahead defining the range that I believe fits well:

因此，讓我們繼續(xù)刪除它，對于其他變量，請這樣做。簡要說明一下，有幾種處理離群值(例如分位數(shù)間)的方法。但是，就我而言，只有一個(gè)，所以我繼續(xù)定義了我認(rèn)為合適的范圍：

df_groupby = df_groupby.loc[df_groupby.totals_transactionRevenue < 200000]

B. K-均值聚類 (B. K-Means Clustering)

What is K-Means Clustering and how does it help with customer segmentation?

什么是K-Means聚類，它對客戶細(xì)分有何幫助？

Clustering is the most well-known unsupervised learning technique that finds structure in unlabeled data by identifying similar groups/clusters, particularly with the helps of K-Means.

聚類是最著名的無監(jiān)督學(xué)習(xí)技術(shù)，通過識(shí)別相似的組/集群來發(fā)現(xiàn)未標(biāo)記數(shù)據(jù)中的結(jié)構(gòu)，尤其是在K-Means的幫助下。

K-Means tries to address two questions: (1) K: the number of clusters (groups) we expect to find in the dataset and (2) Means: the average distance of data to each cluster center (centroid) which we try to minimize.

K-Means嘗試解決兩個(gè)問題：(1)K：我們希望在數(shù)據(jù)集中找到的聚類 (組) 的數(shù)量； (2)均值： 數(shù)據(jù)到我們試圖聚類的每個(gè)聚類中心 (質(zhì)心) 的平均距離最小化。

Also, one thing of note is that K-Means comes with several variations, typically :

另外，值得注意的是，K-Means具有多種變體，通常是：

init = ‘random’: that randomly selects the centroids of each cluster

init ='random'：隨機(jī)選擇每個(gè)簇的質(zhì)心

init = ‘k-means++’: that only selects the 1st centroid by randomness while other centroids to be placed as far away from the 1st as possible

init ='k-means ++'：僅隨機(jī)選擇第一個(gè)質(zhì)心，而其他質(zhì)心則盡可能遠(yuǎn)離第一個(gè)質(zhì)心

In this project, I’ll use the second option to ensure that each cluster is well-distinguished from one another:

在這個(gè)項(xiàng)目中，我將使用第二個(gè)選項(xiàng)來確保每個(gè)群集之間的區(qū)別明顯：

from sklearn.cluster import KMeansdata = df_groupby.iloc[:, 7:]kmeans = KMeans(n_clusters=3, init="k-means++")
kmeans.fit(data)labels = kmeans.predict(data)
labels = pd.DataFrame(data=labels, index = df_groupby.index, columns=["labels"])

Before applying the algorithm, we need to define “n_clusters” which is the number of groups we expect to get out of the modelling. In this case, I randomly put n_clusters = 3. Then, I went ahead visualizing how the dataset is grouped using 2 variables: Revenue and PageViews:

在應(yīng)用算法之前，我們需要定義“ n_clusters ”，這是我們希望從建模中擺脫出來的組數(shù)。在這種情況下，我隨機(jī)放置n_clusters =3。然后，我繼續(xù)可視化如何使用2個(gè)變量對數(shù)據(jù)集進(jìn)行分組：Revenue和PageViews：

plt.scatter(df_kmeans.totals_transactionRevenue[df_kmeans.labels == 0],df_kmeans.totals_pageviews[df_kmeans.labels == 0], c='blue')plt.scatter(df_kmeans.totals_transactionRevenue[df_kmeans.labels == 1], df_kmeans.totals_pageviews[df_kmeans.labels == 1], c='green')plt.scatter(df_kmeans.totals_transactionRevenue[df_kmeans.labels == 2], df_kmeans.totals_pageviews[df_kmeans.labels == 2], c='orange')plt.show()

As you can see, the x-axis stands for the number of Revenue while y-axis for PageViews . After modelling, we can tell a certain degree of difference in 3 clusters. However, I was not sure whether 3 is the “right” number of clusters or not. That said, we can rely on the estimator of K-Means algorithm, inertia_, which is the distance from each sample to the centroid. In particular, we will compare the inertia of each cluster ranging from 1 to 10, in my case, and see which is the lowest and how far we should go:

如您所見，x軸代表“收入”數(shù)，y軸代表“ PageViews”。建模后，我們可以區(qū)分3個(gè)聚類的一定程度的差異。但是，我不確定3個(gè)集群是否正確。就是說，我們可以依靠K-Means算法的估計(jì)量initiative_ ，它是每個(gè)樣本到質(zhì)心的距離。特別是，在我的例子中，我們將比較每個(gè)群集的慣性，范圍是1到10，然后看看哪一個(gè)是最低的以及應(yīng)該走多遠(yuǎn)：

#Find the best number of clustersnum_clusters = [x for x in range(1,10)]
inertia = []for i in num_clusters:
model = KMeans(n_clusters = i, init="k-means++")
model.fit(data)
inertia.append(model.inertia_)

plt.plot(num_clusters, inertia)
plt.show()model.inertia_model.inertia_

From the chart above, inertia started to fall slowly since the 4th or 5th cluster, meaning that that’s the lowest inertia we can get, so I decided to go with “n_clusters=4”:

從上表中可以看出，自第4簇或第5簇以來，慣性開始緩慢下降，這意味著這是我們可以獲得的最低慣性，因此我決定使用“ n_clusters = 4 ”：

plt.scatter(df_kmeans_n4.totals_pageviews[df_kmeans_n4.labels == 0], df_kmeans_n4.totals_transactionRevenue[df_kmeans_n4.labels == 0], c='blue')plt.scatter(df_kmeans_n4.totals_pageviews[df_kmeans_n4.labels == 1],
df_kmeans_n4.totals_transactionRevenue[df_kmeans_n4.labels == 1], c='green')plt.scatter(df_kmeans_n4.totals_pageviews[df_kmeans_n4.labels == 2],
df_kmeans_n4.totals_transactionRevenue[df_kmeans_n4.labels == 2], c='orange')plt.scatter(df_kmeans_n4.totals_pageviews[df_kmeans_n4.labels == 3],
df_kmeans_n4.totals_transactionRevenue[df_kmeans_n4.labels == 3], c='red')plt.xlabel("Page Views")
plt.ylabel("Revenue")plt.show()Switch PageViews to x-axis and Revenue to y-axis將網(wǎng)頁瀏覽量切換到x軸，將收入切換到y(tǒng)軸

The clusters now look a lot more distinguishable from one another:

現(xiàn)在，這些群集彼此之間的區(qū)別更加明顯：

Cluster 0 (Blue): high PageViews yet little-to-none Revenue

群集0(藍(lán)色)：網(wǎng)頁瀏覽量高，但收入?yún)s幾乎沒有

Cluster 1 (Red): medium PageViews, low Revenue

群集1(紅色)：中型網(wǎng)頁瀏覽量，低收入

Cluster 2 (Orange): medium PageViews, medium Revenue

群集2(橙色)：中等瀏覽量，中等收入

Cluster 4 (Green): unclear trend of PageViews, high Revenue

群集4(綠色)：PageViews趨勢不明確，收入高

Except for cluster 0 and 4 (unclear pattern), which are beyond our control, cluster 1 and 2 can tell a story here as they seem to share some similarities.

除了群集0和4(不清楚的模式)，這超出了我們的控制范圍，群集1和2在這里可以講一個(gè)故事，因?yàn)樗鼈兯坪蹙哂幸恍┫嗨浦帯?

To understand which factor that might impact each cluster, I segmented each cluster by Channels, Device and Operating System:

為了了解可能影響每個(gè)集群的因素，我按渠道，設(shè)備和操作系統(tǒng)對每個(gè)集群進(jìn)行了細(xì)分：

Cluster 1集群1 Cluster 2集群2

As seen from above, in Cluster 1, Referral channel contributed the highest Revenue, followed by Direct and Organic Search. In contrast, it’s Direct that made the highest contribution in Cluster 2. Similarly, while Macintosh is the most dominating device in Cluster 1, it’s Windows in Cluster 2 that achieved higher revenue. The only similarity between 2 clusters is the Device Browser, which Chrome is widely used.

從上方可以看出，在類別1中，引薦渠道貢獻(xiàn)了最高的收入，其次是直接搜索和自然搜索。相比之下，Direct在集群2中貢獻(xiàn)最大。類似地，盡管Macintosh是集群1中最主要的設(shè)備，但集群2中的Windows獲得了更高的收入。 2個(gè)群集之間的唯一相似之處是設(shè)備瀏覽器，Chrome被廣泛使用。

Voila! This further segmentation helps us tell which factor (in this case, Channel, Device Browser, Operating System) works better for each cluster, hence we can better evaluate our investment moving forward!

瞧！進(jìn)一步的細(xì)分可以幫助我們確定哪個(gè)因素(在這種情況下，通道，設(shè)備瀏覽器，操作系統(tǒng))對于每個(gè)集群都更有效，因此我們可以更好地評估未來的投資！

C.通過假設(shè)檢驗(yàn)進(jìn)行A / B檢驗(yàn) (C. A/B Testing through Hypothesis Testing)

What is A/B Testing and how can Hypothesis Testing come into place to complement the process?

什么是A / B測試，以及如何進(jìn)行假設(shè)測試來補(bǔ)充流程？

A/B Testing is no stranger to those who work in Advertising and Media, since it’s one of the powerful techniques that help improve the performance with more cost efficiency. Particularly, A/B Testing divides the audience into 2 groups: Test vs Control. Then, we expose the ads/show a different design to the Test group only to see if there’s any significant discrepancy between 2 groups: exposed vs un-exposed.

A / B測試對于從事廣告和媒體工作的人員并不陌生，因?yàn)樗菐椭愿叩某杀拘侍岣咝阅艿膹?qiáng)大技術(shù)之一。特別是，A / B測試將受眾分為兩組：測試與控制。然后，我們向測試組展示廣告/展示不同的設(shè)計(jì)，只是為了查看兩組之間是否存在顯著差異：公開與未公開。

Image credit: https://productcoalition.com/are-you-segmenting-your-a-b-test-results-c5512c6def65?gi=7b445e5ef457圖片來源： https : //productcoalition.com/are-you-segmenting-your-ab-test-results-c5512c6def65?gi=7b445e5ef457

In Advertising, there are a number of different automatic tools in the market that can easily help do A/B Testing at one click. However, I still wanted to try a different method in Data Science that can do the same: Hypothesis Testing. The methodology is pretty much the same, as Hypothesis Testing compares the Null Hypothesis (H0) and Alternate Hypothesis (H1) and see if there’s any significant discrepancy between the two!

在廣告中，市場上有許多不同的自動(dòng)工具，可輕松幫助您一鍵進(jìn)行A / B測試。但是，我仍然想在數(shù)據(jù)科學(xué)中嘗試一種可以做到這一點(diǎn)的不同方法： 假設(shè)檢驗(yàn) 。方法學(xué)幾乎是一樣的，因?yàn)榧僭O(shè)檢驗(yàn)將零假設(shè)(H0)和替代假設(shè)(H1)進(jìn)行比較，看看兩者之間是否存在顯著差異！

Assume that I run a promotion campaign that exposes an ad to the Test group. Here’s a quick summary of steps that need to be followed to test the result with Hypothesis Testing:

假設(shè)我運(yùn)行了一個(gè)促銷活動(dòng)，將廣告展示給“測試”組。以下是使用假設(shè)檢驗(yàn)測試結(jié)果所需遵循的步驟的快速摘要：

Sample Size Determination

樣本量確定

Pre-requisite Requirements: Normality and Correlation Tests

先決條件：正常性和相關(guān)性測試

Hypothesis Testing

假設(shè)檢驗(yàn)

For the 1st step, we can rely on Power Analysis which helps determine the sample size to draw from a population. Power Analysis requires 3 parameters: (1) effect size, (2) power and (3) alpha. If you are looking for details on how Power Analysis, please refer to an in-depth article here that I wrote some time ago.

對于第一步 ，我們可以依靠功效分析，該分析有助于確定要從總體中提取的樣本量。功效分析需要3個(gè)參數(shù)：(1)效果大小，(2)功效和(3)alpha。如果您正在尋找在功率分析如何，請參閱了深入的文章詳細(xì)介紹在這里，我寫了前一段時(shí)間。

Below is a quick note to each parameter for your quick understanding:

以下是對每個(gè)參數(shù)的快速注釋，以幫助您快速理解：

#Effect Size: (expected mean - actual mean) / actual_std
effect_size = (280000 - df_group1_ab.revenue.mean())/df_group1_ab.revenue.std() #set expected mean to $350,000
print(effect_size)#Power
power = 0.9 #the probability of rejecting the null hypothesis#Alpha
alpha = 0.05 #the error rate

After having 3 parameters ready, we use TTestPower() to determine the sample size:

準(zhǔn)備好3個(gè)參數(shù)后，我們使用TTestPower()確定樣本大小：

import statsmodels.stats.power as smsn = sms.TTestPower().solve_power(effect_size=effect_size, power=power, alpha=alpha)print(n)

The result is 279, meaning we need to draw 279 data points from each group: Test and Control. As I don’t have real data, I used np.random.normal to generate a list of revenue data, in this case sample size = 279 for each group:

結(jié)果是279，這意味著我們需要從每個(gè)組中提取279個(gè)數(shù)據(jù)點(diǎn)：測試和控制。由于我沒有真實(shí)數(shù)據(jù)，因此我使用np.random.normal生成了收入數(shù)據(jù)列表，在這種情況下，每個(gè)組的樣本量= 279：

#Take the samples out of each group: control vs testcontrol_sample = np.random.normal(control_rev.mean(), control_rev.std(), size=279)
test_sample = np.random.normal(test_rev.mean(), test_rev.std(), size=279)

Moving to the 2nd step, we need to ensure the samples are (1) normally distributed and (2) independent (not correlated). Again, if you want a refresh on the tests used in this step, refer to my article as above. In short, we are going to use (1) Shapiro as the normality test and (2) Pearson as the correlation test.

移至第二步 ，我們需要確保樣本是(1)正態(tài)分布和(2)獨(dú)立(不相關(guān))的。同樣，如果您想刷新此步驟中使用的測試，請參考上面的文章。簡而言之，我們將使用(1)Shapiro作為正態(tài)性檢驗(yàn)，(2)Pearson作為相關(guān)性檢驗(yàn)。

#Step 2. Pre-requisite: Normality, Correlationfrom scipy.stats import shapiro, pearsonrstat1, p1 = shapiro(control_sample)
stat2, p2 = shapiro(test_sample)print(p1, p2)stat3, p3 = pearsonr(control_sample, test_sample)
print(p3)

The p-value of Shapiro is 0.129 and 0.539 for Control and Test group respectively, which is > 0.05. Hence, we don’t reject the null hypothesis and are able to say that 2 groups are normally distributed.

對照組和測試組的Shapiro p值分別為0.129和0.539，> 0.05。因此，我們不會(huì)拒絕原假設(shè)，而是可以說2個(gè)組是正態(tài)分布的。

The p-value of Pearson is 0.98, which is >0.05, meaning that 2 groups are independent from each other.

皮爾森(Pearson)的p值為0.98，即> 0.05，表示2個(gè)組彼此獨(dú)立。

Final step is here! As there are 2 variables to be tested against each other (Test vs Control group), we use T-Test to see if there’s any significant discrepancy in Revenue after running A/B Testing:

最后一步就在這里 ！由于有兩個(gè)變量需要相互測試(測試組和對照組)，因此我們使用T-Test來查看運(yùn)行A / B測試后收入是否存在顯著差異：

#Step 3. Hypothesis Testingfrom scipy.stats import ttest_indtstat, p4 = ttest_ind(control_sample, test_sample)
print(p4)

The result is 0.35, which is > 0.05. Hence, the A/B Test conducted indicates that the Test Group exposed to the ads doesn’t show any superiority over the Control Group with no ad exposure.

結(jié)果為0.35，即> 0.05。因此，進(jìn)行的A / B測試表明，暴露于廣告的測試組與沒有暴露廣告的對照組相比沒有任何優(yōu)勢。

Voila! That’s the end of this project — Customer Segmentation & A/B Testing! I hope you find this article useful and easy to follow.

瞧！這就是項(xiàng)目的結(jié)尾–客戶細(xì)分和A / B測試！我希望您覺得這篇文章有用且易于閱讀。

Do look out for my upcoming projects in Data Science and Machine Learning in the near future! In the meantime feel free to check out my Github here for the complete repository:

請?jiān)诓痪玫膶碜⒁馕?strong>即將進(jìn)行的數(shù)據(jù)科學(xué)和機(jī)器學(xué)習(xí)項(xiàng)目 ！同時(shí)，請隨時(shí)在此處查看我的Github以獲取完整的存儲(chǔ)庫：

Github: https://github.com/andrewnguyen07LinkedIn: www.linkedin.com/in/andrewnguyen07

GitHub： https : //github.com/andrewnguyen07 LinkedIn： www.linkedin.com/in/andrewnguyen07

Thanks!

謝謝！

翻譯自: https://towardsdatascience.com/customer-segmentation-k-means-clustering-a-b-testing-bd26a94462dd

客戶細(xì)分

總結(jié)

以上是生活随笔為你收集整理的客户细分_客户细分：K-Means聚类和A / B测试的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。