亚马逊商城评论数据分析与可视化(KNN预测评分,绘制云图)
目錄
- 1.項目源碼
- 2.數據部分
- 2.1.數據說明
- 2.2.數據預處理
- 2.3.文本清理
- 3.文本特征提取
- 4.KNN分類器尋找相似產品
- 5.基于聚類的詞關聯
1.項目源碼
可在github下載:
https://github.com/chenshunpeng/Amazon-Product-Recommender-System
項目借鑒自github:傳送門
2.數據部分
2.1.數據說明
數據集來源為亞馬遜產品數據(網站數據集包含來自亞馬遜的產品評論和元數據,包括 1996年5月至2014 年 7 月的 1.428 億條評論)的一部分,具體為服裝、鞋子和珠寶部分(一共278,677 條評論,45.1MB),網址為:亞馬遜審查數據 (ucsd.edu),所用數據如下圖:
數據集要下用于實驗的"小"子集(本次實驗為45.1MB),對于以G計算的較大數據,下載速度會很慢,還可能突然網速為0,另一方面,數據處理時間過長,不適合編譯器的處理,所以我選取的樣本是:“Small” subsets for experimentation
評論數據集的信息:
1產品編號asin - ID of the product, e.g. 0000013714
2評價認可度helpful - helpfulness rating of the review, e.g. 2/3
3產品評分overall - rating of the product
4產品評價reviewText - text of the review
5評論時間reviewTime - time of the review (raw)
6評論者編號reviewerID - ID of the reviewer, e.g. A2SUAM1J3GNN3B
7評論者名字reviewerName - name of the reviewer
8評論的概要summary - summary of the review
9發表評論的時間unixReviewTime - time of the review (unix time)
數據集的一個評論樣例:
{"reviewerID": "A2SUAM1J3GNN3B","asin": "0000013714","reviewerName": "J. McDonald","helpful": [2, 3],"reviewText": "I bought this for my husband who plays the piano. He is having a wonderful time playing these old hymns. The music is at times hard to read because we think the book was published for singing from more than playing from. Great purchase though!","overall": 5.0,"summary": "Heavenly Highway Hymns","unixReviewTime": 1252800000,"reviewTime": "09 13, 2009" }例如,我這里reviews_merged.json的前3條信息如下(共278677條信息):
{"reviewerID": "A1KLRMWW2FWPL4", "asin": "0000031887", "reviewerName": "Amazon Customer \"cameramom\"", "helpful": [0, 0], "reviewText": "This is a great tutu and at a really great price. It doesn't look cheap at all. I'm so glad I looked on Amazon and found such an affordable tutu that isn't made poorly. A++", "overall": 5.0, "summary": "Great tutu- not cheaply made", "unixReviewTime": 1297468800, "reviewTime": "02 12, 2011"} {"reviewerID": "A2G5TCU2WDFZ65", "asin": "0000031887", "reviewerName": "Amazon Customer", "helpful": [0, 0], "reviewText": "I bought this for my 4 yr old daughter for dance class, she wore it today for the first time and the teacher thought it was adorable. I bought this to go with a light blue long sleeve leotard and was happy the colors matched up great. Price was very good too since some of these go for over $15.00 dollars.", "overall": 5.0, "summary": "Very Cute!!", "unixReviewTime": 1358553600, "reviewTime": "01 19, 2013"} {"reviewerID": "A1RLQXYNCMWRWN", "asin": "0000031887", "reviewerName": "Carola", "helpful": [0, 0], "reviewText": "What can I say... my daughters have it in orange, black, white and pink and I am thinking to buy for they the fuccia one. It is a very good way for exalt a dancer outfit: great colors, comfortable, looks great, easy to wear, durables and little girls love it. I think it is a great buy for costumer and play too.", "overall": 5.0, "summary": "I have buy more than one", "unixReviewTime": 1357257600, "reviewTime": "01 4, 2013"}2.2.數據預處理
首先從json文件(reviews_merged.json)讀取數據,數據集用相對路徑方便一些,不然得每個.ipynb文件下面都復制一份
我在jupyter上跑的,首先導入一些數據分析的庫:
%matplotlib inlineimport numpy as np import pandas as pd import matplotlib.pyplot as plt from pandas import DataFrame import nltkfrom sklearn.neighbors import NearestNeighbors from sklearn.linear_model import LogisticRegression from sklearn import neighbors from scipy.spatial.distance import cosine from sklearn.metrics import classification_report from sklearn.metrics import accuracy_score from sklearn.feature_selection import SelectKBest from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformerimport re import string from wordcloud import WordCloud, STOPWORDS from sklearn.metrics import mean_squared_error# 忽略了警告錯誤的輸出 import warnings warnings.filterwarnings("ignore")因為直接在jupyter下用matplotlib畫圖會出現后臺運行的exe文件,而網頁上無法顯示圖片,所以需要在頭文件上加上一行代碼:%matplotlib inline,具體可看:
完美解決Python下matplotlib繪圖中文亂碼問題
先介紹一下json文件:
- 任何支持的類型都可以通過 JSON 來表示,例如字符串、數字、對象、數組等。但是對象和數組是比較特殊且常用的兩種類型
- 對象:對象在 JS 中是使用花括號包裹 {} 起來的內容,數據結構為 {key1:value1, key2:value2, ...} 的鍵值對結構。在面向對象的語言中,key 為對象的屬性,value 為對應的值。鍵名可以使用整數和字符串來表示。值的類型可以是任意類型
- 數組:數組在 JS 中是方括號 [] 包裹起來的內容,數據結構為 ["java", "javascript", "vb", ...] 的索引結構。在 JS 中,數組是一種比較特殊的數據類型,它也可以像對象那樣使用鍵值對,但還是索引使用得多。同樣,值的類型可以是任意類型
之后嘗試用Pandas的read_json()讀取文件,但是報錯:ValueError: Trailing data,后來發現是 json 格式問題,需將文件里面的字典作為元素保存在列表中,把文件每一行看做一個完整的字典,然后在函數中修改參數pd.read_json('data.json',lines=True),lines 默認為 False ,設為 True 后可以按行讀取 json 對象,借鑒博客:
Pandas read_json()時報錯ValueError: Trailing data
# df = pd.read_csv('reviews.csv') df = pd.read_json('reviews_merged.json',lines=True) # 輸出初始時數據 df結果:
在這里說一下,Pandas DataFrame是帶有標簽軸(行和列)的二維大小可變的,可能是異構的表格數據結構,算術運算在行和列標簽上對齊。可以將其視為Series對象的dict-like容器。它也是Pandas的主要數據結構,其用法可看:
Pandas DataFrame的基本屬性詳解
pandas 入門:DataFrame的創建,讀寫,插入和刪除
# 獲取列索引 print(df.columns)# 執行df.shape會返回一個元組,該元組的第一個元素代表行數,第二個元素代表列數 # 這就是這個數據的基本形狀,也是數據的大小 print(df.shape)結果:
Index(['reviewerID', 'asin', 'reviewerName', 'helpful', 'reviewText','overall', 'summary', 'unixReviewTime', 'reviewTime'],dtype='object') (278677, 9)統計每一個產品(按照asin區分產品)的評論個數并加到后面:
count = df.groupby("asin", as_index=False).count()# mean求均值,但在這里沒有用到 mean = df.groupby("asin", as_index=False).mean()# 將count連接在df后面 dfMerged = pd.merge(df, count, how='right', on=['asin']) dfMerged一些函數用法:
Python-Groupby函數應用
Pandas中groupby的參數as_index的True與False
詳解pandas庫的pd.merge函數
結果:
reviewerID_x asin reviewerName_x helpful_x reviewText_x overall_x summary_x unixReviewTime_x reviewTime_x reviewerID_y reviewerName_y helpful_y reviewText_y overall_y summary_y unixReviewTime_y reviewTime_y 0 A1KLRMWW2FWPL4 0000031887 Amazon Customer "cameramom" [0, 0] This is a great tutu and at a really great pri... 5 Great tutu- not cheaply made 1297468800 02 12, 2011 23 23 23 23 23 23 23 23 1 A2G5TCU2WDFZ65 0000031887 Amazon Customer [0, 0] I bought this for my 4 yr old daughter for dan... 5 Very Cute!! 1358553600 01 19, 2013 23 23 23 23 23 23 23 23 ...當然如果我們輸出中間結果count表格,可以看到其每一列信息都是一樣的數字(即評論個數):
dfMerged最后增加3列信息:
dfMerged["totalReviewers"] = dfMerged["reviewerID_y"] dfMerged["overallScore"] = dfMerged["overall_x"] dfMerged["summaryReview"] = dfMerged["summary_x"]結果:
reviewerID_x asin reviewerName_x helpful_x reviewText_x overall_x summary_x unixReviewTime_x reviewTime_x reviewerID_y reviewerName_y helpful_y reviewText_y overall_y summary_y unixReviewTime_y reviewTime_y totalReviewers overallScore summaryReview 0 A1KLRMWW2FWPL4 0000031887 Amazon Customer "cameramom" [0, 0] This is a great tutu and at a really great pri... 5 Great tutu- not cheaply made 1297468800 02 12, 2011 23 23 23 23 23 23 23 23 23 5 Great tutu- not cheaply made 1 A2G5TCU2WDFZ65 0000031887 Amazon Customer [0, 0] I bought this for my 4 yr old daughter for dan... 5 Very Cute!! 1358553600 01 19, 2013 23 23 23 23 23 23 23 23 23 5 Very Cute!! ...把dfMerged按照totalReviewers排序(遞減)并選擇評論數超過100條的產品存入dfCount:
dfMerged = dfMerged.sort_values(by='totalReviewers', ascending=False) dfCount = dfMerged[dfMerged.totalReviewers >= 100] dfCount結果:
reviewerID_x asin reviewerName_x helpful_x reviewText_x overall_x summary_x unixReviewTime_x reviewTime_x reviewerID_y reviewerName_y helpful_y reviewText_y overall_y summary_y unixReviewTime_y reviewTime_y totalReviewers overallScore summaryReview 161700 A205ZO9KZY2ZD2 B005LERHD8 Winnie [0, 0] I was expecting it to be more of a gold tint w... 4 It's ok 1357776000 01 10, 2013 441 441 441 441 441 441 441 441 441 4 It's ok 161269 A1HFSY6W8LJNJM B005LERHD8 Alicia7tommy "Alicia Andrews" [0, 0] The owl necklace is really cute but made real ... 4 Really Cute 1343001600 07 23, 2012 441 441 441 441 441 441 441 441 441 4 Really Cute ...對每個產品求其評分overall的均值:
首先看一下df:
df結果:
reviewerID asin reviewerName helpful reviewText overall summary unixReviewTime reviewTime 0 A1KLRMWW2FWPL4 0000031887 Amazon Customer "cameramom" [0, 0] This is a great tutu and at a really great pri... 5 Great tutu- not cheaply made 1297468800 02 12, 2011 1 A2G5TCU2WDFZ65 0000031887 Amazon Customer [0, 0] I bought this for my 4 yr old daughter for dan... 5 Very Cute!! 1358553600 01 19, 2013 ...之后求均值(僅保留可以取均值的那些列)到 dfProductReview:
dfProductReview = df.groupby("asin", as_index=False).mean() dfProductReview結果:
asin overall unixReviewTime 0 0000031887 4.608696 1.370064e+09 1 0123456479 4.166667 1.382947e+09 ...把評論的概要summary按照asin分組提取出來到ProductReviewSummary,并保存到ProductReviewSummary.csv文件中:
ProductReviewSummary = dfCount.groupby("asin")["summaryReview"].apply(list) ProductReviewSummary = pd.DataFrame(ProductReviewSummary) ProductReviewSummary.to_csv("ProductReviewSummary.csv") ProductReviewSummary結果:
asin summaryReview B000072UMJ [Love it, Weird sizing on the tag..., Great Sh... B0000ANHST [It's a carhartt what more can you say, Nice, ... ...從ProductReviewSummary讀出summary到df3,后面接上含有均值信息的dfProductReview,之后把無關列unixReviewTime去掉(也就是保留'asin','summaryReview','overall'這3列)到df3:
df3 = pd.read_csv("ProductReviewSummary.csv") df3 = pd.merge(df3, dfProductReview, on="asin", how='inner') df3 = df3[['asin','summaryReview','overall']] df3結果:
asin summaryReview overall 0 B000072UMJ ['Love it', 'Weird sizing on the tag...', 'Gre... 4.594595 1 B0000ANHST ["It's a carhartt what more can you say", 'Nic... 4.487179 ...2.3.文本清理
定義文本清理函數cleanReviews:
#用于文本清理的函數 #匹配以a-z開頭的字符串 regEx = re.compile('[^a-z]+') def cleanReviews(reviewText):reviewText = reviewText.lower()#刪除空格reviewText = regEx.sub(' ', reviewText).strip()return reviewTextre.sub (pattern, replacement, string)將所有出現的 pattern 替換為提供的字符串中的 replacement。 這個方法的行為類似于 Python 字符串方法 str.sub,但是使用正則表達式來匹配模式,具體可看:
【python】Regex相關函數的使用
Python學習,python的re模塊,正則表達式用法詳解,正則表達式中括號的用法
重置索引并刪除重復行(可看:(Python)Pandas reset_index()用法總結):
df3["summaryClean"] = df3["summaryReview"].apply(cleanReviews) #Pandas-去除重復項函數drop_duplicates() df3 = df3.drop_duplicates(['overall'], keep='last') #重置索引時,將舊索引添加為列,并使用新的順序索引 df3 = df3.reset_index() df3結果:
index asin summaryReview overall summaryClean 0 0 B0000ANHST ["It's a carhartt what more can you say", 'Nic... 4.487179 it s a carhartt what more can you say nice hea... 1 1 B0000C321X ['NIce fit, nice wash', 'nice', 'nada mejor', ... 4.263415 nice fit nice wash nice nada mejor levi s orig... ...3.文本特征提取
從df3中提取清洗后的評論summaryClean放入reviews,之后用sklearn的CountVectorizer進行文本特征提取,對于每一個訓練文本,其只考慮每種詞匯在該訓練文本中出現的頻率:
reviews = df3["summaryClean"] # max_features:對所有關鍵詞的term frequency進行降序排序,只取前max_features個作為關鍵詞集 # 停用詞設為'english',這類詞是可以完全忽略掉,不做統計的 countVector = CountVectorizer(max_features = 300, stop_words='english') transformedReviews = countVector.fit_transform(reviews) dfReviews = DataFrame(transformedReviews.A, columns=countVector.get_feature_names()) dfReviews = dfReviews.astype(int) dfReviews同樣我們注意到有些單詞對情感分類是毫無意義的,這類詞有個名字,叫“Stop_Word”(停用詞),這類詞是可以完全忽略掉不做統計的,顯然忽略掉這些詞,詞頻記錄的存儲空間能夠得到優化,而且構建速度也更快
在csdn上關于stop_words的介紹很模糊,可在stackoverflow找到一些見解(可看:scikit learn classifies stopwords)
還有一些很有用的博客:
sklearn—CountVectorizer詳解
用Python開始機器學習(5:文本特征抽取與向量化)
【SKLEARN】使用CountVector類來提取詞頻特征,并計算其TF-IDF特征(含可執行代碼)
結果:
把dfReviews保存至dfReviews.csv:
4.KNN分類器尋找相似產品
創建數據集和測試集:
# 創建一個名為X的數據集 X = np.array(dfReviews) # 創建數據集和測試集 tpercent = 0.9 tsize = int(np.floor(tpercent * len(dfReviews))) dfReviews_train = X[:tsize] dfReviews_test = X[tsize:] #數據集和測試集的長度 lentrain = len(dfReviews_train) lentest = len(dfReviews_test) print(lentrain) print(lentest)結果:
80 9之后用k最近鄰算法(據說是最簡單的機器學習算法QAQ)查找最相關的產品
k最近鄰算法可看博客:knn scikit_Scikit學習-KNN學習
當然閱讀國外介紹也可以,比國內詳細很多,在這里說一下,國外的自成體系,如果要詳細研究,一定要親自讀外文(sklearn.impute.KNNImputer):
記得老師上課講了一下,有中英文2個機器學習網站(scikit-learn)是常用的,一定要多看:
1:scikit-learn Machine Learning in Python
2:scikit-learn (sklearn) 官方文檔中文版
代碼:
neighbor = NearestNeighbors(n_neighbors=3, algorithm='ball_tree').fit(dfReviews_train)# 為了在對象X中找到每個點的k鄰域,需要在對象X上調用kneighbors()函數 distances, indices = neighbor.kneighbors(dfReviews_train)# 查找最相關的產品 for i in range(lentest):a = neighbor.kneighbors([dfReviews_test[i]])related_product_list = a[1]first_related_product = [item[0] for item in related_product_list]first_related_product = str(first_related_product).strip('[]')first_related_product = int(first_related_product)second_related_product = [item[1] for item in related_product_list]second_related_product = str(second_related_product).strip('[]')second_related_product = int(second_related_product)print ("Based on product reviews, for ", df3["asin"][lentrain + i] ," average rating is ",df3["overall"][lentrain + i])print ("The first similar product is ", df3["asin"][first_related_product] ," average rating is ",df3["overall"][first_related_product])print ("The second similar product is ", df3["asin"][second_related_product] ," average rating is ",df3["overall"][second_related_product])print ("-----------------------------------------------------------")結果:
Based on product reviews, for B008RUOCJU average rating is 3.973684210526316 The first similar product is B007WAEBPQ average rating is 4.333333333333333 The second similar product is B004R1II48 average rating is 4.055555555555555 ----------------------------------------------------------- Based on product reviews, for B008WYDP1C average rating is 4.257028112449799 The first similar product is B007WA3K4Y average rating is 4.209424083769633 The second similar product is B0083S18LQ average rating is 3.9565217391304346 ----------------------------------------------------------- Based on product reviews, for B008X0EW44 average rating is 3.874125874125874 The first similar product is B007WAEBPQ average rating is 4.333333333333333 The second similar product is B0083S18LQ average rating is 3.9565217391304346 ----------------------------------------------------------- Based on product reviews, for B009DNWFD0 average rating is 3.8446601941747574 The first similar product is B0053XF2U2 average rating is 3.8684210526315788 The second similar product is B004R1II48 average rating is 4.055555555555555 ----------------------------------------------------------- Based on product reviews, for B009ZDEXQK average rating is 4.7254901960784315 The first similar product is B000EIJG0I average rating is 4.594594594594595 The second similar product is B001Q5QLP6 average rating is 4.673913043478261 ----------------------------------------------------------- Based on product reviews, for B00BNB3A0W average rating is 3.4414414414414414 The first similar product is B004Z1CZDK average rating is 3.1923076923076925 The second similar product is B0053XF2U2 average rating is 3.8684210526315788 ----------------------------------------------------------- Based on product reviews, for B00CIBCJ62 average rating is 4.2164179104477615 The first similar product is B004R1II48 average rating is 4.055555555555555 The second similar product is B007WAEBPQ average rating is 4.333333333333333 ----------------------------------------------------------- Based on product reviews, for B00CKGB85I average rating is 4.066666666666666 The first similar product is B004R1II48 average rating is 4.055555555555555 The second similar product is B0074T7TY0 average rating is 4.255474452554744 ----------------------------------------------------------- Based on product reviews, for B00CN47GXA average rating is 3.4634146341463414 The first similar product is B007WAU1VY average rating is 3.551470588235294 The second similar product is B007WAEBPQ average rating is 4.333333333333333 ----------------------------------------------------------- Based on product reviews, for B00D1MR8YU average rating is 3.83739837398374 The first similar product is B004R1II48 average rating is 4.055555555555555 The second similar product is B0053XF2U2 average rating is 3.8684210526315788 ----------------------------------------------------------- Based on product reviews, for B00DMWQK0W average rating is 4.298076923076923 The first similar product is B0078FXHNM average rating is 4.26056338028169 The second similar product is B007WAEBPQ average rating is 4.333333333333333 ----------------------------------------------------------- Based on product reviews, for B00DMWQOYY average rating is 4.119718309859155 The first similar product is B0067GUM2W average rating is 4.174863387978142 The second similar product is B0078FXHNM average rating is 4.26056338028169 ----------------------------------------------------------- Based on product reviews, for B00DNQIIE8 average rating is 4.228758169934641 The first similar product is B0078FXHNM average rating is 4.26056338028169 The second similar product is B0067GUM2W average rating is 4.174863387978142 ----------------------------------------------------------- Based on product reviews, for B00DQYNS3I average rating is 4.526315789473684 The first similar product is B003YBHF82 average rating is 4.21 The second similar product is B000FH4JJQ average rating is 4.536363636363636 -----------------------------------------------------------按照格式打印數據:
# 按照格式打印數據 # print ("Based on product reviews, for ", df3["asin"][260] ," average rating is ",df3["overall"][260]) print ("The first similar product is ", df3["asin"][first_related_product] ," average rating is ",df3["overall"][first_related_product]) print ("The second similar product is ", df3["asin"][second_related_product] ," average rating is ",df3["overall"][second_related_product]) print ("-----------------------------------------------------------")結果:
The first similar product is B003YBHF82 average rating is 4.21 The second similar product is B000FH4JJQ average rating is 4.536363636363636 -----------------------------------------------------------預測評分:
df5_train_target = df3["overall"][:lentrain] df5_test_target = df3["overall"][lentrain:lentrain+lentest] df5_train_target = df5_train_target.astype(int) df5_test_target = df5_test_target.astype(int)n_neighbors = 3 knnclf = neighbors.KNeighborsClassifier(n_neighbors, weights='distance') knnclf.fit(dfReviews_train, df5_train_target) knnpreds_test = knnclf.predict(dfReviews_test)print(classification_report(df5_test_target, knnpreds_test))結果:
precision recall f1-score support3 1.00 1.00 1.00 34 1.00 1.00 1.00 6avg / total 1.00 1.00 1.00 9模型的準確性:
print (accuracy_score(df5_test_target, knnpreds_test)) print(mean_squared_error(df5_test_target, knnpreds_test))結果:
1.0 0.05.基于聚類的詞關聯
先看一下df:
df結果:
之后按照評分歸類評論:
結果:
overall 1 [Never GOT IT...., DO NOT BUY IF YOU EVER WANT... 2 [too short, I'm glad i bought back up straps, ... 3 [Came apart in 2weeks!, Arrived with a defect,... 4 [It's ok, Good, Practically Perfect in every w... 5 [Great tutu- not cheaply made, Very Cute!!, I... Name: summary, dtype: object把聚類后的數據轉換為Dataframe型:
可參考:dataframe數據標準化處理_pandas用法及數據預處理實例
cluster = pd.DataFrame(cluster) cluster結果:
保存到cluster.csv,把數據從cluster.csv導入cluster1并清洗數據:
結果:
可視化每個分數組的單詞云:
注意:為了展示不同分數評論的云圖,需要使漢字在圖表中顯示,需要加上#coding:utf-8并用matplotlib.use('qt4agg') 來指定默認字體,用matplotlib.rcParams['axes.unicode_minus'] = False來解決負號’-'顯示為方塊的問題,否則會報錯
#coding:utf-8 import matplotlib matplotlib.use('qt4agg') #指定默認字體 matplotlib.rcParams['font.sans-serif'] = ['SimHei'] matplotlib.rcParams['font.family']='sans-serif' #解決負號'-'顯示為方塊的問題 matplotlib.rcParams['axes.unicode_minus'] = False之后展示不同分數評論的云圖:
show_wordcloud(cluster1["summaryClean"][0], title = "1分的評論") show_wordcloud(cluster1["summaryClean"][1] , title = "2分的評論") show_wordcloud(cluster1["summaryClean"][2], title = "3分的評論") show_wordcloud(cluster1["summaryClean"][3], title = "4分的評論") show_wordcloud(cluster1["summaryClean"][4], title = "5分的評論") show_wordcloud(cluster1["summaryClean"][:], title = "評分1-5的總評論")總結
以上是生活随笔為你收集整理的亚马逊商城评论数据分析与可视化(KNN预测评分,绘制云图)的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 一缕黑暗中的火光-----------顺
- 下一篇: iOS摸鱼周报 第二十四期