【机器学习】使用奇异值分解(SVD)构建推荐系统
今天將和大家一起學習如何僅使用奇異值分解來構建推薦系統。如果你對奇異值分解不是很熟悉,推薦閱讀戳👉?這次終于徹底理解了奇異值分解(SVD)原理及應用
奇異值分解是一種非常流行的線性代數技術,用于將矩陣分解為幾個較小矩陣的乘積。該技術用途廣泛。可以使用 SVD 來挖掘項目之間的關系,由此構建推薦系統。
本文主要介紹
如何對矩陣進行奇異值分解
如何解釋奇異值分解的結果
單個推薦系統需要哪些數據,以及如何利用 SVD 對其進行分析
如何利用 SVD 的結果提出建議
奇異值分解簡介
一個整數24可以分解為 24=2×3×4 的因數,矩陣也可以表示為其他一些矩陣的乘積。因為矩陣是數字數組,所以它們有自己的乘法規則,因此有不同的分解方式,或稱為分解。一般有 QR 分解或 LU 分解。另一種是奇異值分解,它對要分解的矩陣的形狀或性質沒有限制。
假設一個矩陣 (如m×n矩陣)被分解為
是一個 矩陣, 是一個對角矩陣 , 和 是一個 矩陣。對角矩陣 可以是非正方形的,但只有對角線上的條目可能是非零的。矩陣 和 是正交矩陣。表示的列 和 均是單位向量且彼此正交并。如果任意兩個向量的點積為零,那么它們就是正交的。如果一個向量的l2范數是1,那么它就是單位向量。正交矩陣的性質是它的轉置就是它的逆。換句話說,由于 是一個正交矩陣, 或者 , 是單位矩陣。
奇異值分解得名于對角矩陣 ,稱為矩陣 的奇異值。它們實際上是矩陣 特征值的平方根。類比于分解為素數的數字,矩陣的奇異值分解揭示了該矩陣的結構。
實際上上面描述的被稱為full SVD。還有另一種稱為reduced SVD 或compact SVD 的版本。同樣,奇異值分解公式 ,但此時 一種 方對角矩陣, 是矩陣的 的秩,通常小于或等于 和 。矩陣 是 矩陣, 是一個 矩陣。因為矩陣 和 是非正方形的,它們被稱為半正交, 和 , 這兩種情況中 均為 r×r單位矩陣。
奇異值分解在推薦系統中的意義
如果矩陣 的秩是 ,那么可以證明矩陣 和矩陣 的秩均為 。在奇異值分解(簡化 SVD)中,矩陣 的列是矩陣 的特征向量,矩陣 的行是矩陣 的特征向量。有趣的是矩陣 和矩陣 可能有不同的形狀大小(因為矩陣 可以是非正方形),但它們具有相同的特征值集,即對角矩陣 對角線上的值的平方。
這就是為什么奇異值分解的結果可以揭示很多關于矩陣 的信息。
假設我們收集了一些書評,比如書是列,人是行,條目是一個人對一本書的評分。在這種情況下, 將是一個人對人的表格,其中的條目即為一個人給出的評分與匹配的另一個人給出的評分的總和。相似地 將是一個書到書的表格,其中條目是收到的評分與相匹配的另一本書收到的評分總和。人與書之間隱藏的聯系是什么?那可能是類型,作者,或類似性質的東西。
構建推薦系統
數據集
接下來看看如何利用 SVD 的結果來構建推薦系統。首先從這個鏈接下載數據集(注意:它是 600MB 大)
該數據集是“推薦系統和個性化數據集[1]”中的“社交推薦數據[2]”。它包含用戶對Librarything[3]書籍的評論。我們是對用戶給一本書的“starts”數感興趣。
如果解壓這個 tar 文件,會看到一個名為“reviews.json”的大文件。可以提取它或者即時讀取包含的文件。
import?tarfile#?公眾號:機器學習研習院?后臺回復 lthing_data 獲取 with?tarfile.open("lthing_data.tar.gz")?as?tar:print("Files?in?tar?archive:")tar.list()with?tar.extractfile("lthing_data/reviews.json")?as?file:count?=?0for?line?in?file:print(line)count?+=?1if?count?>?3:break以上將打印:
Files?in?tar?archive: ?rwxr-xr-x?julian/julian?0?2016-09-30?17:58:55?lthing_data/ ?rw-r--r--?julian/julian?4824989?2014-01-02?13:55:12?lthing_data/edges.txt ?rw-rw-r--?julian/julian?1604368260?2016-09-30?17:58:25?lthing_data/reviews.json b"{'work':?'3206242',?'flags':?[],?'unixtime':?1194393600,?'stars':?5.0,?'nhelpful':?0,?'time':?'Nov?7,?2007',?'comment':?'This?a?great?book?for?young?readers?to?be?introduced?to?the?world?of?Middle?Earth.?',?'user':?'van_stef'}\n" b"{'work':?'12198649',?'flags':?[],?'unixtime':?1333756800,?'stars':?5.0,?'nhelpful':?0,?'time':?'Apr?7,?2012',?'comment':?'Help?Wanted:?Tales?of?On?The?Job?Terror?from?Evil?Jester?Press?is?a?fun?and?scary?read.?This?book?is?edited?by?Peter?Giglio?and?has?short?stories?by?Joe?McKinney,?Gary?Brandner,?Henry?Snider?and?many?more.?As?if?work?wasnt?already?scary?enough,?this?book?gives?you?more?reasons?to?be?scared.?Help?Wanted?is?an?excellent?anthology?that?includes?some?great?stories?by?some?master?storytellers.\\nOne?of?the?stories?includes?Agnes:?A?Love?Story?by?David?C.?Hayes,?which?tells?the?tale?of?a?lawyer?named?Jack?who?feels?unappreciated?at?work?and?by?his?wife?so?he?starts?a?relationship?with?a?photocopier.?They?get?along?well?until?the?photocopier?starts?wanting?the?lawyer?to?kill?for?it.?The?thing?I?liked?about?this?story?was?how?the?author?makes?you?feel?sorry?for?Jack.?His?two?co-workers?are?happily?married?and?love?their?jobs?while?Jack?is?married?to?a?paranoid?alcoholic?and?he?hates?and?works?at?a?job?he?cant?stand.?You?completely?understand?how?he?can?fall?in?love?with?a?copier?because?he?is?a?lonely?soul?that?no?one?understands?except?the?copier?of?course.\\nAnother?story?in?Help?Wanted?is?Work?Life?Balance?by?Jeff?Strand.?In?this?story?a?man?works?for?a?company?that?starts?to?let?their?employees?do?what?they?want?at?work.?It?starts?with?letting?them?come?to?work?a?little?later?than?usual,?then?the?employees?are?allowed?to?hug?and?kiss?on?the?job.?Things?get?really?out?of?hand?though?when?the?company?starts?letting?employees?carry?knives?and?stab?each?other,?as?long?as?it?doesnt?interfere?with?their?job.?This?story?is?meant?to?be?more?funny?then?scary?but?still?has?its?scary?moments.?Jeff?Strand?does?a?great?job?mixing?humor?and?horror?in?this?story.\\nAnother?good?story?in?Help?Wanted:?On?The?Job?Terror?is?The?Chapel?Of?Unrest?by?Stephen?Volk.?This?is?a?gothic?horror?story?that?takes?place?in?the?1800s?and?has?to?deal?with?an?undertaker?who?has?the?duty?of?capturing?and?embalming?a?ghoul?who?has?been?eating?dead?bodies?in?a?graveyard.?Stephen?Volk?through?his?use?of?imagery?in?describing?the?graveyard,?the?chapel?and?the?clothes?of?the?time,?transports?you?into?an?1800s?gothic?setting?that?reminded?me?of?Bram?Stokers?Dracula.\\nOne?more?story?in?this?anthology?that?I?have?to?mention?is?Expulsion?by?Eric?Shapiro?which?tells?the?tale?of?a?mad?man?going?into?a?office?to?kill?his?fellow?employees.?This?is?a?very?short?but?very?powerful?story?that?gets?you?into?the?mind?of?a?disgruntled?employee?but?manages?to?end?on?a?positive?note.?Though?there?were?stories?I?didnt?like?in?Help?Wanted,?all?in?all?its?a?very?good?anthology.?I?highly?recommend?this?book?',?'user':?'dwatson2'}\n" b"{'work':?'12533765',?'flags':?[],?'unixtime':?1352937600,?'nhelpful':?0,?'time':?'Nov?15,?2012',?'comment':?'Magoon,?K.?(2012).?Fire?in?the?streets.?New?York:?Simon?and?Schuster/Aladdin.?336?pp.?ISBN:?978-1-4424-2230-8.?(Hardcover);?$16.99.\\nKekla?Magoon?is?an?author?to?watch?(http://www.spicyreads.org/Author_Videos.html-?scroll?down).?One?of?my?favorite?books?from?2007?is?Magoons?The?Rock?and?the?River.?At?the?time,?I?mentioned?in?reviews?that?we?have?very?few?books?that?even?mention?the?Black?Panther?Party,?let?alone?deal?with?them?in?a?careful,?thorough?way.?Fire?in?the?Streets?continues?the?story?Magoon?began?in?her?debut?book.?While?her?familys?financial?fortunes?drip?away,?not?helped?by?her?mothers?drinking?and?assortment?of?boyfriends,?the?Panthers?provide?a?very?real?respite?for?Maxie.?Sam?is?still?dealing?with?the?death?of?his?brother.?Maxies?relationship?with?Sam?only?serves?to?confuse?and?upset?them?both.?Her?friends,?Emmalee?and?Patrice,?are?slowly?drifting?away.?The?Panther?Party?is?the?only?thing?that?seems?to?make?sense?and?she?basks?in?its?routine?and?consistency.?She?longs?to?become?a?full?member?of?the?Panthers?and?constantly?battles?with?her?Panther?brother?Raheem?over?her?maturity?and?ability?to?do?more?than?office?tasks.?Maxie?wants?to?have?her?own?gun.?When?Maxie?discovers?that?there?is?someone?working?with?the?Panthers?that?is?leaking?information?to?the?government?about?Panther?activity,?Maxie?investigates.?Someone?is?attempting?to?destroy?the?only?place?that?offers?her?shelter.?Maxie?is?determined?to?discover?the?identity?of?the?traitor,?thinking?that?this?will?prove?her?worth?to?the?organization.?However,?the?truth?is?not?simple?and?it?is?filled?with?pain.?Unfortunately?we?still?do?not?have?many?teen?books?that?deal?substantially?with?the?Democratic?National?Convention?in?Chicago,?the?Black?Panther?Party,?and?the?social?problems?in?Chicago?that?lead?to?the?civil?unrest.?Thankfully,?Fire?in?the?Streets?lives?up?to?the?standard?Magoon?set?with?The?Rock?and?the?River.?Readers?will?feel?like?they?have?stepped?back?in?time.?Magoons?factual?tidbits?add?journalistic?realism?to?the?story?and?only?improves?the?atmosphere.?Maxie?has?spunk.?Readers?will?empathize?with?her?Atlas-task?of?trying?to?hold?onto?her?world.?Fire?in?the?Streets?belongs?in?all?middle?school?and?high?school?libraries.?While?readers?are?able?to?read?this?story?independently?of?The?Rock?and?the?River,?I?strongly?urge?readers?to?read?both?and?in?order.?Magoons?recognition?by?the?Coretta?Scott?King?committee?and?the?NAACP?Image?awards?are?NOT?mistakes!',?'user':?'edspicer'}\n" b'{\'work\':?\'12981302\',?\'flags\':?[],?\'unixtime\':?1364515200,?\'stars\':?4.0,?\'nhelpful\':?0,?\'time\':?\'Mar?29,?2013\',?\'comment\':?"Well,?I?definitely?liked?this?book?better?than?the?last?in?the?series.?There?was?less?fighting?and?more?story.?I?liked?both?Toni?and?Ricky?Lee?and?thought?they?were?pretty?good?together.?The?banter?between?the?two?was?sweet?and?often?times?funny.?I?enjoyed?seeing?some?of?the?past?characters?and?of?course?it\'s?always?nice?to?be?introduced?to?new?ones.?I?just?wonder?how?many?more?of?these?books?there?will?be.?At?least?two?hopefully,?one?each?for?Rory?and?Reece.?",?\'user\':?\'amdrane2\'}\n'解壓數據集
reviews.json 中的每一行都是一條記錄。我們將提取每條記錄的“user”、“work”和“stars”字段,只要這三個字段中沒有缺失數據。盡管有名稱,單該數據集不是嚴格遵循 JSON 字符串格式的,尤其是它使用單引號而不是雙引號。因此這里并不能使用Python 中的json包,而是用ast來解碼這樣的字符串。
import?astreviews?=?[] with?tarfile.open("lthing_data.tar.gz")?as?tar:with?tar.extractfile("lthing_data/reviews.json")?as?file:for?line?in?file:record?=?ast.literal_eval(line.decode("utf8"))if?any(x?not?in?record?for?x?in?['user',?'work',?'stars']):continuereviews.append([record['user'],?record['work'],?record['stars']]) print(len(reviews),?"records?retrieved")1387209 records retrieved構建數據框
現在創建一個矩陣,存儲不同的用戶如何評價每本書。利用pandas庫將數據矩陣轉換成表格:
import?pandas?as?pd reviews?=?pd.DataFrame(reviews,?columns=["user",?"work",?"stars"]) print(reviews.head())user work stars 0 van_stef 3206242 5.0 1 dwatson2 12198649 5.0 2 amdrane2 12981302 4.0 3 Lila_Gustavus 5231009 3.0 4 skinglist 184318 2.0數據篩選
這里,小猴子為了節省時間和內存,沒有使用所有數據。只考慮那些評論超過 50 本書的用戶以及那些被超過 50 位用戶評論的圖書。這樣可以數據集裁剪到其原始大小的 15% 以下:
查找評論超過50本書的用戶
usercount?=?reviews[["work","user"]].groupby("user").count() usercount?=?usercount[usercount["work"]?>=?50] print(usercount.head())work user84 -Eva- 602 06nwingert 370 1983mk 63 1dragones 194查找被超過50個用戶評論過的書
workcount?=?reviews[["work","user"]].groupby("work").count() workcount?=?workcount[workcount["user"]?>=?50] print(workcount.head())user work 10000 106 10001 53 1000167 186 10001797 53 10005525 134只保留流行的書籍和活躍的用戶
reviews?=?reviews[reviews["user"].isin(usercount.index)?&?reviews["work"].isin(workcount.index)] print(reviews)user work stars 0 van_stef 3206242 5.0 6 justine 3067 4.5 18 stephmo 1594925 4.0 19 Eyejaybee 2849559 5.0 35 LisaMaria_C 452949 4.5 ... ... ... ... 1387161 connie53 1653 4.0 1387177 BruderBane 24623 4.5 1387192 StuartAston 8282225 4.0 1387202 danielx 9759186 4.0 1387206 jclark88 8253945 3.0[205110 rows x 3 columns]數據轉換
然后利用 pandas 中的"數據透視表"功能將其轉換為矩陣:
reviewmatrix?=?reviews.pivot(index="user",?columns="work",?values="stars").fillna(0)結果是一個 5593 行 2898 列的矩陣
應用奇異值分解
在一個矩陣中表示 5593 個用戶和 2898 本書。然后應用 SVD(這需要一段時間):
from?numpy.linalg?import?svd matrix?=?reviewmatrix.values u,?s,?vh?=?svd(matrix,?full_matrices=False)默認情況下,svd() 返回一個完整的奇異值分解。選擇一個簡化的版本,可以使用更小的矩陣來節省內存。列vh對應于書籍,可以基于向量空間模型來找出哪本書與正在看的那本書最相似:
import?numpy?as?np def?cosine_similarity(v,u):return?(v?@?u)/?(np.linalg.norm(v)?*?np.linalg.norm(u))highest_similarity?=?-np.inf highest_sim_col?=?-1 for?col?in?range(1,vh.shape[1]):similarity?=?cosine_similarity(vh[:,0],?vh[:,col])if?similarity?>?highest_similarity:highest_similarity?=?similarityhighest_sim_col?=?colprint("Column?%d?is?most?similar?to?column?0"?%?highest_sim_col)Column 906 is most similar to column 0嘗試找到與第一列最匹配的書,結果是906行。
在推薦系統中,當用戶選擇一本書時,可能會根據上面計算的余弦距離,并向她展示與她選擇的那本書最相似的其他幾本書。
取決于數據集,我們可以使用截斷的 SVD 來降低矩陣的維數vh。本質上,在使用它來計算相似度之前,在列vh上刪除了幾行s中對應的奇異值很小的行。這可能會使預測更加準確,因為一本書的那些不太重要的特征被排除在考慮之外。
注意,在分解 中, 的行是用戶和 的列是書,我們不能確定 的列或 的行是什么意思。例如,我們知道它們可能是在用戶和書籍之間提供一些潛在聯系的類型,而我們無法確定它們到底是什么。但這并不防礙將它們用作推薦系統中的特征。
參考資料
[1]
推薦系統和個性化數據集: https://gitee.com/yunduodatastudio/picture/raw/master/data.png
[2]社交推薦數據: https://gitee.com/yunduodatastudio/picture/raw/master/data.png
[3]Librarything: https://www.librarything.com/
往期精彩回顧適合初學者入門人工智能的路線及資料下載(圖文+視頻)機器學習入門系列下載中國大學慕課《機器學習》(黃海廣主講)機器學習及深度學習筆記等資料打印《統計學習方法》的代碼復現專輯 AI基礎下載機器學習交流qq群955171419,加入微信群請掃碼:總結
以上是生活随笔為你收集整理的【机器学习】使用奇异值分解(SVD)构建推荐系统的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 【时间序列】基于一维卷积自动特征提取的短
- 下一篇: 模糊聚类划分matlab代码,Matla