Surprise——Python的推荐系统库(1)
生活随笔
收集整理的這篇文章主要介紹了
Surprise——Python的推荐系统库(1)
小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.
基于Surprise推薦系統(tǒng)實(shí)戰(zhàn)
- 本文就movielens數(shù)據(jù)集做測(cè)試,實(shí)踐推薦。movielens數(shù)據(jù)集格式為:user item rating timestamp 其中主要用到前三列,timestamp在處理自己的數(shù)據(jù)集的時(shí)候可以用別的特征替換,在此不做詳細(xì)說(shuō)明。
- 本文基于開源推薦框架surprise,傳送門。
- 官網(wǎng)上的例子直接用 Dataset.load_builtin(‘ml-100k’)載入數(shù)據(jù)集,奈何小白我一直不成功。遂自己去下載了數(shù)據(jù)集,從本地讀取。
- 這里使用的是基于item的協(xié)同過(guò)濾,也就是這里的電影。相似度計(jì)算使用的是皮爾遜相關(guān)系數(shù)。
- 代碼中可能稍微有點(diǎn)費(fèi)腦的一點(diǎn)是一個(gè)轉(zhuǎn)換問(wèn)題,name<–>rid<–>inner_id,這里rid是一個(gè)橋梁的作用,rid:raw_id也就是每部電影所對(duì)應(yīng)的原始id號(hào)。而在訓(xùn)練計(jì)算皮爾遜相關(guān)系數(shù)矩陣的時(shí)候,又將每部電影進(jìn)行了id映射,也就是代碼中的to_inner_iid()就是講raw_id轉(zhuǎn)換到相似性矩陣的inner_id。之后計(jì)算近鄰,得到的inner_id 要將其轉(zhuǎn)換為具體的電影名字,同樣需要通過(guò)raw_id作為中介進(jìn)行轉(zhuǎn)換。講起來(lái)有點(diǎn)繞,看代碼詳細(xì)體會(huì)。
- 結(jié)果截圖:?
?
- -----------------------------------------------------------------------------------------------------------------------------------------
- Surprise
- 簡(jiǎn)單易用同時(shí)支持多種推薦算法
- 其中基于近鄰的方法協(xié)同過(guò)濾可以設(shè)定不同的度量準(zhǔn)則
- 支持不同的評(píng)估準(zhǔn)則
- 使用示例
- 基本使用方法如下
- 載入自己的數(shù)據(jù)集方法
- 算法調(diào)參讓推薦系統(tǒng)有更好的效果
- 在自己的數(shù)據(jù)集上訓(xùn)練模型
- 首先載入數(shù)據(jù)
- 使用不同的推薦系統(tǒng)算法進(jìn)行建模比較
- 建模和存儲(chǔ)模型
- 用協(xié)同過(guò)濾構(gòu)建模型并進(jìn)行預(yù)測(cè)
- 1 movielens的例子
- 2 音樂(lè)預(yù)測(cè)的例子
- 用SVD矩陣分解進(jìn)行預(yù)測(cè)
- 用協(xié)同過(guò)濾構(gòu)建模型并進(jìn)行預(yù)測(cè)
Surprise
在推薦系統(tǒng)的建模過(guò)程中,我們將用到python庫(kù)?Surprise(Simple Python RecommendatIon System Engine),是scikit系列中的一個(gè)(很多同學(xué)用過(guò)scikit-learn和scikit-image等庫(kù))。Surprise的User Guide有詳細(xì)的解釋和說(shuō)明
簡(jiǎn)單易用,同時(shí)支持多種推薦算法:
- 基礎(chǔ)算法/baseline algorithms
- 基于近鄰方法(協(xié)同過(guò)濾)/neighborhood methods
- 矩陣分解方法/matrix factorization-based (SVD, PMF, SVD++, NMF)
| random_pred.NormalPredictor | Algorithm predicting a random rating based on the distribution of the training set, which is assumed to be normal. |
| baseline_only.BaselineOnly | Algorithm predicting the baseline estimate for given user and item. |
| knns.KNNBasic | A basic collaborative filtering algorithm. |
| knns.KNNWithMeans | A basic collaborative filtering algorithm, taking into account the mean ratings of each user. |
| knns.KNNBaseline | A basic collaborative filtering algorithm taking into account a baseline rating. |
| matrix_factorization.SVD | The famous SVD algorithm, as popularized by Simon Funk during the Netflix Prize. |
| matrix_factorization.SVDpp | The SVD++ algorithm, an extension of SVD taking into account implicit ratings. |
| matrix_factorization.NMF | A collaborative filtering algorithm based on Non-negative Matrix Factorization. |
| slope_one.SlopeOne | A simple yet accurate collaborative filtering algorithm. |
| co_clustering.CoClustering | A collaborative filtering algorithm based on co-clustering. |
其中基于近鄰的方法(協(xié)同過(guò)濾)可以設(shè)定不同的度量準(zhǔn)則。
| cosine | Compute the cosine similarity between all pairs of users (or items). |
| msd | Compute the Mean Squared Difference similarity between all pairs of users (or items). |
| pearson | Compute the Pearson correlation coefficient between all pairs of users (or items). |
| pearson_baseline | Compute the (shrunk) Pearson correlation coefficient between all pairs of users (or items) using baselines for centering instead of means. |
支持不同的評(píng)估準(zhǔn)則
| rmse | Compute RMSE (Root Mean Squared Error). |
| mae | Compute MAE (Mean Absolute Error). |
| fcp | Compute FCP (Fraction of Concordant Pairs). |
使用示例
基本使用方法如下
# 可以使用上面提到的各種推薦系統(tǒng)算法 from surprise import SVD from surprise import Dataset from surprise import evaluate, print_perf# 默認(rèn)載入movielens數(shù)據(jù)集,會(huì)提示是否下載這個(gè)數(shù)據(jù)集,這是非常經(jīng)典的公開推薦系統(tǒng)數(shù)據(jù)集——MovieLens數(shù)據(jù)集之一 data = Dataset.load_builtin('ml-100k') # k折交叉驗(yàn)證(k=3) data.split(n_folds=3) # 試一把SVD矩陣分解 algo = SVD() # 在數(shù)據(jù)集上測(cè)試一下效果 perf = evaluate(algo, data, measures=['RMSE', 'MAE']) #輸出結(jié)果 print_perf(perf)載入自己的數(shù)據(jù)集方法
# 指定文件所在路徑 file_path = os.path.expanduser('~/.surprise_data/ml-100k/ml-100k/u.data') # 告訴文本閱讀器,文本的格式是怎么樣的 reader = Reader(line_format='user item rating timestamp', sep='\t') # 加載數(shù)據(jù) data = Dataset.load_from_file(file_path, reader=reader) # 手動(dòng)切分成5折(方便交叉驗(yàn)證) data.split(n_folds=5)算法調(diào)參(讓推薦系統(tǒng)有更好的效果)
這里實(shí)現(xiàn)的算法用到的算法無(wú)外乎也是SGD等,因此也有一些超參數(shù)會(huì)影響最后的結(jié)果,我們同樣可以用sklearn中常用到的網(wǎng)格搜索交叉驗(yàn)證(GridSearchCV)來(lái)選擇最優(yōu)的參數(shù)。簡(jiǎn)單的例子如下所示:
# 定義好需要優(yōu)選的參數(shù)網(wǎng)格 param_grid = {'n_epochs': [5, 10], 'lr_all': [0.002, 0.005],'reg_all': [0.4, 0.6]} # 使用網(wǎng)格搜索交叉驗(yàn)證 grid_search = GridSearch(SVD, param_grid, measures=['RMSE', 'FCP']) # 在數(shù)據(jù)集上找到最好的參數(shù) data = Dataset.load_builtin('ml-100k') data.split(n_folds=3) grid_search.evaluate(data) # 輸出調(diào)優(yōu)的參數(shù)組 # 輸出最好的RMSE結(jié)果 print(grid_search.best_score['RMSE']) # >>> 0.96117566386# 輸出對(duì)應(yīng)最好的RMSE結(jié)果的參數(shù) print(grid_search.best_params['RMSE']) # >>> {'reg_all': 0.4, 'lr_all': 0.005, 'n_epochs': 10}# 最好的FCP得分 print(grid_search.best_score['FCP']) # >>> 0.702279736531# 對(duì)應(yīng)最高FCP得分的參數(shù) print(grid_search.best_params['FCP']) # >>> {'reg_all': 0.6, 'lr_all': 0.005, 'n_epochs': 10}在自己的數(shù)據(jù)集上訓(xùn)練模型
首先載入數(shù)據(jù)
import os from surprise import Reader, Dataset # 指定文件路徑 file_path = os.path.expanduser('./popular_music_suprise_format.txt') # 指定文件格式 reader = Reader(line_format='user item rating timestamp', sep=',') # 從文件讀取數(shù)據(jù) music_data = Dataset.load_from_file(file_path, reader=reader) # 分成5折 music_data.split(n_folds=5)使用不同的推薦系統(tǒng)算法進(jìn)行建模比較
### 使用NormalPredictor from surprise import NormalPredictor, evaluate algo = NormalPredictor() perf = evaluate(algo, music_data, measures=['RMSE', 'MAE'])### 使用BaselineOnly from surprise import BaselineOnly, evaluate algo = BaselineOnly() perf = evaluate(algo, music_data, measures=['RMSE', 'MAE'])### 使用基礎(chǔ)版協(xié)同過(guò)濾 from surprise import KNNBasic, evaluate algo = KNNBasic() perf = evaluate(algo, music_data, measures=['RMSE', 'MAE'])### 使用均值協(xié)同過(guò)濾 from surprise import KNNWithMeans, evaluate algo = KNNWithMeans() perf = evaluate(algo, music_data, measures=['RMSE', 'MAE'])### 使用協(xié)同過(guò)濾baseline from surprise import KNNBaseline, evaluate algo = KNNBaseline() perf = evaluate(algo, music_data, measures=['RMSE', 'MAE'])### 使用SVD from surprise import SVD, evaluate algo = SVD() perf = evaluate(algo, music_data, measures=['RMSE', 'MAE'])### 使用SVD++ from surprise import SVDpp, evaluate algo = SVDpp() perf = evaluate(algo, music_data, measures=['RMSE', 'MAE'])### 使用NMF from surprise import NMF algo = NMF() perf = evaluate(algo, music_data, measures=['RMSE', 'MAE']) print_perf(perf)建模和存儲(chǔ)模型
1.用協(xié)同過(guò)濾構(gòu)建模型并進(jìn)行預(yù)測(cè)
1.1 movielens的例子
# 可以使用上面提到的各種推薦系統(tǒng)算法 from surprise import SVD from surprise import Dataset from surprise import evaluate, print_perf# 默認(rèn)載入movielens數(shù)據(jù)集 data = Dataset.load_builtin('ml-100k') # k折交叉驗(yàn)證(k=3) data.split(n_folds=3) # 試一把SVD矩陣分解 algo = SVD() # 在數(shù)據(jù)集上測(cè)試一下效果 perf = evaluate(algo, data, measures=['RMSE', 'MAE']) #輸出結(jié)果 print_perf(perf)""" 以下的程序段告訴大家如何在協(xié)同過(guò)濾算法建模以后,根據(jù)一個(gè)item取回相似度最高的item,主要是用到algo.get_neighbors()這個(gè)函數(shù) """from __future__ import (absolute_import, division, print_function,unicode_literals) import os import iofrom surprise import KNNBaseline from surprise import Datasetdef read_item_names():"""獲取電影名到電影id 和 電影id到電影名的映射"""file_name = (os.path.expanduser('~') +'/.surprise_data/ml-100k/ml-100k/u.item')rid_to_name = {}name_to_rid = {}with io.open(file_name, 'r', encoding='ISO-8859-1') as f:for line in f:line = line.split('|')rid_to_name[line[0]] = line[1]name_to_rid[line[1]] = line[0]return rid_to_name, name_to_rid# 首先,用算法計(jì)算相互間的相似度 data = Dataset.load_builtin('ml-100k') trainset = data.build_full_trainset() sim_options = {'name': 'pearson_baseline', 'user_based': False} algo = KNNBaseline(sim_options=sim_options) algo.train(trainset)# 獲取電影名到電影id 和 電影id到電影名的映射 rid_to_name, name_to_rid = read_item_names()# Retieve inner id of the movie Toy Story toy_story_raw_id = name_to_rid['Toy Story (1995)'] toy_story_inner_id = algo.trainset.to_inner_iid(toy_story_raw_id)# Retrieve inner ids of the nearest neighbors of Toy Story. toy_story_neighbors = algo.get_neighbors(toy_story_inner_id, k=10)# Convert inner ids of the neighbors into names. toy_story_neighbors = (algo.trainset.to_raw_iid(inner_id)for inner_id in toy_story_neighbors) toy_story_neighbors = (rid_to_name[rid]for rid in toy_story_neighbors)print() print('The 10 nearest neighbors of Toy Story are:') for movie in toy_story_neighbors:print(movie)1.2 音樂(lè)預(yù)測(cè)的例子
from __future__ import (absolute_import, division, print_function, unicode_literals) import os import iofrom surprise import KNNBaseline from surprise import Datasetimport cPickle as pickle # 重建歌單id到歌單名的映射字典 id_name_dic = pickle.load(open("popular_playlist.pkl","rb")) print("加載歌單id到歌單名的映射字典完成...") # 重建歌單名到歌單id的映射字典 name_id_dic = {} for playlist_id in id_name_dic:name_id_dic[id_name_dic[playlist_id]] = playlist_id print("加載歌單名到歌單id的映射字典完成...")file_path = os.path.expanduser('./popular_music_suprise_format.txt') # 指定文件格式 reader = Reader(line_format='user item rating timestamp', sep=',') # 從文件讀取數(shù)據(jù) music_data = Dataset.load_from_file(file_path, reader=reader) # 計(jì)算歌曲和歌曲之間的相似度 print("構(gòu)建數(shù)據(jù)集...") trainset = music_data.build_full_trainset() #sim_options = {'name': 'pearson_baseline', 'user_based': False}- current_playlist => 歌單名
- playlist_id => 歌單id(網(wǎng)易給的歌單id)
- playlist_inner_id => 內(nèi)部id(對(duì)所有歌單id重新從1開始編碼)
2.用SVD矩陣分解進(jìn)行預(yù)測(cè)
### 使用SVD++ from surprise import SVDpp, evaluate from surprise import Datasetfile_path = os.path.expanduser('./popular_music_suprise_format.txt') # 指定文件格式 reader = Reader(line_format='user item rating timestamp', sep=',') # 從文件讀取數(shù)據(jù) music_data = Dataset.load_from_file(file_path, reader=reader) # 構(gòu)建數(shù)據(jù)集和建模 algo = SVDpp() trainset = music_data.build_full_trainset() algo.train(trainset)總結(jié)
以上是生活随笔為你收集整理的Surprise——Python的推荐系统库(1)的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: Surprise入门
- 下一篇: Surprise库使用总结