使用Python构建推荐系统的机器学习
Recommender systems are widely used in product recommendations such as recommendations of music, movies, books, news, research articles, restaurants, etc. [1][5][9][10].
推薦系統(tǒng)廣泛用于產(chǎn)品推薦,例如音樂,電影,書籍,新聞,研究文章,餐廳等的推薦。[1] [5] [9] [10]。
There are two popular methods for building recommender systems:
有兩種建立推薦系統(tǒng)的流行方法:
collaborative filtering [3][4][5][10]
協(xié)同過濾 [3] [4] [5] [10]
Content-based filtering [6][9]
基于內(nèi)容的過濾 [6] [9]
The collaborative filtering method [5][10] predicts (filters) the interests of a user on a product by collecting preferences information from many other users (collaborating). The assumption behind the collaborative filtering method is that if a person P1 has the same opinion as another person P2 on an issue, P1 is more likely to share P2’s opinion on a different issue than that of a randomly chosen person [5].
協(xié)作過濾方法[5] [10]通過從許多其他用戶收集(協(xié)作)偏好信息來預(yù)測(過濾)用戶對產(chǎn)品的興趣。 協(xié)作過濾方法背后的假設(shè)是,如果一個人P1在某個問題上與另一個人P2具有相同的觀點(diǎn),則P1與隨機(jī)選擇的人相比,更有可能分享P2在不同問題上的觀點(diǎn)。
Content-based filtering method [6][9] utilizes product features/attributes to recommend other products similar to what the user likes, based on other users’ previous actions or explicit feedback such as rating on products.
基于內(nèi)容的過濾方法[6] [9]根據(jù)其他用戶的先前行為或明確的反饋(例如,對產(chǎn)品的評分),利用產(chǎn)品功能/屬性來推薦與用戶喜歡的產(chǎn)品類似的其他產(chǎn)品。
A recommender system may use either or both of these two methods.
推薦系統(tǒng)可以使用這兩種方法中的一種或兩種。
In this article, I use the Kaggle Netflix prize data [2] to demonstrate how to use model-based collaborative filtering method to build a recommender system in Python.
在本文中,我將使用Kaggle Netflix獎品數(shù)據(jù)[2]演示如何使用基于模型的協(xié)作過濾方法在Python中構(gòu)建推薦系統(tǒng)。
The rest of the article is arranged as follows:
本文的其余部分安排如下:
- Overview of collaborative filtering 協(xié)作過濾概述
- Build recommender system in Python 用Python構(gòu)建推薦系統(tǒng)
- Summary 摘要
1.協(xié)同過濾概述 (1. Overview of Collaborative Filtering)
As described in [5], the main idea behind collaborative filtering is that one person often gets the best recommendations from another with similar interests. Collaborative filtering uses various techniques to match people with similar interests and make recommendations based on shared interests.
如[5]中所述,協(xié)作過濾的主要思想是一個人經(jīng)常從興趣相似的另一個人那里獲得最佳建議。 協(xié)作過濾使用各種技術(shù)來匹配具有相似興趣的人,并根據(jù)共同的興趣提出建議。
The high-level workflow of a collaborative filtering system can be described as follows:
協(xié)作過濾系統(tǒng)的高級工作流程可以描述如下:
- A user rates items (e.g., movies, books) to express his or her preferences on the items 用戶對項(xiàng)目(例如電影,書籍)進(jìn)行評分,以表達(dá)他或她對項(xiàng)目的偏好
- The system treats the ratings as an approximate representation of the user’s interest in items 系統(tǒng)將等級視為用戶對商品興趣的近似表示
- The system matches this user’s ratings with other users’ ratings and finds the people with the most similar ratings 系統(tǒng)將該用戶的評分與其他用戶的評分相匹配,并找到評分最相似的人
- The system recommends items that the similar users have rated highly but not yet being rated by this user 系統(tǒng)推薦相似用戶評價較高但尚未被該用戶評價的項(xiàng)目
Typically a collaborative filtering system recommends products to a given user in two steps [5]:
通常,協(xié)作式篩選系統(tǒng)通過兩個步驟[5]向給定的用戶推薦產(chǎn)品:
- Step 1: Look for people who share the same rating patterns with the given user 步驟1:尋找與指定使用者分享相同評分模式的使用者
- Step 2: Use the ratings from the people found in step 1 to calculate a prediction of a rating by the given user on a product 步驟2:使用步驟1中找到的人員的評分來計算給定用戶對產(chǎn)品的評分預(yù)測
This is called user-based collaborative filtering. One specific implementation of this method is the user-based Nearest Neighbor algorithm.
這稱為基于用戶的協(xié)作過濾。 該方法的一種特定實(shí)現(xiàn)是基于用戶的最近鄰算法 。
As an alternative, item-based collaborative filtering (e.g., users who are interested in x also interested in y) works in an item-centric manner:
或者,基于項(xiàng)目的協(xié)作過濾(例如,對x感興趣的用戶也對y感興趣)以項(xiàng)目為中心的方式工作:
- Step 1: Build an item-item matrix of the rating relationships between pairs of items 步驟1:建立項(xiàng)目對之間的評級關(guān)系的項(xiàng)目-項(xiàng)目矩陣
- Step 2: Predict the rating of the current user on a product by examining the matrix and matching that user’s rating data 步驟2:通過檢查矩陣并匹配該用戶的評分?jǐn)?shù)據(jù),預(yù)測當(dāng)前用戶對產(chǎn)品的評分
There are two types of collaborative filtering system:
協(xié)作過濾系統(tǒng)有兩種類型:
- Model-based 基于模型
- Memory-based 基于內(nèi)存
In a model-based system, we develop models using different machine learning algorithms to predict users’ rating of unrated items [5]. There are many model-based collaborative filtering algorithms such as Matrix factorization algorithms (e.g., singular value decomposition (SVD), Alternating Least Squares (ALS) algorithm [8]), Bayesian networks, clustering models, etc.[5].
在基于模型的系統(tǒng)中,我們使用不同的機(jī)器學(xué)習(xí)算法開發(fā)模型,以預(yù)測用戶對未分級項(xiàng)目的評分[5]。 有許多基于模型的協(xié)作過濾算法,例如矩陣分解算法(例如, 奇異值分解 (SVD),交替最小二乘(ALS)算法[8]), 貝葉斯網(wǎng)絡(luò) , 聚類模型等[5] 。
A memory-based system uses users’ rating data to compute the similarity between users or items. Typical examples of this type of systems are neighbourhood-based method and item-based/user-based top-N recommendations [5].
基于內(nèi)存的系統(tǒng)使用用戶的評分?jǐn)?shù)據(jù)來計算用戶或項(xiàng)目之間的相似度。 這種類型系統(tǒng)的典型示例是基于鄰域的方法和基于項(xiàng)/基于用戶的前N個建議[5]。
This article describes how to build a model-based collaborative filtering system using the SVD model.
本文介紹如何使用SVD模型構(gòu)建基于模型的協(xié)作篩選系統(tǒng)。
2.用Python構(gòu)建推薦系統(tǒng) (2. Build Recommender System in Python)
This section describes how to build a recommender system in Python.
本節(jié)介紹如何在Python中構(gòu)建推薦系統(tǒng)。
2.1安裝庫 (2.1 Installing Library)
There are multiple Python libraries available (e.g., Python scikit Surprise [7], Spark RDD-based API for collaborative filtering [8]) for building recommender systems. I use the Python scikit Surprise library in this article for demonstration purpose.
有許多可用的Python庫(例如,Python scikit Surprise [7], 基于Spark RDD的用于協(xié)作過濾的API [8])用于構(gòu)建推薦系統(tǒng)。 我將本文中的Python scikit Surprise庫用于演示目的。
The Surprise library can be installed as follows:
Surprise庫可以按以下方式安裝:
pip install scikit-surprise2.2加載數(shù)據(jù) (2.2 Loading Data)
As described before, I use the Kaggle Netflix prize data [2] in this article. There are multiple data files for different purposes. The following data files are used in this article:
如前所述,我在本文中使用Kaggle Netflix獎勵數(shù)據(jù)[2]。 有多個數(shù)據(jù)文件可用于不同目的。 本文中使用以下數(shù)據(jù)文件:
training data:
訓(xùn)練數(shù)據(jù):
- combined_data_1.txt Combined_data_1.txt
- combined_data_2.txt Combined_data_2.txt
- combined_data_3.txt Combined_data_3.txt
- combined_data_4.txt Combined_data_4.txt
Movie titles data file:
電影標(biāo)題數(shù)據(jù)文件:
- movie_titles.csv movie_titles.csv
The training dataset is too big to be handled on a Laptop. Thus I only load the first 100,000 records from each of the training data files for demonstration purpose.
訓(xùn)練數(shù)據(jù)集太大,無法在筆記本電腦上處理。 因此,出于演示目的,我僅從每個訓(xùn)練數(shù)據(jù)文件中加載前100,000條記錄。
Once training data files have been downloaded onto a local machine, the first 100,000 records from each of the training data files can be loaded into memory as Pandas DataFrames as follows:
將訓(xùn)練數(shù)據(jù)文件下載到本地計算機(jī)上之后,可以將每個訓(xùn)練數(shù)據(jù)文件中的前100,000條記錄作為Pandas DataFrames加載到內(nèi)存中,如下所示:
def readFile(file_path, rows=100000):data_dict = {'Cust_Id' : [], 'Movie_Id' : [], 'Rating' : [], 'Date' : []}
f = open(file_path, "r")
count = 0
for line in f:
count += 1
if count > rows:
break
if ':' in line:
movidId = line[:-2] # remove the last character ':'
movieId = int(movidId)
else:
customerID, rating, date = line.split(',')
data_dict['Cust_Id'].append(customerID)
data_dict['Movie_Id'].append(movieId)
data_dict['Rating'].append(rating)
data_dict['Date'].append(date.rstrip("\n"))
f.close()
return pd.DataFrame(data_dict)df1 = readFile('./data/netflix/combined_data_1.txt', rows=100000)
df2 = readFile('./data/netflix/combined_data_2.txt', rows=100000)
df3 = readFile('./data/netflix/combined_data_3.txt', rows=100000)
df4 = readFile('./data/netflix/combined_data_4.txt', rows=100000)df1['Rating'] = df1['Rating'].astype(float)
df2['Rating'] = df2['Rating'].astype(float)
df3['Rating'] = df3['Rating'].astype(float)
df4['Rating'] = df4['Rating'].astype(float)
The resulting different DataFrames for different portions of training data are combined into one as follows:
針對訓(xùn)練數(shù)據(jù)的不同部分所產(chǎn)生的不同DataFrame合并為一個,如下所示:
df = df1.copy()df = df.append(df2)
df = df.append(df3)
df = df.append(df4)df.index = np.arange(0,len(df))
df.head(10)
The movie titles file can be loaded into memory as Pandas DataFrame:
電影標(biāo)題文件可以作為Pandas DataFrame加載到內(nèi)存中:
df_title = pd.read_csv('./data/netflix/movie_titles.csv', encoding = "ISO-8859-1", header = None, names = ['Movie_Id', 'Year', 'Name'])df_title.head(10)
2.3培訓(xùn)與評估模型 (2.3 Training and Evaluating Model)
The Dataset module in Surprise provides different methods for loading data from files, Pandas DataFrames, or built-in datasets such as ml-100k (MovieLens 100k) [4]:
Surprise中的Dataset模塊提供了從文件,Pandas DataFrames或內(nèi)置數(shù)據(jù)集(例如ml-100k(MovieLens 100k)[4])中加載數(shù)據(jù)的不同方法:
- Dataset.load_builtin() 數(shù)據(jù)集.load_builtin()
- Dataset.load_from_file() 數(shù)據(jù)集.load_from_file()
- Dataset.load_from_df() 數(shù)據(jù)集.load_from_df()
I use the load_from_df() method to load data from Pandas DataFrame in this article.
我在本文中使用load_from_df ()方法從Pandas DataFrame加載數(shù)據(jù)。
The Reader class in Surprise is to parse a file containing users, items, and users’ ratings on items. The default format is that each rating is stored in a separate line in the following order separated by space: user item rating
Surprise中的Reader類用于解析包含用戶,項(xiàng)目以及用戶對項(xiàng)目的評分的文件。 缺省格式是,每個等級以以下順序存儲在單獨(dú)的行中,并以空格分隔: 用戶 項(xiàng)目 等級
This order and the separator are configurable using the following parameters:
可以使用以下參數(shù)配置此順序和分隔符:
line_format is a string like “item user rating” to indicate the order of the data with field names separated by a space
line_format是一個類似于“ item user rating ”的字符串,用于指示字段名稱用空格分隔的數(shù)據(jù)順序
sep is used to specify separator between fields, such as space, ‘,’, etc.
sep用于指定字段之間的分隔符,例如空格,“,”等。
rating_scale is to specify the rating scale. The default value is (1, 5)
rating_scale用于指定評分等級。 默認(rèn)值為(1,5)
skip_lines is to indicate the number of lines to skip at the beginning of the file and the default is 0
skip_lines用于指示文件開頭要跳過的行數(shù),默認(rèn)值為0
I use the default settings in this article. The item, user, rating correspond to the columns of Cust_Id, Movie_Id, and Rating of the DataFrame respectively.
我在本文中使用默認(rèn)設(shè)置。 item , user , rating分別對應(yīng)于DataFrame的Cust_Id , Movie_Id和Rating的列。
The Surprise library [7] contains the implementation of multiple models/algorithms for building recommender systems such as SVD, Probabilistic Matrix Factorization (PMF), Non-negative Matrix Factorization (NMF), etc. The SVD model is used in this article.
Surprise庫[7]包含用于構(gòu)建推薦系統(tǒng)的多個模型/算法,例如SVD,概率矩陣分解(PMF),非負(fù)矩陣分解(NMF)等。本文使用了SVD模型。
The following code is to load data from Pandas DataFrame and create a SVD model instance:
以下代碼用于從Pandas DataFrame加載數(shù)據(jù)并創(chuàng)建SVD模型實(shí)例:
from surprise import Reader, Dataset, SVDfrom surprise.model_selection.validation import cross_validatereader = Reader()data = Dataset.load_from_df(df[['Cust_Id', 'Movie_Id', 'Rating']], reader)svd = SVD()
Once the data and model for product recommendation are ready, the model can be evaluated using cross-validation as follows:
一旦準(zhǔn)備好產(chǎn)品推薦的數(shù)據(jù)和模型,就可以使用交叉驗(yàn)證對模型進(jìn)行評估,如下所示:
# Run 5-fold cross-validation and print resultscross_validate(svd, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)
The following are the results of the cross validation of the SVD model:
以下是SVD模型的交叉驗(yàn)證的結(jié)果:
Once the model has been evaluated to our satisfaction, then we can re-train the model using the entire training dataset:
一旦對模型進(jìn)行了評估,我們就可以使用整個訓(xùn)練數(shù)據(jù)集對模型進(jìn)行重新訓(xùn)練:
trainset = data.build_full_trainset()svd.fit(trainset)
2.4推薦產(chǎn)品 (2.4 Recommending Products)
After a recommendation model has been trained appropriately, it can be used for prediction.
推薦模型經(jīng)過適當(dāng)訓(xùn)練后,可以用于預(yù)測。
For example, given a user (e.g., Customer Id 785314), we can use the trained model to predict the ratings given by the user on different products (i.e., Movie titles):
例如,給定用戶(例如,客戶ID 785314),我們可以使用經(jīng)過訓(xùn)練的模型來預(yù)測用戶對不同產(chǎn)品(即電影標(biāo)題)給出的評分:
titles = df_title.copy()titles['Estimate_Score'] = titles['Movie_Id'].apply(lambda x: svd.predict(785314, x).est)To recommend products (i.e., movies) to the given user, we can sort the list of movies in decreasing order of predicted ratings and take the top N movies as recommendations:
為了向給定的用戶推薦產(chǎn)品(例如電影),我們可以按照預(yù)測收視率從高到低的順序?qū)﹄娪傲斜磉M(jìn)行排序,并以推薦的前N部電影作為推薦:
titles = titles.sort_values(by=['Estimate_Score'], ascending=False)titles.head(10)
The following are the top 10 movies to be recommended to the user with Customer Id 785314:
以下是建議使用客戶ID 785314向用戶推薦的十大電影:
3.總結(jié) (3. Summary)
In this article, I used the scikit Surprise library [7] and the Kaggle Netflix prize data [2] to demonstrate how to use model-based collaborative filtering method to build a recommender system in Python.
在本文中,我使用了scikit Surprise庫[7]和Kaggle Netflix獎勵數(shù)據(jù)[2]來演示如何使用基于模型的協(xié)作過濾方法在Python中構(gòu)建推薦系統(tǒng)。
As described at the beginning of this article, the dataset is too big to be handled on a laptop or any typical single personal computer. Thus I only loaded the first 100,000 records from each of the training dataset files for demonstration purpose.
如本文開頭所述,數(shù)據(jù)集太大,無法在筆記本電腦或任何典型的單臺個人計算機(jī)上處??理。 因此,出于演示目的,我僅從每個訓(xùn)練數(shù)據(jù)集文件中加載了前100,000條記錄。
In the settings of real applications, I would recommend to use Surprise with Koalas or use the ALS algorithm in Spark MLLib to implement collaborative filtering system and run it on Spark cluster [8].
在實(shí)際應(yīng)用中的設(shè)置,我會建議使用與驚喜考拉或使用ALS算法星火MLLib實(shí)現(xiàn)協(xié)同過濾系統(tǒng)和星火集群[8]上運(yùn)行它。
The Jupyter notebook with all of the source code used in this article is available in Github [11].
Github [11]中提供了Jupyter筆記本以及本文中使用的所有源代碼。
翻譯自: https://towardsdatascience.com/machine-learning-for-building-recommender-system-in-python-9e4922dd7e97
總結(jié)
以上是生活随笔為你收集整理的使用Python构建推荐系统的机器学习的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 纪念碑谷1通关攻略(中国十大著名纪念碑)
- 下一篇: 苹果7跟苹果7plus有什么区别