ML之K-means:基于K-means算法利用电影数据集实现对top 100 电影进行文档分类
生活随笔
收集整理的這篇文章主要介紹了
ML之K-means:基于K-means算法利用电影数据集实现对top 100 电影进行文档分类
小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.
ML之K-means:基于K-means算法利用電影數(shù)據(jù)集實(shí)現(xiàn)對(duì)top 100 電影進(jìn)行文檔分類(lèi)
?
?
?
目錄
輸出結(jié)果
實(shí)現(xiàn)代碼
?
?
?
輸出結(jié)果
先看文檔分類(lèi)后的結(jié)果,一共得到五類(lèi)電影:
?
實(shí)現(xiàn)代碼
# -*- coding: utf-8 -*- from __future__ import print_functionimport numpy as np import pandas as pd import nltk from bs4 import BeautifulSoup import re import os import codecs from sklearn import feature_extraction import mpld3 #看心情#import three lists: titles, links and wikipedia synopses titles = open('document_cluster_master/title_list.txt').read().split('\n') #ensures that only the first 100 are read in titles = titles[:100]links = open('document_cluster_master/link_list_imdb.txt').read().split('\n') links = links[:100]synopses_wiki = open('document_cluster_master/synopses_list_wiki.txt').read().split('\n BREAKS HERE') synopses_wiki = synopses_wiki[:100]synopses_clean_wiki = [] for text in synopses_wiki:text = BeautifulSoup(text, 'html.parser').getText()#strips html formatting and converts to unicodesynopses_clean_wiki.append(text)synopses_wiki = synopses_clean_wikigenres = open('document_cluster_master/genres_list.txt').read().split('\n') genres = genres[:100]print(str(len(titles)) + ' titles') print(str(len(links)) + ' links') print(str(len(synopses_wiki)) + ' synopses') print(str(len(genres)) + ' genres')synopses_imdb = open('document_cluster_master/synopses_list_imdb.txt').read().split('\n BREAKS HERE') synopses_imdb = synopses_imdb[:100]synopses_clean_imdb = []for text in synopses_imdb:text = BeautifulSoup(text, 'html.parser').getText()#strips html formatting and converts to unicodesynopses_clean_imdb.append(text)synopses_imdb = synopses_clean_imdbsynopses = []for i in range(len(synopses_wiki)):item = synopses_wiki[i] + synopses_imdb[i]synopses.append(item)# generates index for each item in the corpora (in this case it's just rank) and I'll use this for scoring later #為語(yǔ)料庫(kù)中的每一個(gè)項(xiàng)目生成索引 ranks = [] for i in range(0,len(titles)):ranks.append(i)#定義一些函數(shù)對(duì)劇情簡(jiǎn)介進(jìn)行處理。首先,載入 NLTK 的英文停用詞列表。停用詞是類(lèi)似“a”,“the”,或者“in”這些無(wú)法傳達(dá)重要意義的詞。我相信除此之外還有更好的解釋。 # load nltk's English stopwords as variable called 'stopwords' stopwords = nltk.corpus.stopwords.words('english')print (stopwords[:10]) #可以查看一下#接下來(lái)我導(dǎo)入 NLTK 中的 Snowball 詞干分析器(Stemmer)。詞干化(Stemming)的過(guò)程就是將詞打回原形,其實(shí)就是把長(zhǎng)得很像的英文單詞關(guān)聯(lián)在一起。 # load nltk's SnowballStemmer as variabled 'stemmer' from nltk.stem.snowball import SnowballStemmer stemmer = SnowballStemmer("english")# tokenize_and_stem:對(duì)每個(gè)詞例(token)分詞(tokenizes)(將劇情簡(jiǎn)介分割成單獨(dú)的詞或詞例列表)并詞干化 # tokenize_only: 分詞即可# 這里我定義了一個(gè)分詞器(tokenizer)和詞干分析器(stemmer),它們會(huì)輸出給定文本詞干化后的詞集合 def tokenize_and_stem(text):# 首先分句,接著分詞,而標(biāo)點(diǎn)也會(huì)作為詞例存在tokens = [word for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]filtered_tokens = []# 過(guò)濾所有不含字母的詞例(例如:數(shù)字、純標(biāo)點(diǎn))for token in tokens:if re.search('[a-zA-Z]', token):filtered_tokens.append(token)stems = [stemmer.stem(t) for t in filtered_tokens]return stemsdef tokenize_only(text):# 首先分句,接著分詞,而標(biāo)點(diǎn)也會(huì)作為詞例存在tokens = [word.lower() for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]filtered_tokens = []# 過(guò)濾所有不含字母的詞例(例如:數(shù)字、純標(biāo)點(diǎn))for token in tokens:if re.search('[a-zA-Z]', token):filtered_tokens.append(token)return filtered_tokens# 使用上述詞干化/分詞和分詞函數(shù)遍歷劇情簡(jiǎn)介列表以生成兩個(gè)詞匯表:經(jīng)過(guò)詞干化和僅僅經(jīng)過(guò)分詞后。# 非常不 pythonic,一點(diǎn)也不! # 擴(kuò)充列表后變成了非常龐大的二維(flat)詞匯表 totalvocab_stemmed = [] totalvocab_tokenized = [] for i in synopses:allwords_stemmed = tokenize_and_stem(i) #對(duì)每個(gè)電影的劇情簡(jiǎn)介進(jìn)行分詞和詞干化totalvocab_stemmed.extend(allwords_stemmed) # 擴(kuò)充“totalvocab_stemmed”列表allwords_tokenized = tokenize_only(i)totalvocab_tokenized.extend(allwords_tokenized)#一個(gè)可查詢(xún)的stemm詞表,以下是詞干化后的詞變回原詞例是一對(duì)多(one to many)的過(guò)程:詞干化后的“run”能夠關(guān)聯(lián)到“ran”,“runs”,“running”等等。 vocab_frame = pd.DataFrame({'words': totalvocab_tokenized}, index = totalvocab_stemmed) print ('there are ' + str(vocab_frame.shape[0]) + ' items in vocab_frame') print (vocab_frame.head())#利用Tf-idf計(jì)算文本相似度,利用 tf-idf 矩陣,你可以跑一長(zhǎng)串聚類(lèi)算法來(lái)更好地理解劇情簡(jiǎn)介集里的隱藏結(jié)構(gòu) from sklearn.feature_extraction.text import TfidfVectorizer# 定義向量化參數(shù) tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=200000,min_df=0.2, stop_words='english',use_idf=True, tokenizer=tokenize_and_stem, ngram_range=(1,3)) tfidf_matrix = tfidf_vectorizer.fit_transform(synopses) # 向量化劇情簡(jiǎn)介文本 print(tfidf_matrix.shape) #(100, 563),100個(gè)電影記錄,每個(gè)電影后邊有563個(gè)詞terms = tfidf_vectorizer.get_feature_names() #terms” 這個(gè)變量只是 tf-idf 矩陣中的特征(features)表,也是一個(gè)詞匯表#dist 變量被定義為 1 – 每個(gè)文檔的余弦相似度。余弦相似度用以和 tf-idf 相互參照評(píng)價(jià)。可以評(píng)價(jià)全文(劇情簡(jiǎn)介)中文檔與文檔間的相似度。被 1 減去是為了確保我稍后能在歐氏(euclidean)平面(二維平面)中繪制余弦距離。 # dist 可以用以評(píng)估任意兩個(gè)或多個(gè)劇情簡(jiǎn)介間的相似度 from sklearn.metrics.pairwise import cosine_similarity dist = 1 - cosine_similarity(tfidf_matrix)#1、首先采用k-means算法 #每個(gè)觀測(cè)對(duì)象(observation)都會(huì)被分配到一個(gè)聚類(lèi),這也叫做聚類(lèi)分配(cluster assignment)。這樣做是為了使組內(nèi)平方和最小。接下來(lái),聚類(lèi)過(guò)的對(duì)象通過(guò)計(jì)算來(lái)確定新的聚類(lèi)質(zhì)心(centroid)。然后,對(duì)象將被重新分配到聚類(lèi),在下一次迭代操作中質(zhì)心也會(huì)被重新計(jì)算,直到算法收斂。 from sklearn.cluster import KMeans#跑了幾次這個(gè)算法以后我發(fā)現(xiàn)得到全局最優(yōu)解(global optimum)的幾率要比局部最優(yōu)解(local optimum)大 num_clusters = 5 #需要先設(shè)定聚類(lèi)的數(shù)目 km = KMeans(n_clusters=num_clusters) km.fit(tfidf_matrix) clusters = km.labels_.tolist()from sklearn.externals import joblibjoblib.dump(km, 'doc_cluster.pkl') # 注釋語(yǔ)句用來(lái)存儲(chǔ)你的模型, 因?yàn)槲乙呀?jīng)從 pickle 載入過(guò)模型了 km = joblib.load('doc_cluster.pkl') clusters = km.labels_.tolist()#進(jìn)行可視化 import pandas as pd#創(chuàng)建一個(gè)字典,包含片名,排名,簡(jiǎn)要?jiǎng)∏?#xff0c;聚類(lèi)分配,還有電影類(lèi)型(genre)(排名和類(lèi)型是從 IMDB 上爬下來(lái)的)。 為了方便起見(jiàn),將這個(gè)字典轉(zhuǎn)換成了 Pandas DataFrame。 films = { 'title': titles, 'rank': ranks, 'synopsis': synopses, 'cluster': clusters, 'genre': genres } frame = pd.DataFrame(films, index = [clusters] , columns = ['rank', 'title', 'cluster', 'genre']) print(frame['cluster'].value_counts())grouped = frame['rank'].groupby(frame['cluster']) #為了凝聚(aggregation),由聚類(lèi)分類(lèi)。 print(grouped.mean()) #每個(gè)聚類(lèi)的平均排名(1 到 100),clusters 4 和 clusters 0 的排名最低,說(shuō)明它們包含的影片在 top 100 列表中相對(duì)沒(méi)那么棒。#選取n(我選6個(gè))離聚類(lèi)質(zhì)心最近的詞對(duì)聚類(lèi)進(jìn)行一些好玩的索引(indexing)和排列(sorting)。這樣可以更直觀觀察聚類(lèi)的主要主題。 # from __future__ import print_functionprint("Top terms per cluster:") print() #按離質(zhì)心的距離排列聚類(lèi)中心,由近到遠(yuǎn) order_centroids = km.cluster_centers_.argsort()[:, ::-1] for i in range(num_clusters):print("Cluster %d words:" % i, end='')for ind in order_centroids[i, :6]: #每個(gè)聚類(lèi)選6個(gè)詞print(' %s' % vocab_frame.ix[terms[ind].split(' ')].values.tolist()[0][0].encode('utf-8', 'ignore'), end=',')print()print()print("Cluster %d titles:" % i, end='')for title in frame.ix[i]['title'].values.tolist():print(' %s,' % title, end='')print()print()#This is purely to help export tables to html and to correct for my 0 start rank (so that Godfather is 1, not 0) frame['Rank'] = frame['rank'] + 1 frame['Title'] = frame['title'] #export tables to HTML print(frame[['Rank', 'Title']].loc[frame['cluster'] == 1].to_html(index=False))#此處降維:為了可視化而用,維度太高的數(shù)據(jù)沒(méi)法可視化 import os # for os.path.basename import matplotlib.pyplot as plt import matplotlib as mpl from sklearn.manifold import MDSMDS() # two components as we're plotting points in a two-dimensional plane # "precomputed" because we provide a distance matrix # we will also specify `random_state` so the plot is reproducible. mds = MDS(n_components=2, dissimilarity="precomputed", random_state=1) pos = mds.fit_transform(dist) # shape (n_components, n_samples) xs, ys = pos[:, 0], pos[:, 1]#strip any proper nouns (NNP) or plural proper nouns (NNPS) from a text from nltk.tag import pos_tagdef strip_proppers_POS(text):tagged = pos_tag(text.split()) #use NLTK's part of speech taggernon_propernouns = [word for word,pos in tagged if pos != 'NNP' and pos != 'NNPS']return non_propernouns#進(jìn)行可視化文檔聚類(lèi) #set up colors per clusters using a dict cluster_colors = {0: '#1b9e77', 1: '#d95f02', 2: '#7570b3', 3: '#e7298a', 4: '#66a61e'}#set up cluster names using a dict cluster_names = {0: 'Family, home, war', 1: 'Police, killed, murders', 2: 'Father, New York, brothers', 3: 'Dance, singing, love', 4: 'Killed, soldiers, captain'}#create data frame that has the result of the MDS plus the cluster numbers and titles df = pd.DataFrame(dict(x=xs, y=ys, label=clusters, title=titles)) #group by cluster groups = df.groupby('label')# set up plot fig, ax = plt.subplots(figsize=(17, 9)) # set size ax.margins(0.05) # Optional, just adds 5% padding to the autoscaling#iterate through groups to layer the plot #note that I use the cluster_name and cluster_color dicts with the 'name' lookup to return the appropriate color/label for name, group in groups:ax.plot(group.x, group.y, marker='o', linestyle='', ms=12, label=cluster_names[name], color=cluster_colors[name], mec='none')ax.set_aspect('auto')ax.tick_params(\axis= 'x', # changes apply to the x-axiswhich='both', # both major and minor ticks are affectedbottom='off', # ticks along the bottom edge are offtop='off', # ticks along the top edge are offlabelbottom='off')ax.tick_params(\axis= 'y', # changes apply to the y-axiswhich='both', # both major and minor ticks are affectedleft='off', # ticks along the bottom edge are offtop='off', # ticks along the top edge are offlabelleft='off')ax.legend(numpoints=1) #show legend with only 1 point #add label in x,y position with the label as the film title for i in range(len(df)):ax.text(df.ix[i]['x'], df.ix[i]['y'], df.ix[i]['title'], size=8) plt.show() #show the plot #uncomment the below to save the plot if need be #plt.savefig('clusters_small_noaxes.png', dpi=200)?
相關(guān)文章推薦
Document Clustering with Python
用Python實(shí)現(xiàn)文檔聚類(lèi)
總結(jié)
以上是生活随笔為你收集整理的ML之K-means:基于K-means算法利用电影数据集实现对top 100 电影进行文档分类的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: ML之SVM:调用(sklearn的lf
- 下一篇: ML之Clustering之H-clus