python集群_使用Python集群文档
python集群
Natural Language Processing has made huge advancements in the last years. Currently, various implementations of neural networks are cutting edge and it seems that everybody talks about them. But, sometimes a simpler solution might be preferable. After all, one should try to walk before running. In this short article, I am going to demonstrate a simple method for clustering documents with Python. All code is available at GitHub (please note that it might be better to view the code in nbviewer).
在過(guò)去的幾年中,自然語(yǔ)言處理取得了巨大的進(jìn)步。 當(dāng)前,神經(jīng)網(wǎng)絡(luò)的各種實(shí)現(xiàn)都是最前沿的,似乎每個(gè)人都在談?wù)撍鼈儭?但是,有時(shí)更簡(jiǎn)單的解決方案可能更可取。 畢竟,應(yīng)該在跑步之前嘗試走路。 在這篇簡(jiǎn)短的文章中,我將演示一種使用Python集群文檔的簡(jiǎn)單方法。 所有代碼都可以在GitHub上獲得 (請(qǐng)注意,最好在nbviewer中查看代碼)。
We are going to cluster Wikipedia articles using k-means algorithm. The steps for doing that are the following:
我們將使用k-means算法對(duì)Wikipedia文章進(jìn)行聚類(lèi)。 為此,請(qǐng)執(zhí)行以下步驟:
2. represent each article as a vector,
2.將每篇文章表示為矢量,
3. perform k-means clustering,
3.執(zhí)行k均值聚類(lèi),
4. evaluate the result.
4.評(píng)估結(jié)果。
1.獲取維基百科文章 (1. Fetch Wikipedia articles)
Using the wikipedia package it is very easy to download content from Wikipedia. For this example, we will use the content of the articles for:
使用Wikipedia軟件包,可以很容易地從Wikipedia下載內(nèi)容。 在此示例中,我們將文章的內(nèi)容用于:
- Data Science 數(shù)據(jù)科學(xué)
- Artificial intelligence 人工智能
- Machine Learning 機(jī)器學(xué)習(xí)
- European Central Bank 歐洲中央銀行
- Bank 銀行
- Financial technology 金融技術(shù)
- International Monetary Fund 國(guó)際貨幣基金組織
- Basketball 籃球
- Swimming 游泳的
- Tennis 網(wǎng)球
The content of each Wikipedia article is stored wiki_list while the title of each article is stored in variable title.
每個(gè)Wikipedia文章的內(nèi)容都存儲(chǔ)在wiki_list中,而每個(gè)文章的標(biāo)題都存儲(chǔ)在可變標(biāo)題中。
import pandas as pdimport wikipediaarticles=['Data Science','Artificial intelligence','Machine Learning','European Central Bank','Bank','Financial technology','International Monetary Fund','Basketball','Swimming','Tennis']wiki_lst=[]
title=[]
for article in articles:
print("loading content: ",article)
wiki_lst.append(wikipedia.page(article).content)
title.append(article)print("examine content")
wiki_lstAt the top of the github page there is a button that will allow you to execute the notebook in Google’s Colab. Feel free to experiment!
2.將每篇文章表示為矢量 (2. Represent each article as a vector)
Since we are going to use k-means, we need to represent each article as a numeric vector. A popular method is to use term-frequency/inverse-document-frequency (tf-idf). Put it simply, with this method for each word w and document d we calculate:
由于我們將使用k均值,因此我們需要將每篇文章表示為一個(gè)數(shù)值向量。 一種流行的方法是使用術(shù)語(yǔ)頻率/文檔反向頻率(tf-idf) 。 簡(jiǎn)而言之,使用此方法對(duì)每個(gè)單詞w和文檔d進(jìn)行計(jì)算:
tf(w,d): the ratio of the number of appearances of w in d divided with the total number of words in d.
TF(W,d):W的d與d的字的總數(shù)除以出現(xiàn)次數(shù)的比率。
idf(w): the logarithm of the fraction of the total number of documents divided by the number of documents that contain w.
idf(w):文檔總數(shù)的分?jǐn)?shù)的對(duì)數(shù)除以包含w的文檔數(shù)。
tfidf(w,d)=tf(w,d) x idf(w)
tfidf(w,d)= tf(w,d)x idf(w)
It is recommended that common, stop words are excluded. All the calculations are easily done with sklearn’s TfidfVectorizer.
建議排除常見(jiàn)的停用詞。 使用sklearn的TfidfVectorizer可以輕松完成所有計(jì)算。
from sklearn.feature_extraction.text import TfidfVectorizervectorizer = TfidfVectorizer(stop_words={'english'})
X = vectorizer.fit_transform(wiki_lst)
(To be honest, the calculation of sklearn is slightly different. You can read about in the documentation.)
(說(shuō)實(shí)話,sklearn的計(jì)算略有不同。您可以在文檔中閱讀。)
3.執(zhí)行k均值聚類(lèi) (3. Perform k-means clustering)
Each row of variable X is a vector representation of a Wikipedia article. Hence, we can use X as input for the k-means algorithm.
變量X的每一行都是Wikipedia文章的矢量表示。 因此,我們可以將X用作k-means算法的輸入。
First, we must decide on the number of clusters. Here, we will use the elbow method.
首先,我們必須決定群集的數(shù)量。 在這里,我們將使用彎頭法 。
import matplotlib.pyplot as pltfrom sklearn.cluster import KMeansSum_of_squared_distances = []
K = range(2,10)for k in K:
km = KMeans(n_clusters=k, max_iter=200, n_init=10)
km = km.fit(X)
Sum_of_squared_distances.append(km.inertia_)plt.plot(K, Sum_of_squared_distances, 'bx-')
plt.xlabel('k')
plt.ylabel('Sum_of_squared_distances')
plt.title('Elbow Method For Optimal k')
plt.show()
The plot is almost a straight line, probably because we have to few articles. But at a closer examination a dent appears for k=4 or k=6. We will try to cluster into 6 groups.
該圖幾乎是一條直線,可能是因?yàn)槲覀兤苌佟?但是在仔細(xì)檢查時(shí),凹痕出現(xiàn)為k = 4或k = 6。 我們將嘗試分為6組。
true_k = 6model = KMeans(n_clusters=true_k, init='k-means++', max_iter=200, n_init=10)
model.fit(X)
labels=model.labels_
wiki_cl=pd.DataFrame(list(zip(title,labels)),columns=['title','cluster'])
print(wiki_cl.sort_values(by=['cluster']))Result of clustering聚類(lèi)結(jié)果
4.評(píng)估結(jié)果 (4. Evaluate the result)
Since we have used only 10 articles, it is fairly easy to evaluate the clustering just by examining what articles are contained in each cluster. That would be difficult for a large corpus. A nice way is to create a word cloud from the articles of each cluster.
由于我們僅使用了10篇文章,因此僅通過(guò)檢查每個(gè)聚類(lèi)中包含哪些文章就可以很容易地評(píng)估聚類(lèi)。 對(duì)于大型語(yǔ)料庫(kù)來(lái)說(shuō),這將是困難的。 一種不錯(cuò)的方法是根據(jù)每個(gè)群集的文章創(chuàng)建詞云。
from wordcloud import WordCloudresult={'cluster':labels,'wiki':wiki_lst}
result=pd.DataFrame(result)
for k in range(0,true_k):
s=result[result.cluster==k]
text=s['wiki'].str.cat(sep=' ')
text=text.lower()
text=' '.join([word for word in text.split()])
wordcloud = WordCloud(max_font_size=50, max_words=100, background_color="white").generate(text)
print('Cluster: {}'.format(k))
print('Titles')
titles=wiki_cl[wiki_cl.cluster==k]['title']
print(titles.to_string(index=False))
plt.figure()
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()Wordclouds of the clusters集群的Wordclouds
- Cluster 0 consists of articles on European Central Bank, Bank and International Monetary Fund 第0組包含有關(guān)歐洲中央銀行,銀行和國(guó)際貨幣基金組織的文章
- Cluster 1 consists of articles on Artificial intelligence and Machine Learning 第1組包含有關(guān)人工智能和機(jī)器學(xué)習(xí)的文章
- Cluster 2 has an article for Swimming 類(lèi)別2有一篇關(guān)于游泳的文章
- Cluster 3 has an article for Data Science 集群3上有一篇關(guān)于數(shù)據(jù)科學(xué)的文章
- Cluster 4 has articles on Basketball and Tennis 第4組中有關(guān)籃球和網(wǎng)球的文章
- Cluster 5 has an article on Financial Technology 第5組中有一篇關(guān)于金融技術(shù)的文章
It might seem odd that swimming is not in the same cluster as basketball and tennis. Or that AI and Machine Learning are not in the same group with Data Science. That is because we choose to create 6 clusters. But by looking at the word clouds we can see that articles on basket and tennis have words like game, player, team, ball in common while the article on swimming does not.
游泳與籃球和網(wǎng)球不在同一組,這似乎很奇怪。 或者,人工智能和機(jī)器學(xué)習(xí)與數(shù)據(jù)科學(xué)不在同一個(gè)組中。 那是因?yàn)槲覀冞x擇創(chuàng)建6個(gè)集群。 但是通過(guò)看云這個(gè)詞,我們可以看到籃筐和網(wǎng)球上的文章有共同的詞,例如比賽,球員,球隊(duì),球,而游泳方面的文章則沒(méi)有。
At the top of the GitHub page, there is a button that will allow you to execute the notebook in Google’s Colab. You can easily try a different number of clusters. The results might surprise you!
在GitHub頁(yè)面的頂部,有一個(gè)按鈕,可讓您在Google的Colab中執(zhí)行筆記本。 您可以輕松嘗試其他數(shù)量的群集。 結(jié)果可能會(huì)讓您感到驚訝!
翻譯自: https://towardsdatascience.com/clustering-documents-with-python-97314ad6a78d
python集群
總結(jié)
以上是生活随笔為你收集整理的python集群_使用Python集群文档的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: 深天马:国内面板企业在 LCD 行业已经
- 下一篇: 马尔可夫的营销归因