基于sklearn.decomposition.TruncatedSVD的潜在语义分析实践
文章目錄
- 1. sklearn.decomposition.TruncatedSVD
- 2. sklearn.feature_extraction.text.TfidfVectorizer
- 3. 代碼實踐
- 4. 參考文獻
《統計學習方法》潛在語義分析(Latent Semantic Analysis,LSA) 筆記
1. sklearn.decomposition.TruncatedSVD
sklearn.decomposition.TruncatedSVD 官網介紹
class sklearn.decomposition.TruncatedSVD(n_components=2, algorithm='randomized', n_iter=5, random_state=None, tol=0.0)主要參數:
-
n_components: default = 2,話題數量
-
algorithm: default = “randomized”,算法選擇
-
n_iter: optional (default 5),迭代次數
Number of iterations for randomized SVD solver. Not used by ARPACK.
屬性:
-
components_, shape (n_components, n_features)
-
explained_variance_, shape (n_components,)
The variance of the training samples transformed by a projection to each component. -
explained_variance_ratio_, shape (n_components,)
Percentage of variance explained by each of the selected components. -
singular_values_, shape (n_components,)
The singular values corresponding to each of the selected components.
2. sklearn.feature_extraction.text.TfidfVectorizer
sklearn.feature_extraction.text.TfidfVectorizer 官網介紹
將原始文檔集合轉換為TF-IDF矩陣
參數介紹 這個博客 寫的很清楚。
from sklearn.feature_extraction.text import TfidfVectorizer corpus = ['This is the first document.','This document is the second document.','And this is the third one.','Is this the first document?', ] vectorizer = TfidfVectorizer() X = vectorizer.fit_transform(corpus) print(vectorizer.get_feature_names()) print(X.shape) print(X) ['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this'] (4, 9)(0, 8) 0.38408524091481483(0, 3) 0.38408524091481483(0, 6) 0.38408524091481483(0, 2) 0.5802858236844359(0, 1) 0.46979138557992045(1, 8) 0.281088674033753(1, 3) 0.281088674033753(1, 6) 0.281088674033753(1, 1) 0.6876235979836938(1, 5) 0.5386476208856763(2, 8) 0.267103787642168(2, 3) 0.267103787642168(2, 6) 0.267103787642168(2, 0) 0.511848512707169(2, 7) 0.511848512707169(2, 4) 0.511848512707169(3, 8) 0.38408524091481483(3, 3) 0.38408524091481483(3, 6) 0.38408524091481483(3, 2) 0.5802858236844359(3, 1) 0.469791385579920453. 代碼實踐
# -*- coding:utf-8 -*- # @Python Version: 3.7 # @Time: 2020/5/1 10:27 # @Author: Michael Ming # @Website: https://michael.blog.csdn.net/ # @File: 17.LSA.py # @Reference: https://cloud.tencent.com/developer/article/1530432 import numpy as np from sklearn.decomposition import TruncatedSVD # LSA 潛在語義分析 from sklearn.feature_extraction.text import TfidfVectorizer # 將文本集合轉成權值矩陣# 5個文檔 docs = ["Love is patient, love is kind. It does not envy, it does not boast, it is not proud.","It does not dishonor others, it is not self-seeking, it is not easily angered, it keeps no record of wrongs.","Love does not delight in evil but rejoices with the truth.","It always protects, always trusts, always hopes, always perseveres.","Love never fails. But where there are prophecies, they will cease; where there are tongues, \they will be stilled; where there is knowledge, it will pass away. (1 Corinthians 13:4-8 NIV)"] vectorizer = TfidfVectorizer() X = vectorizer.fit_transform(docs) # 轉成權重矩陣 print("--------轉成權重---------") print(X) print("--------獲取特征(單詞)---------") words = vectorizer.get_feature_names() print(words) print(len(words), "個特征(單詞)") # 52個單詞topics = 4 lsa = TruncatedSVD(n_components=topics) # 潛在語義分析,設置4個話題 X1 = lsa.fit_transform(X) # 訓練并進行轉化 print("--------lsa奇異值---------") print(lsa.singular_values_) print("--------5個文本,在4個話題向量空間下的表示---------") print(X1) # 5個文本,在4個話題向量空間下的表示pick_docs = 2 # 每個話題挑出2個最具代表性的文檔 topic_docid = [X1[:, t].argsort()[:-(pick_docs + 1):-1] for t in range(topics)] # argsort,返回排序后的序號 print("--------每個話題挑出2個最具代表性的文檔---------") print(topic_docid)# print("--------lsa.components_---------") # print(lsa.components_) # 4話題*52單詞,話題向量空間 pick_keywords = 3 # 每個話題挑出3個關鍵詞 topic_keywdid = [lsa.components_[t].argsort()[:-(pick_keywords + 1):-1] for t in range(topics)] print("--------每個話題挑出3個關鍵詞---------") print(topic_keywdid)print("--------打印LSA分析結果---------") for t in range(topics):print("話題 {}".format(t))print("\t 關鍵詞:{}".format(", ".join(words[topic_keywdid[t][j]] for j in range(pick_keywords))))for i in range(pick_docs):print("\t\t 文檔{}".format(i))print("\t\t", docs[topic_docid[t][i]])運行結果
--------轉成權重---------(0, 24) 0.3031801002944161(0, 19) 0.4547701504416241(0, 32) 0.2263512201359201(0, 22) 0.2263512201359201(0, 20) 0.3825669873635752(0, 12) 0.3031801002944161(0, 28) 0.4547701504416241(0, 14) 0.2263512201359201(0, 6) 0.2263512201359201(0, 36) 0.2263512201359201(1, 19) 0.28327311337182914(1, 20) 0.4765965465346523(1, 12) 0.14163655668591457(1, 28) 0.42490967005774366(1, 11) 0.21148886348790247(1, 30) 0.21148886348790247(1, 40) 0.21148886348790247(1, 39) 0.21148886348790247(1, 13) 0.21148886348790247(1, 2) 0.21148886348790247(1, 21) 0.21148886348790247(1, 27) 0.21148886348790247(1, 37) 0.21148886348790247(1, 29) 0.21148886348790247(1, 51) 0.21148886348790247: :(3, 46) 0.22185332169737518(3, 17) 0.22185332169737518(3, 33) 0.22185332169737518(4, 24) 0.09483932399667956(4, 19) 0.09483932399667956(4, 20) 0.0797818291938777(4, 7) 0.1142518110942895(4, 25) 0.14161217495916(4, 16) 0.14161217495916(4, 48) 0.42483652487747997(4, 43) 0.42483652487747997(4, 3) 0.28322434991832(4, 34) 0.14161217495916(4, 44) 0.28322434991832(4, 49) 0.42483652487747997(4, 8) 0.14161217495916(4, 45) 0.14161217495916(4, 5) 0.14161217495916(4, 41) 0.14161217495916(4, 23) 0.14161217495916(4, 31) 0.14161217495916(4, 4) 0.14161217495916(4, 9) 0.14161217495916(4, 0) 0.14161217495916(4, 26) 0.14161217495916 --------獲取特征(單詞)--------- ['13', 'always', 'angered', 'are', 'away', 'be', 'boast', 'but', 'cease', 'corinthians', 'delight', 'dishonor', 'does', 'easily', 'envy', 'evil', 'fails', 'hopes', 'in', 'is', 'it', 'keeps', 'kind', 'knowledge', 'love', 'never', 'niv', 'no', 'not', 'of', 'others', 'pass', 'patient', 'perseveres', 'prophecies', 'protects', 'proud', 'record', 'rejoices', 'seeking', 'self', 'stilled', 'the', 'there', 'they', 'tongues', 'trusts', 'truth', 'where', 'will', 'with', 'wrongs'] 52 個特征(單詞) --------lsa奇異值--------- [1.29695724 1.00165234 0.98752651 0.94862686] --------5個文本,在4個話題向量空間下的表示--------- [[ 0.85667347 -0.00334881 -0.11274158 -0.14912237][ 0.80868148 0.09220662 -0.16057627 -0.33804609][ 0.46603522 -0.3005665 -0.06851382 0.82322097][ 0.13423034 0.92315127 0.22573307 0.2806665 ][ 0.24297388 -0.22857306 0.9386499 -0.08314939]] --------每個話題挑出2個最具代表性的文檔--------- [array([0, 1], dtype=int64), array([3, 1], dtype=int64), array([4, 3], dtype=int64), array([2, 3], dtype=int64)] --------每個話題挑出3個關鍵詞--------- [array([28, 20, 19], dtype=int64), array([ 1, 46, 33], dtype=int64), array([49, 48, 43], dtype=int64), array([10, 42, 18], dtype=int64)] --------打印LSA分析結果--------- 話題 0關鍵詞:not, it, is文檔0Love is patient, love is kind. It does not envy, it does not boast, it is not proud.文檔1It does not dishonor others, it is not self-seeking, it is not easily angered, it keeps no record of wrongs. 話題 1關鍵詞:always, trusts, perseveres文檔0It always protects, always trusts, always hopes, always perseveres.文檔1It does not dishonor others, it is not self-seeking, it is not easily angered, it keeps no record of wrongs. 話題 2關鍵詞:will, where, there文檔0Love never fails. But where there are prophecies, they will cease; where there are tongues, they will be stilled; where there is knowledge, it will pass away. (1 Corinthians 13:4-8 NIV)文檔1It always protects, always trusts, always hopes, always perseveres. 話題 3關鍵詞:delight, the, in文檔0Love does not delight in evil but rejoices with the truth.文檔1It always protects, always trusts, always hopes, always perseveres.4. 參考文獻
主要參考了下面作者的文章,表示感謝!
sklearn: 利用TruncatedSVD做文本主題分析
總結
以上是生活随笔為你收集整理的基于sklearn.decomposition.TruncatedSVD的潜在语义分析实践的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: LeetCode 734. 句子相似性(
- 下一篇: LeetCode 1207. 独一无二的