當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

数据分析词数统计和词的重要程度统计

發(fā)布時間：2024/1/23 编程问答 35 豆豆

生活随笔收集整理的這篇文章主要介紹了数据分析词数统计和词的重要程度统计小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

?1、詞數(shù)統(tǒng)計代碼

#-*-coding:utf-8-*- import pandas as pd import numpy as np import jieba from sklearn.feature_extraction.text import CountVectorizer #自己構建文章 content=['This i is the first document.', 'This is the second second document.', 'And the third one.', 'Is this the first document? i x y','i'] #content=['今天陽光真好','我要去看北京天安門','逛完天安門之后我要去王府井',''] #進行中文分詞 content_list=[] for tmp in content:#使用精確模式res=jieba.cut(tmp,cut_all=False)res_str=','.join(res)content_list.append(res_str) #1、構建實例 con_vet=CountVectorizer() #2、進行提取詞語 #對于英文來說會按照空格分詞 #認為單個的字符的詞對于我們的文章分類沒有影響，所以不拿出來 X=con_vet.fit_transform(content) #獲取提取到的詞語 names=con_vet.get_feature_names() print(names) print(X) print(X.toarray())

2、詞的重要程度統(tǒng)計代碼

#-*-coding:utf-8-*- from sklearn.feature_extraction.text import TfidfVectorizer import jieba #自己構建文章 #content=['This i is the first document.', 'This is the second second document.', 'And the third one.', 'Is this the first document? i x y','i'] content=['今天陽光真好','我要去看北京天安門','逛完天安門之后我要去王府井',''] #進行中文分詞 content_list=[] for tmp in content:#使用精確模式res=jieba.cut(tmp,cut_all=False)res_str=','.join(res)content_list.append(res_str) #1、構建實例 #min_df=1#設置分詞的時候，詞必須至少出現(xiàn)一次 #stop_words===停用詞 tf_vec=TfidfVectorizer(stop_words=['之后','今天']) #2、統(tǒng)計詞的重要程度 X=tf_vec.fit_transform(content_list) #獲取分詞結果 names=tf_vec.get_feature_names() print(names) print(X.toarray())

總結

以上是生活随笔為你收集整理的数据分析词数统计和词的重要程度统计的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內容還不錯，歡迎將生活随笔推薦給好友。