生活随笔
收集整理的這篇文章主要介紹了
Python LDA主题模型实战
小編覺得挺不錯的,現在分享給大家,幫大家做個參考.
import numpy
as np
import lda
12
X
= lda
. datasets
. load_reuters
( )
X
. shape
12
( 395 , 4258 )
1
這里說明X是395行4258列的數據,說明有395個訓練樣本
vocab
= lda
. datasets
. load_reuters_vocab
( )
len ( vocab
)
12
4258
1
1
title
= lda
. datasets
. load_reuters_titles
( )
title
[ : 10 ]
12
( '0 UK: Prince Charles spearheads British royal revolution. LONDON 1996-08-20' , '1 GERMANY: Historic Dresden church rising from WW2 ashes. DRESDEN, Germany 1996-08-21' , "2 INDIA: Mother Teresa's condition said still unstable. CALCUTTA 1996-08-23" , '3 UK: Palace warns British weekly over Charles pictures. LONDON 1996-08-25' , '4 INDIA: Mother Teresa, slightly stronger, blesses nuns. CALCUTTA 1996-08-25' , "5 INDIA: Mother Teresa's condition unchanged, thousands pray. CALCUTTA 1996-08-25" , '6 INDIA: Mother Teresa shows signs of strength, blesses nuns. CALCUTTA 1996-08-26' , "7 INDIA: Mother Teresa's condition improves, many pray. CALCUTTA, India 1996-08-25" , '8 INDIA: Mother Teresa improves, nuns pray for "miracle". CALCUTTA 1996-08-26' , '9 UK: Charles under fire over prospect of Queen Camilla. LONDON 1996-08-26' )
12345678910
1
開始訓練,這頂主題數目是20,迭代次數是1500次
model
= lda
. LDA
( n_topics
= 20 , n_iter
= 1500 , random_state
= 1 )
model
. fit
( X
)
控制臺輸出:
INFO:lda:n_documents: 395
INFO:lda:vocab_size: 4258
INFO:lda:n_words: 84010
INFO:lda:n_topics: 20
INFO:lda:n_iter: 1500
INFO:lda:<0> log likelihood: -1051748
INFO:lda:<10> log likelihood: -719800
INFO:lda:<20> log likelihood: -699115
INFO:lda:<30> log likelihood: -689370
INFO:lda:<40> log likelihood: -684918
...
INFO:lda:<1450> log likelihood: -654884
INFO:lda:<1460> log likelihood: -655493
INFO:lda:<1470> log likelihood: -655415
INFO:lda:<1480> log likelihood: -655192
INFO:lda:<1490> log likelihood: -655728
INFO:lda:<1499> log likelihood: -655858<lda.lda.LDA at 0x7effa0508550>
1234567891011121314151617181920212223
topic_word = model.topic_word_
print(topic_word.shape)
topic_word
查看輸出:
(20, 4258)array([[3.62505347e-06, 3.62505347e-06, 3.62505347e-06, ...,3.62505347e-06, 3.62505347e-06, 3.62505347e-06],[1.87498968e-02, 1.17916463e-06, 1.17916463e-06, ...,1.17916463e-06, 1.17916463e-06, 1.17916463e-06],[1.52206232e-03, 5.05668544e-06, 4.05040504e-03, ...,5.05668544e-06, 5.05668544e-06, 5.05668544e-06],...,[4.17266923e-02, 3.93610908e-06, 9.05698699e-03, ...,3.93610908e-06, 3.93610908e-06, 3.93610908e-06],[2.37609835e-06, 2.37609835e-06, 2.37609835e-06, ...,2.37609835e-06, 2.37609835e-06, 2.37609835e-06],[3.46310752e-06, 3.46310752e-06, 3.46310752e-06, ...,3.46310752e-06, 3.46310752e-06, 3.46310752e-06]])
for i
, topic_dist
in enumerate ( topic_word
) : print ( np
. array
( vocab
) [ np
. argsort
( topic_dist
) ] [ : - 9 : - 1 ] )
12
[ 'british' 'churchill' 'sale' 'million' 'major' 'letters' 'west' 'britain' ]
[ 'church' 'government' 'political' 'country' 'state' 'people' 'party' 'against' ]
[ 'elvis' 'king' 'fans' 'presley' 'life' 'concert' 'young' 'death' ]
[ 'yeltsin' 'russian' 'russia' 'president' 'kremlin' 'moscow' 'michael' 'operation' ]
[ 'pope' 'vatican' 'paul' 'john' 'surgery' 'hospital' 'pontiff' 'rome' ]
[ 'family' 'funeral' 'police' 'miami' 'versace' 'cunanan' 'city' 'service' ]
[ 'simpson' 'former' 'years' 'court' 'president' 'wife' 'south' 'church' ]
[ 'order' 'mother' 'successor' 'election' 'nuns' 'church' 'nirmala' 'head' ]
[ 'charles' 'prince' 'diana' 'royal' 'king' 'queen' 'parker' 'bowles' ]
[ 'film' 'french' 'france' 'against' 'bardot' 'paris' 'poster' 'animal' ]
[ 'germany' 'german' 'war' 'nazi' 'letter' 'christian' 'book' 'jews' ]
[ 'east' 'peace' 'prize' 'award' 'timor' 'quebec' 'belo' 'leader' ]
[ "n't" 'life' 'show' 'told' 'very' 'love' 'television' 'father' ]
[ 'years' 'year' 'time' 'last' 'church' 'world' 'people' 'say' ]
[ 'mother' 'teresa' 'heart' 'calcutta' 'charity' 'nun' 'hospital' 'missionaries' ]
[ 'city' 'salonika' 'capital' 'buddhist' 'cultural' 'vietnam' 'byzantine' 'show' ]
[ 'music' 'tour' 'opera' 'singer' 'israel' 'people' 'film' 'israeli' ]
[ 'church' 'catholic' 'bernardin' 'cardinal' 'bishop' 'wright' 'death' 'cancer' ]
[ 'harriman' 'clinton' 'u.s' 'ambassador' 'paris' 'president' 'churchill' 'france' ]
[ 'city' 'museum' 'art' 'exhibition' 'century' 'million' 'churches' 'set' ]
1234567891011121314151617181920212223242526
- 得到每句話在每個主題的分布,并得到每句話的最大主題
1
doc_topic
= model
. doc_topic_
print ( doc_topic
. shape
)
print ( "第一個樣本的主題分布是" , doc_topic
[ 0 ] )
print ( "第一個樣本的最終主題是" , doc_topic
[ 0 ] . argmax
( ) )
1234
( 395 , 20 )
第一個樣本的主題分布是
[ 4.34782609e-04 3.52173913e-02 4.34782609e-04 9.13043478e-03 4.78260870e-03 4.34782609e-04 9.13043478e-03 3.08695652e-02 5.04782609e-01 4.78260870e-03 4.34782609e-04 4.34782609e-04 3.08695652e-02 2.17826087e-01 4.34782609e-04 4.34782609e-04 4.34782609e-04 3.95652174e-02 4.34782609e-04 1.09130435e-01 ]
第一個樣本的最終主題是
8
轉載至:https://blog.csdn.net/jiangzhenkang/article/details/84335646
總結
以上是生活随笔 為你收集整理的Python LDA主题模型实战 的全部內容,希望文章能夠幫你解決所遇到的問題。
如果覺得生活随笔 網站內容還不錯,歡迎將生活随笔 推薦給好友。