生活随笔
收集整理的這篇文章主要介紹了
【NLP】文本情感分析
小編覺得挺不錯的,現在分享給大家,幫大家做個參考.
昨晚太晚代碼還沒有跑完,恰巧又遇到PSO-LSTM的準確率沒辦法復原,慘兮兮/(ㄒoㄒ)/,具體內容今天來補上
文本情感分析
- 一、情感分析簡介
- 二、文本介紹及語料分析
- 三、數據集分析
- 四、LSTM模型
- 五、重點函數講解
- plot_model
- np_utils.to_categorical
- model.summary()
- 特別感謝
一、情感分析簡介
??對人們對產品、服務、組織、個人、問題、事件、話題及其屬性的觀點、情 感、情緒、評價和態度的計算研究。文本情感分析(Sentiment Analysis)是自然語言處理(NLP)方法中常見的應用,也是一個有趣的基本任務,尤其是以提煉文本情緒內容為目的的分類。它是對帶有情感色彩的主觀性文本進行分析、處理、歸納和推理的過程。
??本文將介紹情感分析中的情感極性(傾向)分析。所謂情感極性分析,指的是對文本進行褒義、貶義、中性的判斷。在大多應用場景下,只分為兩類。例如對于“喜愛”和“厭惡”這兩個詞,就屬于不同的情感傾向。
??本文將詳細介紹如何進行文本數據預處理,并使用深度學習模型中的LSTM模型來實現文本的情感分析。
二、文本介紹及語料分析
??本項目以某電商網站中某個商品的評論作為語料(corpus.csv),點擊下載數據集,該數據集一共有4310條評論數據,文本的情感分為兩類:“正面”和“反面”,該數據集的前幾行如下:
三、數據集分析
以下代碼為統計數據集中的情感分布以及評論句子長度分布
import pandas
as pd
import matplotlib
.pyplot
as plt
from matplotlib
import font_manager
from itertools
import accumulate
my_font
=font_manager
.FontProperties
(fname
="C:\Windows\Fonts\simhei.ttf")
df
=pd
.read_csv
('data/data_single.csv')
print(df
.groupby
('label')['label'].count
())df
['length']=df
['evaluation'].apply(lambda x
:len(x
))
len_df
=df
.groupby
('length').count
()
sent_length
=len_df
.index
.tolist
()
sent_freq
=len_df
['evaluation'].tolist
()
plt
.bar
(sent_length
,sent_freq
)
plt
.title
('句子長度及出現頻數統計圖',fontproperties
=my_font
)
plt
.xlabel
('句子長度',fontproperties
=my_font
)
plt
.ylabel
('句子長度出現的頻數',fontproperties
=my_font
)
plt
.show
()
plt
.close
()
sent_pentage_list
=[(count
/sum(sent_freq
)) for count
in accumulate
(sent_freq
)]
plt
.plot
(sent_length
,sent_pentage_list
)
quantile
=0.91
print(list(sent_pentage_list
))
for length
,per
in zip(sent_length
,sent_pentage_list
):if round(per
,2)==quantile
:index
=length
break
print('\n分位點維%s的句子長度:%d.'%(quantile
,index
))plt
.show
()
plt
.close
()
plt
.plot
(sent_length
,sent_pentage_list
)
plt
.hlines
(quantile
,0,index
,colors
='c',linestyles
='dashed')
plt
.vlines
(index
,0,quantile
,colors
='c',linestyles
='dashed')
plt
.text
(0,quantile
,str(quantile
))
plt
.text
(index
,0,str(index
))
plt
.title
('句子長度累計分布函數圖',fontproperties
=my_font
)
plt
.xlabel
('句子長度',fontproperties
=my_font
)
plt
.ylabel
('句子長度累積頻率',fontproperties
=my_font
)
plt
.show
()
plt
.close
()
輸出結果如下:
句子長度及出現頻數統計圖如下:
句子長度累積分布函數圖如下:
從以上的圖片可以看出,大多數樣本的句子長度集中在1-200之間,句子長度累計頻率取0.91分位點,則長度為183左右。
四、LSTM模型
實現的模型框架如下:
代碼如下:
import pickle
import numpy
as np
import pandas
as pd
from keras
.utils
import np_utils
from keras
.utils
.vis_utils
import plot_model
from keras
.models
import Sequential
from keras
.preprocessing
.sequence
import pad_sequences
from keras
.layers
import LSTM
, Dense
, Embedding
,Dropout
from sklearn
.model_selection
import train_test_split
from sklearn
.metrics
import accuracy_score
def load_data(filepath
,input_shape
=20):df
=pd
.read_csv
(filepath
)labels
,vocabulary
=list(df
['label'].unique
()),list(df
['evaluation'].unique
())string
=''for word
in vocabulary
:string
+=wordvocabulary
=set(string
)word_dictionary
={word
:i
+1 for i
,word
in enumerate(vocabulary
)}with open('word_dict.pk','wb') as f
:pickle
.dump
(word_dictionary
,f
)inverse_word_dictionary
={i
+1:word
for i
,word
in enumerate(vocabulary
)}label_dictionary
={label
:i
for i
,label
in enumerate(labels
)}with open('label_dict.pk','wb') as f
:pickle
.dump
(label_dictionary
,f
)output_dictionary
={i
:labels
for i
,labels
in enumerate(labels
)}vocab_size
=len(word_dictionary
.keys
())label_size
=len(label_dictionary
.keys
())x
=[[word_dictionary
[word
] for word
in sent
] for sent
in df
['evaluation']]x
=pad_sequences
(maxlen
=input_shape
,sequences
=x
,padding
='post',value
=0)y
=[[label_dictionary
[sent
]] for sent
in df
['label']]'''np_utils.to_categorical用于將標簽轉化為形如(nb_samples, nb_classes)的二值序列。假設num_classes = 10。如將[1, 2, 3,……4]轉化成:[[0, 1, 0, 0, 0, 0, 0, 0][0, 0, 1, 0, 0, 0, 0, 0][0, 0, 0, 1, 0, 0, 0, 0]……[0, 0, 0, 0, 1, 0, 0, 0]]'''y
=[np_utils
.to_categorical
(label
,num_classes
=label_size
) for label
in y
]y
=np
.array
([list(_
[0]) for _
in y
])return x
,y
,output_dictionary
,vocab_size
,label_size
,inverse_word_dictionary
def create_LSTM(n_units
,input_shape
,output_dim
,filepath
):x
,y
,output_dictionary
,vocab_size
,label_size
,inverse_word_dictionary
=load_data
(filepath
)model
=Sequential
()model
.add
(Embedding
(input_dim
=vocab_size
+1,output_dim
=output_dim
,input_length
=input_shape
,mask_zero
=True))model
.add
(LSTM
(n_units
,input_shape
=(x
.shape
[0],x
.shape
[1])))model
.add
(Dropout
(0.2))model
.add
(Dense
(label_size
,activation
='softmax'))model
.compile(loss
='categorical_crossentropy',optimizer
='adam',metrics
=['accuracy'])'''error:ImportError: ('You must install pydot (`pip install pydot`) and install graphviz (see instructions at https://graphviz.gitlab.io/download/) ', 'for plot_model/model_to_dot to work.')版本問題:from keras.utils.vis_utils import plot_model真正解決方案:https://www.pianshen.com/article/6746984081/'''plot_model
(model
,to_file
='./model_lstm.png',show_shapes
=True)model
.summary
()return model
def model_train(input_shape
,filepath
,model_save_path
):x
,y
,output_dictionary
,vocab_size
,label_size
,inverse_word_dictionary
=load_data
(filepath
,input_shape
)train_x
,test_x
,train_y
,test_y
=train_test_split
(x
,y
,test_size
=0.1,random_state
=42)n_units
=100batch_size
=32epochs
=5output_dim
=20lstm_model
=create_LSTM
(n_units
,input_shape
,output_dim
,filepath
)lstm_model
.fit
(train_x
,train_y
,epochs
=epochs
,batch_size
=batch_size
,verbose
=1)lstm_model
.save
(model_save_path
)N
= test_x
.shape
[0]predict
=[]label
=[]for start
,end
in zip(range(0,N
,1),range(1,N
+1,1)):print(f'start:{start}, end:{end}')sentence
=[inverse_word_dictionary
[i
] for i
in test_x
[start
] if i
!=0]y_predict
=lstm_model
.predict
(test_x
[start
:end
])print('y_predict:',y_predict
)label_predict
=output_dictionary
[np
.argmax
(y_predict
[0])]label_true
=output_dictionary
[np
.argmax
(test_y
[start
:end
])]print(f'label_predict:{label_predict}, label_true:{label_true}')print(''.join
(sentence
),label_true
,label_predict
)predict
.append
(label_predict
)label
.append
(label_true
)acc
=accuracy_score
(predict
,label
)print('模型在測試集上的準確率:%s'%acc
)if __name__
=='__main__':filepath
='data/data_single.csv'input_shape
=180model_save_path
='data/corpus_model.h5'model_train
(input_shape
,filepath
,model_save_path
)
五、重點函數講解
plot_model
如果代碼中輸入from keras.utils import plot_model報錯的話,可以改成from keras.utils.vis_utils import plot_model。
而我改了之后仍然報錯:error:ImportError: ('You must install pydot (pip install pydot) and install graphviz (see instructions at https://graphviz.gitlab.io/download/) ', ‘for plot_model/model_to_dot to work.’)
以下為解決方案:
- (1)pip install pydot_ng
- (2)pip install graphviz,這個建議不要直接pip install,去官網下載,我是下載了以下版本
解壓后放入對應的anaconda環境的site-package中,然后復制bin的目錄。 - (3)修改site-packages\pydot_ng_init_.py中的代碼,在Method3 添加:path = r"D:\App\tech\Anaconda3\envs\nlp\Lib\site-packages\Graphviz\bin" //該路徑指向剛才復制的路徑,如圖所示:
np_utils.to_categorical
np_utils.to_categorical用于將標簽轉化為形如(nb_samples, nb_classes)
的二值序列。
假設num_classes = 10。
如將[1, 2, 3,……4]轉化成:
[[0, 1, 0, 0, 0, 0, 0, 0]
[0, 0, 1, 0, 0, 0, 0, 0]
[0, 0, 0, 1, 0, 0, 0, 0]
……
[0, 0, 0, 0, 1, 0, 0, 0]]
model.summary()
通過model.summary()輸出模型各層的參數狀況,如圖所示:
特別感謝
此文章參考了農夫三拳有點疼 博客和錯誤解決參考鏈接
總結
以上是生活随笔為你收集整理的【NLP】文本情感分析的全部內容,希望文章能夠幫你解決所遇到的問題。
如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。