當前位置：首頁 > 编程语言 > python >内容正文

python

python评论情感分析nltk_基于 Python 和 NLTK 的推特情感分析

發(fā)布時間：2023/12/20 python 23 豆豆

生活随笔收集整理的這篇文章主要介紹了 python评论情感分析nltk_基于 Python 和 NLTK 的推特情感分析小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

基于 Python 和 NLTK 的推特情感分析

作者：宋彤彤

1. 導讀

NLTK 是 Python 的一個自然語言處理模塊，其中實現(xiàn)了樸素貝葉斯分類算法。這次 Mo 來教大家如何通過 python 和 nltk 模塊實現(xiàn)對推文按照正面情緒(positive)和負面情緒(negative)進行歸類。

在項目內部有可運行的代碼教程 naive_code.ipynb 和經過整理方便進行部署的部署文件 Deploy.ipynb，大家可以結合之前發(fā)布的Mo平臺部署介紹一文學習如何部署屬于自己的應用。大家也可以打開下方項目地址，對部署好的應用進行一下測試，比如簡單輸入 ‘My house is great.’ 和 ‘My house is not great.’ 來判斷它們分別是 positve 還是 negative。

2. 準備工作

2.1 導入工具包

首先，導入我們用到的工具包 nltk。

import nltk

# 如果沒有這個包，可以根據(jù)下面的代碼進行操作

# pip install nltk

# import nltk

# nltk.download() # 對依賴資源進行下載，一般下載 nltk.download('popular') 即可

2.2 準備數(shù)據(jù)

訓練模型需要大量的標記數(shù)據(jù)才能有比較好的效果。這里我們先用少量的數(shù)據(jù)來幫助我們了解整個的流程和原理，如果需要更好的實驗結果，可以加大訓練數(shù)據(jù)的數(shù)量。

因為該模型是一個二分類模型，我們需要兩類數(shù)據(jù)，分別標記為 'positive' 和 'negative'。初步訓練好的模型需要測試數(shù)據(jù)來檢驗效果。

# 標記為 positive 的數(shù)據(jù)

pos_tweets = [('I love this car', 'positive'),

('This view is amazing', 'positive'),

('I feel great this morning', 'positive'),

('I am so excited about the concert', 'positive'),

('He is my best friend', 'positive')]

# 標記為 negative 的數(shù)據(jù)

neg_tweets = [('I do not like this car', 'negative'),

('This view is horrible', 'negative'),

('I feel tired this morning', 'negative'),

('I am not looking forward to the concert', 'negative'),

('He is my enemy', 'negative')]

#測試數(shù)據(jù),備用

test_tweets = [('I feel happy this morning', 'positive'),

('Larry is my friend', 'positive'),

('I do not like that man', 'negative'),

('My house is not great', 'negative'),

('Your song is annoying', 'negative')]

2.3 特征提取

我們需要從訓練數(shù)據(jù)中提取有效的特征對模型進行訓練。這里的特征是標簽即其對應的推特中的有效單詞。那么，怎么提取這些有效單詞呢？

首先，分詞并將所有單詞變成小寫，取長度大于 2 的單詞，得到的列表代表一條 tweet；然后，將訓練數(shù)據(jù)所有 tweet 包含的單詞進行整合。

# 數(shù)據(jù)整合及劃分成詞，刪除長度小于2的單詞

tweets = []

for (words, sentiment) in pos_tweets + neg_tweets:

words_filtered = [e.lower() for e in words.split() if len(e) >= 3]

tweets.append((words_filtered, sentiment))

print(tweets)

# 提取訓練數(shù)據(jù)中所有單詞，單詞特征列表從推特內容中提取出的單詞來表示

def get_words_in_tweets(tweets):

all_words = []

for (words, sentiment) in tweets:

all_words.extend(words)

return all_words

words_in_tweets = get_words_in_tweets(tweets)

print(words_in_tweets)

為了訓練分類器，我們需要一個統(tǒng)一的特征，那就是是否包含我們詞庫中的單詞，下面的特征提取器可以對輸入的 tweet 單詞列表進行特征提取。

# 對一條 tweet 提取特征，得到的字典表示 tweet 包含哪幾個單詞

def extract_features(document):

document_words = set(document)

features = { }

for word in word_features:

features['contains({})'.format(word)] = (word in document_words)

return features

print(extract_features(['love', 'this', 'car']))

2.4 制作訓練集并訓練分類器

利用 nltk 的 classify 模塊的 apply_features 方法制作訓練集。

# 利用 apply_features 方法制作訓練集

training_set = nltk.classify.apply_features(extract_features, tweets)

print(training_set)

訓練樸素貝葉斯分類器。

# 訓練樸素貝葉斯分類器

classifier = nltk.NaiveBayesClassifier.train(training_set)

到此，我們的分類器初步訓練完成，可以使用了。

3. 測試工作

我們的分類器效果如何呢？先用我們事先準備好的測試集檢驗一下。可以得到 0.8 的正確率。

count = 0

for (tweet, sentiment) in test_tweets:

if classifier.classify(extract_features(tweet.split())) == sentiment:

print('Yes, it is '+sentiment+' - '+tweet)

count = count + 1

else:

print('No, it is '+sentiment+' - '+tweet)

rate = count/len(test_tweets)

print('Our correct rate is:', rate)

關于 'Your song is annoying' 這一句分類錯誤的原因，是我們的詞庫里沒有關于 'annoying' 一詞的任何信息。這也說明了數(shù)據(jù)集的重要性。

4. 分析總結

分類器的 _label_probdist 是標簽的先驗概率。在我們的例子中，標記為 positive 和 negtive 標簽的概率都是 0.5。

分類器的 _feature_probdist 是特征/值概率詞典。它與 _label_probdist 一起用于創(chuàng)建分類器。特征/值概率詞典將預期似然估計與特征和標簽相關聯(lián)。我們可以看到，當輸入包含 'best' 一詞時，輸入值被標記為 negative 的概率為 0.833。

我們可以通過 show_most_informative_features() 方法來顯示分類器中最有信息價值的特征。我們可以看到，如果輸入中不包含 'not'，那么標記為 positive 的可能性是 negative 的 1.6 倍；不包含 'best'，標記為 negative 的可能性是 positive 的1.2倍。

print(classifier._label_probdist.prob('positive'))

print(classifier._label_probdist.prob('negative'))

print(classifier._feature_probdist)

print(classifier._feature_probdist[('negative', 'contains(best)')].prob(True))

print(classifier.show_most_informative_features())

5. 參考資料

關于我們

Mo(網(wǎng)址：https://momodel.cn) 是一個支持 Python的人工智能在線建模平臺，能幫助你快速開發(fā)、訓練并部署模型。

近期 Mo 也在持續(xù)進行機器學習相關的入門課程和論文分享活動，歡迎大家關注我們的公眾號獲取最新資訊！

總結

以上是生活随笔為你收集整理的python评论情感分析nltk_基于 Python 和 NLTK 的推特情感分析的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：查看users表空间使用率高的原因
下一篇：七年级上册英语第三单元单词课文翻译