用python分析小说_用Python对哈利波特系列小说进行情感分析
原標(biāo)題:用Python對哈利波特系列小說進(jìn)行情感分析
準(zhǔn)備數(shù)據(jù)
現(xiàn)有的數(shù)據(jù)是一部小說放在一個txt里,我們想按照章節(jié)(列表中第一個就是章節(jié)1的內(nèi)容,列表中第二個是章節(jié)2的內(nèi)容)進(jìn)行分析,這就需要用到正則表達(dá)式整理數(shù)據(jù)。
比如我們先看看 01-Harry Potter and the Sorcerer's Stone.txt" 里的章節(jié)情況,我們打開txt
經(jīng)過檢索發(fā)現(xiàn),所有章節(jié)存在規(guī)律性表達(dá)
[Chapter][空格][整數(shù)][換行符n][可能含有空格的英文標(biāo)題][換行符n]
我們先熟悉下正則,使用這個設(shè)計一個模板pattern提取章節(jié)信息
import re
import nltk
raw_text = open("data/01-Harry Potter and the Sorcerer's Stone.txt").read
pattern = 'Chapter d+n[a-zA-Z ]+n'
re.findall(pattern, raw_text)
['Chapter 1nThe Boy Who Livedn',
'Chapter 2nThe Vanishing Glassn',
'Chapter 3nThe Letters From No Onen',
'Chapter 4nThe Keeper Of The Keysn',
'Chapter 5nDiagon Alleyn',
'Chapter 7nThe Sorting Hatn',
'Chapter 8nThe Potions Mastern',
'Chapter 9nThe Midnight Dueln',
'Chapter 10nHalloweenn',
'Chapter 11nQuidditchn',
'Chapter 12nThe Mirror Of Erisedn',
'Chapter 13nNicholas Flameln',
'Chapter 14nNorbert the Norwegian Ridgebackn',
'Chapter 15nThe Forbidden Forestn',
'Chapter 16nThrough the Trapdoorn',
'Chapter 17nThe Man With Two Facesn']
熟悉上面的正則表達(dá)式操作,我們想更精準(zhǔn)一些。我準(zhǔn)備了一個test文本,與實(shí)際小說中章節(jié)目錄表達(dá)相似,只不過文本更短,更利于理解。按照我們的預(yù)期,我們數(shù)據(jù)中只有5個章節(jié),那么列表的長度應(yīng)該是5。這樣操作后的列表中第一個內(nèi)容就是章節(jié)1的內(nèi)容,列表中第二個內(nèi)容是章節(jié)2的內(nèi)容。
import re
test = """Chapter 1nThe Boy Who LivednMr. and Mrs. Dursley, of number four, Privet Drive, were proud to say that they were perfectly normal, thank you very much. They were the last people you’d expect to be involved in anything strange or mysterious, because they just didn’t hold with such nonsense.nMr. Dursley was the director of a firm called Grunnings,
Chapter 2nThe Vanishing GlassnFor a second, Mr. Dursley didn’t realize what he had seen — then he jerked his head around to look again. There was a tabby cat standing on the corner of Privet Drive, but there wasn’t a map in sight. What could he have been thinking of? It must have been a trick of the light. Mr. Dursley blinked and stared at the cat.
Chapter 3nThe Letters From No OnenThe traffic moved on and a few minutes later, Mr. Dursley arrived in the Grunnings parking lot, his mind back on drills.nMr. Dursley always sat with his back to the window in his office on the ninth floor. If he hadn’t, he might have found it harder to concentrate on drills that morning.
Chapter 4nThe Keeper Of The KeysnHe didn’t know why, but they made him uneasy. This bunch were whispering excitedly, too, and he couldn’t see a single collecting tin.
Chapter 5nDiagon AlleynIt was a few seconds before Mr. Dursley realized that the man was wearing a violet cloak. """
#獲取章節(jié)內(nèi)容列表(列表中第一個內(nèi)容就是章節(jié)1的內(nèi)容,列表中第二個內(nèi)容是章節(jié)2的內(nèi)容)
#為防止列表中有空內(nèi)容,這里加了一個條件判斷,保證列表長度與章節(jié)數(shù)預(yù)期一致
chapter_contents = [c for c in re.split('Chapter d+n[a-zA-Z ]+n', test) if c]
chapter_contents
['Mr. and Mrs. Dursley, of number four, Privet Drive, were proud to say that they were perfectly normal, thank you very much. They were the last people you’d expect to be involved in anything strange or mysterious, because they just didn’t hold with such nonsense.nMr. Dursley was the director of a firm called Grunnings,n ',
'For a second, Mr. Dursley didn’t realize what he had seen — then he jerked his head around to look again. There was a tabby cat standing on the corner of Privet Drive, but there wasn’t a map in sight. What could he have been thinking of? It must have been a trick of the light. Mr. Dursley blinked and stared at the cat.n ',
'The traffic moved on and a few minutes later, Mr. Dursley arrived in the Grunnings parking lot, his mind back on drills.nMr. Dursley always sat with his back to the window in his office on the ninth floor. If he hadn’t, he might have found it harder to concentrate on drills that morning.n ',
'He didn’t know why, but they made him uneasy. This bunch were whispering excitedly, too, and he couldn’t see a single collecting tin. n ',
'It was a few seconds before Mr. Dursley realized that the man was wearing a violet cloak. ']
能得到哈利波特的章節(jié)內(nèi)容列表
也就意味著我們可以做真正的文本分析了
數(shù)據(jù)分析章節(jié)數(shù)對比
import os
import re
import matplotlib.pyplot as plt
colors = ['#78C850', '#A8A878','#F08030','#C03028','#6890F0', '#A890F0','#A040A0']
harry_potters = ["Harry Potter and the Sorcerer's Stone.txt",
"Harry Potter and the Chamber of Secrets.txt",
"Harry Potter and the Prisoner of Azkaban.txt",
"Harry Potter and the Goblet of Fire.txt",
"Harry Potter and the Order of the Phoenix.txt",
"Harry Potter and the Half-Blood Prince.txt",
"Harry Potter and the Deathly Hallows.txt"]
#橫坐標(biāo)為小說名
harry_potter_names = [n.replace('Harry Potter and the ', '')[:-4]
for n in harry_potters]
#縱坐標(biāo)為章節(jié)數(shù)
chapter_nums = []
for harry_potter in harry_potters:
file = "data/"+harry_potter
raw_text = open(file).read
pattern = 'Chapter d+n[a-zA-Z ]+n'
chapter_contents = [c for c in re.split(pattern, raw_text) if c]
chapter_nums.append(len(chapter_contents))
#設(shè)置畫布尺寸
plt.figure(figsize=(20, 10))
#圖的名字,字體大小,粗體
plt.title('Chapter Number of Harry Potter', fontsize=25, weight='bold')
#繪制帶色條形圖
plt.bar(harry_potter_names, chapter_nums, color=colors)
#橫坐標(biāo)刻度上的字體大小及傾斜角度
plt.xticks(rotation=25, fontsize=16, weight='bold')
plt.yticks(fontsize=16, weight='bold')
#坐標(biāo)軸名字
plt.xlabel('Harry Potter Series', fontsize=20, weight='bold')
plt.ylabel('Chapter Number', rotation=25, fontsize=20, weight='bold')
plt.show
從上面可以看出哈利波特系列小說的后四部章節(jié)數(shù)據(jù)較多(這分析沒啥大用處,主要是練習(xí))
用詞豐富程度
如果說一句100個詞的句子,同時詞語不帶重樣的,那么用詞的豐富程度為100。
而如果說同樣長度的句子,只用到20個詞語,那么用詞的豐富程度為100/20=5。
import os
import re
import matplotlib.pyplot as plt
from nltk import word_tokenize
from nltk.stem.snowball importSnowballStemmer
plt.style.use('fivethirtyeight')
colors = ['#78C850', '#A8A878','#F08030','#C03028','#6890F0', '#A890F0','#A040A0']
harry_potters = ["Harry Potter and the Sorcerer's Stone.txt",
"Harry Potter and the Chamber of Secrets.txt",
"Harry Potter and the Prisoner of Azkaban.txt",
"Harry Potter and the Goblet of Fire.txt",
"Harry Potter and the Order of the Phoenix.txt",
"Harry Potter and the Half-Blood Prince.txt",
"Harry Potter and the Deathly Hallows.txt"]
#橫坐標(biāo)為小說名
harry_potter_names = [n.replace('Harry Potter and the ', '')[:-4]
for n in harry_potters]
#用詞豐富程度
richness_of_words = []
stemmer = SnowballStemmer("english")
for harry_potter in harry_potters:
file = "data/"+harry_potter
raw_text = open(file).read
words = word_tokenize(raw_text)
words = [stemmer.stem(w.lower) for w in words]
wordset = set(words)
richness = len(words)/len(wordset)
richness_of_words.append(richness)
#設(shè)置畫布尺寸
plt.figure(figsize=(20, 10))
#圖的名字,字體大小,粗體
plt.title('The Richness of Word in Harry Potter', fontsize=25, weight='bold')
#繪制帶色條形圖
plt.bar(harry_potter_names, richness_of_words, color=colors)
#橫坐標(biāo)刻度上的字體大小及傾斜角度
plt.xticks(rotation=25, fontsize=16, weight='bold')
plt.yticks(fontsize=16, weight='bold')
#坐標(biāo)軸名字
plt.xlabel('Harry Potter Series', fontsize=20, weight='bold')
plt.ylabel('Richness of Words', rotation=25, fontsize=20, weight='bold')
plt.show
情感分析
哈利波特系列小說情緒發(fā)展趨勢,這里使用VADER,有現(xiàn)成的庫vaderSentiment,這里使用其中的polarity_scores函數(shù),可以得到
neg:負(fù)面得分
neu:中性得分
pos:積極得分
compound: 綜合情感得分
from vaderSentiment.vaderSentiment importSentimentIntensityAnalyzer
analyzer = SentimentIntensityAnalyzer
test = 'i am so sorry'
analyzer.polarity_scores(test)
{'neg': 0.443, 'neu': 0.557, 'pos': 0.0, 'compound': -0.1513}
import os
import re
import matplotlib.pyplot as plt
from nltk.tokenize import sent_tokenize
from vaderSentiment.vaderSentiment importSentimentIntensityAnalyzer
harry_potters = ["Harry Potter and the Sorcerer's Stone.txt",
"Harry Potter and the Chamber of Secrets.txt",
"Harry Potter and the Prisoner of Azkaban.txt",
"Harry Potter and the Goblet of Fire.txt",
"Harry Potter and the Order of the Phoenix.txt",
"Harry Potter and the Half-Blood Prince.txt",
"Harry Potter and the Deathly Hallows.txt"]
#橫坐標(biāo)為章節(jié)序列
chapter_indexes = []
#縱坐標(biāo)為章節(jié)情緒得分
compounds = []
analyzer = SentimentIntensityAnalyzer
chapter_index = 1
for harry_potter in harry_potters:
file = "data/"+harry_potter
raw_text = open(file).read
pattern = 'Chapter d+n[a-zA-Z ]+n'
chapters = [c for c in re.split(pattern, raw_text) if c]
#計算每個章節(jié)的情感得分
for chapter in chapters:
compound = 0
sentences = sent_tokenize(chapter)
for sentence in sentences:
score = analyzer.polarity_scores(sentence)
compound += score['compound']
compounds.append(compound/len(sentences))
chapter_indexes.append(chapter_index)
chapter_index+=1
#設(shè)置畫布尺寸
plt.figure(figsize=(20, 10))
#圖的名字,字體大小,粗體
plt.title('Average Sentiment of the Harry Potter', fontsize=25, weight='bold')
#繪制折線圖
plt.plot(chapter_indexes, compounds, color='#A040A0')
#橫坐標(biāo)刻度上的字體大小及傾斜角度
plt.xticks(rotation=25, fontsize=16, weight='bold')
plt.yticks(fontsize=16, weight='bold')
#坐標(biāo)軸名字
plt.xlabel('Chapter', fontsize=20, weight='bold')
plt.ylabel('Average Sentiment', rotation=25, fontsize=20, weight='bold')
plt.show
曲線不夠平滑,為了熨平曲線波動,自定義了一個函數(shù)
import numpy as np
import os
import re
import matplotlib.pyplot as plt
from nltk.tokenize import sent_tokenize
from vaderSentiment.vaderSentiment importSentimentIntensityAnalyzer
#曲線平滑函數(shù)
def movingaverage(value_series, window_size):
window = np.ones(int(window_size))/float(window_size)
return np.convolve(value_series, window, 'same')
harry_potters = ["Harry Potter and the Sorcerer's Stone.txt",
"Harry Potter and the Chamber of Secrets.txt",
"Harry Potter and the Prisoner of Azkaban.txt",
"Harry Potter and the Goblet of Fire.txt",
"Harry Potter and the Order of the Phoenix.txt",
"Harry Potter and the Half-Blood Prince.txt",
"Harry Potter and the Deathly Hallows.txt"]
#橫坐標(biāo)為章節(jié)序列
chapter_indexes = []
#縱坐標(biāo)為章節(jié)情緒得分
compounds = []
analyzer = SentimentIntensityAnalyzer
chapter_index = 1
for harry_potter in harry_potters:
file = "data/"+harry_potter
raw_text = open(file).read
pattern = 'Chapter d+n[a-zA-Z ]+n'
chapters = [c for c in re.split(pattern, raw_text) if c]
#計算每個章節(jié)的情感得分
for chapter in chapters:
compound = 0
sentences = sent_tokenize(chapter)
for sentence in sentences:
score = analyzer.polarity_scores(sentence)
compound += score['compound']
compounds.append(compound/len(sentences))
chapter_indexes.append(chapter_index)
chapter_index+=1
#設(shè)置畫布尺寸
plt.figure(figsize=(20, 10))
#圖的名字,字體大小,粗體
plt.title('Average Sentiment of the Harry Potter',
fontsize=25,
weight='bold')
#繪制折線圖
plt.plot(chapter_indexes, compounds,
color='red')
plt.plot(movingaverage(compounds, 10),
color='black',
linestyle=':')
#橫坐標(biāo)刻度上的字體大小及傾斜角度
plt.xticks(rotation=25,
fontsize=16,
weight='bold')
plt.yticks(fontsize=16,
weight='bold')
#坐標(biāo)軸名字
plt.xlabel('Chapter',
fontsize=20,
weight='bold')
plt.ylabel('Average Sentiment',
rotation=25,
fontsize=20,
weight='bold')
plt.show
全新打卡學(xué)習(xí)模式
每天30分鐘
30天學(xué)會Python編程
世界正在獎勵堅持學(xué)習(xí)的人!返回搜狐,查看更多
責(zé)任編輯:
總結(jié)
以上是生活随笔為你收集整理的用python分析小说_用Python对哈利波特系列小说进行情感分析的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: matlab向量归一化_已知近似的特征值
- 下一篇: php流程控制作业题,php流程控制