“《三国演义》人物出场统计“实例讲解
生活随笔
收集整理的這篇文章主要介紹了
“《三国演义》人物出场统计“实例讲解
小編覺得挺不錯的,現在分享給大家,幫大家做個參考.
剛學完英文詞頻統計,現在我們來看一下中文人物出場統計
下面我們以《三國演義》為例,進行統計分析
一、解題思路
1.jieba庫的使用
jieba庫是優秀的中文第三方庫,利用jieba庫我們可以對中文文本分詞獲得單個的詞語
2.詞語篩選
本次統計的目的是獲取《三國演義》中的人物出場次數,這就要求我們對詞語進行篩選,
- 篩除一個字的詞語(不可能是人名)
- 通過對輸出的結果進行分析,將不符合的詞語進行篩除,不斷重復該步驟,直至輸出的結果符合我們的期望
- 有的人物可能有多鐘稱謂,需要我們進行合并
3.出場次數排序
通過字典的值,對數據進行排序,輸出出場次數排名前20的人物
二、代碼實現
1.CalThreeKingdomsV1
代碼
#CalThreeKingdomsV1.py import jieba txt = open("threekingdoms.txt", "r", encoding='utf-8').read() words = jieba.lcut(txt) counts = {} for word in words:if len(word) == 1:continueelse:counts[word] = counts.get(word,0) + 1 items = list(counts.items()) items.sort(key=lambda x:x[1], reverse=True) for i in range(15):word, count = items[i]print ("{0:<10}{1:>5}".format(word, count))注意事項:
- 讀取中文文本要修改編碼方式為"utf-8",不然沒有辦法讀取
- 利用jieba.lcut()方法,把文本精確的切分開,不存在冗余單詞
- 利用字典對出場次數進行統計,利用sorted()方法進行排序
輸出結果
?我們可以看出輸出結果并不是我們所期望的:
- “將軍,卻說,二人,不可,不能,如此,荊州”都不是人名
- “曹操”和“丞相”,“孔明”和“孔明曰”都是一個人
2.CalThreeKingdomsV2
將不符合的詞語從字典中篩除,有多個稱謂的進行合并處理
代碼
#CalThreeKingdomsV2.py import jieba excludes = {"將軍","卻說","荊州","二人","不可","不能","如此"} txt = open("threekingdoms.txt", "r", encoding='utf-8').read() words = jieba.lcut(txt) counts = {} for word in words:if len(word) == 1:continueelif word == "諸葛亮" or word == "孔明曰":rword = "孔明"elif word == "關公" or word == "云長":rword = "關羽"elif word == "玄德" or word == "玄德曰":rword = "劉備"elif word == "孟德" or word == "丞相":rword = "曹操"else:rword = wordcounts[rword] = counts.get(rword,0) + 1 for word in excludes:del counts[word] items = list(counts.items()) items.sort(key=lambda x:x[1], reverse=True) for i in range(10):word, count = items[i]print ("{0:<10}{1:>5}".format(word, count))輸出結果
3.CalThreeKingdomsV3
經過對結果反復的篩選,終于得到了出場次數前20的人名:
代碼
# CalThreeKingdomsV3.py import jieba excludes = {"將軍", "卻說", "荊州", "二人", "不可", "不能", "如此", "商議", "如何","主公", "軍士", "左右", "軍馬", "引兵", "次日", "大喜", "天下", "東吳","于是", "今日", "不敢", "魏兵", "陛下", "一人", "都督", "人馬", "不知","漢中", "只見", "眾將", "蜀兵", "上馬", "大叫", "太守", "此人", "夫人","后人", "背后", "城中", "一面", "何不", "大軍", "忽報", "先生", "百姓","何故", "然后", "先鋒", "不如", "趕來", "原來", "令人", "江東", "下馬","喊聲", "正是", "徐州", "忽然", "因此", "成都", "不見", "未知", "大敗","大事", "之后", "一軍", "引軍", "起兵", "軍中", "接應", "進兵", "大驚", "可以"} txt = open("threekingdoms.txt", "r", encoding='utf-8').read() words = jieba.lcut(txt) counts = {} for word in words:if len(word) == 1:continueelif word == "諸葛亮" or word == "孔明曰":rword = "孔明"elif word == "關公" or word == "云長":rword = "關羽"elif word == "玄德" or word == "玄德曰" or word == "先主":rword = "劉備"elif word == "孟德" or word == "丞相":rword = "曹操"elif word == "后主":rword = "劉禪"elif word == "天子":rword = "劉協"else:rword = wordcounts[rword] = counts.get(rword, 0) + 1 for word in excludes:del counts[word] items = list(counts.items()) items.sort(key=lambda x: x[1], reverse=True) for i in range(20):word, count = items[i]print("{0:<10}{1:>5}".format(word, count))輸出結果:
?備注:篩除的詞語中有些是具有歧義的,如“先生”“夫人”
看到最后的結果,出場次數最多的是曹操,你是否感到驚訝~~~
總結
以上是生活随笔為你收集整理的“《三国演义》人物出场统计“实例讲解的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: Scheme语言入门
- 下一篇: pc端MNIST数据集pytorch模型