一、NLTK工具包使用
@Author : By Runsen
文章目錄
- 自然語言處理
- 自然語言處理應用
- NLTK
- 安裝語料庫
- 了解Tokenize
- 標記文本
- 加載內置語料庫
- 分詞(注意只能分英語)
- 停用詞
- 具體使用
- 過濾停用詞
- 詞性標注
- 分塊
- 命名實體識別
自然語言處理
自然語言處理(natural language processing)是計算機科學領域與人工智能領域中的一個重要方向。它研究能實現人與計算機之間用自然語言進行有效通信的各種理論和方法。自然語言處理是一門融語言學、計算機科學、數學于一體的科學。
自然語言處理應用
- 搜索引擎,比如谷歌,雅虎等等。谷歌等搜索引擎會通過NLP了解到你是一個科技發燒友,所以它會返回科技相關的結果。
- 社交網站信息流,比如 Facebook 的信息流。新聞饋送算法通過自然語言處理了解到你的興趣,并向你展示相關的廣告以及消息,而不是一些無關的信息。
- 語音助手,諸如蘋果 Siri。
- 垃圾郵件程序,比如 Google 的垃圾郵件過濾程序 ,這不僅僅是通常會用到的普通的垃圾郵件過濾,現在,垃圾郵件過濾器會對電子郵件的內容進行分析,看看該郵件是否是垃圾郵件。
NLTK
NLTK是構建Python程序以使用人類語言數據的領先平臺。它為50多種語料庫和詞匯資源(如WordNet)提供了易于使用的界面,還提供了一套用于分類,標記化,詞干化,標記,解析和語義推理的文本處理庫。NLTK是Python上著名的?然語?處理庫 ?帶語料庫,具有詞性分類庫 ?帶分類,分詞,等等功能。NLTK被稱為“使用Python進行教學和計算語言學工作的絕佳工具”,以及“用自然語言進行游戲的神奇圖書館”。
安裝語料庫
pip install nltk注意,這只是安裝好了一個框子,里面是沒東西的
# 新建一個ipython,輸入 import nltk nltk.download()我覺得下book 和popular下好就可以了
功能?覽表
安裝好了,我們來愉快的玩耍
了解Tokenize
把長句?拆成有“意義”的?部件,,使用的是nltk.word_tokenize
>>> import nltk >>> sentence = "hello,,world" >>> tokens = nltk.word_tokenize(sentence) >>> tokens ['hello', ',', ',world']標記文本
>>> import nltk >>> sentence = """At eight o'clock on Thursday morning ... Arthur didn't feel very good.""" >>> tokens = nltk.word_tokenize(sentence) >>> tokens ['At', 'eight', "o'clock", 'on', 'Thursday', 'morning', 'Arthur', 'did', "n't", 'feel', 'very', 'good', '.'] >>> tagged = nltk.pos_tag(tokens) # 標記詞性 >>> tagged[0:6] [('At', 'IN'), ('eight', 'CD'), ("o'clock", 'JJ'), ('on', 'IN'), ('Thursday', 'NNP'), ('morning', 'NN')]加載內置語料庫
分詞(注意只能分英語)
>>> from nltk.tokenize import word_tokenize >>> from nltk.text import Text >>> input_str = "Today's weather is good, very windy and sunny, we have no classes in the afternoon,We have to play basketball tomorrow." >>> tokens = word_tokenize(input_str) >>> tokens[:5] ['Today', "'s", 'weather', 'is', 'good'] >>> tokens = [word.lower() for word in tokens] #小寫 >>> tokens[:5] ['today', "'s", 'weather', 'is', 'good']查看對應單詞的位置和個數
>>> t = Text(tokens) >>> t.count('good') 1 >>> t.index('good') 4還可以畫圖
t.plot(8)停用詞
from nltk.corpus import stopwords stopwords.fileids() # 具體的語言果然沒有中文 ['arabic', 'azerbaijani', 'danish', 'dutch', 'english', 'finnish', 'french', 'german', 'greek','hungarian', 'italian', 'kazakh', 'nepali', 'norwegian', 'portuguese', 'romanian', 'russian','spanish', 'swedish', 'turkish']# 看下英文的停用詞 stopwords.raw('english').replace('\n',' ') #會有很多\n,這里替換"i me my myself we our ours ourselves you you're you've you'll you'd your yours yourself yourselves he him his himself she she's her hers herself it it's its itself they them their theirs themselves what which who whom this that that'll these those am is are was were be been being have has had having do does did doing a an the and but if or because as until while of at by for with about against between into through during before after above below to from up down in out on off over under again further then once here there when where why how all any both each few more most other some such no nor not only own same so than too very s t can will just don don't should should've now d ll m o re ve y ain aren aren't couldn couldn't didn didn't doesn doesn't hadn hadn't hasn hasn't haven haven't isn isn't ma mightn mightn't mustn mustn't needn needn't shan shan't shouldn shouldn't wasn wasn't weren weren't won won't wouldn wouldn't "具體使用
test_words = [word.lower() for word in tokens] # tokens是上面的句子 test_words_set = set(test_words) # 集合 test_words_set.intersection(set(stopwords.words('english'))) >>>{'and', 'have', 'in', 'is', 'no', 'the', 'to', 'very', 'we'}在 "Today’s weather is good, very windy and sunny, we have no classes in the afternoon,We have to play basketball tomorrow."中有這么多個停用詞
‘and’, ‘have’, ‘in’, ‘is’, ‘no’, ‘the’, ‘to’, ‘very’, ‘we’
過濾停用詞
filtered = [w for w in test_words_set if(w not in stopwords.words('english'))] filtered ['today','good','windy','sunny','afternoon','play','basketball','tomorrow','weather','classes',',','.',"'s"]詞性標注
from nltk import pos_tag tags = pos_tag(tokens) tags [('Today', 'NN'),("'s", 'POS'),('weather', 'NN'),('is', 'VBZ'),('good', 'JJ'),(',', ','),('very', 'RB'),('windy', 'JJ'),('and', 'CC'),('sunny', 'JJ'),(',', ','),('we', 'PRP'),('have', 'VBP'),('no', 'DT'),('classes', 'NNS'),('in', 'IN'),('the', 'DT'),('afternoon', 'NN'),(',', ','),('We', 'PRP'),('have', 'VBP'),('to', 'TO'),('play', 'VB'),('basketball', 'NN'),('tomorrow', 'NN'),('.', '.')]分塊
from nltk.chunk import RegexpParser sentence = [('the','DT'),('little','JJ'),('yellow','JJ'),('dog','NN'),('died','VBD')] grammer = "MY_NP: {<DT>?<JJ>*<NN>}" cp = nltk.RegexpParser(grammer) #生成規則 result = cp.parse(sentence) #進行分塊 print(result) out: result.draw() #調用matplotlib庫畫出來命名實體識別
命名實體識別是NLP里的一項很基礎的任務,就是指從文本中識別出命名性指稱項,為關系抽取等任務做鋪墊。狹義上,是識別出人命、地名和組織機構名這三類命名實體(時間、貨幣名稱等構成規律明顯的實體類型可以用正則表達式等方式識別)。當然,在特定的領域中,會相應地定義領域內的各種實體類型。
from nltk import ne_chunk sentence = "Edison went to Tsinghua University today." print(ne_chunk(pos_tag(word_tokenize(sentence)))) (S(PERSON Edison/NNP)went/VBDto/TO(ORGANIZATION Tsinghua/NNP University/NNP)today/NN./.)總結
以上是生活随笔為你收集整理的一、NLTK工具包使用的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: python 来搞定 非线性方程组和最小
- 下一篇: 三、入门爬虫,爬取豆瓣电影