當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

NLTK2：词性标注

發布時間：2023/12/18 编程问答 24 豆豆

生活随笔收集整理的這篇文章主要介紹了 NLTK2：词性标注小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

1. 使用NLTK對英文進行詞性標注
- 1.1詞性標注示例
- 1.2 語料庫的已標注數據
2 標注器
- 2.1 默認標注器
- 2.2 正則表達式標注器
- 2.3 查詢標注器
3 訓練N-gram標注器
- 3.1 一般N-gram標注
- 3.2 組合標注器
4.更進一步
5.中文標注器的訓練
6. brown語料庫相關方法

參考鏈接2
參考鏈接3
參考鏈接1

自然語言是人類在溝通中形成的一套規則體系。規則有強有弱，比如非正式場合使用口語，正式場合下的書面語。要處理自然語言，也要遵循這些形成的規則，否則就會得出令人無法理解的結論。下面介紹一些術語的簡單區別。
文法：等同于語法(grammar)，文章的書寫規范，用來描述語言及其結構，它包含句法和詞法規范。
句法：Syntax，句子的結構或成分的構成與關系的規范。
詞法：Lexical，詞的構詞，變化等的規范。

詞性標注，或POS(Part Of Speech)，是一種分析句子成分的方法，通過它來識別每個詞的詞性。

下面簡要列舉POS的tagset含意，詳細可看nltk.help.brown_tagset()

標記詞性示例

ADJ	形容詞	new, good, high, special, big, local
ADV	動詞	really, already, still, early, now
CONJ	連詞	and, or, but, if, while, although
DET	限定詞	the, a, some, most, every, no
EX	存在量詞	there, there’s
MOD	情態動詞	will, can, would, may, must, should
NN	名詞	year,home,costs,time
NNP	專有名詞	April，China，Washington
NUM	數詞	fourth，2016, 09:30
PRON	代詞	he,they,us
P	介詞	on,over,with,of
TO	詞to	to
UH	嘆詞	ah,ha,oops
VB		動詞
VBD	動詞過去式	made,said,went
VBG	現在分詞	going,lying,playing
VBN	過去分詞	taken,given,gone
WH	wh限定詞	who,where,when,what

1. 使用NLTK對英文進行詞性標注

1.1詞性標注示例

import nltksent = "I am going to Beijing tomorrow."""" nltk.sent_tokenize(text) #按句子分割 ,python3分不開句子 nltk.word_tokenize(sentence) #分詞 nltk的分詞是句子級別的，所以對于一篇文檔首先要將文章按句子進行分割，然后句子進行分詞： """ '\nnltk.sent_tokenize(text) #按句子分割 ,python3分不開句子\nnltk.word_tokenize(sentence) #分詞 \nnltk的分詞是句子級別的，所以對于一篇文檔首先要將文章按句子進行分割，然后句子進行分詞： \n' # 分割句子 words = nltk.word_tokenize(sent) print(words) ['I', 'am', 'going', 'to', 'Beijing', 'tomorrow', '.'] # 詞性標注 taged_sent = nltk.pos_tag(words) taged_sent [('I', 'PRP'),('am', 'VBP'),('going', 'VBG'),('to', 'TO'),('Beijing', 'NNP'),('tomorrow', 'NN'),('.', '.')]

1.2 語料庫的已標注數據

語料類提供了下列方法可以返回預標注數據。

方法說明

tagged_words(fileids,categories)	返回標注數據，以詞列表的形式
tagged_sents(fileids,categories)	返回標注數據，以句子列表形式
tagged_paras(fileids,categories)	返回標注數據，以文章列表形式

2 標注器

2.1 默認標注器

最簡單的詞性標注器是將所有詞都標注為名詞NN。這種標注器沒有太大的價值。正確率很低。下面演示NLTK提供的默認標注器的用法。

import nltk from nltk.corpus import brown # 加載數據 brown_tagged_sents = brown.tagged_sents(categories='news') # [[('The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL'), brown_sents = brown.sents(categories='news') # brown_tagged_sents # 最簡單的標注器是為每個標識符分配同樣的標記。這似乎是一個相對普通的方法， # 但為標注器的性能建立了一個重要的標準。為了得到最好的效果，我們用最有可能的標記標注每個詞。 # 通過下例找出哪個標記是最有可能的。 tags = [tag for (word,tag) in brown.tagged_words(categories='news')] tags ['AT','NP-TL','NN-TL','JJ-TL','NN-TL','VBD','NR','AT','NN','IN','NP$','JJ','NN','NN','VBD','``','AT','NN',"''",'CS','DTI','NNS','VBD','NN', ...,'IN','NN','.','NP','NPS','BER','VBG','JJ','NN','TO','VB','AT','NN','IN','AT','CD','NN$',...] tag = nltk.FreqDist(tags).max() tag 'NN' # 我們現在可以創建一個將所有詞都標注為NN的標注器。 default_tagger = nltk.DefaultTagger('NN') sent = "I am going to Beijing tomorrow." default_tagger.tag(nltk.word_tokenize(sent)) [('I', 'NN'),('am', 'NN'),('going', 'NN'),('to', 'NN'),('Beijing', 'NN'),('tomorrow', 'NN'),('.', 'NN')] default_tagger.evaluate(brown_tagged_sents) 0.13089484257215028

2.2 正則表達式標注器

正則表達式標注器基于匹配模式分配標記給標識符。例如，一般情況下認為任一以ed結尾的詞都是動詞過去分詞，任一以‘s結尾的詞都是名詞所有格。下例中可以用正則表達式的列表來表示這些。

patterns = [(r'.*ing$', 'VBG'), # gerunds(r'.*ed$', 'VBD'), # simple past(r'.*es$', 'VBZ'), # 3rd singular present(r'.*ould$', 'MD'), # modals(r'.*\'s$', 'NN$'), # possessive nouns(r'.*s$', 'NNS'), # plural nouns(r'^-?[0-9]+(.[0-9]+)?$', 'CD'), # cardinal numbers(r'.*', 'NN') # nouns (deafult) ]

這些是按順序處理的，第一個匹配上的會被使用。現在建立一個標注器，并用它來標注句子。

regexp_tagger = nltk.RegexpTagger(patterns) regexp_tagger.tag(brown_sents[3]) regexp_tagger.evaluate(brown_tagged_sents) # 0.20326391789486245 # 大約有五分之一是正確的 0.20326391789486245

2.3 查詢標注器

很多高頻詞沒有NN標記，我們找出100個最頻繁的詞，存儲它們最有可能的標記，然后我們可以使用這個信息作為“查詢標注器（NLTKUnigramTagger）”的模型，如下例：

# 先把詞拿出來 fd = nltk.FreqDist(brown.words(categories='news')) # ['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]# 收集了在不同條件下運行的單個實驗的頻率分布。條件頻率分布用于記錄每個樣本在給定的實驗條件下出現的次數。 # 例如，可以使用條件頻率分布來記錄文檔中給定長度的每個單詞(類型)的頻率。 # 在形式上，條件頻率分布可以定義為一個函數，將每個條件映射到實驗條件下的FreqDist。 cfd =nltk.ConditionalFreqDist(brown.tagged_words(categories='news')) # print(cfd.items()) # dict_items([('The', FreqDist({'AT': 775, 'AT-TL': 28, 'AT-HL': 3})), ('Fulton', FreqDist({'NP-TL': 10, 'NP': 4})), # 頻繁詞top100 most_freq_words = fd.keys()# python 3.6 以上，dict_keys 類型需要list轉化 most_freq_words = list(most_freq_words)[:100] # ['The','Fulton','County','Grand','Jury','said','Friday','an',# 字典生成式對于top100的單詞，取該單詞頻率分布最高的詞性，作為該詞的詞性 likely_tags = dict((word,cfd[word].max()) for word in most_freq_words) # likely_tags # {'The': 'AT','Fulton': 'NP-TL','County': 'NN-TL','Grand': 'JJ-TL','Jury': 'NN-TL','said': 'VBD','Friday': 'NR',# UnigramTagger為訓練語料庫中的每個單詞找到最有可能的標記，然后使用該信息為新標記分配標記。 baseline_tagger = nltk.UnigramTagger(model = likely_tags)baseline_tagger.evaluate(brown_tagged_sents) # 0.3329355371243312 # brown.tagged_words(categories='news') #[('The', 'AT'), ('Fulton', 'NP-TL'), ...] baseline_tagger.evaluate([brown.tagged_words(categories='news')]) # brown.tagged_words()需要加括號轉二維數組 0.3329355371243312 baseline_tagger.evaluate([brown.tagged_sents(categories='news')[3]]) # 個別語句會有極高的準確率 0.972972972972973 0.972972972972973

此處結果與書中不同，書中結果為0.45左右，即僅僅知道100個最頻繁的詞的標記就能正確標注很大一部分標識符。

來看看它在未標注的輸入文本是運行得怎么樣：

sent = brown.sents(categories='news')[10] # baseline_tagger.tag(sent) [('It', 'PPS'),('urged', None),('that', 'CS'),('the', 'AT'),('city', 'NN'),('``', '``'),('take', None),('steps', None),('to', 'TO'),('remedy', None),("''", "''"),('this', 'DT'),('problem', None),('.', '.')]

可以看到很多詞都被分配了’None’標簽，因為它們不在100個最頻繁的詞中。這種情況我們想分配默認標記NN。也就是說，我們應先使用查找表，如果不能指定就使用默認標注器，這個過程叫“回退”。

# 設置默認標注器，在找不到匹配時使用 baseline_tagger = nltk.UnigramTagger(model = likely_tags,backoff = nltk.DefaultTagger('NN'))

最后我們把查找標注器和默認標注器結合起來之后，看它的性能如何，使用大小不同的模型：

def performance(cfd,wordlist):lt = dict((word,cfd[word].max()) for word in wordlist)baseline_tagger = nltk.UnigramTagger(model=lt,backoff=nltk.DefaultTagger('NN'))return baseline_tagger.evaluate(brown.tagged_sents(categories='news'))def display():import pylabwords_by_freq = list(nltk.FreqDist(brown.words(categories='news')))cfd = nltk.ConditionalFreqDist(brown.tagged_words(categories='news'))sizes = 2 ** pylab.arange(16)prefs = [performance(cfd,words_by_freq[:size]) for size in sizes]pylab.plot(sizes,prefs,'-bo')pylab.title('Lookup Tagger Performance with Varying Model Size')pylab.xlabel('Model Size')pylab.ylabel('Performace')pylab.show()display()

可以看到隨著模型規模的增長，最初性能增加較快，最終達到穩定水平，這時哪怕模型規模再增加，性能提升幅度也很小

3 訓練N-gram標注器

3.1 一般N-gram標注

在上一節中，已經使用了1-Gram，即Unigram標注器。考慮更多的上下文，便有了2/3-gram，這里統稱為N-gram。注意，更長的上正文并不能帶來準確度的提升。
除了向N-gram標注器提供詞表模型，另外一種構建標注器的方法是訓練。N-gram標注器的構建函數如下：init(train=None, model=None, backoff=None),可以將標注好的語料作為訓練數據，用于構建一個標注器。

import nltk from nltk.corpus import brownbrown_tagged_sents = brown.tagged_sents(categories = 'news') train_num = int(len(brown_tagged_sents) * 0.9) x_train = brown_tagged_sents[0:train_num] x_test = brown_tagged_sents[train_num:] tagger = nltk.UnigramTagger(train = x_train) print(tagger.evaluate(x_test)) # 0.8121200039868434 0.8121200039868434

對于UniGram，使用90%的數據進行訓練，在余下10%的數據上測試的準確率為81%。如果改為BiGram，則正確率會下降到10%左右。

3.2 組合標注器

可以利用backoff參數，將多個組合標注器組合起來，以提高識別精確率。

import nltk from nltk.corpus import brown pattern = [(r'.*ing$','VBG'),(r'.*ed$','VBD'),(r'.*es$','VBZ'),(r'.*\'s$','NN$'),(r'.*s$','NNS'),(r'.*', 'NN') #未匹配的仍標注為NN ] brown_tagged_sents = brown.tagged_sents(categories = 'news') train_num = int(len(brown_tagged_sents) * 0.9) x_train = brown_tagged_sents[0:train_num] x_test = brown_tagged_sents[train_num:]t0 = nltk.RegexpTagger(pattern) t1 = nltk.UnigramTagger(x_train,backoff = t0) t2 = nltk.BigramTagger(x_train,backoff = t1) print(t2.evaluate(x_test)) # 0.8627529153792485 0.8627529153792485

從上面可以看出，不需要任何的語言學知識，只需要借助統計數據便可以使得詞性標注做的足夠好。
對于中文，只要有標注語料，也可以按照上面的過程訓練N-gram標注器。

4.更進一步

nltk.tag.BrillTagger實現了基于轉換的標注，在基礎標注器的結果上，對輸出進行基于規則的修正，實現更高的準確度。

import nltk import nltk.tag.brill from nltk.corpus import brownpattern = [(r'.*ing$','VBG'),(r'.*ed$','VBD'),(r'.*es$','VBZ'),(r'.*\'s$','NN$'),(r'.*s$','NNS'),(r'.*', 'NN') #未匹配的仍標注為NN ] # 劃分數據集 brown_tagged_sents = brown.tagged_sents(categories = ['news']) train_num = int(len(brown_tagged_sents)*0.9) x_train = brown_tagged_sents[:train_num] x_test = brown_tagged_sents[train_num:] # baseline_tagger = nltk.UnigramTagger(x_train,backoff = nltk.RegexpTagger(pattern)) tt = nltk.tag.brill_trainer.BrillTaggerTrainer(baseline_tagger, nltk.tag.brill.brill24()) brill_tagger = tt.train(x_train,max_rules=20,min_acc=0.99) # 評估 print(brill_tagger.evaluate(x_test))# 0.8683344961626632 0.8683344961626632 brown_sents = brown.sents(categories="news") print(brown_tagged_sents[2007]) print(brill_tagger.tag(brown_sents[2007])) [('Various', 'JJ'), ('of', 'IN'), ('the', 'AT'), ('apartments', 'NNS'), ('are', 'BER'), ('of', 'IN'), ('the', 'AT'), ('terrace', 'NN'), ('type', 'NN'), (',', ','), ('being', 'BEG'), ('on', 'IN'), ('the', 'AT'), ('ground', 'NN'), ('floor', 'NN'), ('so', 'CS'), ('that', 'CS'), ('entrance', 'NN'), ('is', 'BEZ'), ('direct', 'JJ'), ('.', '.')] [('Various', 'JJ'), ('of', 'IN'), ('the', 'AT'), ('apartments', 'NNS'), ('are', 'BER'), ('of', 'IN'), ('the', 'AT'), ('terrace', 'NN'), ('type', 'NN'), (',', ','), ('being', 'BEG'), ('on', 'IN'), ('the', 'AT'), ('ground', 'NN'), ('floor', 'NN'), ('so', 'QL'), ('that', 'CS'), ('entrance', 'NN'), ('is', 'BEZ'), ('direct', 'JJ'), ('.', '.')]

5.中文標注器的訓練

下面基于Unigram訓練一個中文詞性標注器，語料使用網上可以下載得到的人民日報98年1月的標注資料。

import nltk import jsonlines = open('./詞性標注人民日報199801.txt',encoding = 'utf-8').readlines() all_tagged_sents = []for line in lines:sent = line.split()tagged_sent = []for item in sent:pair = nltk.str2tuple(item)tagged_sent.append(pair)if len(tagged_sent)>0:all_tagged_sents.append(tagged_sent)train_size = int(len(all_tagged_sents)*0.8) x_train = all_tagged_sents[:train_size] x_test = all_tagged_sents[train_size:]tagger = nltk.UnigramTagger(train=x_train,backoff=nltk.DefaultTagger('n')) print(tagger.evaluate(x_test)) # 0.8714095491725319 """ line: 19980101-01-001-001/m 邁向/v 充滿/v 希望/n 的/u 新/a 世紀/n ——/w 一九九八年/t 新年/t 講話/n （/w 附/v 圖片/n １/m 張/q ）/w line.split(): '\nline:\n19980101-01-001-001/m 邁向/v 充滿/v 希望/n 的/u 新/a 世紀/n ——/w 一九九八年/t 新年/t 講話/n （/w 附/v 圖片/n １/m 張/q ）/w \n'tagged_sent: [('19980101-01-001-001', 'M'), ('邁向', 'V'), ('充滿', 'V'), ('希望', 'N'), ('的', 'U'), ('新', 'A'), ('世紀', 'N'), ('——', 'W'), ('一九九八年', 'T'), ('新年', 'T'), ('講話', 'N'), ('（', 'W'), ('附', 'V'), ('圖片', 'N'), ('１', 'M'), ('張', 'Q'), ('）', 'W')] """ 0.8714095491725319"\nline:\n19980101-01-001-001/m 邁向/v 充滿/v 希望/n 的/u 新/a 世紀/n ——/w 一九九八年/t 新年/t 講話/n （/w 附/v 圖片/n １/m 張/q ）/w \n\nline.split():\n'\nline:\n19980101-01-001-001/m 邁向/v 充滿/v 希望/n 的/u 新/a 世紀/n ——/w 一九九八年/t 新年/t 講話/n （/w 附/v 圖片/n １/m 張/q ）/w \n'\n\ntagged_sent:\n[('19980101-01-001-001', 'M'), ('邁向', 'V'), ('充滿', 'V'), ('希望', 'N'), ('的', 'U'), ('新', 'A'), ('世紀', 'N'), ('——', 'W'), ('一九九八年', 'T'), ('新年', 'T'), ('講話', 'N'), ('（', 'W'), ('附', 'V'), ('圖片', 'N'), ('１', 'M'), ('張', 'Q'), ('）', 'W')]\n"

6. brown語料庫相關方法

# 語料庫文件名列表 brown.fileids() ['ca01','ca02','ca03','ca04', ...,'cp17','cp18','cp19','cp20','cp21','cp22','cp23','cp24','cp25','cp26','cp27','cp28','cp29','cr01','cr02','cr03','cr04','cr05','cr06','cr07','cr08','cr09'] # 返回指定類別('news')的文件名列表 brown.fileids('news') ['ca01','ca02','ca03','ca04','ca05','ca06','ca07','ca08',...'ca26','ca27','ca28','ca29','ca30','ca31','ca32','ca33','ca34','ca35','ca36','ca37','ca38','ca39','ca40','ca41','ca42','ca43','ca44'] # 返回指定分類的原始文本 brown.raw(categories=['news'])

"\n\n\tThe/at Fulton/np-tl County/nn-tl Grand/jj-tl Jury/nn-tl said/vbd Friday/nr an/at investigation/nn of/in Atlanta’s/np$ recent/jj primary/nn election/nn produced/vbd / no/at evidence/nn ‘’/‘’ that/cs any/dti irregularities/nns took/vbd place/nn ./.\n\n\n\tThe/at jury/nn further/rbr said/vbd in/in term-end/nn presentments/nns that/cs the/at City/nn-tl Executive/jj-tl Committee/nn-tl ,/, which/wdt had/hvd over-all/jj charge/nn of/in the/at election/nn ,/, / deserves/vbz the/at praise/nn and/cc thanks/nns of/in the/at City/nn-tl of/in-tl Atlanta/np-tl ‘’/‘’ for/in the/at manner/nn in/in which/wdt the/at election/nn was/bedz conducted/vbn ./.\n\n\n\tThe/at September-October/np term/nn jury/nn had/hvd been/ben charged/vbn by/in Fulton/np-tl Superior/jj-tl
…
Steve/np Barber/np joined/vbd the/at club/nn one/cd week/nn ago/rb after/cs completing/vbg his/pp$ hitch/nn under/in the/at Army’s/nn $KaTeX parse error: Undefined control sequence: \nThe at position 108: … ,/, Ky./np ./.\?n?T?h?e?/at 22-year-old…$ bulky/jj spring-training/nn contingent/nn now/rb gradually/rb will/md be/be reduced/vbn as/cs Manager/nn-tl Paul/np Richards/np and/cc his/pp$ coaches/nns seek/vb to/to trim/vb it/ppo down/rp to/in a/at more/ql streamlined/vbn and/cc workable/jj unit/nn ./.\n\n\n\n\n/ Take/vb a/at ride/nn on/in this/dt one/cd ‘’/‘’ ,/, Brooks/np Robinson/np greeted/vbd Hansen/np as/cs the/at Bird/np third/od sacker/nn grabbed/vbd a/at bat/nn ,/, headed/vbd for/in the/at plate/nn and/cc bounced/vbd a/at third-inning/nn two-run/jj double/nn off/in the/at left-centerfield/nn wall/nn tonight/nr ./.\n\n\n\tIt/pps was/bedz the/at first/od of/in two/cd doubles/nns by/in Robinson/np ,/, who/wps was/bedz in/in a/at mood/nn to/to celebrate/vb ./.\n\n\n\tJust/rb before/in game/nn time/nn ,/, Robinson’s/np$ pretty/jj wife/nn ,/, Connie/np informed/vbd him/ppo that/cs an/at addition/nn to/in the/at family/nn can/md be/be expected/vbn late/jj next/ap summer/nn ./.\n\n\n\tUnfortunately/rb ,/, Brooks’s/np$ teammates/nns were/bed not/* in/in such/ql festive/jj mood/nn as/cs the/at Orioles/nps expired/vbd before/in the/at seven-hit

# 返回指定文件名的文本字符串 brown.raw(fileids=['ca01','ca02']) "\n\n\tThe/at Fulton/np-tl County/nn-tl Grand/jj-tl Jury/nn-tl said/vbd Friday/nr an/at investigation/nn of/in Atlanta's/np$ recent/jj primary/nn election/nn produced/vbd ``/`` no/at evidence/nn ''/'' that/cs any/dti irregularities/nns took/vbd place/nn ./.\n\n\n\tThe/at jury/nn further/rbr said/vbd in/in term-end/nn presentments/nns that/cs the/at City/nn-tl Executive/jj-tl Committee/nn-tl ,/, which/wdt had/hvd over-all/jj charge/nn of/in the/at election/nn ,/, ``/`` deserves/vbz the/at praise/nn and/cc thanks/nns of/in the/at City/nn-tl of/in-tl ... for/in-hl extension/nn-hl \nOther/ap recommendations/nns made/vbn by/in the/at committee/nn are/ber :/: \n\n\tExtension/nn of/in the/at ADC/nn program/nn to/in all/abn children/nns in/in need/nn living/vbg with/in any/dti relatives/nns ,/, including/in both/abx parents/nns ,/, as/cs a/at means/nns of/in preserving/vbg family/nn unity/nn ./.\n\n\n\tResearch/nn projects/nns as/ql soon/rb as/cs possible/jj on/in the/at causes/nns and/cc prevention/nn of/in dependency/nn and/cc illegitimacy/nn ./.\n\n" # 返回指定文件名的語句列表 brown.sents(fileids=['ca01','ca02']) [['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of', "Atlanta's", 'recent', 'primary', 'election', 'produced', '``', 'no', 'evidence', "''", 'that', 'any', 'irregularities', 'took', 'place', '.'], ['The', 'jury', 'further', 'said', 'in', 'term-end', 'presentments', 'that', 'the', 'City', 'Executive', 'Committee', ',', 'which', 'had', 'over-all', 'charge', 'of', 'the', 'election', ',', '``', 'deserves', 'the', 'praise', 'and', 'thanks', 'of', 'the', 'City', 'of', 'Atlanta', "''", 'for', 'the', 'manner', 'in', 'which', 'the', 'election', 'was', 'conducted', '.'], ...] # 按分類返回語句列表 brown.sents(categories=['news']) [['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of', "Atlanta's", 'recent', 'primary', 'election', 'produced', '``', 'no', 'evidence', "''", 'that', 'any', 'irregularities', 'took', 'place', '.'], ['The', 'jury', 'further', 'said', 'in', 'term-end', 'presentments', 'that', 'the', 'City', 'Executive', 'Committee', ',', 'which', 'had', 'over-all', 'charge', 'of', 'the', 'election', ',', '``', 'deserves', 'the', 'praise', 'and', 'thanks', 'of', 'the', 'City', 'of', 'Atlanta', "''", 'for', 'the', 'manner', 'in', 'which', 'the', 'election', 'was', 'conducted', '.'], ...] # 返回指定文件名的單詞列表 brown.words('ca01') ['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...] # 返回指定分類的單詞列表 brown.words(categories=['news']) ['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...] # 返回按句子標注好詞性的二維數組 brown.tagged_sents(categories=['news']) [[('The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL'), ('Grand', 'JJ-TL'), ('Jury', 'NN-TL'), ('said', 'VBD'), ('Friday', 'NR'), ('an', 'AT'), ('investigation', 'NN'), ('of', 'IN'), ("Atlanta's", 'NP$'), ('recent', 'JJ'), ('primary', 'NN'), ('election', 'NN'), ('produced', 'VBD'), ('``', '``'), ('no', 'AT'), ('evidence', 'NN'), ("''", "''"), ('that', 'CS'), ('any', 'DTI'), ('irregularities', 'NNS'), ('took', 'VBD'), ('place', 'NN'), ('.', '.')], [('The', 'AT'), ('jury', 'NN'), ('further', 'RBR'), ('said', 'VBD'), ('in', 'IN'), ('term-end', 'NN'), ('presentments', 'NNS'), ('that', 'CS'), ('the', 'AT'), ('City', 'NN-TL'), ('Executive', 'JJ-TL'), ('Committee', 'NN-TL'), (',', ','), ('which', 'WDT'), ('had', 'HVD'), ('over-all', 'JJ'), ('charge', 'NN'), ('of', 'IN'), ('the', 'AT'), ('election', 'NN'), (',', ','), ('``', '``'), ('deserves', 'VBZ'), ('the', 'AT'), ('praise', 'NN'), ('and', 'CC'), ('thanks', 'NNS'), ('of', 'IN'), ('the', 'AT'), ('City', 'NN-TL'), ('of', 'IN-TL'), ('Atlanta', 'NP-TL'), ("''", "''"), ('for', 'IN'), ('the', 'AT'), ('manner', 'NN'), ('in', 'IN'), ('which', 'WDT'), ('the', 'AT'), ('election', 'NN'), ('was', 'BEDZ'), ('conducted', 'VBN'), ('.', '.')], ...]

總結

以上是生活随笔為你收集整理的NLTK2：词性标注的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

词性