2.3.NLTK工具包安装、分词、Text对象、停用词、过滤掉停用词、词性标注、分块、命名实体识别、数据清洗实例、参考文章
2.3.NLTK工具包安裝
2.3.1.分詞
2.3.2.Text對象
2.3.3.停用詞
2.3.4.過濾掉停用詞
2.3.5.詞性標注
2.3.6.分塊
2.3.7.命名實體識別
2.3.8.數據清洗實例
2.3.9.參考文章
2.3.NLTK工具包安裝
非常實用的文本處理工具,主要用于英文數據,歷史悠久~
(base) C:\Users\toto>pip install nltk -i https://pypi.tuna.tsinghua.edu.cn/simple Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple Requirement already satisfied: nltk in d:\installed\anaconda3\lib\site-packages (3.5) Requirement already satisfied: joblib in d:\installed\anaconda3\lib\site-packages (from nltk) (0.17.0) Requirement already satisfied: tqdm in d:\installed\anaconda3\lib\site-packages (from nltk) (4.50.2) Requirement already satisfied: regex in d:\installed\anaconda3\lib\site-packages (from nltk) (2020.10.15) Requirement already satisfied: click in d:\installed\anaconda3\lib\site-packages (from nltk) (7.1.2)(base) C:\Users\toto>NLTK最麻煩的是它的使用需要一些較大的數據包,如果對自己的網速有信心,可以直接在切到安裝環境后,使用python命令進入到python環境中,輸入:
import nltk nltk.download()然后在可視化界面中下載就好。
但是,這種方式不僅僅下載慢,還容易遇到大大小小的下載問題,因此,可以直接到nltk的github上下載數據包:https://github.com/nltk/nltk_data
下載之后,需要將文件放在nltk掃描的文件下,其中的路徑可以通過下面的方式找到:
解決辦法是將上面github上下載的packages包里面的內容放到D:\installed\Anaconda\nltk_data中,最終如下:
不過,要注意一點,在Github上下載的這個壓縮數據包,里面的一些子文件夾下還有壓縮內容,例如,如果調用nltk進行句子分割,會用到這個函數: word_tokenize():
卻會報錯(我這里是這樣),可以在報錯信息中看到是punkt數據未找到:
Resource [93mpunkt[0m not found.Please use the NLTK Downloader to obtain the resource:[31m>>> import nltk>>> nltk.download('punkt')[0mFor more information see: https://www.nltk.org/data.html類似這樣的錯誤,其實如果找到查找的路徑,也就是上面我們放數據包的地方,是可以在tokenizers文件夾下找到這個punkt的,原因就在于沒有解壓,那么,把punkt.zip解壓到文件夾中,再運行分割句子的代碼就沒問題了。話有其他的一些數據也是這樣的,如果遇到顯示沒有找到某個數據包,不妨試一試。
如下:
最后再次運行,結果如下:
2.3.1.分詞
import nltkfrom nltk.tokenize import word_tokenize from nltk.text import Textinput_str = "Today's weather is good, very windy and sunny, we have no classes in the afternoon,We have to play basketball tomorrow." tokens = word_tokenize(input_str) tokens = [word.lower() for word in tokens] print(tokens) ''' 輸出結果: ['today', "'s", 'weather', 'is', 'good', ',', 'very', 'windy', 'and', 'sunny', ',', 'we', 'have', 'no', 'classes', 'in', 'the', 'afternoon', ',', 'we', 'have', 'to', 'play', 'basketball', 'tomorrow', '.'] '''print(tokens[:5]) ''' 輸出結果: ['today', "'s", 'weather', 'is', 'good'] '''2.3.2.Text對象
import nltk# from nltk.tokenize import word_tokenize from nltk.text import Texthelp(nltk.text)輸出結果:
D:\installed\Anaconda\python.exe E:/workspace/nlp/nltk/demo.py Help on module nltk.text in nltk:NAMEnltk.textDESCRIPTIONThis module brings together a variety of NLTK functionality fortext analysis, and provides simple, interactive interfaces.Functionality includes: concordancing, collocation discovery,regular expression search over tokenized strings, anddistributional similarity.CLASSESbuiltins.objectConcordanceIndexContextIndexTextTextCollectionTokenSearcherclass ConcordanceIndex(builtins.object)| ConcordanceIndex(tokens, key=<function ConcordanceIndex.<lambda> at 0x000002602C7FA280>)| | An index that can be used to look up the offset locations at which| a given word occurs in a document.| | Methods defined here:| | __init__(self, tokens, key=<function ConcordanceIndex.<lambda> at 0x000002602C7FA280>)| Construct a new concordance index.| | :param tokens: The document (list of tokens) that this| concordance index was created from. This list can be used| to access the context of a given word occurrence.| :param key: A function that maps each token to a normalized| version that will be used as a key in the index. E.g., if| you use ``key=lambda s:s.lower()``, then the index will be| case-insensitive.| | __repr__(self)| Return repr(self).| | find_concordance(self, word, width=80)| Find all concordance lines given the query word.| | offsets(self, word)| :rtype: list(int)| :return: A list of the offset positions at which the given| word occurs. If a key function was specified for the| index, then given word's key will be looked up.| | print_concordance(self, word, width=80, lines=25)| Print concordance lines given the query word.| :param word: The target word| :type word: str| :param lines: The number of lines to display (default=25)| :type lines: int| :param width: The width of each line, in characters (default=80)| :type width: int| :param save: The option to save the concordance.| :type save: bool| | tokens(self)| :rtype: list(str)| :return: The document that this concordance index was| created from.| | ----------------------------------------------------------------------| Data descriptors defined here:| | __dict__| dictionary for instance variables (if defined)| | __weakref__| list of weak references to the object (if defined)class ContextIndex(builtins.object)| ContextIndex(tokens, context_func=None, filter=None, key=<function ContextIndex.<lambda> at 0x000002602C7F4EE0>)| | A bidirectional index between words and their 'contexts' in a text.| The context of a word is usually defined to be the words that occur| in a fixed window around the word; but other definitions may also| be used by providing a custom context function.| | Methods defined here:| | __init__(self, tokens, context_func=None, filter=None, key=<function ContextIndex.<lambda> at 0x000002602C7F4EE0>)| Initialize self. See help(type(self)) for accurate signature.| | common_contexts(self, words, fail_on_unknown=False)| Find contexts where the specified words can all appear; and| return a frequency distribution mapping each context to the| number of times that context was used.| | :param words: The words used to seed the similarity search| :type words: str| :param fail_on_unknown: If true, then raise a value error if| any of the given words do not occur at all in the index.| | similar_words(self, word, n=20)| | tokens(self)| :rtype: list(str)| :return: The document that this context index was| created from.| | word_similarity_dict(self, word)| Return a dictionary mapping from words to 'similarity scores,'| indicating how often these two words occur in the same| context.| | ----------------------------------------------------------------------| Data descriptors defined here:| | __dict__| dictionary for instance variables (if defined)| | __weakref__| list of weak references to the object (if defined)class Text(builtins.object)| Text(tokens, name=None)| | A wrapper around a sequence of simple (string) tokens, which is| intended to support initial exploration of texts (via the| interactive console). Its methods perform a variety of analyses| on the text's contexts (e.g., counting, concordancing, collocation| discovery), and display the results. If you wish to write a| program which makes use of these analyses, then you should bypass| the ``Text`` class, and use the appropriate analysis function or| class directly instead.| | A ``Text`` is typically initialized from a given document or| corpus. E.g.:| | >>> import nltk.corpus| >>> from nltk.text import Text| >>> moby = Text(nltk.corpus.gutenberg.words('melville-moby_dick.txt'))| | Methods defined here:| | __getitem__(self, i)| | __init__(self, tokens, name=None)| Create a Text object.| | :param tokens: The source text.| :type tokens: sequence of str| | __len__(self)| | __repr__(self)| Return repr(self).| | __str__(self)| Return str(self).| | collocation_list(self, num=20, window_size=2)| Return collocations derived from the text, ignoring stopwords.| | >>> from nltk.book import text4| >>> text4.collocation_list()[:2]| [('United', 'States'), ('fellow', 'citizens')]| | :param num: The maximum number of collocations to return.| :type num: int| :param window_size: The number of tokens spanned by a collocation (default=2)| :type window_size: int| :rtype: list(tuple(str, str))| | collocations(self, num=20, window_size=2)| Print collocations derived from the text, ignoring stopwords.| | >>> from nltk.book import text4| >>> text4.collocations() # doctest: +ELLIPSIS| United States; fellow citizens; four years; ...| | :param num: The maximum number of collocations to print.| :type num: int| :param window_size: The number of tokens spanned by a collocation (default=2)| :type window_size: int| | common_contexts(self, words, num=20)| Find contexts where the specified words appear; list| most frequent common contexts first.| | :param words: The words used to seed the similarity search| :type words: str| :param num: The number of words to generate (default=20)| :type num: int| :seealso: ContextIndex.common_contexts()| | concordance(self, word, width=79, lines=25)| Prints a concordance for ``word`` with the specified context window.| Word matching is not case-sensitive.| | :param word: The target word| :type word: str| :param width: The width of each line, in characters (default=80)| :type width: int| :param lines: The number of lines to display (default=25)| :type lines: int| | :seealso: ``ConcordanceIndex``| | concordance_list(self, word, width=79, lines=25)| Generate a concordance for ``word`` with the specified context window.| Word matching is not case-sensitive.| | :param word: The target word| :type word: str| :param width: The width of each line, in characters (default=80)| :type width: int| :param lines: The number of lines to display (default=25)| :type lines: int| | :seealso: ``ConcordanceIndex``| | count(self, word)| Count the number of times this word appears in the text.| | dispersion_plot(self, words)| Produce a plot showing the distribution of the words through the text.| Requires pylab to be installed.| | :param words: The words to be plotted| :type words: list(str)| :seealso: nltk.draw.dispersion_plot()| | findall(self, regexp)| Find instances of the regular expression in the text.| The text is a list of tokens, and a regexp pattern to match| a single token must be surrounded by angle brackets. E.g.| | >>> print('hack'); from nltk.book import text1, text5, text9| hack...| >>> text5.findall("<.*><.*><bro>")| you rule bro; telling you bro; u twizted bro| >>> text1.findall("<a>(<.*>)<man>")| monied; nervous; dangerous; white; white; white; pious; queer; good;| mature; white; Cape; great; wise; wise; butterless; white; fiendish;| pale; furious; better; certain; complete; dismasted; younger; brave;| brave; brave; brave| >>> text9.findall("<th.*>{3,}")| thread through those; the thought that; that the thing; the thing| that; that that thing; through these than through; them that the;| through the thick; them that they; thought that the| | :param regexp: A regular expression| :type regexp: str| | generate(self, length=100, text_seed=None, random_seed=42)| Print random text, generated using a trigram language model.| See also `help(nltk.lm)`.| | :param length: The length of text to generate (default=100)| :type length: int| | :param text_seed: Generation can be conditioned on preceding context.| :type text_seed: list(str)| | :param random_seed: A random seed or an instance of `random.Random`. If provided,| makes the random sampling part of generation reproducible. (default=42)| :type random_seed: int| | index(self, word)| Find the index of the first occurrence of the word in the text.| | plot(self, *args)| See documentation for FreqDist.plot()| :seealso: nltk.prob.FreqDist.plot()| | readability(self, method)| | similar(self, word, num=20)| Distributional similarity: find other words which appear in the| same contexts as the specified word; list most similar words first.| | :param word: The word used to seed the similarity search| :type word: str| :param num: The number of words to generate (default=20)| :type num: int| :seealso: ContextIndex.similar_words()| | vocab(self)| :seealso: nltk.prob.FreqDist| | ----------------------------------------------------------------------| Data descriptors defined here:| | __dict__| dictionary for instance variables (if defined)| | __weakref__| list of weak references to the object (if defined)class TextCollection(Text)| TextCollection(source)| | A collection of texts, which can be loaded with list of texts, or| with a corpus consisting of one or more texts, and which supports| counting, concordancing, collocation discovery, etc. Initialize a| TextCollection as follows:| | >>> import nltk.corpus| >>> from nltk.text import TextCollection| >>> print('hack'); from nltk.book import text1, text2, text3| hack...| >>> gutenberg = TextCollection(nltk.corpus.gutenberg)| >>> mytexts = TextCollection([text1, text2, text3])| | Iterating over a TextCollection produces all the tokens of all the| texts in order.| | Method resolution order:| TextCollection| Text| builtins.object| | Methods defined here:| | __init__(self, source)| Create a Text object.| | :param tokens: The source text.| :type tokens: sequence of str| | idf(self, term)| The number of texts in the corpus divided by the| number of texts that the term appears in.| If a term does not appear in the corpus, 0.0 is returned.| | tf(self, term, text)| The frequency of the term in text.| | tf_idf(self, term, text)| | ----------------------------------------------------------------------| Methods inherited from Text:| | __getitem__(self, i)| | __len__(self)| | __repr__(self)| Return repr(self).| | __str__(self)| Return str(self).| | collocation_list(self, num=20, window_size=2)| Return collocations derived from the text, ignoring stopwords.| | >>> from nltk.book import text4| >>> text4.collocation_list()[:2]| [('United', 'States'), ('fellow', 'citizens')]| | :param num: The maximum number of collocations to return.| :type num: int| :param window_size: The number of tokens spanned by a collocation (default=2)| :type window_size: int| :rtype: list(tuple(str, str))| | collocations(self, num=20, window_size=2)| Print collocations derived from the text, ignoring stopwords.| | >>> from nltk.book import text4| >>> text4.collocations() # doctest: +ELLIPSIS| United States; fellow citizens; four years; ...| | :param num: The maximum number of collocations to print.| :type num: int| :param window_size: The number of tokens spanned by a collocation (default=2)| :type window_size: int| | common_contexts(self, words, num=20)| Find contexts where the specified words appear; list| most frequent common contexts first.| | :param words: The words used to seed the similarity search| :type words: str| :param num: The number of words to generate (default=20)| :type num: int| :seealso: ContextIndex.common_contexts()| | concordance(self, word, width=79, lines=25)| Prints a concordance for ``word`` with the specified context window.| Word matching is not case-sensitive.| | :param word: The target word| :type word: str| :param width: The width of each line, in characters (default=80)| :type width: int| :param lines: The number of lines to display (default=25)| :type lines: int| | :seealso: ``ConcordanceIndex``| | concordance_list(self, word, width=79, lines=25)| Generate a concordance for ``word`` with the specified context window.| Word matching is not case-sensitive.| | :param word: The target word| :type word: str| :param width: The width of each line, in characters (default=80)| :type width: int| :param lines: The number of lines to display (default=25)| :type lines: int| | :seealso: ``ConcordanceIndex``| | count(self, word)| Count the number of times this word appears in the text.| | dispersion_plot(self, words)| Produce a plot showing the distribution of the words through the text.| Requires pylab to be installed.| | :param words: The words to be plotted| :type words: list(str)| :seealso: nltk.draw.dispersion_plot()| | findall(self, regexp)| Find instances of the regular expression in the text.| The text is a list of tokens, and a regexp pattern to match| a single token must be surrounded by angle brackets. E.g.| | >>> print('hack'); from nltk.book import text1, text5, text9| hack...| >>> text5.findall("<.*><.*><bro>")| you rule bro; telling you bro; u twizted bro| >>> text1.findall("<a>(<.*>)<man>")| monied; nervous; dangerous; white; white; white; pious; queer; good;| mature; white; Cape; great; wise; wise; butterless; white; fiendish;| pale; furious; better; certain; complete; dismasted; younger; brave;| brave; brave; brave| >>> text9.findall("<th.*>{3,}")| thread through those; the thought that; that the thing; the thing| that; that that thing; through these than through; them that the;| through the thick; them that they; thought that the| | :param regexp: A regular expression| :type regexp: str| | generate(self, length=100, text_seed=None, random_seed=42)| Print random text, generated using a trigram language model.| See also `help(nltk.lm)`.| | :param length: The length of text to generate (default=100)| :type length: int| | :param text_seed: Generation can be conditioned on preceding context.| :type text_seed: list(str)| | :param random_seed: A random seed or an instance of `random.Random`. If provided,| makes the random sampling part of generation reproducible. (default=42)| :type random_seed: int| | index(self, word)| Find the index of the first occurrence of the word in the text.| | plot(self, *args)| See documentation for FreqDist.plot()| :seealso: nltk.prob.FreqDist.plot()| | readability(self, method)| | similar(self, word, num=20)| Distributional similarity: find other words which appear in the| same contexts as the specified word; list most similar words first.| | :param word: The word used to seed the similarity search| :type word: str| :param num: The number of words to generate (default=20)| :type num: int| :seealso: ContextIndex.similar_words()| | vocab(self)| :seealso: nltk.prob.FreqDist| | ----------------------------------------------------------------------| Data descriptors inherited from Text:| | __dict__| dictionary for instance variables (if defined)| | __weakref__| list of weak references to the object (if defined)class TokenSearcher(builtins.object)| TokenSearcher(tokens)| | A class that makes it easier to use regular expressions to search| over tokenized strings. The tokenized string is converted to a| string where tokens are marked with angle brackets -- e.g.,| ``'<the><window><is><still><open>'``. The regular expression| passed to the ``findall()`` method is modified to treat angle| brackets as non-capturing parentheses, in addition to matching the| token boundaries; and to have ``'.'`` not match the angle brackets.| | Methods defined here:| | __init__(self, tokens)| Initialize self. See help(type(self)) for accurate signature.| | findall(self, regexp)| Find instances of the regular expression in the text.| The text is a list of tokens, and a regexp pattern to match| a single token must be surrounded by angle brackets. E.g.| | >>> from nltk.text import TokenSearcher| >>> print('hack'); from nltk.book import text1, text5, text9| hack...| >>> text5.findall("<.*><.*><bro>")| you rule bro; telling you bro; u twizted bro| >>> text1.findall("<a>(<.*>)<man>")| monied; nervous; dangerous; white; white; white; pious; queer; good;| mature; white; Cape; great; wise; wise; butterless; white; fiendish;| pale; furious; better; certain; complete; dismasted; younger; brave;| brave; brave; brave| >>> text9.findall("<th.*>{3,}")| thread through those; the thought that; that the thing; the thing| that; that that thing; through these than through; them that the;| through the thick; them that they; thought that the| | :param regexp: A regular expression| :type regexp: str| | ----------------------------------------------------------------------| Data descriptors defined here:| | __dict__| dictionary for instance variables (if defined)| | __weakref__| list of weak references to the object (if defined)DATA__all__ = ['ContextIndex', 'ConcordanceIndex', 'TokenSearcher', 'Text'...FILEd:\installed\anaconda\lib\site-packages\nltk\text.py創建一個Text對象,方便后續操作
import nltkfrom nltk.tokenize import word_tokenize from nltk.text import Textinput_str = "Today's weather is good, very windy and sunny, we have no classes in the afternoon,We have to play basketball tomorrow." tokens = word_tokenize(input_str) tokens = [word.lower() for word in tokens]t = Text(tokens) print(t.count('good')) ''' 輸出結果: 1 '''print(t.index('good')) ''' 輸出結果: 4 '''t.plot(8)2.3.3.停用詞
可以看一下說明中的介紹
import nltk from nltk.corpus import stopwords print(stopwords.readme().replace('\n', ' '))輸出結果:
Stopwords Corpus This corpus contains lists of stop words for several languages. These are high-frequency grammatical words which are usually ignored in text retrieval applications. They were obtained from: http://anoncvs.postgresql.org/cvsweb.cgi/pgsql/src/backend/snowball/stopwords/ The stop words for the Romanian language were obtained from: http://arlc.ro/resources/ The English list has been augmented https://github.com/nltk/nltk_data/issues/22 The German list has been corrected https://github.com/nltk/nltk_data/pull/49 A Kazakh list has been added https://github.com/nltk/nltk_data/pull/52 A Nepali list has been added https://github.com/nltk/nltk_data/pull/83 An Azerbaijani list has been added https://github.com/nltk/nltk_data/pull/100 A Greek list has been added https://github.com/nltk/nltk_data/pull/103 An Indonesian list has been added https://github.com/nltk/nltk_data/pull/112 import nltk from nltk.tokenize import word_tokenize from nltk.corpus import stopwords # print(stopwords.readme().replace('\n', ' '))print(stopwords.fileids()) ''' 輸出結果: ['arabic', 'azerbaijani', 'danish', 'dutch', 'english', 'finnish', 'french', 'german', 'greek', 'hungarian', 'indonesian', 'italian', 'kazakh', 'nepali', 'norwegian', 'portuguese', 'romanian', 'russian', 'slovene', 'spanish', 'swedish', 'tajik', 'turkish'] '''print(stopwords.raw('english').replace('\n', ' ')) ''' 輸出結果: i me my myself we our ours ourselves you you're you've you'll you'd your yours yourself yourselves he him his himself she she's her hers herself it it's its itself they them their theirs themselves what which who whom this that that'll these those am is are was were be been being have has had having do does did doing a an the and but if or because as until while of at by for with about against between into through during before after above below to from up down in out on off over under again further then once here there when where why how all any both each few more most other some such no nor not only own same so than too very s t can will just don don't should should've now d ll m o re ve y ain aren aren't couldn couldn't didn didn't doesn doesn't hadn hadn't hasn hasn't haven haven't isn isn't ma mightn mightn't mustn mustn't needn needn't shan shan't shouldn shouldn't wasn wasn't weren weren't won won't wouldn wouldn't '''''' 數據準備 ''' input_str = "Today's weather is good, very windy and sunny, we have no classes in the afternoon,We have to play basketball tomorrow." tokens = word_tokenize(input_str) tokens = [word.lower() for word in tokens]test_words = [word.lower() for word in tokens] test_words_set = set(test_words)print(test_words_set) ''' 輸出結果: {'no', 'good', 'windy', 'in', 'afternoon', 'very', '.', 'have', 'to', 'basketball', 'classes', 'and', 'the', 'we', 'weather', 'tomorrow', 'is', ',', 'today', "'s", 'play', 'sunny'} '''''' 獲得test_words_set中的停用詞 1''' print(test_words_set.intersection(set(stopwords.words('english')))) ''' {'no', 'to', 'and', 'is', 'very', 'the', 'we', 'have', 'in'} '''2.3.4.過濾掉停用詞
filtered = [w for w in test_words_set if(w not in stopwords.words('english'))] print(filtered) ''' 輸出結果: ['.', 'play', 'windy', 'tomorrow', 'today', 'weather', 'afternoon', 'classes', 'sunny', 'good', "'s", 'basketball', ','] '''2.3.5.詞性標注
nltk.download() # 第三個 ''' 輸出結果: showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml ''' from nltk import pos_tag tags = pos_tag(tokens) print(tags) ''' 輸出結果: [('today', 'NN'), ("'s", 'POS'), ('weather', 'NN'), ('is', 'VBZ'), ('good', 'JJ'), (',', ','), ('very', 'RB'), ('windy', 'JJ'), ('and', 'CC'), ('sunny', 'JJ'), (',', ','), ('we', 'PRP'), ('have', 'VBP'), ('no', 'DT'), ('classes', 'NNS'), ('in', 'IN'), ('the', 'DT'), ('afternoon', 'NN'), (',', ','), ('we', 'PRP'), ('have', 'VBP'), ('to', 'TO'), ('play', 'VB'), ('basketball', 'NN'), ('tomorrow', 'NN'), ('.', '.')] '''2.3.6.分塊
from nltk.chunk import RegexpParser sentence = [('the','DT'),('little','JJ'),('yellow','JJ'),('dog','NN'),('died','VBD')] grammer = "MY_NP: {<DT>?<JJ>*<NN>}" cp = nltk.RegexpParser(grammer) # 生成規則 result = cp.parse(sentence) # 進行分塊 print(result)result.draw() # 調用matplotlib庫畫出來2.3.7.命名實體識別
nltk.download() #maxent_ne_chunke #wordsshowing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml from nltk import ne_chunksentence = "Edison went to Tsinghua University today" print(ne_chunk(pos_tag(word_tokenize(sentence)))) ''' 輸出結果: (S(PERSON Edison/NNP)went/VBDto/TO(ORGANIZATION Tsinghua/NNP University/NNP)today/NN) '''2.3.8.數據清洗實例
import re from nltk.corpus import stopwords from nltk.tokenize import word_tokenize# 輸入數據 s = ' RT @Amila #Test\nTom\'s newly listed Co & Mary\'s unlisted Group to supply tech for nlTK.\nh $TSLA $AAPL https:// t.co/x34afsfQsh'# 指定停用詞 cache_english_stopwords = stopwords.words('english')def text_clean(text):print('原始數據:', text, '\n')# 去掉HTML標簽(e.g. &)text_no_special_entities = re.sub(r'\&\w*;|#\w*|@\w*', '', text)print('去掉特殊標簽后的:', text_no_special_entities, '\n')# 去掉一些價值符號text_no_tickers = re.sub(r'\$\w*', '', text_no_special_entities)print('去掉價值符號后的:', text_no_tickers, '\n')# 去掉超鏈接text_no_hyperlinks = re.sub(r'https?:\/\/.*\/\w*', '', text_no_tickers)print('去掉超鏈接后的:', text_no_hyperlinks, '\n')# 去掉一些專門名詞縮寫,簡單來說就是字母比較少的詞text_no_small_words = re.sub(r'\b\w{1,2}\b', '', text_no_hyperlinks)print('去掉專門名詞縮寫后:', text_no_small_words, '\n')# 去掉多余的空格text_no_whitespace = re.sub(r'\s\s+', ' ', text_no_small_words)text_no_whitespace = text_no_whitespace.lstrip(' ')print('去掉空格后的:', text_no_whitespace, '\n')# 分詞tokens = word_tokenize(text_no_whitespace)print('分詞結果:', tokens, '\n')# 去停用詞list_no_stopwords = [i for i in tokens if i not in cache_english_stopwords]print('去停用詞后結果:', list_no_stopwords, '\n')# 過濾后結果text_filtered = ' '.join(list_no_stopwords) # ''.join() would join without spaces between words.print('過濾后:', text_filtered)text_clean(s)輸出結果:
D:\installed\Anaconda\python.exe E:/workspace/nlp/nltk/demo2.py 原始數據: RT @Amila #Test Tom's newly listed Co & Mary's unlisted Group to supply tech for nlTK. h $TSLA $AAPL https:// t.co/x34afsfQsh 去掉特殊標簽后的: RT Tom's newly listed Co Mary's unlisted Group to supply tech for nlTK. h $TSLA $AAPL https:// t.co/x34afsfQsh 去掉價值符號后的: RT Tom's newly listed Co Mary's unlisted Group to supply tech for nlTK. h https:// t.co/x34afsfQsh 去掉超鏈接后的: RT Tom's newly listed Co Mary's unlisted Group to supply tech for nlTK. h 去掉專門名詞縮寫后: Tom' newly listed Mary' unlisted Group supply tech for nlTK.去掉空格后的: Tom' newly listed Mary' unlisted Group supply tech for nlTK. 分詞結果: ['Tom', "'", 'newly', 'listed', 'Mary', "'", 'unlisted', 'Group', 'supply', 'tech', 'for', 'nlTK', '.'] 去停用詞后結果: ['Tom', "'", 'newly', 'listed', 'Mary', "'", 'unlisted', 'Group', 'supply', 'tech', 'nlTK', '.'] 過濾后: Tom ' newly listed Mary ' unlisted Group supply tech nlTK .Process finished with exit code 02.3.9.參考文章
https://pypi.org/project/nltk/#files https://blog.csdn.net/sinat_34328764/article/details/94830948總結
以上是生活随笔為你收集整理的2.3.NLTK工具包安装、分词、Text对象、停用词、过滤掉停用词、词性标注、分块、命名实体识别、数据清洗实例、参考文章的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 咸鱼茄子煲怎么做?
- 下一篇: a股s股h股什么意思