python中nlp的库_用于nlp的python中的网站数据清理
python中nlp的庫
The most important step of any data-driven project is obtaining quality data. Without these preprocessing steps, the results of a project can easily be biased or completely misunderstood. Here, we will focus on cleaning data that is composed of scraped web pages.
任何數據驅動項目中最重要的步驟是獲取質量數據。 沒有這些預處理步驟,項目的結果很容易產生偏差或被完全誤解。 在這里,我們將著重于清理由抓取的網頁組成的數據。
獲取數據 (Obtaining the data)
There are many tools to scrape the web. If you are looking for something quick and simple, the URL handling module in Python called urllib might do the trick for you. Otherwise, I recommend scrapyd because of the possible customizations and robustness.
有很多工具可以抓取網絡。 如果您正在尋找快速簡單的方法,那么Python中稱為urllib的URL處理模塊可能會幫您這個忙 。 否則,我建議您使用scrapyd,因為可能會進行自定義和增強魯棒性。
It is important to ensure that the pages you are scraping contain rich text data that is suitable for your use case.
確保要抓取的頁面包含適合您的用例的富文本數據非常重要。
從HTML到文本 (From HTML to text)
Once we have obtained our scraped web pages, we begin by extracting the text out of each web page. Websites have lots of tags that don’t contain useful information when it comes to NLP, such as <script> and <button>. Thankfully, there is a Python module called boilerpy3 that makes text extraction easy.
獲取抓取的網頁后,我們將從每個網頁中提取文本開始。 網站上有很多標簽,其中包含<script>和<button> ,這些標簽不包含有關NLP的有用信息。 值得慶幸的是,有一個名為boilerpy3的Python模塊,可輕松提取文本。
We use the ArticleExtractor to extract the text. This extractor has been tuned for news articles that works well for most HTMLs. You can try out other extractors listed in the documentation for boilerpy3 and see what works best for your dataset.
我們使用ArticleExtractor提取文本。 該提取器已針對大多數HTML都適用的新聞文章進行了調整。 您可以嘗試鍋爐3文檔中列出的其他提取器,并查看最適合您的數據集的提取器。
Next, we condense all newline characters (\n and \r) into one \n character. This is done so that when we split the text up into sentences by \n and periods, we don’t get sentences with no words.
接下來,我們將所有換行符( \n和\r )壓縮為一個\n字符。 這樣做是為了使我們在將文本按\n和句點分成句子時,不會得到沒有單詞的句子。
import os import re from boilerpy3 import extractors# Condenses all repeating newline characters into one single newline character def condense_newline(text):return '\n'.join([p for p in re.split('\n|\r', text) if len(p) > 0])# Returns the text from a HTML file def parse_html(html_path):# Text extraction with boilerpy3html_extractor = extractors.ArticleExtractor()return condense_newline(html_extractor.get_content_from_file(html_path))# Extracts the text from all html files in a specified directory def html_to_text(folder):parsed_texts = []filepaths = os.listdir(folder)for filepath in filepaths:filepath_full = os.path.join(folder, filepath)if filepath_full.endswith(".html"):parsed_texts.append(parse_html(filepath_full))return parsed_texts# Your directory to the folder with scraped websites scraped_dir = './scraped_pages' parsed_texts = html_to_text(scraped_dir)If the extractors from boilerpy3 are not working for your web pages, you can use beautifulsoup to build your own custom text extractor. Below is an example replacement of the parse_html method.
如果boilerpy3的提取器不適用于您的網頁,則可以使用beautifulsoup構建自己的自定義文本提取器。 下面是parse_html方法的示例替換。
from bs4 import BeautifulSoup# Returns the text from a HTML file based on specified tags def parse_html(html_path):with open(html_path, 'r') as fr:html_content = fr.read()soup = BeautifulSoup(html_content, 'html.parser')# Check that file is valid HTMLif not soup.find():raise ValueError("File is not a valid HTML file")# Check the language of the filetag_meta_language = soup.head.find("meta", attrs={"http-equiv": "content-language"})if tag_meta_language:document_language = tag_meta_language["content"]if document_language and document_language not in ["en", "en-us", "en-US"]:raise ValueError("Language {} is not english".format(document_language))# Get text from the specified tags. Add more tags if necessary.TAGS = ['p']return ' '.join([remove_newline(tag.text) for tag in soup.findAll(TAGS)])大型N型清洗 (Large N-Gram Cleaning)
Once the text has been extracted, we want to continue with the cleaning process. It is common for web pages to contain repeated information, especially if you scrape multiple articles from the same domain. Elements such as website titles, company slogans, and page footers can be present in your parsed text. To detect and remove these phrases, we analyze our corpus by looking at the frequency of large n-grams.
提取文本后,我們要繼續進行清理過程。 網頁通常包含重復的信息,尤其是當您從同一域中抓取多篇文章時。 網站標題,公司標語和頁面頁腳等元素可以出現在您的解析文本中。 為了檢測和刪除這些短語,我們通過查看大n-gram的頻率來分析我們的語料庫。
N-grams is a concept from NLP where the “gram” is a contiguous sequence of words from a body of text, and “N” is the size of these sequences. This is frequently used to build language models which can assist in tasks ranging from text summarization to word prediction. Below is an example for trigrams (3-grams):
N-gram是NLP的概念,其中“ gram”是來自正文的單詞的連續序列,而“ N”是這些序列的大小。 這通常用于構建語言模型,該模型可以協助完成從文本摘要到單詞預測的任務。 以下是三字母組(3克)的示例:
input = 'It is quite sunny today.'output = ['It is quite', is quite sunny', 'quite sunny today.']
When we read articles, there are many single words (unigrams) that are repeated, such as “the” and “a”. However, as we increase our n-gram size, the probability of the n-gram repeating decreases. Trigrams start to become more rare, and it is almost impossible for the articles to contain the same sequence of 20 words. By searching for large n-grams that occur frequently, we are able to detect the repeated elements across websites in our corpus, and manually filter them out.
當我們閱讀文章時,有很多重復的單詞(字母組合),例如“ the”和“ a”。 但是,隨著我們增加n-gram大小,n-gram重復的可能性降低。 Trigram開始變得越來越稀有,文章幾乎不可能包含20個單詞的相同序列。 通過搜索頻繁出現的大n-gram,我們可以檢測到語料庫中各個網站上的重復元素,并手動將其過濾掉。
We begin this process by breaking up our dataset up into sentences by splitting the text chunks up by the newline characters and periods. Next, we tokenize our sentences (break up the sentence into single word strings). With these tokenized sentences, we are able to generate n-grams of a specific size (we want to start large, around 15). We want to sort the n-grams by frequency using the FreqDist function provided by nltk. Once we have our frequency dictionary, we print the top 10 n-grams. If the frequency is higher than 1 or 2, the sentence might be something you would consider removing from the corpus. To remove the sentence, copy the entire sentence and add it as a single string in the filter_strs array. Copying the entire sentence can be accomplished by increasing the n-gram size until the entire sentence is captured in one n-gram and printed on the console, or simply printing the parsed_texts and searching for the sentence. If there is multiple unwanted sentences with slightly different words, you can copy the common substring into filter_strs, and the regular expression will filter out all sentences containing the substring.
我們通過以換行符和句點將文本塊拆分為句子來將數據集分解為句子來開始此過程。 接下來,我們將句子標記化(將句子分解成單個單詞字符串)。 使用這些標記化的句子,我們能夠生成特定大小的n-gram(我們希望從15開始大一點)。 我們要使用的正克按頻率進行排序FreqDist所提供的功能nltk 。 有了頻率字典后,我們將打印出前10個n-gram。 如果頻率高于1或2,則可能要考慮從句子中刪除句子。 要刪除該句子,請復制整個句子并將其作為單個字符串添加到filter_strs數組中。 復制整個句子可以通過增加n-gram的大小,直到整個句子被捕獲為一個n-gram并打印在控制臺上,或者簡單地打印parsed_texts并搜索句子來完成。 如果有多個不需要的句子,且單詞略有不同,則可以將公共子字符串復制到filter_strs ,并且正則表達式將過濾掉包含該子字符串的所有句子。
import nltk nltk.download('punkt') import matplotlib.pyplot as plt from nltk.util import ngrams from nltk.tokenize import word_tokenize# Helper method for generating n-grams def extract_ngrams_sentences(sentences, num):all_grams = []for sentence in sentences:n_grams = ngrams(sentence, num)all_grams += [ ' '.join(grams) for grams in n_grams]return all_grams# Splits text up by newline and period def split_by_newline_and_period(pages):sentences = []for page in pages:sentences += re.split('\n|\. ', page)return sentences# Break the dataset up into sentences, split by newline characters and periods sentences = split_by_newline_and_period(parsed_texts)# Add unwanted strings into this array filter_strs = []# Filter out unwanted strings sentences = [x for x in sentencesif not any([re.search(filter_str, x, re.IGNORECASE)for filter_str in filter_strs])]# Tokenize the sentences tokenized_sentences = [word_tokenize(sentence) for sentence in sentences]# Adjust NGRAM_SIZE to capture unwanted phrases NGRAM_SIZE = 15 ngrams_all = extract_ngrams_sentences(tokenized_sentences, NGRAM_SIZE)# Sort the n-grams by most common n_gram_all = nltk.FreqDist(ngrams_all).most_common()# Print out the top 10 most commmon n-grams print(f'{NGRAM_SIZE}-Gram Frequencies') for gram, count in n_gram_all[:10]:print(f'{count}\t\"{gram}\"')# Plot the distribution of n-grams plt.plot([count for _, count in n_gram_all]) plt.xlabel('n-gram') plt.ylabel('frequency') plt.title(f'{NGRAM_SIZE}-Gram Frequencies') plt.show()If you run the code above on your dataset without adding any filters to filter_strs, you might get a graph similar to the one below. In my dataset, you can see that there are several 15-grams that are repeated 6, 3, and 2 times.
如果您在數據集上運行上面的代碼時未向filter_strs添加任何過濾器,則可能會得到類似于下面的圖。 在我的數據集中,您可以看到有一些15克重復6、3和2次。
Once we go through the process of populating filter_strs with unwanted sentences, our plot of 15-grams flattens out.
一旦我們完成了用不需要的句子填充filter_strs的過程,我們的15克圖表便變平了。
Keep in mind there is no optimal threshold for n-gram size and frequency that determines whether or not a sentence should be removed, so play around with these two parameters. Sometimes you will need to lower the n-gram size to 3 or 4 to pick up a repeated title, but be careful not to remove valuable data. This block of code is designed to be an iterative process, where you slowly build the filter_strs array after many different experiments.
請記住,沒有確定n-gram大小和頻率的最佳閾值來確定是否應刪除句子,因此請試用這兩個參數。 有時,您需要將n-gram大小減小到3或4才能獲得重復的標題,但請注意不要刪除有價值的數據。 此代碼塊被設計為一個迭代過程,在此過程中,您將在許多不同的實驗之后慢慢構建filter_strs數組。
標點,大寫和標記化 (Punctuation, Capitalization, and Tokenization)
After we clean the corpus, the next step is to process the words of our corpus. We want to remove punctuation, lowercase all words, and break each sentence up into arrays of individual words (tokenization). To do this, I like to use the simple_preprocess library method from gensim. This function accomplishes all three of these tasks in one go and has a few parameters that allow some customization. By setting deacc=True, punctuation will be removed. When punctuation is removed, the punctuation itself is treated as a space, and the two substrings on each side of the punctuation is treated as two separate words. In most cases, words will be split up with one substring having a length of one. For example, “don’t” will end up as “don” and “t”. As a result, the default min_len value is 2, so words with 1 letter are not kept. If this is not suitable for your use case, you can also create a text processor from scratch. Python’s string class contains a punctuation attribute that lists all commonly used punctuation. Using this set of punctuation marks, you can use str.maketrans to remove all punctuation from a string, but keeping those words that did have punctuation as one single word (“don’t” becomes “dont”). Keep in mind this does not capture punctuation as well as gensim’s simple_preprocess. For example, there are three types of dashes (‘ — ’ em dash, –’ en dash, ‘-’ hyphen), and while simple_preprocess removes them all, string.punctuation does not contain the em dash, and therefore does not remove it.
在清理完語料庫之后,下一步就是處理語料庫中的單詞。 我們要刪除標點符號,將所有單詞都小寫,然后將每個句子分成單個單詞的數組(標記化)。 為此,我喜歡使用gensim中的simple_preprocess庫方法。 此功能可一次性完成所有這三個任務,并具有一些允許自定義的參數。 通過設置deacc=True ,標點符號將被刪除。 刪除標點符號后,標點符號本身被視為一個空格,標點符號兩側的兩個子字符串被視為兩個單獨的單詞。 在大多數情況下,單詞將被一個長度為一的子字符串分割。 例如,“不要”將最終以“ don”和“ t”結尾。 結果,默認的min_len值為2,因此不會保留帶有1個字母的單詞。 如果這不適合您的用例,您也可以從頭開始創建文本處理器。 Python的string類包含一個punctuation屬性,該屬性列出了所有常用的標點符號。 使用這組標點符號,可以使用str.maketrans刪除字符串中的所有標點符號,但是將那些確實具有標點符號的單詞保留為一個單詞(“不要”變為“不要”)。 請記住,它不像gensim的simple_preprocess一樣捕獲標點符號。 例如,存在三種類型的破折號(' string.punctuation破折號,-'en破折號,'-'連字符),而simple_preprocess刪除了所有破折號, string.punctuation不包含破折號,因此也不會將其刪除。 。
import gensim import string# Uses gensim to process the sentences def sentence_to_words(sentences):for sentence in sentences:sentence_tokenized = gensim.utils.simple_preprocess(sentence,deacc=True,min_len=2,max_len=15)# Make sure we don't yield empty arraysif len(sentence_tokenized) > 0:yield sentence_tokenized# Process the sentences manually def sentence_to_words_from_scratch(sentences):for sentence in sentences:sentence_tokenized = [token.lower() for token in word_tokenize(sentence.translate(str.maketrans('','',string.punctuation)))]# Make sure we don't yield empty arraysif len(sentence_tokenized) > 0:yield sentence_tokenizedsentences = list(sentence_to_words(sentences))停用詞 (Stop Words)
Once we have our corpus nicely tokenized, we will remove all stop words from the corpus. Stop words are words that don’t provide much additional meaning to a sentence. Words in the English vocabulary include “the”, “a”, and “in”. nltk contains a list of English stopwords, so we use that to filter our lists of tokens.
一旦我們對語料庫進行了很好的標記,我們將從語料庫中刪除所有停用詞。 停用詞是不會為句子提供更多附加含義的詞。 英語詞匯中的單詞包括“ the”,“ a”和“ in”。 nltk包含英語停用詞列表,因此我們使用它來過濾標記列表。
拔除和摘除 (Lemmatization and Stemming)
Lemmatization is the process of grouping together different forms of the same word and replacing these instances with the word’s lemma (dictionary form). For example, “functions” is reduced to “function”. Stemming is the process of reducing a word to its root word (without any suffixes or prefixes). For example, “running” is reduced to “run”. These two steps decreases the vocabulary size, making it easier for the machine to understand our corpus.
合法化是將同一個單詞的不同形式組合在一起,并用單詞的引理(字典形式)替換這些實例的過程。 例如,“功能”被簡化為“功能”。 詞干處理是將單詞還原為其根單詞(沒有任何后綴或前綴)的過程。 例如,“運行”減少為“運行”。 這兩個步驟減小了詞匯量,使機器更容易理解我們的語料庫。
from nltk.corpus import stopwords from nltk.stem import WordNetLemmatizer from nltk.stem import SnowballStemmer nltk.download('stopwords') nltk.download('wordnet')# Remove all stopwords stop_words = stopwords.words('english') def remove_stopwords(tokenized_sentences):for sentence in tokenized_sentences:yield([token for token in sentence if token not in stop_words])# Lemmatize all words wordnet_lemmatizer = WordNetLemmatizer() def lemmatize_words(tokenized_sentences):for sentence in tokenized_sentences:yield([wordnet_lemmatizer.lemmatize(token) for token in sentence])snowball_stemmer = SnowballStemmer('english') def stem_words(tokenized_sentences):for sentence in tokenized_sentences:yield([snowball_stemmer.stem(token) for token in sentence])sentences = list(remove_stopwords(sentences)) sentences = list(lemmatize_words(sentences)) sentences = list(stem_words(sentences))Now that you know how to extract and preprocess your text data, you can begin the data analysis. Best of luck with your NLP adventures!
既然您知道如何提取和預處理文本數據,就可以開始數據分析了。 祝您在NLP冒險中好運!
筆記 (Notes)
- If you are tagging the corpus with parts-of-speech tags, stop words should be kept in the dataset and lemmatization should not be done prior to tagging. 如果要使用詞性標記來標記語料庫,則停用詞應保留在數據集中,并且在標記之前不應該進行詞形化。
The GitHub repository for the Jupyter Notebook can be found here.
可以在此處找到Jupyter Notebook的GitHub存儲庫。
翻譯自: https://towardsdatascience.com/website-data-cleaning-in-python-for-nlp-dda282a7a871
python中nlp的庫
總結
以上是生活随笔為你收集整理的python中nlp的库_用于nlp的python中的网站数据清理的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 数据可视化机器学习工具在线_为什么您不能
- 下一篇: 梦到买椰子有什么预兆