regex 正则表达式_使用正则表达式(Regex)删除HTML标签
regex 正則表達式
Most of the data in the world are unstructured data form because, in human communication, message transmission happens in words, not in a table or other structured data format. Each day we produce unstructured data from emails, SMS, tweets, feedback, social media posts, blogs, articles, documents, etc.
在世界上的數據的M個OST是非結構化數據形式,因為在人類傳播,消息傳輸中發生的話,而不是在一個表格或其它結構化數據格式。 每天,我們都會通過電子郵件,短信,推文,反饋,社交媒體帖子,博客,文章,文檔等生成非結構化數據。
As we all know, the text is the most unstructured form of all the available data. Extracting Meaning from Text is Hard. Computers can’t yet truly understand text data even English Language in the way that humans do, but they can already do a lot with text data. In some areas using a computer or machine, what you can do with NLP already seems like magic.
眾所周知,文本是所有可用數據中最非結構化的形式。 從文本中提取含義很難。 電腦還無法像人類那樣真正理解文本數據,甚至是英語,但它們已經可以對文本數據做很多事情。 在某些使用計算機或機器的區域中,使用NLP可以做的事情似乎已經很神奇了。
NLP helps us to organize the massive chunks of text data and solve a wide range of problems such as — Machine Translation, Text Summarization, Named Entity Recognition (NER), Topic Modeling and Topic Segmentation, Semantic Parsing, Question and Answering (Q&A), Relationship Extraction, Sentiment Analysis, and Speech Recognition, etc.
NLP幫助我們組織大量的文本數據并解決各種問題,例如:機器翻譯,文本摘要,命名實體識別(NER),主題建模和主題細分,語義解析,問題與回答(Q&A),關系提取,情感分析和語音識別等
NLP algorithms are based on machine learning algorithms. Doing anything complicated in machine learning usually means building a pipeline. The idea is to break up your problem into very small pieces and then use machine learning to solve each smaller piece separately. Then by chaining together several machine learning models that feed into each other, you can do very complicated things.
NLP算法基于機器學習算法。 在機器學習中做任何復雜的事情通常意味著建立一條管道 。 想法是將您的問題分解成很小的部分,然后使用機器學習分別解決每個較小的部分。 然后,通過將相互饋送的多個機器學習模型鏈接在一起,您可以做非常復雜的事情。
You might be able to solve lots of problems and also save a lot of time by applying NLP techniques to your own projects. Using NLP, We’ll break down the process of understanding text (English) into small chunks of words and see how each one works.
通過將NLP技術應用于自己的項目,您也許能夠解決很多問題,并節省大量時間。 使用NLP,我們將把理解文本(英語)的過程分解成小的單詞,并查看每個單詞的工作方式。
Sentence Hierarchy:
句子層次結構:
A sentence typically follows a hierarchical structure consisting of the following components:
句子通常遵循由以下組件組成的層次結構:
Standard NLP Workflow
標準NLP工作流程
CRISP-DM Model is a Cross-industry standard process for data mining, known as CRISP-DM, which is an open standard process model that describes common approaches used by data mining experts. It is the most widely-used analytics model. Typically, any NLP-based problem can be solved by a methodical workflow that has a sequence of steps. The major steps are depicted in the following figure.
CRISP-DM模型是用于數據挖掘的跨行業標準過程,稱為CRISP-DM,它是一種開放標準過程模型,描述了數據挖掘專家使用的常用方法。 它是使用最廣泛的分析模型。 通常,任何基于NLP的問題都可以通過有步驟的有條理的工作流程來解決。 下圖描述了主要步驟。
We usually start with a corpus of text documents and follow standard processes of text wrangling and pre-processing, parsing, and basic exploratory data analysis. Based on the initial insights, we usually represent the text using relevant feature engineering techniques. Depending on the problem at hand, we either focus on building predictive supervised models or unsupervised models, which usually focus more on pattern mining and grouping. Finally, we evaluate the model and the overall success criteria with relevant stakeholders or customers and deploy the final model for future usage.
我們通常從文本文檔的語料庫開始,并遵循文本整理和預處理,解析以及基本探索性數據分析的標準過程。 基于最初的見識,我們通常使用相關的特征工程技術來表示文本。 根據當前的問題,我們要么專注于構建預測性監督模型,要么專注于非監督模型,后者通常更多地關注模式挖掘和分組。 最后,我們與利益相關者或客戶一起評估模型和總體成功標準,并部署最終模型以供將來使用。
NLP Pipeline:
NLP管道:
Above mentioned steps are used in a typical NLP pipeline, but you will skip steps or re-order steps depending on what you want to do and how your NLP library is implemented. For example, some libraries like spaCy do sentence segmentation much later in the pipeline using the results of the dependency parse.
上面提到的步驟在典型的NLP管道中使用,但是您將跳過步驟或對步驟進行重新排序,這取決于您要執行的操作以及NLP庫的實現方式。 例如,像spaCy這樣的一些庫會在管道中使用依賴項解析的結果進行句子分割。
NLP管道:分步 (NLP Pipeline: Step-by-step)
Converting text to lowercase:
將文本轉換為小寫:
In-text normalization process, very first step to convert all text data into lowercase which makes all text on a level playing field. With this step, we are able to cover each and every word available in text data.
文本規范化過程,這是將所有文本數據轉換為小寫字母的第一步,這使所有文本都處于公平的競爭環境中。 通過此步驟,我們可以覆蓋文本數據中可用的每個單詞。
Removing HTML Tags:
刪除HTML標簽:
HTML tags are typically one of these components which don’t add much value towards understanding and analyzing text. When we used text data collected using techniques like web scraping or screen scraping, it contained a lot of noise. We can remove unnecessary HTML tags and retain the useful textual information for further process.
HTML標記通常是這些組件之一,不會為理解和分析文本增加太多價值。 當我們使用通過網絡抓取或屏幕抓取等技術收集的文本數據時,其中包含很多噪音。 我們可以刪除不必要HTML標記,并保留有用的文本信息以供進一步處理。
使用正則表達式(Regex)刪除HTML標簽 (Remove HTML Tags using Regular Expression (Regex))
In [01]:
在[01]中:
# Import libraryimport reTAG_RE = re.compile(r'<[^>]+>')def remove_tags(text): #define remove tag funtionreturn TAG_RE.sub('', text)
In [02]:
在[02]中:
text = """<div> <h1>Title</h1> <p>A long text........ </p> <a href=""> a link </a> </div>"""In [03]:
在[03]中:
text = remove_tags(text)text
Out[04]:
出[04]:
' Title A long text........ a linkRemoving accented characters
刪除重音字符
Usually, in any text corpus, you might be dealing with accented characters/letters, especially if you only want to analyze the English language. Hence, we need to make sure that these characters are converted and standardized into ASCII characters. A simple example — converting é to e.
通常,在任何文本語料庫中,您都可能要處理帶重音的字符/字母,尤其是如果您只想分析英語時。 因此,我們需要確保將這些字符轉換并標準化為ASCII字符。 一個簡單的示例-將é轉換為e。
Expanding Contractions
擴大收縮
Contractions are words or combinations of words that are shortened by dropping letters and replacing them by an apostrophe. Let’s have a look at some examples:
收縮是單詞或單詞組合,可通過刪除字母并將其替換為撇號來縮短。 讓我們看一些例子:
we’re = we are; we’ve = we have; I’d = I would
我們是=我們是; 我們有=我們有; 我會=我會
We can say that contractions are shortened versions of words or syllables. Or simply, a contraction is an abbreviation for a sequence of words.
我們可以說收縮是單詞或音節的縮寫。 或簡單地說,收縮是單詞序列的縮寫。
In NLP, we can deal with constraints by converting each contraction to its expanded, original form helps with text standardization.
在NLP中,我們可以通過將每個壓縮轉換為其擴展的原始格式來處理約束,從而有助于文本標準化。
Removing Special Characters
刪除特殊字符
Special characters and symbols are usually non-alphanumeric characters or even occasionally numeric characters (depending on the problem), which adds to the extra noise in unstructured text. Usually, simple regular expressions (regexes) can be used to remove them.
特殊字符和符號通常是非字母數字字符,有時甚至是數字字符(取決于問題),這會增加非結構化文本中的額外雜音。 通常,可以使用簡單的正則表達式(regexes)刪除它們。
Stemming
抽干
To understand the stemming, we have to gain some knowledge about the word stem. Word stems are also known as the base form of a word, and we can create new words by attaching affixes to them in a process known as inflection.
要理解詞干,我們必須了解一些關于詞干的知識。 詞干也被稱為單詞的基本形式,我們可以通過在詞綴變化過程中將詞綴附加到詞上來創建新詞。
You can add affixes to it and form new words like JUMPS, JUMPED, and JUMPING. In this case, the base word JUMP is the word stem.
您可以在其上添加詞綴并形成新單詞,例如JUMPS,JUMPD和JUMPING。 在這種情況下,基本詞JUMP是詞干。
The figure shows how the word stem is present in all its inflections since it forms the base on which each inflection is built upon using affixes. The reverse process of obtaining the base form of a word from its inflected form is known as stemming. Stemming helps us in standardizing words to their base or root stem, irrespective of their inflections, which helps many applications like classifying or clustering text, and even in information retrieval.
該圖顯示了詞干一詞在其所有變體中是如何出現的,因為它形成了使用詞綴建立每次變體的基礎。 從詞的屈折形式獲取單詞基本形式的相反過程稱為詞干。 詞干幫助我們將單詞標準化為詞根或詞根,而不管其詞尾變化如何,這有助于許多應用程序,例如對文本進行分類或聚類,甚至在信息檢索中。
Different types of stemmers in NLTK are PorterStemmer, LancasterStemmer, SnowballStemmer.
NLTK中不同類型的莖干是PorterStemmer,LancasterStemmer和SnowballStemmer。
Lemmatization:
合法化:
In NLP, lemmatization is the process of figuring out the root form or root word (most basic form) or lemma of each word in the sentence. Lemmatization is very similar to stemming, where we remove word affixes to get to the base form of a word. The difference is that the root word is always a lexicographically correct word (present in the dictionary), but the root stem may not be so. Thus, the root word, also known as the lemma, will always be present in the dictionary. It uses a knowledge base called WordNet. Because of knowledge, lemmatization can even convert words that are different and cant be solved by stemmers, for example converting “came” to “come”.
在NLP中,詞義化是弄清楚句子中每個單詞的詞根形式或詞根(最基本的形式)或詞綴的過程。 詞法化與詞干法非常相似,在詞干法中,我們刪除詞綴以得到詞的基本形式。 區別在于根詞始終是詞典上正確的詞(在詞典中存在),但是根詞干可能并非如此。 因此,根詞(也稱為引理)將始終存在于字典中。 它使用稱為WordNet的知識庫。 由于知識,詞根化甚至可以轉換詞干不能解決的不同單詞,例如將“來”轉換為“來”。
StopWords
停用詞
Words which have little or no significance, especially when constructing meaningful features from text, are known as stopwords or stop words. These are usually words that end up having the maximum frequency if you do a simple term or word frequency in a corpus. Consider words like a, an, the, be etc. These words don’t add any extra information in a sentence.
幾乎沒有意義或沒有意義的單詞,尤其是從文本構造有意義的特征時,被稱為停用詞或停用詞。 如果您在語料庫中做一個簡單的詞或詞頻,這些詞通常會以最大頻率出現。 考慮諸如a,an,the,be等的單詞。這些單詞不會在句子中添加任何額外的信息。
整合所有內容—構建文本規范化器 (Bringing it all together — Building a Text Normalizer)
Text normalization includes:
文本規范化包括:
- Converting Text (all letters) into lower case 將文本(所有字母)轉換為小寫
- Removing HTML tags 刪除HTML標簽
- Expanding contractions 宮縮擴大
- Converting numbers into words or removing numbers 將數字轉換為單詞或刪除數字
- Removing special character (punctuations, accent marks, and other diacritics) 刪除特殊字符(標點符號,重音符號和其他變音符號)
- Removing white spaces 刪除空格
- Word Tokenization 詞標記化
- Stemming and Lemmatization 詞干和詞法化
- Removing stop words, sparse terms, and particular words 刪除停用詞,稀疏詞和特定詞
In NLP, we can deal with constraints by converting each contraction to its expanded, original form helps with text standardization.
在NLP中,我們可以通過將每個壓縮轉換為其擴展的原始格式來處理約束,從而有助于文本標準化。
https://www.nltk.org
https://www.nltk.org
翻譯自: https://medium.com/@suneelpatel.in/nlp-pipeline-building-an-nlp-pipeline-step-by-step-7f0576e11d08
regex 正則表達式
總結
以上是生活随笔為你收集整理的regex 正则表达式_使用正则表达式(Regex)删除HTML标签的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 小米8se屏幕碎了怎么办(小米官方售后服
- 下一篇: 精度,精确率,召回率_了解并记住精度和召