机器学习 文本分类 代码_无需担心机器学习-如何在少于10行代码中对文本进行分类
機器學習 文本分類 代碼
This article builds upon my previous two articles where I share some tips on how to get started with data analysis in Python (or R) and explain some basic concepts on text analysis in Python. In this article, I want to go a step further and talk about how to get started with text classification with the help of machine learning.
本文以我之前的兩篇文章為基礎,我在其中分享了一些有關如何開始使用Python(或R)進行數據分析的技巧,并解釋了有關使用Python進行文本分析的一些基本概念 。 在本文中,我想更進一步,討論如何借助機器學習來開始文本分類。
The motivation behind writing this article is the same as the previous ones: because there are enough people out there who are stuck with tools that are not optimal for the given task, e.g. using MS Excel for text analysis. I want to encourage people to use Python, not be scared of programming, and automate as much of their work as possible.
撰寫本文的動機與之前的動機相同:因為那里有足夠多的人被那些對于給定任務不是最佳的工具所困,例如使用MS Excel進行文本分析。 我想鼓勵人們使用Python,不要害怕編程,并盡可能使他們的工作自動化。
Speaking of automation, in my last article I presented some methods on how to extract information out of textual data, using railroad incident reports as an example. Instead of having to read each incident text, I showed how to extract the most common words from a dataset (to get an overall idea of the data), how to look for specific words like “damage” or “hazardous materials” in order to log whether damages occurred or not, and how to extract the cost of the damages with the help of regular expressions.
說到自動化,在上一篇文章中,我以鐵路事故報告為例,介紹了一些有關如何從文本數據中提取信息的方法。 我無需閱讀每個事件文本,而是展示了如何從數據集中提取最常用的詞(以全面了解數據),如何查找“損壞”或“危險材料”之類的特定詞,以便記錄是否發生損壞,以及如何借助正則表達式提取損壞的成本。
However, what if there are too many words to look for? Using crime reports as an example — what if you are interested in whether minors were involved? There are many different words you can use to describe minors: Underaged person, minor, youth, teenager, juvenile, adolescent, etc. Keeping track of all these words would be rather tedious. But, if you have already labeled data from the past, there is a simple solution. Assuming you have a dataset of previous crime reports that you have labeled manually, you can train a classifier that can learn the patterns of the labeled crime reports and match them to the labels (whether a minor was involved, whether the person was intoxicated, etc). If you get a new batch of unlabelled crime reports, you could run the classifier on the new data and label the reports automatically, without ever having to read them. Sounds great, doesn’t it? And what if I told you that this can be done in as little as 7 lines of code? Mind blowing, I know.
但是,如果要查找的單詞過多,該怎么辦? 以犯罪報告為例-如果您對未成年人是否參與感興趣怎么辦? 您可以使用許多不同的詞來描述未成年人: 未成年人,未成年人,青年,少年,少年,青少年等。跟蹤所有這些詞將非常乏味。 但是,如果您已經標記了過去的數據,則有一個簡單的解決方案。 假設您具有手動標記的先前犯罪報告的數據集,則可以訓練一個分類器,該分類器可以學習標記的犯罪報告的模式并將其與標記匹配(是否涉及未成年人,該人是否陶醉等) )。 如果您獲得了一批新的未標記犯罪報告,則可以對新數據運行分類器并自動標記報告,而不必閱讀它們。 聽起來不錯,不是嗎? 而且,如果我告訴您僅用7行代碼就可以完成呢? 我知道
In this article, I will go over some main concepts in machine learning and natural language processing (NLP) and link articles for further reading. Some knowledge of text preprocessing is required, like stopword removal and lemmatisation, which I described in my previous article. I will then show on a real-world example, using a dataset that I have acquired during my latest PhD study, how to train a classifier.
在本文中,我將介紹機器學習和自然語言處理(NLP)中的一些主要概念,并鏈接文章以供進一步閱讀。 需要一些有關文本預處理的知識,例如停用詞的去除和詞形的修飾,這些我已在上一篇文章中進行了介紹 。 然后,我將使用我在最近的博士學位研究期間獲得的數據集,展示一個如何訓練分類器的真實示例。
分類問題 (Classification Problem)
Classification is the task of identifying to which of a set of categories a new observation belongs. An easy example would be the classification of emails into spam and ham (binary classification). If you have more than two categories it’s called multi-class classification. There are several popular classification algorithms which I will not discuss in depth but invite you to check out the provided links and do your own research:
分類是確定新觀測值屬于一組類別中的哪一個的任務。 一個簡單的例子就是將電子郵件分類為垃圾郵件和火腿(二進制分類)。 如果您有兩個以上的類別,則稱為多類別分類。 有幾種流行的分類算法,我將不作深入討論,但邀請您查看提供的鏈接并進行自己的研究:
Logistic regression (check out this video and this article for a great explanation)
Logistic回歸(請觀看此視頻和本文以獲得詳細說明)
K-nearest neighbour (KNN) (video, article)
K近鄰(KNN)( 視頻 , 文章 )
Support-vector machines (SVM) (video, article)
支持向量機(SVM)( 視頻 , 文章 )
Naive Bayes (video, article)
樸素貝葉斯( 視頻 , 文章 )
Decision Trees (video, article)
決策樹( 視頻 , 文章 )
Neural Networks (video, article)
神經網絡( 視頻 , 文章 )
培訓和測試數據 (Training and Testing Data)
When you receive a new email, the email is not labelled as “spam” or “ham” by the sender. Your email provider has to label the incoming email as either ham and send it to your inbox, or spam and send it into the spam folder. To tell how well your classifier (machine learning model) works in the real world, you split your data into a training and a testing set during the development.
當您收到新電子郵件時,發件人未將電子郵件標記為“垃圾郵件”或“垃圾郵件”。 您的電子郵件提供商必須將收到的電子郵件標記為火腿,然后將其發送到收件箱,或者將其標記為垃圾郵件,然后將其發送到垃圾郵件文件夾。 為了告訴您分類器(機器學習模型)在現實世界中的運行情況,您可以在開發過程中將數據分為訓練集和測試集 。
Let’s say you have 1000 emails that are labelled as spam and ham. If you train your classifier with the whole data you have no data left to tell how accurate your classifier is because you do not have any e-mail examples that the model has not seen. So you want to leave some emails out of the training set, to let your trained model predict the labels of those left-out emails (testing set). By comparing the predicted with the actual labels you will be able to tell how well your model generalises to new data and how it would perform in the real world.
假設您有1000封標為垃圾郵件和火腿的電子郵件。 如果使用全部數據訓練分類器,則沒有數據可以告訴您分類器的準確性,因為您沒有該模型未看到的任何電子郵件示例。 因此,您想將一些電子郵件排除在訓練集中之外,以使您的訓練模型可以預測那些遺漏的電子郵件(測試集)的標簽。 通過將預測標簽與實際標簽進行比較,您將能夠知道模型對新數據的概括程度以及在現實世界中的表現。
以計算機可讀格式表示文本(數字) (Representing text in computer-readable format (numbers))
Ok, so now you have your data, you have divided it into a training and a testing set, let’s start training the classifier, no? Given, that we are dealing with text data, there is one more thing we need to do. If we were dealing with numerical data, we could feed those numbers into the classifier directly. But computers don’t understand text. The text has to be converted into numbers, known as text vectorisation. There are several ways of doing it, but I will only go through one. Check out these articles for further reading on tf-idf and word embeddings.
好的,現在您有了數據,將其分為訓練和測試集,讓我們開始訓練分類器,不是嗎? 鑒于我們正在處理文本數據,我們還需要做另一件事。 如果要處理數字數據,則可以將這些數字直接輸入分類器。 但是計算機不懂文字。 文本必須轉換為數字,稱為文本向量化 。 有幾種方法可以做到,但我只會講一種。 查看這些文章,以進一步閱讀tf-idf和單詞嵌入 。
A popular and simple method of feature extraction with text data is called the bag-of-words model of text. Let’s take the following sentence as an example:
一種流行且簡單的利用文本數據進行特征提取的方法稱為文本的詞袋模型。 讓我們以下面的句子為例:
“apples are great but so are pears, however, sometimes I feel like oranges and on other days I like bananas”
“蘋果很棒,但梨也很棒,但是,有時候我覺得自己像橘子,而有時候我喜歡香蕉”
This quote is used as our vocabulary. The sentence contains 17 distinct words. The following three sentences can then be represented as vectors using our vocabulary:
這句話被用作我們的詞匯。 該句子包含17個不同的詞。 然后,可以使用我們的詞匯將以下三個句子表示為向量:
“I like apples”
“我喜歡蘋果”
[1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0]
[1,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0]
“Bananas are great. Bananas are awesome,”
“香蕉很棒。 香蕉很棒,”
[0, 2, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2]
[0,2,1,0,0,0,0,0,0,0,0,0,0,0,0,0,2]
or
要么
[0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1] if you don’t care how often a word occurs.
如果您不在乎一個單詞出現的頻率,則為[0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1]。
“She eats kiwis”
“她吃獼猴桃”
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
Below you can see a representation of the sentence vectors in table format.
您可以在下面看到表格形式的句子向量。
You can see that we count how often each word of the vocabulary shows up in the vector representation of the sentence. Words that are not in our initial vocabulary are of course not known and therefore the last sentence contains only zeros. You should now be able to see how a classifier can identify patterns when trying o predict the label of a sentence. Imagine a training set that contains sentences about different fruits. The sentences that talk about apples will have a value of higher than 0 at index 1 in the vector representation, whereas sentences that talk about bananas will have a value of higher than 0 at index 17 in the vector representation.
您可以看到我們計算了詞匯中每個單詞在句子的矢量表示中出現的頻率。 當然,我們最初詞匯表中沒有的單詞是未知的,因此最后一句話僅包含零。 現在,您應該能夠看到分類器在嘗試預測句子的標簽時如何識別模式。 想象一下一個包含關于不同水果的句子的訓練集。 談論蘋果的句子在矢量表示中的索引1處的值將大于0,而談論香蕉的句子在矢量表示中的索引17處的值將大于0。
測量精度 (Measuring accuracy)
Great, now we have trained our classifier, how do we know how accurate it is? That depends on your type of data. The simplest way to measure the accuracy of your classifier is by comparing the predictions on the test set with the actual labels. If the classifier got 80/100 correct, the accuracy is 80%. There are, however, a few things to consider. I will not go into great detail but will give an example to demonstrate my point: Imagine you have a dataset about credit fraud. This dataset contains transactions made by credit cards and presents transactions that occurred in two days, where there were 492 frauds out of 284,807 transactions. You can see that the dataset is highly unbalanced and frauds account for only 0.172% of all transactions. You could assign “not fraud” to all transactions and get 99.98% accuracy”! This sounds great but completely misses the point of detecting fraud as we have identified exactly 0% of the fraud cases.
太好了,現在我們已經訓練了分類器,我們如何知道它的準確性? 這取決于您的數據類型。 衡量分類器準確性的最簡單方法是將測試集的預測與實際標簽進行比較。 如果分類器正確率為80/100,則準確性為80%。 但是,需要考慮一些事項。 我不會詳細介紹,但會舉一個例子來說明我的觀點:假設您有一個有關信用欺詐的數據集 。 該數據集包含通過信用卡進行的交易,并提供了兩天內發生的交易,其中284,807筆交易中有492起欺詐。 您可以看到數據集高度不平衡,欺詐僅占所有交易的0.172%。 您可以為所有交易指定“非欺詐”,并獲得99.98%的準確性”! 這聽起來不錯,但由于我們已經準確地確定了0%的欺詐案件,因此完全沒有發現欺詐的意義。
Here is an article that explains different ways to evaluate machine learning models. And here another one that describes techniques of how to deal with unbalanced data.
本文介紹了評估機器學習模型的不同方法。 這里是另一種描述如何處理不平衡數據的技術。
交叉驗證 (Cross-Validation)
There is one more concept I want to introduce before going through a coding example, which is cross-validation. There are many great videos and articles that explain cross-validation so, again, I will only quickly touch upon it in regards of accuracy evaluation. Cross-validation is usually used to avoid over- and underfitting of the model on the training data. This is an important concept in machine learning, so I invite you to read up on it.
在進行編碼示例之前,我還想介紹一個概念,即交叉驗證。 有很多很棒的視頻和文章介紹了交叉驗證,因此,在準確性評估方面,我將只快速介紹一下。 交叉驗證通常用于避免模型在訓練數據上的過擬合和過擬合。 這是機器學習中的一個重要概念,因此我邀請您繼續閱讀。
When splitting your data into training and testing set — what if you got lucky with the testing set and the performance of the model on it overestimates how well your model would perform in real life? Or the opposite — what if you got unlucky and your model might perform better on real-world data than on the testing set. To get a more general idea of your model’s accuracy it, therefore, makes sense to perform several different training and testing splits and averaging the different accuracies that you get for each iteration.
將數據分為訓練和測試集時-如果您對測試集感到幸運,并且模型的性能高估了模型在現實生活中的表現,該怎么辦? 或相反-如果您不走運,并且您的模型在真實數據上的性能可能會比測試集更好。 因此,為了更全面地了解模型的準確性,進行幾次不同的訓練和測試劃分并平均每次迭代獲得的不同精度是有意義的。
分類人們對潛在COVID-19疫苗的關注。 (Classifying people’s concerns about a potential COVID-19 vaccine.)
Initially, I wanted to use a public dataset from the BBC comprised of 2225 articles, each labeled under one of 5 categories: business, entertainment, politics, sport or tech, and train a classifier that can predict the category of a news article. However, often popular datasets give you very high accuracy scores without having to do much preprocessing of the data. The notebook is available here, where I demonstrate that you can get 97% accuracy without doing anything with the textual data. This is great but, unfortunately, not how it works for most problems in real life.
最初,我想使用來自BBC的公開數據集,其中包含2225篇文章,每篇文章都被標記為以下5個類別之一:商業,娛樂,政治,體育或科技,并訓練可以預測新聞類別的分類器。 但是,通常流行的數據集可以為您提供非常高的準確性得分,而無需進行大量的數據預處理。 該筆記本可在此處找到 ,我在此演示了無需對文本數據進行任何處理即可獲得97%的準確性。 這很好,但是不幸的是,它不是如何解決現實生活中大多數問題的。
So, I used my own small dataset instead, which I collected during my latest PhD study. Once I have finished writing the paper I will link it here. I have crowdsourced arguments from people who are opposed to taking a COVID-19 vaccine should one be developed. I labeled the data partially automatically, partially manually by “concern” that the argument addresses. The most popular concern is, not surprisingly, potential side effects of the vaccine, followed by a lack of trust in the government and concerns about insufficient testing due to the speed of the vaccine’s development.
因此,我改用了自己的小型數據集,該數據集是我在最近的博士學位研究期間收集的。 一旦寫完論文,我將在這里鏈接。 我反對使用COVID-19疫苗的人提出了自己的論點,應該開發一種。 我部分自動地標記了數據,部分地通過“關注”參數所標記的方式手動標記了數據。 毫不奇怪,最普遍的擔憂是疫苗的潛在副作用,其次是對政府的信任不足以及由于疫苗開發速度而導致測試不足的擔憂。
I am using Python and the scikit-learn library for this task. Scikit-learn can do a lot of the heavy lifting for you which, as stated in the title of this article, allows you to code up classifiers for textual data in as little as 7 lines of code. So, let’s start!
我正在使用Python和scikit-learn庫執行此任務。 如本文標題所述,Scikit-learn可以為您完成很多繁重的工作,這使您只需7行代碼即可為文本數據分類器編碼。 所以,讓我們開始吧!
After reading in the data, which is a CSV file that contains 2 columns, the argument, and it’s assigned concern, I am first preprocessing the data. I describe preprocessing methods in my last article. I add a new column that contains the preprocessed arguments into the data frame and will use the preprocessed arguments to train the classifier(s).
讀取數據后,該數據是一個包含2列(即參數)的CSV文件,并且已為其分配關注點,我首先對數據進行預處理。 我在上一篇文章中描述了預處理方法。 我將一個包含預處理參數的新列添加到數據框中,并將使用預處理參數來訓練分類器。
I classify the arguments and their labels into training and testing set at an 80:20 ratio. Scikit-learn has a train_test_split function which can also stratify and shuffle the data for you.
我將參數及其標簽按80:20的比例分為訓練和測試集。 Scikit-learn具有train_test_split函數 ,該函數還可以為您分層和隨機整理數據。
Next, I instantiate a CountVectoriser to transform the arguments into sparse vectors, as described above. Then I call the fit() function to learn the vocabulary from the arguments in the testing set, and afterward, I call the transform() function on the arguments to encode each as a vector (scikit-learn provides a function which does both steps at the same time). Now note, that I only call transform() on the testing arguments. Why? Because I do not want to learn the vocabulary from the testing arguments and introduce bias into the classifier.
接下來,我實例化一個CountVectoriser,以將參數轉換為稀疏向量,如上所述。 然后,我調用fit()函數從測試集中的參數中學習詞匯,然后,我在參數上調用transform()函數,將每個參數編碼為一個向量(scikit-learn提供了同時執行這兩個步驟的函數與此同時)。 現在注意,我只在測試參數上調用transform() 。 為什么? 因為我不想從測試參數中學習詞匯并將偏見引入分類器。
Finally, I instantiate a multinomial Naive Bayes, fit it on the training arguments, test it on the testing arguments, and print out the accuracy score. Almost 74%. But what if we use cross-validation? Then accuracy drops to 67%. So we were just lucky with the testing data that the random state created. The random state ensures that the splits that you generate are reproducible.
最后,我實例化多項式樸素貝葉斯(Naive Bayes),將其擬合到訓練參數上,在測試參數上進行測試,然后打印出準確性得分。 近74%。 但是,如果我們使用交叉驗證怎么辦? 然后準確性下降到67%。 因此,我們很幸運地發現了隨機狀態創建的測試數據。 隨機狀態可確保您生成的拆分是可重??現的。
Out of curiosity, I am printing out the predictions in the notebook and check what arguments the classifier got wrong and we can see that mostly the concern “side_effects” is assigned to wrongly labelled arguments. This makes sense because the dataset is unbalanced and there are more than twice as many arguments that address side effects than for the second most popular concern, lack of government trust. So assigning side effects as a concern has a higher chance of being correct than assigning another one. I have tried undersampling and dropped some side effects arguments but the results were not great. I encourage you to experiment with it.
出于好奇,我正在打印筆記本中的預測,并檢查分類器弄錯了哪些參數,我們可以看到,大多數關注點“ side_effects”都分配給了標簽錯誤的參數。 這是有道理的,因為數據集是不平衡的,而且解決副作用的論點是第二大關注點,即缺乏政府信任,是爭論的兩倍。 因此,將副作用分配為關注點比分配另一個關注點的可能性更高。 我嘗試了欠采樣,并刪除了一些副作用參數,但結果并不理想。 我鼓勵您嘗試一下。
I coded up a few more classifiers, and we can see that accuracy for all of them is around 70%. I managed to get it up to 74% using a support vector machine and dropping around 30 side_effect arguments. I have also coded up a neural network using tensorflow, which is beyond the scope of this article, which increased accuracy by almost 10%.
我編寫了更多分類器,我們可以看到所有分類器的準確性都在70%左右。 使用支持向量機,并設法刪除了約30個side_effect參數,我設法將其提高到74%。 我還使用tensorflow編寫了一個神經網絡,這超出了本文的范圍,它使準確性提高了近10%。
I hope this article provided you with the necessary resources to get started with machine learning on textual data. Feel free to use any of the code from the notebook and I encourage you to play around with different preprocessing methods, classification algorithms, text vectorisations and, of course, datasets.
我希望本文為您提供了必要的資源,以開始對文本數據進行機器學習。 隨意使用筆記本中的任何代碼,我鼓勵您嘗試使用不同的預處理方法,分類算法,文本向量化以及數據集。
翻譯自: https://towardsdatascience.com/no-fear-of-machine-learning-classify-your-textual-data-in-less-than-10-lines-of-code-2360d5cec798
機器學習 文本分類 代碼
總結
以上是生活随笔為你收集整理的机器学习 文本分类 代码_无需担心机器学习-如何在少于10行代码中对文本进行分类的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 国内油价将在2月3日晚间再次调整 预计每
- 下一篇: 一文读懂安检机器的工作原理