算法题指南书_分类算法指南
算法題指南書
Today, we will see how popular classification algorithms work and help us, for example, to pick out and sort wonderful, juicy tomatoes.
今天,我們將了解流行的分類算法是如何工作的,例如,幫助我們挑選出美味多汁的西紅柿并進行分類。
I will remind you that no algorithm is optimal over the set of all possible situations. Machine learning algorithms are delicate instruments that you tune based on the problem set, especially in supervised machine learning.
我會提醒您,在所有可能的情況下,沒有一種算法是最優(yōu)的。 機器學(xué)習(xí)算法是您根據(jù)問題集調(diào)整的精密工具,尤其是在有監(jiān)督的機器學(xué)習(xí)中。
分類工作原理 (How Classification Works)
We predict whether a thing can be referred to as a particular class every day. To give an example, classification helps us make decisions when picking tomatoes in a supermarket (“green,” “perfect,” “rotten”). In machine learning terms, we assign a label of one of the classes to every tomato we hold in our hands.
我們預(yù)測每天是否可以將某事物稱為特定類別。 舉個例子,分類有助于我們在超市采摘西紅柿?xí)r做出決定(“綠色”,“完美”,“爛”)。 用機器學(xué)習(xí)的術(shù)語來說,我們?yōu)槭种械拿總€西紅柿分配一個類別的標(biāo)簽。
The efficiency of your tomato picking contest (some would call it a classification model) depends on the accuracy of its results. The more often you go to the supermarket yourself (instead of sending your parents or your significant other), the better you will become at picking out tomatoes that are fresh and yummy.
您的番茄采摘比賽(有些人將其稱為分類模型)的效率取決于其結(jié)果的準(zhǔn)確性。 您自己去超市的次數(shù)越多(而不是送父母或重要的他人),您越會挑選出新鮮美味的西紅柿。
Computers are just the same. For a classification model to learn to predict outcomes accurately, it needs a lot of training examples.
電腦是一樣的。 為了使分類模型學(xué)習(xí)準(zhǔn)確地預(yù)測結(jié)果,它需要大量的訓(xùn)練示例。
4種分類類型 (4 Types of Classification)
Image source: Author圖片來源:作者二元 (Binary)
Binary classification means there are two classes to work with that relate to one another as true and false. Imagine you have a huge lug box in front of you with yellow and red tomatoes. But your fancy Italian pasta recipe says that you only need the red ones.
二進制分類意味著有兩個類別可以使用,它們相互關(guān)聯(lián),分別為真和假。 想象一下,您面前有一個巨大的接線盒,上面放著黃色和紅色的西紅柿。 但是您喜歡的意大利面食食譜說您只需要紅色的意大利面食。
What do you do? Obviously, you use label-encoding and, in this case, assign 1 to “red” and 0 to “not red.” Sorting tomatoes has never been easier.
你是做什么? 顯然,您使用標(biāo)簽編碼 ,在這種情況下,將1分配給“紅色”,將0分配給“非紅色”。 西紅柿分類從未如此簡單。
Image source: Author圖片來源:作者多類 (Multiclass)
What do you see in this photo?
您在這張照片中看到了什么?
Source: freepik.com資料來源:freepik.comRed beefsteak tomatoes. Cherry tomatoes. Cocktail tomatoes. Heirloom tomatoes.
紅色的牛排番茄。 櫻桃西紅柿。 雞尾酒番茄。 傳家寶西紅柿。
There is no black and white here, no normal and abnormal like in binary classification. We welcome all sorts of wonderful vegetables (or berries) to our table.
這里沒有黑白,沒有像二進制分類那樣的正常和異常。 我們歡迎各種精美的蔬菜(或漿果)進入我們的餐桌。
What you probably don’t know if you are not a fan of cooking with tomatoes is that not all the tomatoes are equally good for the same dish. Red beefsteak tomatoes are perfect for salsa but you do not pickle them. Cherry tomatoes work for salads but not for pasta. So it’s important to know what you are dealing with.
您可能不知道自己是否不喜歡用西紅柿做飯,是因為并非所有的西紅柿都同樣適合同一道菜。 紅色牛排番茄非常適合薩爾薩醬,但是您不要腌制它們。 櫻桃番茄可用于沙拉,但不適用于意大利面。 因此了解您要處理的內(nèi)容很重要。
Multiclass classification helps us to sort all the tomatoes from the box regardless of how many classes there are.
多類分類可幫助我們從包裝盒中對所有西紅柿進行分類,而不管有多少類。
Image source: Author圖片來源:作者多標(biāo)簽 (Multi-label)
Multi-label classification is applied when one input can belong to more than one class, like a person who is a citizen of two countries.
當(dāng)一個輸入可以屬于一個以上的類別時(例如,作為兩個國家的公民的人),將應(yīng)用多標(biāo)簽分類。
To work with this type of classification, you need to build a model that can predict multiple outputs.
要使用這種類型的分類,您需要構(gòu)建一個可以預(yù)測多個輸出的模型。
You need a multi-label classification for object recognition in photos, for example, when you need to identify not only tomatoes but also other, different kinds of objects in the same image: apples, zucchinis, onions, etc.
您需要對照片中的對象進行識別的多標(biāo)簽分類,例如,當(dāng)您不僅需要識別西紅柿,還需要識別同一圖像中其他不同類型的對象:蘋果,西葫蘆,洋蔥等時。
Important note for all tomato lovers: You cannot just take a binary or multiclass classification algorithm and apply it directly to multi-label classification. But you can use:
對所有番茄愛好者的重要說明 :您不能僅采用二進制或多類分類算法并將其直接應(yīng)用于多標(biāo)簽分類。 但是您可以使用:
Image source: Author圖片來源:作者You can also try to use a separate algorithm for each class to predict labels for each category.
您也可以嘗試對每個類別使用單獨的算法來預(yù)測每個類別的標(biāo)簽。
不平衡 (Imbalanced)
We work with imbalanced classification when examples in each class are unequally distributed.
當(dāng)每個類別中的示例分布不均時,我們將使用不平衡分類。
Imbalanced classification is used for fraud detection software and medical diagnosis. Finding rare and exquisite biologically grown tomatoes that are accidentally spilled in a large pile of supermarket tomatoes is an example of imbalanced classification.
不平衡分類用于欺詐檢測軟件和醫(yī)療診斷。 尋找偶然散落在一大堆超市番茄中的稀有和精美的生物種植番茄是分類失衡的一個例子。
Image source: Author圖片來源:作者I recommend you visit the fantastic blog of Machine Learning Mastery, where you can read about the different types of classification and study many more machine learning materials.
我建議您訪問夢幻般的Machine Learning Mastery博客,您可以在其中閱讀有關(guān)分類的不同類型并學(xué)習(xí)更多的機器學(xué)習(xí)材料。
建立分類模型的步驟 (Steps to Build a Classification Model)
Once you know what kind of classification task you are dealing with, it’s time to build a model.
一旦知道要處理的分類任務(wù),就可以構(gòu)建模型了。
Let us now take a look at the most widely-used classification algorithms.
現(xiàn)在,讓我們看一下使用最廣泛的分類算法。
最受歡迎的分類算法 (The Most Popular Classification Algorithms)
Image source: Author圖片來源:作者Scikit-learn is one of the top ML libraries for Python. So if you want to build your model, check it out. It provides access to widely-used classifiers.
Scikit-learn是用于Python的頂級ML庫之一。 因此,如果要構(gòu)建模型,請簽出。 它提供了對廣泛使用的分類器的訪問。
邏輯回歸 (Logistic regression)
Logistic regression is used for binary classification.
Logistic回歸用于二進制分類。
This algorithm employs a logistic function to model the probability of an outcome happening. It is most useful when you want to understand how several independent variables affect a single outcome variable.
該算法采用邏輯函數(shù)對結(jié)果發(fā)生的概率進行建模。 當(dāng)您想了解幾個獨立變量如何影響單個結(jié)果變量時,它非常有用。
Example question: Will the precipitation levels and the soil composition lead to a tomato’s prosperity or its untimely death?
示例問題:降水量和土壤成分會導(dǎo)致番茄的繁榮或過早地死亡嗎?
Logistic regression has limitations; all predictors should be independent, and there should be no missing values. This algorithm will fail when there is no linear separation of values.
邏輯回歸有局限性。 所有預(yù)測變量都應(yīng)獨立,并且不應(yīng)缺少任何值。 沒有值的線性分隔時,該算法將失敗。
樸素貝葉斯 (Naive Bayes)
The Naive Bayes algorithm is based on Bayes’ theorem. You can apply this algorithm for binary and multiclass classification and classify data based on historical results.
樸素貝葉斯算法基于貝葉斯定理 。 您可以將此算法應(yīng)用于二進制和多類分類,并根據(jù)歷史結(jié)果對數(shù)據(jù)進行分類。
Example task: I need to separate rotten tomatoes from the fresh ones based on their look.
示例任務(wù):我需要根據(jù)它們的外觀將腐爛的西紅柿與新鮮的西紅柿分開。
The advantage of Naive Bayes is that these algorithms are fast to build: They do not require an extensive training set and are also fast compared to other methods. However, since the performance of Bayesian algorithms depends on the accuracy of its strong assumptions, the results can potentially turn out very bad.
樸素貝葉斯的優(yōu)點是這些算法的構(gòu)建速度很快:它們不需要大量的訓(xùn)練,并且與其他方法相比也很快。 但是,由于貝葉斯算法的性能取決于其強大假設(shè)的準(zhǔn)確性,因此結(jié)果可能會變得非常糟糕。
Using Bayes’ theorem, it is possible to tell how the occurrence of an event impacts the probability of another event.
使用貝葉斯定理,可以判斷一個事件的發(fā)生如何影響另一個事件的概率 。
k最近鄰居 (k-nearest neighbors)
kNN stands for “k-nearest neighbor” and is one of the simplest classification algorithms.
kNN代表“ k最近鄰居”,是最簡單的分類算法之一。
The algorithm assigns objects to the class that most of its nearest neighbors in the multidimensional feature space belong to. The number k is the number of neighboring objects in the feature space that are compared with the classified object.
該算法將對象分配給其在多維要素空間中大多數(shù)最近鄰居所屬的類。 數(shù)字k是特征空間中與分類對象比較的相鄰對象的數(shù)量。
Example: I want to predict the species of the tomato from the species of tomatoes similar to it.
示例:我想根據(jù)與之相似的西紅柿的種類來預(yù)測西紅柿的種類。
To classify the inputs using k-nearest neighbors, you need to perform a set of actions:
要使用k近鄰對輸入進行分類,您需要執(zhí)行一組操作:
- Calculate the distance to each of the objects in the training sample. 計算到訓(xùn)練樣本中每個對象的距離。
- Select k objects of the training sample, the distance to which is minimal. 選擇訓(xùn)練樣本的k個對象,該對象的距離最小。
- The class of the object to be classified is the class that occurs most frequently among the k-nearest neighbors. 要分類的對象的類別是在k個最近鄰居中最頻繁出現(xiàn)的類別。
決策樹 (Decision tree)
Decision trees are probably the most intuitive way to visualize a decision-making process. To predict a class label of input, we start from the root of the tree. You need to divide the possibility space into smaller subsets based on a decision rule that you have for each node.
決策樹可能是可視化決策過程的最直觀的方法。 為了預(yù)測輸入的類標(biāo)簽,我們從樹的根開始。 您需要根據(jù)每個節(jié)點的決策規(guī)則將可能性空間劃分為較小的子集。
Here is an example:
這是一個例子:
Image source: Author圖片來源:作者You keep breaking up the possibility space until you reach the bottom of the tree. Every decision node has two or more branches. The leaves in the model above contain the decision about whether a person is or isn’t fit.
您一直在打破可能的空間,直到到達(dá)樹的底部。 每個決策節(jié)點都有兩個或多個分支。 上面模型中的葉子包含有關(guān)一個人是否合適的決定。
Example: You have a basket of different tomatoes and want to choose the correct one to enhance your dish.
示例:您有一籃子不同的西紅柿,并且想要選擇正確的西紅柿來增強菜肴。
Types of decision trees
決策樹的類型
There are two types of trees. They are based on the nature of the target variable:
有兩種類型的樹。 它們基于目標(biāo)變量的性質(zhì):
- Categorical variable decision tree 分類變量決策樹
- Continuous variable decision tree 連續(xù)變量決策樹
Therefore, decision trees work quite well with both numerical and categorical data. Another plus of using decision trees is that they require little data preparation.
因此,決策樹在數(shù)值和分類數(shù)據(jù)上都可以很好地工作。 使用決策樹的另一個好處是它們幾乎不需要數(shù)據(jù)準(zhǔn)備。
However, decision trees can become too complicated, which leads to overfitting. A significant disadvantage of these algorithms is that small variations in training data make them unstable and lead to entirely new trees.
但是,決策樹可能變得過于復(fù)雜,從而導(dǎo)致過度擬合 。 這些算法的顯著缺點是訓(xùn)練數(shù)據(jù)的微小變化使它們不穩(wěn)定,并導(dǎo)致了全新的樹木。
隨機森林 (Random forest)
Random forest classifiers use several different decision trees on various sub-samples of datasets. The average result is taken as the model’s prediction, which improves the predictive accuracy of the model in general and combats overfitting.
隨機森林分類器在數(shù)據(jù)集的各個子樣本上使用幾種不同的決策樹。 將平均結(jié)果作為模型的預(yù)測,可以從總體上提高模型的預(yù)測精度并避免過擬合。
Consequently, random forests can be used to solve complex machine learning problems without compromising the accuracy of the results. Nonetheless, they demand more time to form a prediction and are more challenging to implement.
因此,隨機森林可用于解決復(fù)雜的機器學(xué)習(xí)問題,而不會影響結(jié)果的準(zhǔn)確性。 盡管如此,他們需要更多的時間來進行預(yù)測,并且實施起來更具挑戰(zhàn)性。
Read more about how random forests work in the Towards Data Science blog.
在Towards Data Science博客中了解有關(guān)隨機森林如何工作的更多信息。
支持向量機 (Support vector machine)
Support vector machines use a hyperplane in an N-dimensional space to classify the data points. N here is the number of features. It can be, basically, any number, but the bigger it is, the harder it becomes to build a model.
支持向量機在N維空間中使用超平面對數(shù)據(jù)點進行分類。 這里的N是要素數(shù)量。 基本上可以是任何數(shù)字,但是越大,建立模型就越困難。
One can imagine the hyperplane as a line (for a two-dimensional space). Once you pass three-dimensional space, it becomes hard for us to visualize the model.
可以將超平面想象成一條線(對于二維空間)。 一旦您通過了三維空間,我們就很難將模型可視化。
Data points that fall on different sides of the hyperplane are attributed to different classes.
落在超平面的不同面上的數(shù)據(jù)點被歸于不同的類。
Example: An automatic system that sorts tomatoes based on their shape, weight, and color.
示例:一個自動系統(tǒng),可根據(jù)西紅柿的形狀,重量和顏色對西紅柿進行分類。
The hyperplane that we choose directly affects the accuracy of the results. So we search for the plane that has the maximum distance between data points of both classes.
我們選擇的超平面直接影響結(jié)果的準(zhǔn)確性。 因此,我們搜索兩個類的數(shù)據(jù)點之間具有最大距離的平面。
SVMs show accurate results with minimal computation power when you have a lot of features.
當(dāng)您具有許多功能時,SVM可以以最小的計算能力顯示準(zhǔn)確的結(jié)果。
加起來 (Summing Up)
As you can see, machine learning can be as simple as picking up vegetables in the shop. But there are many details to keep in mind if you don’t want to mess it up.
如您所見,機器學(xué)習(xí)就像在商店里撿菜一樣簡單。 但是,如果您不想弄亂它,則有許多細(xì)節(jié)需要牢記。
翻譯自: https://medium.com/better-programming/a-guide-to-classification-algorithms-fdaabb538b26
算法題指南書
總結(jié)
以上是生活随笔為你收集整理的算法题指南书_分类算法指南的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 马斯克合伙人研制大脑“第七层” 可赋予瘫
- 下一篇: 小米 pegasus_使用Google的