内容管理系统_内容
內(nèi)容管理系統(tǒng)
In this blog, we’ll talk about one of the most widely used machine learning algorithms for classification used in various real-world cases, which is the K-Nearest Neighbors (K-NN) algorithm. K-Nearest Neighbor (K-NN) is a simple, easy to understand, versatile, and one of the topmost machine learning algorithms that find its applications in a variety of fields.
在此博客中,我們將討論在各種現(xiàn)實(shí)情況下使用的最廣泛的機(jī)器學(xué)習(xí)分類算法,即K-最近鄰居(K-NN)算法。 K最近鄰居(K-NN)是一種簡(jiǎn)單,易于理解,通用的工具,并且是最頂級(jí)的機(jī)器學(xué)習(xí)算法之一,可在各種領(lǐng)域中找到其應(yīng)用。
內(nèi)容 (Contents)
To know How K-NN works, please read our previous blog, to read the blog visit here.
要了解K-NN的工作原理,請(qǐng)閱讀我們以前的博客,并在此處閱讀博客。
1.不平衡v / s余額數(shù)據(jù)集的情況 (1. Case of Imbalance v/s Balance Dataset)
First, we want to Know what is Imbalance Dataset?
首先,我們想知道什么是不平衡數(shù)據(jù)集?
Consider two-class classification, If there is a very high difference between the positive class and Negative class. Then we can say our dataset in Imbalance Dataset.
考慮二分類, 如果在正類和負(fù)類之間非常高的差異。 然后我們可以說不平衡數(shù)據(jù) 集中的數(shù)據(jù)集 。
Imbalanced dataset數(shù)據(jù)集不平衡If the number of positive classes and Negative classes is approximately the same. in the given data set. Then we can say our dataset in balance Dataset.
如果肯定類別和否定類別的數(shù)量近似相同。 在給定的數(shù)據(jù)集中。 然后,我們可以說我們的數(shù)據(jù)集處于余額數(shù)據(jù) 集中 。
Balanced dataset平衡數(shù)據(jù)集K-NN is very much impacted by an imbalanced dataset when we take the majority vote and sometimes it is dominated by majority class.
當(dāng)我們進(jìn)行多數(shù)表決時(shí),K-NN很大程度上受不平衡數(shù)據(jù)集的影響,有時(shí)它由多數(shù)階級(jí)主導(dǎo)。
如何解決數(shù)據(jù)集不平衡的問題? (How to work-around an imbalanced dataset issue?)
Imbalanced data is not always a bad thing, and in real data sets, there is always some degree of imbalance. That said, there should not be any big impact on your model performance if the level of imbalance is relatively low.
數(shù)據(jù)不平衡并不總是一件壞事,在實(shí)際數(shù)據(jù)集中,總會(huì)存在一定程度的不平衡。 就是說,如果不平衡程度相對(duì)較低,則對(duì)模型性能不會(huì)有太大影響。
Now, let’s cover a few techniques to solve the class imbalance problem.
現(xiàn)在,讓我們介紹一些解決類不平衡問題的技術(shù)。
Under-Sampling
欠采樣
Let assume I have a dataset “N” with 1000 data points. And ’N’ have two class one is n1 and another one is n2. These two classes have two different reviews Positive and Negative. Here n1 is a positive class and has 900 data points and n2 is a negative class and has 100 data points, so we can say n1 is a majority class because n1 has a big amount of data points and n2 is a minority class because n2 have less number of data points. To handle this Imbalanced dataset I will create a new dataset called m. Here I will take all (100)n2 datapoints as it is and I will take randomly (100)n1 data points and put into the dataset called m’. This is a sampling trick and its called Under-Sampling.
假設(shè)我有一個(gè)具有1000個(gè)數(shù)據(jù)點(diǎn)的數(shù)據(jù)集“ N ”。 “ N”有兩個(gè)類別,一個(gè)是n1,另一個(gè)是n2。 這兩個(gè)類別有兩個(gè)不同的評(píng)論正面和負(fù)面。 這里n1是一個(gè)正類,具有900個(gè)數(shù)據(jù)點(diǎn),n2是一個(gè)負(fù)類,具有100個(gè)數(shù)據(jù)點(diǎn),所以可以說n1是多數(shù)類,因?yàn)閚1擁有大量數(shù)據(jù)點(diǎn),而n2是少數(shù)類,因?yàn)閚2具有更少的數(shù)據(jù)點(diǎn)。 為了處理這個(gè)不平衡的數(shù)據(jù)集,我將創(chuàng)建一個(gè)名為m的新數(shù)據(jù)集。 在這里,我將按原樣獲取所有(100)n2個(gè)數(shù)據(jù)點(diǎn),并隨機(jī)獲取(100)n1個(gè)數(shù)據(jù)點(diǎn)并將其放入稱為m'的數(shù)據(jù)集中。 這是一個(gè)采樣技巧,稱為欠采樣。
Instead of using n1 and n2, we use m and n2 for modeling.
代替使用n1和n2,我們使用m和n2進(jìn)行建模。
But in this approach, we did throwing away of data and we lose the information and it’s not a good idea. To solving this under-sampling problem we will introduce a new method called Over-Sampling.
但是在這種方法中,我們確實(shí)丟棄了數(shù)據(jù),并且丟失了信息,這不是一個(gè)好主意。 為了解決這個(gè)欠采樣問題,我們將介紹一種稱為過采樣的新方法。
Over-Sampling
過度采樣
This technique is used to modify the unequal data classes to create balanced datasets. When the quantity of data is insufficient, the oversampling method tries to balance by incrementing the size of rare samples.
此技術(shù)用于修改不相等的數(shù)據(jù)類以創(chuàng)建平衡的數(shù)據(jù)集。 當(dāng)數(shù)據(jù)量不足時(shí),過采樣方法會(huì)嘗試通過增加稀有樣本的大小來平衡。
Over-sampling increases the number of minority class members in the training set. The advantage of over-sampling is that no information from the original training set is lost, as all observations from the minority and majority classes are kept.
過度采樣會(huì)增加培訓(xùn)集中少數(shù)群體成員的數(shù)量。 過度采樣的優(yōu)點(diǎn)是不會(huì)保留原始訓(xùn)練集中的信息,因?yàn)闀?huì)保留少數(shù)和多數(shù)類別的所有觀察結(jié)果。
Over-sampling reduces the domination of the one class from the dataset.
過采樣減少了數(shù)據(jù)集中一類的優(yōu)勢(shì)。
Instead of repeating we can also create artificially (or) synthetic points in that minority class region, which creates synthetic samples by randomly sampling the characteristics from occurrences in the minority class this technique is called Synthetic Minority Over-sampling Technique.
除了重復(fù)之外,我們還可以在該少數(shù)族裔區(qū)域中人工創(chuàng)建(或)合成點(diǎn),該技術(shù)通過從少數(shù)族裔中出現(xiàn)的特征中隨機(jī)采樣特征來創(chuàng)建合成樣本,該技術(shù)稱為合成少數(shù)族裔過采樣技術(shù)。
We can get high accuracy with imbalanced data for the ‘dumb model’, so we can’t use accuracy as a performance measure when we have an imbalanced dataset.
對(duì)于“啞模型”,使用不平衡的數(shù)據(jù)可以獲得較高的準(zhǔn)確性,因此當(dāng)數(shù)據(jù)集不平衡時(shí),不能將準(zhǔn)確性用作性能指標(biāo)。
2.多類別分類 (2. Multi-Class Classification)
Consider In an MNIST dataset the class label Y ∈ {0,1} is called binary classification.
考慮在MNIST數(shù)據(jù)集中,類標(biāo)簽Y∈ {0,1}被稱為二進(jìn)制分類。
Classification task with more than two classes called Multi-Class Classification. Consider In an MNIST dataset the class label Y ∈ {0,1,2,3,4,5} .
具有兩個(gè)以上類的分類任務(wù)稱為多類分類。 考慮在MNIST數(shù)據(jù)集中,類標(biāo)簽Y∈ {0,1,2,3,4,5}。
Multi-Class Classification多類別分類K-NN is easily extendable to the Multi-Class Classifier because it just considers the Majority Vote.
K-NN很容易擴(kuò)展到多分類器,因?yàn)樗豢紤]多數(shù)表決。
But in Machine Learning some types of classification algorithms like Logistic regression can’t extend to Multi-Class Classification.
但是在機(jī)器學(xué)習(xí)中,某些類型的分類算法(例如Logistic回歸)無法擴(kuò)展到多類分類。
As such, they cannot be used for multi-class classification tasks, at least not directly.
因此,它們不能用于多類分類任務(wù),至少不能直接用于。
Instead, heuristic methods can be used to split a multi-class classification problem into multiple binary classification datasets and train a binary classification model each. one such technique is One-Vs-Rest.
相反,可以使用啟發(fā)式方法將多類分類問題分解為多個(gè)二進(jìn)制分類數(shù)據(jù)集,并分別訓(xùn)練一個(gè)二進(jìn)制分類模型。 一種這樣的技術(shù)是One-Vs-Rest。
One-vs-rest is a heuristic method for using binary classification algorithms for multi-class classification. It involves splitting the multi-class dataset into multiple binary classification problems. A binary classifier is then trained on each binary classification problem and predictions are made using the model that is the most confident.
一對(duì)多休息是一種使用二進(jìn)制分類算法進(jìn)行多分類的啟發(fā)式方法。 它涉及將多類數(shù)據(jù)集拆分為多個(gè)二進(jìn)制分類問題。 然后針對(duì)每個(gè)二進(jìn)制分類問題訓(xùn)練一個(gè)二進(jìn)制分類器,并使用最有信心的模型進(jìn)行預(yù)測(cè)。
For example, given a multi-class classification problem with examples for each class ‘red,’ ‘blue,’ and ‘green‘. This could be divided into three binary classification datasets as follows:
例如,給定一個(gè)多類別分類問題,并為每個(gè)類別“紅色”,“藍(lán)色”和“綠色”提供示例。 可以將其分為三個(gè)二進(jìn)制分類數(shù)據(jù)集,如下所示:
Binary Classification Problem 1: red vs [blue, green]
二進(jìn)制分類問題1 :紅色與[藍(lán)色,綠色]
Binary Classification Problem 2: blue vs [red, green]
二進(jìn)制分類問題2 :藍(lán)色與[紅色,綠色]
Binary Classification Problem 3: green vs [red, blue]
二進(jìn)制分類問題3 :綠色與[紅色,藍(lán)色]
A possible downside of this approach is that it requires one model to be created for each class. For example, three classes require three models. This could be an issue for large datasets.
這種方法的可能缺點(diǎn)是,需要為每個(gè)類創(chuàng)建一個(gè)模型。 例如,三個(gè)類需要三個(gè)模型。 對(duì)于大型數(shù)據(jù)集,這可能是個(gè)問題。
To Know about One-Vs-Rest in sci-kit-learn visit here.
要了解sci-kit-learn中的One-Vs-Rest ,請(qǐng)?jiān)L問此處。
3. K-NN作為給定的距離(或)相似度矩陣 (3. K-NN as a given Distance (or) similarity matrix)
A distance-based classification is one of the popular methods for classifying instances using a point-to-point distance based on the nearest neighbor (k-NN).
基于距離的分類是使用基于最近鄰居(k-NN)的點(diǎn)對(duì)點(diǎn)距離對(duì)實(shí)例進(jìn)行分類的流行方法之一。
The representation of distance measure can be one of the various measures available (e.g. Euclidean distance, Manhattan distance, Mahalanobis distance, or other specific distance measures).
距離度量的表示形式可以是可用的各種度量之一(例如,歐幾里得距離,曼哈頓距離,馬氏距離或其他特定距離度量)。
Instead of giving the data and label, if someone gives the similarity between the two products (or) distance between the two vectors K-NN works very well. Because K-NN internally calculates the distance between two points.
如果有人提供兩個(gè)乘積之間的相似性(或)兩個(gè)向量之間的距離,則K-NN會(huì)很好地工作,而不是提供數(shù)據(jù)和標(biāo)簽。 因?yàn)镵-NN在內(nèi)部計(jì)算兩點(diǎn)之間的距離。
4.訓(xùn)練和測(cè)試集差異 (4. Train and Test set differences)
when data is change over time than the distribution of train and test set difference will occur. If the distribution of the train and test set is different the model can’t give better results.
當(dāng)數(shù)據(jù)隨時(shí)間變化時(shí),火車和測(cè)試裝置的分布會(huì)發(fā)生差異。 如果訓(xùn)練和測(cè)試集的分布不同,則該模型將無法給出更好的結(jié)果。
We want to check the distribution of the train and test set before we build a model.
我們想在建立模型之前檢查火車和測(cè)試儀的分布。
But How can we know the train and test sets have different distributions?
但是我們?cè)趺粗烙?xùn)練集和測(cè)試集具有不同的分布呢?
consider our dataset split into train and test set and both contain x and y, where x as given data points and y as a label.
考慮我們的數(shù)據(jù)集分為訓(xùn)練集和測(cè)試集,并且都包含x和y,其中x為給定的數(shù)據(jù)點(diǎn),y為標(biāo)簽。
To find the distribution of train and test set we want to create a new dataset from our existing dataset.
為了找到訓(xùn)練集和測(cè)試集的分布,我們想從現(xiàn)有數(shù)據(jù)集中創(chuàng)建一個(gè)新的數(shù)據(jù)集。
consider in our new train set will be x’=concat(x,y) and y’=1 and the new test set will be x’=concat(x,y) and y’=0. For this new dataset apply binary classifiers like K-NN. After applying the binary classifier if we got results like below cases
考慮在我們的新火車集中將x'= concat(x,y)和y'= 1,而新測(cè)試集將是x'= concat(x,y)和y'= 0。 對(duì)于此新數(shù)據(jù)集,請(qǐng)使用K-NN之類的二進(jìn)制分類器。 在應(yīng)用二元分類器之后,如果我們得到以下情況的結(jié)果
Case 1:
情況1:
If we get low accuracy by model and the train and test set are almost overlapping than the distribution of train and test sets is very similar.
如果我們通過模型獲得低精度,并且訓(xùn)練集和測(cè)試集幾乎重疊,則訓(xùn)練集和測(cè)試集的分布非常相似。
Case 2:
情況2:
If we get medium accuracy by model and the train and test set are less overlapping than the distribution of train and test sets is not very similar.
如果我們通過模型獲得中等精度,并且訓(xùn)練集和測(cè)試集的重疊程度不如訓(xùn)練集和測(cè)試集的分布不太相似。
Case 3:
情況3:
If we get high accuracy by model and the train and test set are very low in overlapping than the distribution of train and test sets are very different.
如果我們通過模型獲得高精度,則火車和測(cè)試集的重疊率非常低,而火車和測(cè)試集的分布則非常不同。
If the train and test sets come from the same distribution are good, else features change over time, then we can design new features.
如果來自同一分布的訓(xùn)練集和測(cè)試集很好,否則功能會(huì)隨著時(shí)間而變化,那么我們可以設(shè)計(jì)新功能。
5.異常值的影響 (5. Impact of outliers)
The model can be easily understood by seeing the decision surface.
通過查看決策面可以輕松理解模型。
Consider the above image if we apply K=1 then take the 1-NN then the decision surface changes. When K is small than the outlier impact on the model is more. When K is large than the outlier impact is less prone to the model.
如果我們應(yīng)用K = 1,然后考慮1-NN,然后決策面發(fā)生變化,請(qǐng)考慮以上圖像。 當(dāng)K小時(shí),離群值對(duì)模型的影響更大。 當(dāng)K大于異常值時(shí),模型的影響較小。
Techniques to remove outliers in K-NN
去除K-NN中離群值的技術(shù)
Local Outlier Factor (LOF): The local outlier factor is based on a concept of a local density, where locality is given by k nearest neighbors, whose distance is used to estimate the density.
局部離群因子(LOF):局部離群因子基于局部密度的概念,其中局部性由k個(gè)最近的鄰居給出,其距離用于估計(jì)密度。
To understand LOF let’s see some basic definitions.
要了解LOF,讓我們看一些基本定義。
K-distance (Xi): The distance to the k-th Nearest Neighbor of Xi from Xi.
K距離(Xi):從Xi到Xi的第k個(gè)最近鄰居的距離。
The neighborhood of Xi: Set of all the points that belong to the K-NN of Xi.
Xi的鄰域:屬于Xi的K-NN的所有點(diǎn)的集合。
Let assume K=5 than the Neighborhood of Xi becomes, Set of all the points that belong to the 5-NN of Xi i.e, {x1,x2,x3,x4,x5}
假設(shè)比Xi的鄰居變成K = 5, 屬于Xi的5-NN的所有點(diǎn)的集合,即{x1,x2,x3,x4,x5}
Reachability-distance (Xi, Xj):
可達(dá)距離(Xi,Xj):
reach-dist(Xi, Xj) = max{k-distance(Xj), dist(Xi, Xj)}
到達(dá)距離(Xi,Xj)= max {k-距離(Xj),dist(Xi,Xj)}
Basically, if point Xi is within the k neighbors of point Xj, the reach-dist(Xi, Xj) will be the k-distance of Xj. Otherwise, it will be the real distance between Xi and Xj. This is just a “smoothing factor”. For simplicity, consider this the usual distance between two points.
基本上,如果點(diǎn)Xi在點(diǎn)Xj的k個(gè)鄰居內(nèi),則distance-dist(Xi,Xj)將是Xj的k-距離。 否則,它將是Xi與Xj之間的實(shí)際距離。 這只是一個(gè)“平滑因素”。 為簡(jiǎn)單起見,請(qǐng)考慮這是兩點(diǎn)之間的通常距離。
Local reachability density (LRD): To get the Lrd for a point, we will first calculate the reachability distance of a to all its k nearest neighbors and take the average of that number. The Lrd is then simply the inverse of that average reachability distance of Xi from its neighbor.
局部可達(dá)性密度(LRD):要獲取某個(gè)點(diǎn)的Lrd,我們將首先計(jì)算a到其所有k個(gè)最近鄰居的可達(dá)性距離,并取該數(shù)字的平均值。 然后,Lrd只是Xi與其鄰居之間的平均可達(dá)距離的倒數(shù)。
By intuition, the local reachability density tells how far we have to travel from our point to reach the next point or cluster of points. The lower it is, the less dense it is, the longer we have to travel.
通過直覺,局部可達(dá)性密度指示我們必須從我們的點(diǎn)行進(jìn)多遠(yuǎn)才能到達(dá)下一個(gè)點(diǎn)或點(diǎn)集。 它越低,密度越低,我們旅行的時(shí)間就越長(zhǎng)。
Local Outlier Factor (LOF): LOF is basically the multiplication of average Lrd of points in the neighborhood of Xi and inverse of Lrd of xi.
局部離群因子(LOF): LOF基本上是Xi附近的點(diǎn)的平均Lrd與xi Lrd的逆的乘積。
which means the data point Xi has a density of its neighborhood is large and the density of Xi small then it's considered as outlier.
這意味著數(shù)據(jù)點(diǎn)Xi的鄰域密度較大,而Xi的密度較小,因此被視為離群值。
When we applying LOF which point is large LOF that point is considered as outlier otherwise its an inlier.
當(dāng)我們應(yīng)用LOF時(shí),該點(diǎn)是大LOF,則將該點(diǎn)視為離群值,否則視為離群值。
6.規(guī)模和專欄標(biāo)準(zhǔn)化 (6. Scale and Column Standardization)
All such distance-based algorithms are affected by the scale of the variables. KNN is a Distance-Based algorithm where KNN classifies data based on proximity to the K-Neighbors. Then, often we find that the features of the data we used are not at the same scale (or) units.
所有這些基于距離的算法都受變量規(guī)模的影響。 KNN是一種基于距離的算法,其中KNN根據(jù)接近度對(duì)數(shù)據(jù)進(jìn)行分類 到K鄰居。 然后,我們經(jīng)常發(fā)現(xiàn)我們使用的數(shù)據(jù)的特征不在相同的比例(或)單位上。
An example is when we have features age and height. Obviously these two features have different units, the feature age is in the year and the height is in centimeter.
例如,當(dāng)我們具有年齡和身高特征時(shí)。 顯然,這兩個(gè)要素具有不同的單位,要素年齡在一年中 高度以厘米為單位。
This unit difference causes Distance-Based algorithms such as KNN to not perform optimally, so it is necessary to rescaling features that have different units to have the same scale (or) units. Many ways can be used for rescaling features. In this story, I will discuss several ways of rescaling, namely Min-Max Scaling, Standard Scaling.
這個(gè)單位的差異 導(dǎo)致基于距離的算法(例如KNN)無法達(dá)到最佳效果,因此有必要重新縮放 具有不同單位的要素具有相同的比例單位。 許多方法可用于重新縮放功能。 在這個(gè)故事中,我將討論幾種縮放方法,即最小-最大 縮放,標(biāo)準(zhǔn)縮放。
Before we build a model we want to rescale (or) standardize features is necessary.
在構(gòu)建模型之前,我們需要重新縮放(或)標(biāo)準(zhǔn)化功能。
7.模型的可解釋性 (7. Model Interpretability)
A model is better interpretable than another model if its decisions are easier for a human to comprehend than decisions from the other model.
如果一個(gè)人的決策比另一個(gè)模型的決策更容易理解,那么該模型比另一個(gè)模型的解釋性更好。
The model explains the results that models are called the Interpretable model.
該模型解釋了將模型稱為可解釋模型的結(jié)果。
K-NN is interpretable when dimension d is small as k is small.
當(dāng)維數(shù)d小而k小時(shí),K-NN是可解釋的。
8.功能重要性 (8. Feature Importance)
Feature Importance is telling us what are the important features in our model. but K-NN doesn’t give the feature importance internally.
特征重要性告訴我們模型中的重要特征是什么。 但K-NN在內(nèi)部并未賦予該功能以重要性。
To Know which features are important we want to apply some techniques i.e, Two popular members of the stepwise family are the forward selection and backward selection (also known as backward elimination) algorithms.
要知道哪些功能很重要,我們想應(yīng)用一些技術(shù),例如,逐步族的兩個(gè)流行成員是正向選擇和反向選擇(也稱為反向消除)算法。
Forward Selection: The procedure starts with an empty set of features. The best of the original features is determined and added to the reduced set. At each subsequent iteration, the best of the remaining original attributes is added to the set.
正向選擇 :該過程從一組空白功能開始。 確定最佳原始功能并將其添加到精簡(jiǎn)集中。 在每個(gè)后續(xù)迭代中,將剩余的最佳原始屬性添加到集合中。
- First, the best single feature is selected (i.e., using some criterion function like accuracy, AUC, etc). 首先,選擇最佳的單一特征(即,使用某些標(biāo)準(zhǔn)功能,如準(zhǔn)確性,AUC等)。
- Then, pairs of features are formed using one of the remaining features and this best feature, and the best pair is selected. 然后,使用剩余特征之一和該最佳特征形成特征對(duì),并選擇最佳對(duì)。
- Next, triplets of features are formed using one of the remaining features and these two best features, and the best triplet is selected. 接下來,使用其余特征之一和這兩個(gè)最佳特征形成三重特征,并選擇最佳三元組。
- This procedure continues until a predefined number of features is selected. 繼續(xù)此過程,直到選擇了預(yù)定義數(shù)量的功能。
Backward Elimination: The procedure starts with the full set of attributes. At each step, it removes the worst attribute remaining in the set. For given data I have already had some features, this technique tries to remove the feature in each iteration based on some performance metric.
向后消除 :該過程從整套屬性開始。 在每個(gè)步驟中,它都會(huì)刪除集合中剩余的最差屬性。 對(duì)于給定的數(shù)據(jù),我已經(jīng)具有一些功能,該技術(shù)嘗試根據(jù)某些性能指標(biāo)在每次迭代中刪除該功能。
- First, the criterion function is computed for all n features. 首先,針對(duì)所有n個(gè)特征計(jì)算標(biāo)準(zhǔn)函數(shù)。
- Then, each feature is deleted one at a time, the criterion function is computed for all subsets with n-1 features, and the worst feature is discarded(i.e., using some criterion function like accuracy, AUC, etc). 然后,一次刪除每個(gè)特征,為具有n-1個(gè)特征的所有子集計(jì)算標(biāo)準(zhǔn)函數(shù),并丟棄最差的特征(即,使用某些標(biāo)準(zhǔn)函數(shù),如準(zhǔn)確性,AUC等)。
- Next, each feature among the remaining n-1 is deleted one at a time, and the worst feature is discarded to form a subset with n-2 features. 接下來,一次刪除其余n-1個(gè)特征中的每個(gè)特征,并丟棄最差的特征以形成具有n-2個(gè)特征的子集。
- This procedure continues until a predefined number of features are left. 此過程將繼續(xù)進(jìn)行,直到剩下預(yù)定義數(shù)量的功能為止。
9.分類特征 (9. Categorical Features)
In many practical Data Science activities, the data set will contain categorical variables. These variables are typically stored as text values that represent various traits. Some examples include color (“Red”, “Yellow”, “Blue”), size (“Small”, “Medium”, “Large”) or geographic designations (State or Country).
在許多實(shí)際的數(shù)據(jù)科學(xué)活動(dòng)中,數(shù)據(jù)集將包含類別變量。 這些變量通常存儲(chǔ)為代表各種特征的文本值。 一些示例包括顏色(“紅色”,“黃色”,“藍(lán)色”),大小(“小”,“中”,“大”)或地理名稱(州或國(guó)家)。
Label Encoding: Label encoders transform non-numerical labels into numerical labels. Each category is assigned a unique label starting from 0 and going on till n_categories — 1 per feature. Label encoders are suitable for encoding variables where alphabetical alignment or numerical value of labels is important. However, if you got nominal data, using label encoders may not be such a good idea.
標(biāo)簽編碼:標(biāo)簽編碼器將非數(shù)字標(biāo)簽轉(zhuǎn)換為數(shù)字標(biāo)簽。 為每個(gè)類別分配一個(gè)唯一的標(biāo)簽,從0開始,一直到n_categories —每個(gè)功能1。 標(biāo)簽編碼器適用于對(duì)標(biāo)簽的字母對(duì)齊或數(shù)字值很重要的變量進(jìn)行編碼。 但是,如果您得到了名義數(shù)據(jù),則使用標(biāo)簽編碼器可能不是一個(gè)好主意。
OneHot encoding: One-hot encoding is the most widely used encoding scheme. It works by creating a column for each category present in the feature and assigning a 1 or 0 to indicate the presence of a category in the data.
OneHot編碼: One-hot編碼是使用最廣泛的編碼方案。 它通過為功能中存在的每個(gè)類別創(chuàng)建一列并分配1或0來指示數(shù)據(jù)中類別的存在來工作。
Binary encoding: Binary encoding is not as intuitive as the above two approaches. Binary encoding works like this:
二進(jìn)制編碼:二進(jìn)制編碼不如上述兩種方法直觀。 二進(jìn)制編碼如下:
- The categories are encoded as ordinal, for example, categories like red, yellow, green are assigned labels as 1, 2, 3 (let’s assume). 類別被編碼為序數(shù),例如,紅色,黃色,綠色等類別被分配為1、2、3(假設(shè))。
- These integers are then converted into binary code, so for example 1 becomes 001 and 2 becomes 010, and so on. 然后將這些整數(shù)轉(zhuǎn)換為二進(jìn)制代碼,例如1變?yōu)?01,2變?yōu)?10,依此類推。
Binary encoding is good for high cardinality data as it creates very few new columns. Most similar values overlap with each other across many of the new columns. This allows many machine learning algorithms to learn the similarity of the values.
二進(jìn)制編碼對(duì)于高基數(shù)數(shù)據(jù)很有用,因?yàn)樗鼊?chuàng)建的新列很少。 大多數(shù)相似的值在許多新列中彼此重疊。 這允許許多機(jī)器學(xué)習(xí)算法學(xué)習(xí)值的相似性。
If text features are present in the given dataset use Natural Language process techniques like Bag of words, TFIDF, Word2vec.
如果給定數(shù)據(jù)集中存在文本特征,則使用自然語言處理技術(shù),例如單詞袋,TFIDF,Word2vec。
To know more about categorical features visit here
要了解有關(guān)分類功能的更多信息,請(qǐng)?jiān)L問此處
10.缺失價(jià)值估算 (10. Missing Value Imputation)
Due to data collection error and if data is corrupted then missing values occur in our dataset.
由于數(shù)據(jù)收集錯(cuò)誤,并且如果數(shù)據(jù)損壞,則數(shù)據(jù)集中會(huì)丟失值。
如何解決缺失值 (How to work around missing values)
Imputation Techniques: By Mean, Median, and Mode of given data.
插補(bǔ)技術(shù):平均數(shù)據(jù)的中位數(shù)和模式。
Imputation by Class label: Class label imputation technique is if positive class label data is missing take the mean of the positive label only, if the negative class label data is missing take the mean of the negative class label only.
按類別標(biāo)簽進(jìn)行插補(bǔ):類別標(biāo)簽插補(bǔ)技術(shù)是:如果缺少正類別標(biāo)簽數(shù)據(jù),則僅取正標(biāo)簽的均值;如果缺少負(fù)類別標(biāo)簽數(shù)據(jù),則僅取負(fù)列標(biāo)簽的均值。
Model-based imputation: To prepare a dataset for machine learning we need to fix missing values, and we can fix missing values by applying machine learning to that dataset! If we consider a column with missing data as our target variable, and existing columns with complete data as our predictor variables, then we may construct a machine learning model using complete records as our train and test datasets and the records with incomplete data as our generalization target.
基于模型的歸因:要準(zhǔn)備用于機(jī)器學(xué)習(xí)的數(shù)據(jù)集,我們需要修正缺失值,并且可以通過將機(jī)器學(xué)習(xí)應(yīng)用于該數(shù)據(jù)集來修正缺失值! 如果我們將缺少數(shù)據(jù)的列作為目標(biāo)變量,將具有完整數(shù)據(jù)的現(xiàn)有列視為預(yù)測(cè)變量,那么我們可以使用完整記錄作為訓(xùn)練和測(cè)試數(shù)據(jù)集,而將不完整數(shù)據(jù)的記錄作為廣義來構(gòu)造機(jī)器學(xué)習(xí)模型目標(biāo)。
This is a fully scoped-out machine learning problem. Most of the time K-NN is used in a Model-based imputation technique because it uses the nearest neighbor's strategy.
這是一個(gè)范圍很廣的機(jī)器學(xué)習(xí)問題。 大多數(shù)時(shí)候,K-NN用于基于模型的插補(bǔ)技術(shù)中,因?yàn)樗褂昧俗罱従拥牟呗浴?
11.維度詛咒 (11. Curse of Dimensionality)
In machine learning, it’s important to know as dimensionality increases the number of data points to perform good classification models increase exponentially.
在機(jī)器學(xué)習(xí)中,重要的是要知道維數(shù)會(huì)增加數(shù)據(jù)點(diǎn)的數(shù)量,從而執(zhí)行良好的分類模型將呈指數(shù)增長(zhǎng)。
Hughes phenomenon: When the size of the datasets is fixed performance decreases when dimensionality increases.
休斯現(xiàn)象:當(dāng)數(shù)據(jù)集的大小固定時(shí),維數(shù)增加時(shí)性能下降。
Distance Functions [Euclidean distance]: Intuition of distance in 3-D is not valid in high dimensionality spaces.
距離函數(shù)[歐幾里得距離]: 3-D距離的直覺在高維空間中無效。
As dimensionality increases careful choice of the number of dimensions (features) to be used is the prerogative of the data scientist training the network. In general the smaller the size of the training set, the fewer features she should use. She must keep in mind that each feature increases the data set requirement exponentially.
隨著維數(shù)的增加,謹(jǐn)慎選擇要使用的維數(shù)(特征)數(shù)量是訓(xùn)練網(wǎng)絡(luò)的數(shù)據(jù)科學(xué)家的特權(quán)。 通常,訓(xùn)練集的大小越小,她應(yīng)使用的功能就越少。 她必須記住,每個(gè)功能都以指數(shù)方式增加了數(shù)據(jù)集需求。
As dimensionality increases for above image dist max(Xi)~dist min (Xi).
隨著以上圖像的尺寸增加,dist max(Xi)?dist min(Xi)。
In K-NN when dimension d is high euclidian distance is not the good choice as a distance measure, use the cosine distance as a distance measure in high dimensional space.
在維數(shù)為d的高歐氏距離的K-NN中,作為距離度量不是一個(gè)好的選擇,在高維空間中將余弦距離用作距離度量。
When dimension d is high and data points are dense the impact of dimension is high when data points are sparse the impact of dimension is low.
當(dāng)維度d高且數(shù)據(jù)點(diǎn)密集時(shí),維度的影響就高;而當(dāng)數(shù)據(jù)點(diǎn)稀疏時(shí),維度的影響則低。
When dimensionality increases the chances of overfitting to the model is increasing.
當(dāng)維數(shù)增加時(shí),對(duì)模型進(jìn)行過度擬合的機(jī)會(huì)就會(huì)增加。
12.偏差-偏差的權(quán)衡 (12. Bias-Variance Trade-Off)
In the theory f machine learning Bias-Variance Trade-Off is the mathematical way to know the model is underfitting (or) overfitting.
在機(jī)器學(xué)習(xí)的理論中,偏差-偏差權(quán)衡是了解模型擬合不足(或過度擬合)的數(shù)學(xué)方法。
The model is good when the error on future unseen data of the model is low is given by,
當(dāng)模型的未來看不見數(shù)據(jù)的誤差很低時(shí),該模型將是好的,
Generalization Error= Bias2 + variance + Iirreducible error
泛化誤差 =偏差2+方差+不可約誤差
Generalization error is the error on future unseen data of the model, the bias error is due to underfitting, variance error is due to overfitting and the irreducible error is an error that we cannot further reducible for the given model.
泛化誤差是模型未來未見數(shù)據(jù)的誤差,偏差誤差是由于擬合不足而引起的,方差誤差是由于擬合過度而導(dǎo)致的,不可約誤差是我們無法進(jìn)一步減小給定模型的誤差。
High bias means underfitting, error due to simplifying the assumptions about the model.
高偏差意味著擬合不足,由于簡(jiǎn)化了有關(guān)模型的假設(shè)而導(dǎo)致的誤差。
High Variance means overfitting, how much a model changes as training data changes, small changes in a model result in a very different model,s and different decision surfaces.
高方差意味著過度擬合,模型隨訓(xùn)練數(shù)據(jù)的變化而變化多少,模型的較小變化導(dǎo)致非常不同的模型,不同的決策面。
For a good model, low generalization error, no underfit, no overfit, and some amount of Irreducible error.
對(duì)于一個(gè)好的模型,低泛化誤差,無欠擬合,無過度擬合以及一定程度的不可約誤差。
↓ Generalization Error= ↓ Bias2 + ↓ variance + Iirreducible error
↓ 泛化誤差 =↓偏差2+↓方差+不可約誤差
High bias (or) underfit: When train error increases bias also increases
高偏差(或)欠擬合 :當(dāng)列車誤差增加時(shí),偏差也會(huì)增加
High Variance (or) overfit: when test error increases and train error decreases the variance also increases.
高方差(或)過擬合 :當(dāng)測(cè)試誤差增加而訓(xùn)練誤差減小時(shí),方差也增加。
翻譯自: https://medium.com/analytics-vidhya/k-nearest-neighbor-algorithm-in-various-real-world-cases-113c1dc75e91
內(nèi)容管理系統(tǒng)
總結(jié)
- 上一篇: python 梯度下降_Python解释
- 下一篇: opencv图像深度-1_OpenCV空