忍者必须死3 玩什么忍者_降维:忍者新手
忍者必須死3 玩什么忍者
What is high dimensional data?
什么是高維數據?
When the number of features exceed number of observations known as high dimnesional data?, it increases the?computational?complexity.generally any datasets with more than 10-features is known as high Dimensional Datasets.one of most efficiently used high dimensional dataset is MNIST dataset.
當特征的數量超過被稱為高維數據的觀測值的數量時,它會增加計算復雜性。通常,具有10個以上特征的任何數據集都被稱為高維數據集。最高效使用的高維數據集之一是MNIST數據集。
Why We need to reduce the Dimensionality of Dataset ?
為什么我們需要減少數據集的維數?
Greater is the dimension more complex will be computation so it required high functional devices.Reduction in Dimension can give the better and effecient result with the computer of low computational power.Some features of dimensionality Reduction are;
尺寸越大,計算就越復雜,因此需要功能強大的設備。使用低計算能力的計算機,降低尺寸可以提供更好,更有效的結果。
i>Make computation less complex. ii>It Requires less Disk Space. iii>Have lower chance of model overfitting.
i>簡化計算。 ii>它需要更少的磁盤空間。 iii>過度擬合的機會較小。
Why we Avoid Model Overfitting ?
為什么我們避免模型過度擬合?
When we will be making model it shouldn’t be super accurate.Super accurate model will no longer generalize the new datasets, As a result of that it Would not give Efficient and Effective Forecasting.
當我們要制作模型時,它不應該是超級精確的。超級精確的模型將不再推廣新的數據集,因此,它不能給出有效的預測。
fig:1圖。1Dimensionality Reduction Techniques;
降維技術;
i>t-SNE:-t-distribution Stochastic neighbourhood embeeding
i> t-SNE:-t分布隨機鄰域嵌入
ii>PCA:-Principal Component Analysis
ii> PCA:主成分分析
1. t-SNE (1. t-SNE)
Let’s Understand each and every piece of t-SNE in detail.It is most commonly used technique for high dimensional data visualization,gives clear and precise visualization of high dimensional data.It uses Feature Extraction Technique.
讓我們詳細了解每一個t-SNE。這是用于高維數據可視化的最常用技術,它提供了清晰,精確的高維數據可視化。它使用了特征提取技術。
fig:-t-SNE圖:-t-SNEt:- t-Distribution ;
t:-t分布
t-Distributiont分布t-Distribution or Student t-Distribution is a bell shaped curve is quite similar to Gaussian Distribution but it has heavy tail portion which resulted into high extreme values.
t分布或學生t分布是鐘形曲線,與高斯分布非常相似,但其尾部較重,導致極高的極值。
N:- Neighbourhood;
N:-鄰里;
fig:-2圖:-2Consider Rectangular box as a high dimensional space.consider spherical portion and take a point as “xi” and remaining point “xj” in the spherical space will be neighbourhood point of “xi”.How we can determine whether “xi” and “xj” are geometrically close or not?
將矩形框視為高維空間。考慮球形部分,將一個點作為“ xi”,而球形空間中的剩余點“ xj”將成為“ xi”的鄰點。如何確定“ xi”和“ xj”幾何上是否接近?
if “xi” and “xj” satisfy upper formula then both will be considered as neighbourhood point.but if we consider those points which are outside the spherical space that will not be neighbourhood point of “xi” .Here we will use embeeding concept.
如果“ xi”和“ xj”滿足上式,則兩者都將被視為鄰點。但是,如果考慮那些不在球面空間內的點而不是“ xi”的鄰點,則將使用嵌入概念。
E:- Embeeding ;
E:-嵌入;
It takes the points from high dimensional space and put those points into low dimensional space.By the help of embeeding we will resolve upper mentioned neighbourhood issue.
它從高維空間中獲取這些點并將其放入低維空間中。借助嵌入,我們將解決上面提到的鄰域問題。
fig:-3圖:-3S:- Stochastic;
S:-隨機;
more commonly stochastic means probability.How does it affect data points in t-SNE model??
更常見的是隨機均值概率。它如何影響t-SNE模型中的數據點?
If we run the t-SNE model multiple time with same parametric value it give different visual observation?.So if we will be doing the high dimensional visualization we can not reach onto the conclusion using single visual we would have to do multiple visualization with multiple parameter to give the conclusion.
如果我們以相同的參數值多次運行t-SNE模型,則會給出不同的視覺觀察結果。因此,如果我們要進行高維可視化,則無法使用單一視覺獲得結論,我們將不得不對多個參數進行多重可視化得出結論。
For the multiple observation we would have to change the perplexity every time and it’s value should not be more than datapoints.
對于多重觀察,我們每次都必須更改困惑度,其值不應超過數據點。
Perplexity < number of datapoints.
困惑度<數據點數。
what is perplexity ?
什么是困惑?
let’s take perplexity =5 it means there will be five neighbourhood point of “xi”
讓我們感到困惑= 5,這意味著將有五個鄰點“ xi”
coding implementation of t-SNE;
t-SNE的編碼實現;
From sklearn.manifold import TSNEdf=TSNE(n_components=2,random_state=0,perplexity=6,n_iter=500)
n_components convert the data into required number of dimension if we take n_components=2 means n-Dimensional data will be converted into 2-d Datset. random_state!=0 means each time we will run the code with parametric value will give different result. n_iter=500 greater is the iteration lesser will be the complexity.
如果我們將n_components=2 n_components則將數據轉換為所需的維數。這意味著n-Dimensional數據將轉換為2-d Datset。 random_state!=0意味著每次我們使用參數值運行代碼時,都會給出不同的結果。 n_iter=500越大,迭代次數越少,復雜度也會降低。
2. PCA (2. PCA)
PCA is one of the most commonly used dimensionality reduction technique. It is used for Machine learning modeling and high dimensional data visualization.It uses Feature Selection Technique.This technique choose one of the most important feature and removes all the less important features.Variance plays a major role here, those data which has high spreadness is considered as important features than the feature which has less spreadness.
PCA是最常用的降維技術之一。 它用于機器學習建模和高維數據可視化,它使用特征選擇技術,該技術選擇最重要的特征之一并刪除所有次要的特征,在這里方差起主要作用,那些具有高分布性的數據是被認為是重要的特征,而不是具有較小擴展性的特征。
In PCA Features are converted into components i.e:-When all the duplicate data and missing data will be removed then the features will be known as components which gives precise modeling of data for furthur use.
在PCA中,要素將轉換為組件,即:-當所有重復數據和丟失的數據將被刪除時,這些要素將被稱為組件,為進一步使用提供精確的數據建模。
fig:-PCA圖:-PCALet’s take an example of converting two dimensional data into one dimensional data having features f1 and f2 Respectively.
讓我們以將二維數據轉換為分別具有特征f1和f2的一維數據為例。
fig:-4圖:-4According to fig:-4 spreadness is high along y-Axis and low along x-Axis and as we know data with low variance will be removed from PCA.Hence in a process of converting 2D into 1D we will loose feature “F1”.now our new dataset will look like;
根據圖:-4,y軸上的擴散度較高,x軸上的擴散度較低,并且眾所周知,方差較小的數據將從PCA中刪除。因此,在將2D轉換為1D的過程中,我們將失去特征“ F1”。現在我們的新數據集將看起來像;
讓我們詳細分析PCA: (Let’s Analyse PCA in Detail:)
fig:-5圖:-5As we see fig:-5 , 1st image is implicating that both the features(f1&f2) have same variance then how would we convert 2-D Dataset into 1-D Dataset ?
如圖5所示,第一張圖片暗示兩個特征(f1&f2)具有相同的方差,那么我們如何將2-D數據集轉換為1-D數據集?
So if you observe second image you will get to know that I have rotated f1 in such a way that the point xi will have maximum variance on it. And now rotated feature f1 will be known as f1' and we take f2' perpendicular to f1’.
因此,如果您觀察第二張圖像,您將知道我已經旋轉了f1,使得點xi在其上具有最大方差。 現在旋轉的特征f1將被稱為f1',我們將f2'垂直于f1'。
Why we have taken f1 ⊥ f2 ?
為什么我們選擇f1⊥f2?
we have been rotating f1 towards maximum variance direction so we cannot only rotate f1 and spare f2,both will be rotated?.we are trying to make such kind of rotation so that change in rotation of “f1” should not affect “f1 &f2” realtionship.for that we are taking f1 ⊥ f2. Now when we rotate f1 by angle “ Θ” f2 will also be rotated by angle “ Θ” which result into no change in relationship between f1' and f2’.
我們一直朝著最大方差方向旋轉f1,所以我們不僅可以旋轉f1,而且備用f2都將旋轉。我們正在嘗試進行這種旋轉,以使“ f1”的旋轉變化不會影響“ f1&f2”的實現因此,我們取f1⊥f2。 現在,當我們將f1旋轉角度“Θ”時,f2也將旋轉角度“Θ”,這將導致f1'和f2'之間的關系不變。
Now after observing above process we can define PCA as;
現在,在觀察了上述過程之后,我們可以將PCA定義為:
We want to find the f1' in such a way that variance of “xi” projected on f1' will be maximum
我們希望以這樣的方式找到f1':投影在f1'上的“ xi”的方差最大
Understanding Projection in detail ;
詳細了解投影;
fig:-6圖:-6Here U1=unit vector,||U1||=1;
這里U1 =單位向量,|| U1 || = 1;
現在考慮PCA的定義并據此進行數學表述; (Now considering Definition of PCA and making mathematical formulation from that;)
使用距離最小化的PCA: (PCA Using Distance Minimization:)
ResearchGateResearchGateBefore using distance minimization technique we would have to consider;
在使用距離最小化技術之前,我們必須考慮;
All the feature should be of same unit so if we calculate the distance it will give some relatable variation ;let’s consider we are calculating weight we have two features weight in kg and weight in pound so final outcome will give different-different measurement value.
所有特征都應具有相同的單位,因此,如果我們計算距離,它將給出一些相關的變化;讓我們考慮一下我們正在計算重量時,我們有兩個特征:以千克為單位的重量和以磅為單位的重量,因此最終結果將給出不同的測量值。
so before applying distance minimization technique we will scale down the features into {0,1} which will make all the features unit independent and the process is done by using feature standardization.
因此,在應用距離最小化技術之前,我們將按比例縮小特征到{0,1},這將使所有特征單元獨立,并且使用特征標準化來完成該過程。
According to figure;
根據圖;
d=Perpendicular,||Xi||=Hypoteneous, Proj=Base now we will apply pythagoras theoram;
d =垂直,|| Xi || =次冪,Proj = Base現在我們將應用畢達哥拉斯定理;
優化問題的解(λ,V):- (Solution For Optimization Problem(λ,V):-)
This section give us apporach to solve any dimensioinality reduction Question using PCA. Here;
本節為我們提供了使用PCA解決任何維數減少問題的方法。 這里;
λ=特征值&V =特征向量 (λ=eigen Value & V=eigen vector)
let we have X a matrix dataset;
讓我們有一個矩陣數據集;
步驟一: (STEP I:-)
Take The matrix Dataset “X” and Standardize it
取矩陣數據集“ X”并將其標準化
第二步: (STEP II:-)
Take the Covariance of standardize Matrix “X”.Covariance is Denoted by S.
取標準化矩陣“ X”的協方差。協方差用S表示。
第三步: (STEP III:-)
Computing Eigen Value and Eigen Vectors of S.
計算S的本征值和本征向量
步驟IV:- (STEP IV :-)
This step Decide how many dimension you want to extract from n- Dimensional data. In my case i am extracting 3-Dimnesional data hence;
此步驟確定要從n維數據中提取多少維。 因此,就我而言,我正在提取3維數據。
碼: (CODE:)
from sklearn.decomposition import PCAdf2=PCA(n_components=3)
References ;
參考文獻;
applied ai ,josh Starmer.
喬伊?史塔默 ( Josh Starmer) 申請了AI 。
翻譯自: https://medium.com/swlh/dimensionality-reduction-novice-to-ninja-part1-fcbcb7f59d8c
忍者必須死3 玩什么忍者
總結
以上是生活随笔為你收集整理的忍者必须死3 玩什么忍者_降维:忍者新手的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: linux bin文件制作
- 下一篇: 昆山希望家教,希望家教,昆山家教