t-sne 流形_流形学习[t-SNE,LLE,Isomap等]变得轻松
t-sne 流形
Principal Component Analysis is a powerful method, but it often fails in that it assumes that the data can be modelled linearly. PCA expressed new features as linear combinations of existing ones by multiplying each by a coefficient. To address the limitations of PCA, various techniques have been created by apply for data with different specific structures. Manifold learning, however, seeks to find a method that can generalize to all structures of data.
主成分分析是一種強大的方法,但是它經常失敗,因為它假定可以線性建模數據。 PCA通過將每個特征乘以一個系數,將新特征表示為現有特征的線性組合。 為了解決PCA的局限性,通過應用具有不同特定結構的數據已創建了各種技術。 但是,流形學習試圖找到一種可以推廣到所有數據結構的方法。
Different structures of data refer to its different attributes within the data. For instance, it may be linearly separable, or it may be very sparse. Relationships within the data may be tangent, parallel, enveloping, or orthogonal to others. PCA works well on a very specific subset of data structures, since it operates on the assumption of linearity.
數據的不同結構引用數據中其不同的屬性。 例如,它可能是線性可分離的,或者可能非常稀疏。 數據內的關系可以是相切的,平行的,包絡的或正交的。 PCA在非常具體的數據結構子集上運行良好,因為它在線性假設下運行。
To put things in context, consider 300 by 300 pixel headshots. Under perfect conditions, each of the images would be centered perfectly, but in reality, there are many additional degrees of freedom to consider, such as lighting or the tilt of the face. If we were to treat a headshot as a point in 90,000 dimensional space, changing various effects like tilting the head or looking in a different direction move it nonlinearly through space, even though it is the same object with the same class.
為了說明問題,請考慮300 x 300像素的爆頭。 在理想條件下,每個圖像都將完美居中,但實際上,要考慮許多其他自由度,例如光照或面部傾斜。 如果我們將爆頭視為90,000維空間中的一個點,則改變它的各種效果(例如傾斜頭部或朝不同的方向看),即使它是同一類別的同一對象,也會在空間中非線性地移動它。
Created by author.由作者創建。This kind of data appears very often in real-world datasets. In addition to this effect, PCA is flustered when presented with skewed distributions, extreme values, and multiple dummy (one-hot encoded) variables (see Nonlinear PCA for a solution to this). There is a need for a generalizable method of dimensionality reduction.
這種數據經常出現在現實數據集中。 除此效果外,當PCA出現偏斜的分布,極值和多個虛擬(一次熱編碼)變量時,也會感到不安(有關此問題的解決方案,請參閱非線性PCA )。 需要一種通用的降維方法。
Manifold learning refers to this very task. There are many approaches within manifold learning that perhaps have been seen before, such as t-SNE and Locally Linear Embeddings (LLE). There are many articles and papers that go into the technical and mathematical details of these algorithms, but this one will focus on the general intuition and implementation.
流形學習就是指這項任務。 在流形學習中可能已經見過許多方法,例如t-SNE和局部線性嵌入(LLE)。 關于這些算法的技術和數學細節,有很多文章和論文,但是這一文章將重點介紹一般的直覺和實現。
Note that while there have been a few variants of dimensionality reduction that are supervised (e.g. Linear/Quadratic Discriminant Analysis), manifold learning generally refers to an unsupervised reduction, where the class is not presented to the algorithm (but may exist).
請注意,雖然有一些降維的變體是受監督的(例如, 線性/二次判別分析 ),但流形學習通常是指無監督的降維,其中該類未提供給算法(但可能存在)。
Whereas PCA attempts to create several linear hyperplanes to represent dimensions, much like multiple regression constructs as an estimation of the data, manifold learning attempts to learn manifolds, which are smooth, curved surfaces within the multidimensional space. As in the image below, these are often formed by subtle transformations on an image that would otherwise fool PCA.
PCA嘗試創建幾個線性超平面來表示維,就像使用多個回歸構造作為數據估計一樣,而流形學習則嘗試學習流形,這些流形是多維空間內的光滑曲面。 如下圖所示,這些通常是通過對圖像進行微妙的轉換而形成的,否則將使PCA蒙蔽。
Source: Aren Jansen. Image free to share.資料來源:Aren Jansen。 圖片免費分享。Then, ‘local linear patches’ that are tangent to the manifold can be extracted. These patches are usually in enough abundance such that they can accurately represent the manifold. Because these manifolds are not modelled by any one mathematical function but by several small linear patches, these linear neighborhoods can model any manifold. Although this may not be explicitly how certain algorithms approach the modelling of manifolds, the fundamental ideas are very similar.
然后,可以提取與歧管相切的“局部線性面片”。 這些補丁通常足夠豐富,因此它們可以準確地代表歧管。 由于這些歧管不是通過任何一個數學函數而是通過幾個小的線性斑塊來建模的,因此這些線性鄰域可以對任何歧管進行建模。 盡管這可能不是某些算法如何明確歧管建模的明確方法,但基本思想非常相似。
The following are fundamental assumptions or aspects of manifold learning algorithms:
以下是流形學習算法的基本假設或方面:
- There exist nonlinear relationships in the data that can be modelled through manifolds — surfaces that span multiple dimensions, are smooth, and not too ‘wiggly’ (too complex). The manifolds are continuous. 數據中存在可以通過流形建模的非線性關系-跨多個維度的表面,光滑且不太“搖擺”(太復雜)。 歧管是連續的。
- It is not important to maintain the multi-dimensional shape of data. Instead of ‘flattening’ or ‘projecting’ it (as with PCA) with specific directions to maintain the general shape of the data, it is alright to perform more complex manipulations like unfolding a coiled strip or flipping a sphere inside out. 保持數據的多維形狀并不重要。 可以使用特定的方向來“展平”或“投影”該數據(如PCA)以保持數據的總體形狀,而可以執行更復雜的操作,例如展開卷狀條帶或將球體向內翻轉。
- The best method to model manifolds is to treat the curved surface as being composed of several neighborhoods. If each data point manages to preserve the distance not with all the other points but only the ones close to it, the geometric relationships can be maintained in the data. 對歧管建模的最佳方法是將曲面視為由多個鄰域組成。 如果每個數據點都設法與其他點保持距離而不是與其他點保持距離,則可以在數據中保留幾何關系。
This idea can be understood well by looking at different approaches between unravelling this coiled dataset. On the left is a more PCA-like approach towards preserving the shape of the data, where each point is connected to each other. On the right, however, is an approach in which only the distance between neighborhoods of data points are valued.
通過研究拆解該線圈數據集之間的不同方法,可以很好地理解此想法。 左側是一種更類似于PCA的方法來保持數據的形狀,其中每個點都相互連接。 但是,右側是一種方法,其中僅評估數據點鄰域之間的距離。
Source: Jake VanderPlas. Image free to share.資料來源: 杰克·范德普拉斯(Jake VanderPlas)。 圖片免費分享。This relative disregard for points outside of a neighborhood leads to interesting results. For instance, consider this Swiss Roll dataset, which was coiled in three dimensions and reduced to a strip in two dimensions. In some scenarios, this effect would not be desirable. However, if this curve was the result of, say, tilting of the camera in an image or external effects on audio quality, manifold learning does us a huge favor by delicately unravelling these complex nonlinear relationships.
相對不考慮鄰域外的點會產生有趣的結果。 例如,考慮此Swiss Roll數據集,該數據集在三個維度中盤繞,在二維中簡化為帶狀。 在某些情況下,這種效果是不希望的。 但是,如果這條曲線是由于攝像機在圖像中傾斜或外部效果對音頻質量的影響而導致的,那么通過精細地解開這些復雜的非線性關系,多種學習方法將為我們提供巨大的幫助。
Source: Data Science Heroes. Image free to share.資料來源:數據科學英雄。 圖片免費分享。On the Swiss Roll dataset, PCA and even specialized variations like Kernel PCA fail to capture the gradient of values. Locally Linear Embeddings (LLE), a manifold learning algorithm, on the other hand, is able to.
在Swiss Roll數據集上,PCA甚至專用變體(如內核PCA)都無法捕獲值的梯度。 另一方面,局部線性嵌入(LLE)是一種流形學習算法。
Source: Jennifer Chu. Image free to share資料來源:珍妮佛·楚(Jennifer Chu)。 圖片免費分享Let’s get into more detail about three popular manifold learning algorithms: IsoMap, Locally Linear Embeddings, and t-SNE.
讓我們更詳細地了解三種流行的流形學習算法:IsoMap,局部線性嵌入和t-SNE。
One of the first explorations in manifold learning was the Isomap algorithm, short for Isometric Mapping. Isomap seeks a lower-dimensional representation that maintains ‘geodesic distances’ between the points. A geodesic distance is a generalization of distance for curved surfaces. Hence, instead of measuring distance in pure Euclidean distance with the Pythagorean theorem-derived distance formula, Isomap optimizes distances along a discovered manifold.
流形學習的最早探索之一是Isomap算法,它是Isometric Mapping的縮寫。 Isomap尋求一個維數較低的表示形式,以保持點之間的“大地距離”。 測地距離是曲面距離的概括。 因此,Isomap不會使用勾股定理衍生的距離公式來測量純歐幾里得距離中的距離,而是會優化沿著發現的流形的距離。
Source: sklearn. Image free to share.資料來源:sklearn。 圖片免費分享。Isomap performs better than PCA when trained on the MNIST dataset, showing a proper sectioning-off of different types of digits. The proximity and distance between certain groups of digits is revealing to the structure of the data. For instance, the ‘5’ and the ‘3’ that are close to each other (in the bottom left) in distance indeed look similar.
在MNIST數據集上進行訓練時,Isomap的性能優于PCA,顯示了對不同類型數字的正確分割。 某些數字組之間的接近度和距離正在揭示數據的結構。 例如,距離(在左下角)彼此接近的“ 5”和“ 3”的確看起來相似。
Below is the implementation of Isomap in Python. Since MNIST is a very large dataset, you may want to only train Isomap on the first 100 training examples with .fit_transform(X[:100]).
以下是Python中Isomap的實現。 由于MNIST是一個非常大的數據集,因此您可能只想使用.fit_transform(X[:100])在前100個訓練示例中訓練Isomap。
Locally Linear Embeddings use a variety of tangent linear patches (as demonstrated with the diagram above) to model a manifold. It can be thought of as performing a PCA on each of these neighborhoods locally, producing a linear hyperplane, then comparing the results globally to find the best nonlinear embedding. The goal of LLE is to ‘unroll’ or ‘unpack’ in distorted fashion the structure of the data, so often LLE will tend to have a high density in the center with extending rays.
局部線性嵌入使用各種切線線性面片(如上圖所示)為流形建模。 可以考慮在本地對這些鄰域中的每個鄰域執行PCA,生成線性超平面,然后全局比較結果以找到最佳的非線性嵌入。 LLE的目標是以扭曲的方式“展開”或“解包”數據的結構,因此,LLE通常會在中心處具有較高的密度,且光線會延伸。
Source: sklearn. Image free to share.資料來源:sklearn。 圖片免費分享。Note that LLE’s performance on the MNIST dataset is relatively poor. This is likely because the MNIST dataset consists of multiple manifolds, whereas LLE is designed to work on simpler datasets (like the Swiss Roll). It performs relatively similarly, or even worse, than PCA in this case. This makes sense; its ‘represent one function as several small linear ones’ strategy likely does not work well with large and complex dataset structures.
請注意,LLE在MNIST數據集上的性能相對較差。 這可能是因為MNIST數據集由多個流形組成,而LLE被設計為可用于更簡單的數據集(如Swiss Roll)。 在這種情況下,它的性能與PCA相對相似,甚至更差。 這很有道理; 它的“將一個函數表示為幾個小的線性函數”策略可能不適用于大型和復雜的數據集結構。
The implementation for LLE is as follows, assuming the dataset (X) has already been loaded.
假設數據集( X )已經加載,則LLE的實現如下。
t-SNE is one of the most popular choices for high-dimensional visualization, and stands for t-distributed Stochastic Neighbor Embeddings. The algorithm converts relationships in original space into t-distributions, or normal distributions with small sample sizes and relatively unknown standard deviations. This makes t-SNE very sensitive to the local structure, a common theme in manifold learning. It is considered to be the go-to visualization method because of many advantages it possesses:
t-SNE是高維可視化的最受歡迎選擇之一,它代表t分布的隨機鄰居嵌入。 該算法將原始空間中的關系轉換為t分布或樣本量較小且標準偏差相對未知的正態分布。 這使得t-SNE對局部結構非常敏感,這是流形學習中的常見主題。 由于它具有許多優點,因此被認為是首選的可視化方法:
- It is able to reveal the structure of the data at many scales. 它能夠以多種規模揭示數據的結構。
- It reveals data that lies in multiple manifolds and clusters 它揭示了位于多個流形和集群中的數據
- Has a smaller tendency to cluster points at the center. 將點聚集在中心的趨勢較小。
Isomap and LLE are best use to unfold a single, continuous, low-dimensional manifold. On the other hand, t-SNE focuses on the local structure of the data and attempts to ‘extract’ clustered local groups instead of trying to ‘unroll’ or ‘unfold’ it. This gives t-SNE an upper hand in detangling high-dimensional data with multiple manifolds. It is trained using gradient descent and tries to minimize entropy between distributions. In this sense, it is almost like a simplified, unsupervised neural network.
最好使用Isomap和LLE來展開單個連續的低維流形。 另一方面,t-SNE專注于數據的本地結構,并嘗試“提取”集群的本地組,而不是嘗試“展開”或“展開”數據。 這使t-SNE在使用多個流形對高維數據進行糾纏方面具有優勢。 它使用梯度下降進行訓練,并試圖使分布之間的熵最小。 從這個意義上講,它幾乎就像一個簡化的,無監督的神經網絡。
t-SNE is very powerful because of this ‘clustering’ vs. ‘unrolling’ approach to manifold learning. With a high-dimensional and multiple-manifold dataset like MNIST, where rotations and shifts cause nonlinear relationships, t-SNE performs even better than LDA, which was given the labels.
t-SNE非常強大,因為這種“聚類”與“展開”方法可以進行多種學習。 對于MNIST這樣的高維和多流形數據集,其中的旋轉和移動會導致非線性關系,因此t-SNE的性能甚至優于被賦予標簽的LDA 。
Source: sklearn. Image free to share.資料來源:sklearn。 圖片免費分享。However, t-SNE does have some disadvantages:
但是,t-SNE確實有一些缺點:
- t-SNE is very computationally expensive (compare the runtimes in the diagrams above). It can take several hours on a million-sample dataset, whereas PCA can finish in seconds or minutes. t-SNE在計算上非常昂貴(請比較上圖中的運行時)。 一百萬個樣本的數據集可能要花費幾個小時,而PCA可以在幾秒鐘或幾分鐘內完成。
- The algorithm relies on randomness (stochastic) to pick seeds to construct embeddings, which can increase its runtime and decrease performance if the seeds happen to be placed poorly. 該算法依賴于隨機性(隨機性)來挑選種子來構建嵌入,如果種子恰好放置不當,則會增加其運行時間并降低性能。
- The global structure is not explicitly preserved (i.e. more emphasis on clustering than demonstrating global structures). However, in sklearn’s implementation this problem can be solved by initializing points with PCA, which is built especially to preserve the global structure. 沒有明確保留全局結構(即,比起展示全局結構,更強調聚類)。 但是,在sklearn的實現中,可以通過使用PCA初始化點來解決此問題,而PCA專門為保留全局結構而構建。
t-SNE can implemented in sklearn as well:
t-SNE也可以在sklearn中實現:
Laurens van der Maaten, t-SNE’s author, says to consider the following when t-SNE yields a poor output:
t-SNE的作者Laurens van der Maaten說,當t-SNE的產出不佳時,請考慮以下幾點:
As a sanity check, try running PCA on your data to reduce it to two dimensions. If this also gives bad results, then maybe there is not very much nice structure in your data in the first place. If PCA works well but t-SNE doesn’t, I am fairly sure you did something wrong.
作為健全性檢查,請嘗試對數據運行PCA,以將其縮減為兩個維度。 如果這還會帶來不好的結果,則可能一開始數據中可能沒有很好的結構。 如果PCA運作良好,但t-SNE卻不能,我很確定您做錯了什么。
Why does he say so? As an additional reminder to hit the point home, manifold learning is not another variation of PCA but a generalization. Something that performs well in PCA is almost guaranteed to perform well in t-SNE or another manifold learning technique, since they are generalizations.
他為什么這么說? 作為進一步提醒您,綜合學習不是PCA的另一種變化,而是一種概括。 在PCA中表現良好的事物幾乎可以保證在t-SNE或其他多種學習技術中表現良好,因為它們是概括。
Much like an object that is an apple is also a fruit (a generalization), usually something is wrong if something does not yield a similar result as its generalization. On the other hand, if both methods fail, the data is probably inherently tricky to model.
就像是蘋果的對象也是水果(概括)一樣,如果某些事物不能產生與概括相似的結果,則通常是錯誤的。 另一方面,如果兩種方法均失敗,則數據可能固有地難以建模。
關鍵點 (Key Points)
- PCA fails to model nonlinear relationships because it is linear. PCA無法建模非線性關系,因為它是線性的。
- Nonlinear relationships appear in datasets often because external forces like lighting or the tilt can move a data point of the same class nonlinearly through Euclidean space. 非線性關系經常出現在數據集中是因為諸如光照或傾斜之類的外力可以使同一類的數據點在歐氏空間中非線性移動。
- Manifold learning attempts to generalize PCA to perform dimensionality reduction on all sorts of dataset structures, with the main idea that manifolds, or curved, continuous surfaces, should be modelled by preserving and prioritizing local over global distance. 流形學習試圖推廣PCA,以對所有類型的數據集結構執行降維,其主要思想是,應通過保留和優先考慮局部而不是全局距離來對流形或彎曲連續表面進行建模。
- Isomap tries to preserve geodesic distance, or distance measured not in Euclidean space but on the curved surface of the manifold. Isomap嘗試保留測地線距離,即不是在歐幾里得空間中而是在歧管的曲面上測量的距離。
- Locally Linear Embeddings can be thought of as representing the manifold as several linear patches, in which PCA is performed on. 可以認為局部線性嵌入將流形表示為幾個線性面片,在其中進行PCA。
- t-SNE takes more of an ‘extract’ approach opposed to an ‘unrolling’ approach, but still, like other manifold learning algorithms, prioritizes the preservation of local distances by using probability and t-distributions. t-SNE采取了更多的“提取”方法,而不是“展開”方法,但是仍然像其他流形學習算法一樣,通過使用概率和t分布來優先保留本地距離。
其他技術讀物 (Additional Technical Reading)
Isomap
等值圖
Locally Linear Embedding
局部線性嵌入
t-SNE
噸位
Thanks for reading!
謝謝閱讀!
翻譯自: https://towardsdatascience.com/manifold-learning-t-sne-lle-isomap-made-easy-42cfd61f5183
t-sne 流形
總結
以上是生活随笔為你收集整理的t-sne 流形_流形学习[t-SNE,LLE,Isomap等]变得轻松的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: G502hero和G502区别
- 下一篇: 数据库课程设计结论_结论