Python编程语言学习:sklearn.manifold的TSNE函数的简介、使用方法、代码实现之详细攻略
Python編程語言學習:sklearn.manifold的TSNE函數的簡介、使用方法、代碼實現之詳細攻略
?
?
?
目錄
Manifold簡介
TSNE簡介—數據降維且可視化
TSNE使用方法
TSNE代碼實現
?
?
Manifold簡介
| Manifold learning is an approach to non-linear dimensionality reduction. Algorithms for this task are based on the idea that the dimensionality of many data sets is only artificially high. | Manifold是一種非線性降維的方法。這個任務的算法是基于這樣一種想法,即許多數據集的維數只是人為地偏高。 |
| High-dimensional datasets can be very difficult to visualize. While data in two or three dimensions can be plotted to show the inherent structure of the data, equivalent high-dimensional plots are much less intuitive. To aid visualization of the structure of a dataset, the dimension must be reduced in some way. The simplest way to accomplish this dimensionality reduction is by taking a random projection of the data. Though this allows some degree of visualization of the data structure, the randomness of the choice leaves much to be desired. In a random projection, it is likely that the more interesting structure within the data will be lost. To address this concern, a number of supervised and unsupervised linear dimensionality reduction frameworks have been designed, such as Principal Component Analysis (PCA), Independent Component Analysis, Linear Discriminant Analysis, and others. These algorithms define specific rubrics to choose an “interesting” linear projection of the data. These methods can be powerful, but often miss important non-linear structure in the data. | 高維數據集很難可視化。雖然可以繪制二維或三維的數據來顯示數據的固有結構,但等效的高維圖就不那么直觀了。為了幫助可視化數據集的結構,必須以某種方式減少維數。完成這種維數減少的最簡單方法是對數據進行隨機投影。盡管這允許一定程度的數據結構可視化,但選擇的隨機性仍有很多不足之處。在隨機投影中,數據中更有趣的結構很可能會丟失。為了解決這一問題,設計了許多監督和非監督線性降維框架,如主成分分析(PCA),獨立成分分析,線性判別分析,以及其他。這些算法定義了選擇數據的“有趣的”線性投影的特定規則。這些方法可能很強大,但往往忽略了數據中重要的非線性結構。 |
| Manifold Learning can be thought of as an attempt to generalize linear frameworks like PCA to be sensitive to non-linear structure in data. Though supervised variants exist, the typical manifold learning problem is unsupervised: it learns the high-dimensional structure of the data from the data itself, without the use of predetermined classifications. | Manifold可以被認為是一種推廣線性框架的嘗試,如PCA,以敏感的非線性數據結構。雖然有監督變量存在,但典型的Manifold問題是非監督的:它從數據本身學習數據的高維結構,而不使用預定的分類。 |
?
TSNE簡介—數據降維且可視化
? ? ? t-distributed Stochastic Neighbor Embedding(t-SNE),即t-分布隨機鄰居嵌入。t-SNE是一個可視化高維數據的工具。它將數據點之間的相似性轉化為聯合概率,并試圖最小化低維嵌入和高維數據聯合概率之間的Kullback-Leibler差異。t-SNE有一個非凸的代價函數,即通過不同的初始化,我們可以得到不同的結果。強烈建議使用另一種降維方法(如密集數據的PCA或稀疏數據的集群svd)來減少維數到一個合理的數量(如50),如果特征的數量非常高。這將抑制一些噪聲,加快樣本間成對距離的計算。
? ? ? ?t-SNE是目前來說效果最好的數據降維與可視化方法,但是它的缺點也很明顯,比如:占內存大,運行時間長。但是,當我們想要對高維數據進行分類,又不清楚這個數據集有沒有很好的可分性(即同類之間間隔小,異類之間間隔大),可以通過t-SNE投影到2維或者3維的空間中觀察一下。如果在低維空間中具有可分性,則數據是可分的;如果在高維空間中不具有可分性,可能是數據不可分,也可能僅僅是因為不能投影到低維空間。t-SNE(TSNE)的原理是將數據點之間的相似度轉換為概率。原始空間中的相似度由高斯聯合概率表示,嵌入空間的相似度由“學生t分布”表示。
參考文章:https://www.deeplearn.me/2137.html
?
TSNE使用方法
from sklearn.manifold import TSNE visual_model = TSNE(metric='precomputed', perplexity=10) ?# t分布隨機鄰接嵌入 visual = visual_model.fit_transform(dis)?
?
TSNE代碼實現
class TSNE Found at: sklearn.manifold._t_sneclass TSNE(BaseEstimator):"""t-distributed Stochastic Neighbor Embedding.t-SNE [1] is a tool to visualize high-dimensional data. It convertssimilarities between data points to joint probabilities and triesto minimize the Kullback-Leibler divergence between the jointprobabilities of the low-dimensional embedding and thehigh-dimensional data. t-SNE has a cost function that is not convex,i.e. with different initializations we can get different results.It is highly recommended to use another dimensionality reductionmethod (e.g. PCA for dense data or TruncatedSVD for sparse data)to reduce the number of dimensions to a reasonable amount (e.g. 50)if the number of features is very high. This will suppress somenoise and speed up the computation of pairwise distances betweensamples. For more tips see Laurens van der Maaten's FAQ [2].Read more in the :ref:`User Guide <t_sne>`.Parameters----------n_components : int, optional (default: 2)Dimension of the embedded space.perplexity : float, optional (default: 30)The perplexity is related to the number of nearest neighbors thatis used in other manifold learning algorithms. Larger datasetsusually require a larger perplexity. Consider selecting a valuebetween 5 and 50. Different values can result in significanltydifferent results.early_exaggeration : float, optional (default: 12.0)Controls how tight natural clusters in the original space are inthe embedded space and how much space will be between them. Forlarger values, the space between natural clusters will be largerin the embedded space. Again, the choice of this parameter is notvery critical. If the cost function increases during initialoptimization, the early exaggeration factor or the learning ratemight be too high.learning_rate : float, optional (default: 200.0)The learning rate for t-SNE is usually in the range [10.0, 1000.0]. Ifthe learning rate is too high, the data may look like a 'ball' with anypoint approximately equidistant from its nearest neighbours. If thelearning rate is too low, most points may look compressed in a densecloud with few outliers. If the cost function gets stuck in a bad localminimum increasing the learning rate may help.n_iter : int, optional (default: 1000)Maximum number of iterations for the optimization. Should be atleast 250.n_iter_without_progress : int, optional (default: 300)Maximum number of iterations without progress before we abort theoptimization, used after 250 initial iterations with earlyexaggeration. Note that progress is only checked every 50 iterations sothis value is rounded to the next multiple of 50... versionadded:: 0.17parameter *n_iter_without_progress* to control stopping criteria.min_grad_norm : float, optional (default: 1e-7)If the gradient norm is below this threshold, the optimization willbe stopped.metric : string or callable, optionalThe metric to use when calculating distance between instances in afeature array. If metric is a string, it must be one of the optionsallowed by scipy.spatial.distance.pdist for its metric parameter, ora metric listed in pairwise.PAIRWISE_DISTANCE_FUNCTIONS.If metric is "precomputed", X is assumed to be a distance matrix.Alternatively, if metric is a callable function, it is called on eachpair of instances (rows) and the resulting value recorded. The callableshould take two arrays from X as input and return a value indicatingthe distance between them. The default is "euclidean" which isinterpreted as squared euclidean distance.init : string or numpy array, optional (default: "random")Initialization of embedding. Possible options are 'random', 'pca',and a numpy array of shape (n_samples, n_components).PCA initialization cannot be used with precomputed distances and isusually more globally stable than random initialization.verbose : int, optional (default: 0)Verbosity level.random_state : int, RandomState instance, default=NoneDetermines the random number generator. Pass an int for reproducibleresults across multiple function calls. Note that differentinitializations might result in different local minima of the costfunction. See :term: `Glossary <random_state>`.method : string (default: 'barnes_hut')By default the gradient calculation algorithm uses Barnes-Hutapproximation running in O(NlogN) time. method='exact'will run on the slower, but exact, algorithm in O(N^2) time. Theexact algorithm should be used when nearest-neighbor errors needto be better than 3%. However, the exact method cannot scale tomillions of examples... versionadded:: 0.17Approximate optimization *method* via the Barnes-Hut.angle : float (default: 0.5)Only used if method='barnes_hut'This is the trade-off between speed and accuracy for Barnes-Hut T-SNE.'angle' is the angular size (referred to as theta in [3]) of a distantnode as measured from a point. If this size is below 'angle' then it isused as a summary node of all points contained within it.This method is not very sensitive to changes in this parameterin the range of 0.2 - 0.8. Angle less than 0.2 has quickly increasingcomputation time and angle greater 0.8 has quickly increasing error.n_jobs : int or None, optional (default=None)The number of parallel jobs to run for neighbors search. This parameterhas no impact when ``metric="precomputed"`` or(``metric="euclidean"`` and ``method="exact"``).``None`` means 1 unless in a :obj:`joblib.parallel_backend` context.``-1`` means using all processors. See :term:`Glossary <n_jobs>`for more details... versionadded:: 0.22Attributes----------embedding_ : array-like, shape (n_samples, n_components)Stores the embedding vectors.kl_divergence_ : floatKullback-Leibler divergence after optimization.n_iter_ : intNumber of iterations run.Examples-------->>> import numpy as np>>> from sklearn.manifold import TSNE>>> X = np.array([[0, 0, 0], [0, 1, 1], [1, 0, 1], [1, 1, 1]])>>> X_embedded = TSNE(n_components=2).fit_transform(X)>>> X_embedded.shape(4, 2)References----------[1] van der Maaten, L.J.P.; Hinton, G.E. Visualizing High-Dimensional DataUsing t-SNE. Journal of Machine Learning Research 9:2579-2605, 2008.[2] van der Maaten, L.J.P. t-Distributed Stochastic Neighbor Embeddinghttps://lvdmaaten.github.io/tsne/[3] L.J.P. van der Maaten. Accelerating t-SNE using Tree-Based Algorithms.Journal of Machine Learning Research 15(Oct):3221-3245, 2014.https://lvdmaaten.github.io/publications/papers/JMLR_2014.pdf"""# Control the number of exploration iterations with early_exaggeration on_EXPLORATION_N_ITER = 250# Control the number of iterations between progress checks_N_ITER_CHECK = 50@_deprecate_positional_argsdef __init__(self, n_components=2, *, perplexity=30.0, early_exaggeration=12.0, learning_rate=200.0, n_iter=1000, n_iter_without_progress=300, min_grad_norm=1e-7, metric="euclidean", init="random", verbose=0, random_state=None, method='barnes_hut', angle=0.5, n_jobs=None):self.n_components = n_componentsself.perplexity = perplexityself.early_exaggeration = early_exaggerationself.learning_rate = learning_rateself.n_iter = n_iterself.n_iter_without_progress = n_iter_without_progressself.min_grad_norm = min_grad_normself.metric = metricself.init = initself.verbose = verboseself.random_state = random_stateself.method = methodself.angle = angleself.n_jobs = n_jobsdef _fit(self, X, skip_num_points=0):"""Private function to fit the model using X as training data."""if self.method not in ['barnes_hut', 'exact']:raise ValueError("'method' must be 'barnes_hut' or 'exact'")if self.angle < 0.0 or self.angle > 1.0:raise ValueError("'angle' must be between 0.0 - 1.0")if self.method == 'barnes_hut':X = self._validate_data(X, accept_sparse=['csr'], ensure_min_samples=2, dtype=[np.float32, np.float64])else:X = self._validate_data(X, accept_sparse=['csr', 'csc', 'coo'], dtype=[np.float32, np.float64])if self.metric == "precomputed":if isinstance(self.init, str) and self.init == 'pca':raise ValueError("The parameter init=\"pca\" cannot be ""used with metric=\"precomputed\".")if X.shape[0] != X.shape[1]:raise ValueError("X should be a square distance matrix")check_non_negative(X, "TSNE.fit(). With metric='precomputed', X ""should contain positive distances.")if self.method == "exact" and issparse(X):raise TypeError('TSNE with method="exact" does not accept sparse ''precomputed distance matrix. Use method="barnes_hut" ''or provide the dense distance matrix.')if self.method == 'barnes_hut' and self.n_components > 3:raise ValueError("'n_components' should be inferior to 4 for the ""barnes_hut algorithm as it relies on ""quad-tree or oct-tree.")random_state = check_random_state(self.random_state)if self.early_exaggeration < 1.0:raise ValueError("early_exaggeration must be at least 1, but is {}".format(self.early_exaggeration))if self.n_iter < 250:raise ValueError("n_iter should be at least 250")n_samples = X.shape[0]neighbors_nn = Noneif self.method == "exact":# Retrieve the distance matrix, either using the precomputed one or# computing it.if self.metric == "precomputed":distances = Xelse:if self.verbose:print("[t-SNE] Computing pairwise distances...")if self.metric == "euclidean":distances = pairwise_distances(X, metric=self.metric, squared=True)else:distances = pairwise_distances(X, metric=self.metric, n_jobs=self.n_jobs)if np.any(distances < 0):raise ValueError("All distances should be positive, the ""metric given is not correct")# compute the joint probability distribution for the input spaceP = _joint_probabilities(distances, self.perplexity, self.verbose)assert np.all(np.isfinite(P)), "All probabilities should be finite"assert np.all(P >= 0), "All probabilities should be non-negative"assert np.all(P <= 1), ("All probabilities should be less ""or then equal to one")else:n_neighbors = min(n_samples - 1, int(3. * self.perplexity + 1))if self.verbose:print("[t-SNE] Computing {} nearest neighbors...".format(n_neighbors))# Find the nearest neighbors for every pointknn = NearestNeighbors(algorithm='auto', n_jobs=self.n_jobs, n_neighbors=n_neighbors, metric=self.metric)t0 = time()knn.fit(X)duration = time() - t0if self.verbose:print("[t-SNE] Indexed {} samples in {:.3f}s...".format(n_samples, duration))t0 = time()distances_nn = knn.kneighbors_graph(mode='distance')duration = time() - t0if self.verbose:print("[t-SNE] Computed neighbors for {} samples ""in {:.3f}s...".format(n_samples, duration)) # Free the memory used by the ball_treedel knnif self.metric == "euclidean":# knn return the euclidean distance but we need it squared# to be consistent with the 'exact' method. Note that the# the method was derived using the euclidean method as in the# input space. Not sure of the implication of using a different# metric.distances_nn.data **= 2# compute the joint probability distribution for the input spaceP = _joint_probabilities_nn(distances_nn, self.perplexity, self.verbose) # Compute the number of nearest neighbors to find.# LvdM uses 3 * perplexity as the number of neighbors.# In the event that we have very small # of points# set the neighbors to n - 1.if isinstance(self.init, np.ndarray):X_embedded = self.initelif self.init == 'pca':pca = PCA(n_components=self.n_components, svd_solver='randomized', random_state=random_state)X_embedded = pca.fit_transform(X).astype(np.float32, copy=False)elif self.init == 'random': # The embedding is initialized with iid samples from Gaussians with# standard deviation 1e-4.X_embedded = 1e-4 * random_state.randn(n_samples, self.n_components).astype(np.float32)else:raise ValueError("'init' must be 'pca', 'random', or ""a numpy array")# Degrees of freedom of the Student's t-distribution. The suggestion# degrees_of_freedom = n_components - 1 comes from# "Learning a Parametric Embedding by Preserving Local Structure"# Laurens van der Maaten, 2009.degrees_of_freedom = max(self.n_components - 1, 1)return self._tsne(P, degrees_of_freedom, n_samples, X_embedded=X_embedded, neighbors=neighbors_nn, skip_num_points=skip_num_points)def _tsne(self, P, degrees_of_freedom, n_samples, X_embedded, neighbors=None, skip_num_points=0):"""Runs t-SNE."""# t-SNE minimizes the Kullback-Leiber divergence of the Gaussians P# and the Student's t-distributions Q. The optimization algorithm that# we use is batch gradient descent with two stages:# * initial optimization with early exaggeration and momentum at 0.5# * final optimization with momentum at 0.8params = X_embedded.ravel()opt_args = {"it":0, "n_iter_check":self._N_ITER_CHECK, "min_grad_norm":self.min_grad_norm, "learning_rate":self.learning_rate, "verbose":self.verbose, "kwargs":dict(skip_num_points=skip_num_points), "args":[P, degrees_of_freedom, n_samples, self.n_components], "n_iter_without_progress":self._EXPLORATION_N_ITER, "n_iter":self._EXPLORATION_N_ITER, "momentum":0.5}if self.method == 'barnes_hut':obj_func = _kl_divergence_bhopt_args['kwargs']['angle'] = self.angle# Repeat verbose argument for _kl_divergence_bhopt_args['kwargs']['verbose'] = self.verbose # Get the number of threads for gradient computation here to# avoid recomputing it at each iteration.opt_args['kwargs']['num_threads'] = _openmp_effective_n_threads()else:obj_func = _kl_divergence# Learning schedule (part 1): do 250 iteration with lower momentum but# higher learning rate controlled via the early exaggeration parameterP *= self.early_exaggerationparams, kl_divergence, it = _gradient_descent(obj_func, params, **opt_args)if self.verbose:print("[t-SNE] KL divergence after %d iterations with early ""exaggeration: %f" % (it + 1, kl_divergence)) # Learning schedule (part 2): disable early exaggeration and finish# optimization with a higher momentum at 0.8P /= self.early_exaggerationremaining = self.n_iter - self._EXPLORATION_N_ITERif it < self._EXPLORATION_N_ITER or remaining > 0:opt_args['n_iter'] = self.n_iteropt_args['it'] = it + 1opt_args['momentum'] = 0.8opt_args['n_iter_without_progress'] = self.n_iter_without_progressparams, kl_divergence, it = _gradient_descent(obj_func, params, **opt_args)# Save the final number of iterationsself.n_iter_ = itif self.verbose:print("[t-SNE] KL divergence after %d iterations: %f" % (it + 1, kl_divergence))X_embedded = params.reshape(n_samples, self.n_components)self.kl_divergence_ = kl_divergencereturn X_embeddeddef fit_transform(self, X, y=None):"""Fit X into an embedded space and return that transformedoutput.Parameters----------X : array, shape (n_samples, n_features) or (n_samples, n_samples)If the metric is 'precomputed' X must be a square distancematrix. Otherwise it contains a sample per row. If the methodis 'exact', X may be a sparse matrix of type 'csr', 'csc'or 'coo'. If the method is 'barnes_hut' and the metric is'precomputed', X may be a precomputed sparse graph.y : IgnoredReturns-------X_new : array, shape (n_samples, n_components)Embedding of the training data in low-dimensional space."""embedding = self._fit(X)self.embedding_ = embeddingreturn self.embedding_def fit(self, X, y=None):"""Fit X into an embedded space.Parameters----------X : array, shape (n_samples, n_features) or (n_samples, n_samples)If the metric is 'precomputed' X must be a square distancematrix. Otherwise it contains a sample per row. If the methodis 'exact', X may be a sparse matrix of type 'csr', 'csc'or 'coo'. If the method is 'barnes_hut' and the metric is'precomputed', X may be a precomputed sparse graph.y : Ignored"""self.fit_transform(X)return self?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
總結
以上是生活随笔為你收集整理的Python编程语言学习:sklearn.manifold的TSNE函数的简介、使用方法、代码实现之详细攻略的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 成功解决AttributeError:
- 下一篇: Py之matplotlib:在matpl