當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

UMAP介绍和代码实例

發布時間：2023/12/20 编程问答 27 豆豆

生活随笔收集整理的這篇文章主要介紹了 UMAP介绍和代码实例小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

安裝

pip install umap-learn pip install umap-learn[plot]

UMAP包含一個子包UMAP。繪圖UMAP嵌入的結果。這個包需要單獨導入，因為它有額外的需求(matplotlib, datashader和holoviews)。它允許快速和簡單的繪圖，并嘗試做出明智的決定，以避免過度繪圖和其他陷阱。

基礎概念：

Uniform Manifold Approximation and Projection (UMAP)
**流形Manifold：**流形(Manifold)是局部具有歐式空間性質的空間，包括各種緯度的曲線曲面，例如球體、彎曲的平面等。流形的局部和歐式空間是同構的。把流形的局部假設為歐幾里德空間，以方便研究。
**黎曼流形：**以光滑的方式在每一點的切空間上指定了歐式內積的微分流形。

與PCA，和t-SNE的區別：

https://pair-code.github.io/understanding-umap/

該算法基于關于數據的三個假設：

數據均勻分布在黎曼流形上（Riemannian manifold）；

黎曼度量是局部恒定的（或可以這樣近似）；

流形是局部連接的。

可以將UMAP分為兩個主要步驟：

學習高維空間中的流形結構；

找到該流形的低維表示。

步驟一：學習流形結構
1.尋找最近的鄰居：Nearest-Neighbor-Descent算法
**超參數設置：**n_neighbors超參數來指定我們想要使用多少個近鄰點。
一個小的n_neighbors值意味著我們需要一個非常局部的解釋，準確地捕捉結構的細節。而較大的n_neighbors值意味著我們的估計將基于更大的區域，因此在整個流形中更廣泛地準確。

2.構建一個圖：通過連接之前確定的最近鄰來構建圖。
**超參數設置：**local_connectivity(默認值= 1)，表示高維空間中的每一個點都與另一個點相關聯。

對這兩個參數的理解：就是可以將他們視為下限和上限
Local_connectivity(默認值為1)：100%確定每個點至少連接到另一個點(連接數量的下限)
n_neighbors(默認值為15)：一個點直接連接到第16個以上的鄰居的可能性為 0%，因為它在構建圖時落在UMAP使用的局部區域之外

步驟二：尋找低維表示
超參數：min_dist（默認值=0.1），定義嵌入點之間的最小距離
Cross-Entropy，在低維表示中找到邊的最優權值。這些最優權值隨著上述交叉熵函數的最小化而出現，這個過程是可以通過隨機梯度下降法來進行優化的。

UMAP的工作完成了，得到了一個數組，其中包含了指定的低維空間中每個數據點的坐標。

實例一：

使用mnist數據分離數字，并在二維空間中展示：

reducer = umap.UMAP(random_state=42) X_trans = reducer.fit_transform(X) print(X_trans.shape)

畫圖

reducer = umap.UMAP(random_state=42) embedding = reducer.fit_transform(digits.data) print(embedding.shape)plt.scatter(embedding[:, 0], embedding[:, 1], c=digits.target, cmap='Spectral', s=5) plt.gca().set_aspect('equal', 'datalim') plt.colorbar(boundaries=np.arange(11)-0.5).set_ticks(np.arange(10)) plt.title('UMAP projection of the Digits dataset') plt.show()

參數設置

n_components
控制投影后的維數，默認值為 2。但是，當特征數較多時，2D可能不足以完全保留數據的底層拓撲結構，以 5 步嘗試 2-20 之間的值，并評估不同的基線模型以查看準確性的變化。
n_neighbors
這決定了在流形結構的局部逼近中使用的鄰近點的數量。更大的值將導致更多的全局結構被保留，而失去詳細的局部結構。通常，該參數通常應該在5到50之間，選擇10到15作為合理的默認值。
min_dist
這控制了嵌入的緊密程度，允許壓縮點在一起。較大的值確保嵌入點分布更均勻，而較小的值允許算法更準確地針對局部結構進行優化。合理的值在0.001到0.5之間，0.1是合理的默認值。
metric
計算點之間距離的公式，默認值為euclidean。這決定了用于測量輸入空間中距離的度量的選擇。已經編寫了各種各樣的度量標準，只要用戶定義的函數是numba的JITd，就可以傳遞它。

UMAP 會消耗大量內存，尤其是在擬合和創建連接圖等圖表的過程中，可設置low_memory為 True

n_neighbors=100, # default 15, The size of local neighborhood (in terms of number of neighboring sample points) used for manifold approximation.n_components=3, # default 2, The dimension of the space to embed into.metric='euclidean', # default 'euclidean', The metric to use to compute distances in high dimensional space.n_epochs=1000, # default None, The number of training epochs to be used in optimizing the low dimensional embedding. Larger values result in more accurate embeddings.learning_rate=1.0, # default 1.0, The initial learning rate for the embedding optimization.init='spectral', # default 'spectral', How to initialize the low dimensional embedding. Options are: {'spectral', 'random', A numpy array of initial embedding positions}.min_dist=0.1, # default 0.1, The effective minimum distance between embedded points.spread=1.0, # default 1.0, The effective scale of embedded points. In combination with ``min_dist`` this determines how clustered/clumped the embedded points are.low_memory=False, # default False, For some datasets the nearest neighbor computation can consume a lot of memory. If you find that UMAP is failing due to memory constraints consider setting this option to True.set_op_mix_ratio=1.0, # default 1.0, The value of this parameter should be between 0.0 and 1.0; a value of 1.0 will use a pure fuzzy union, while 0.0 will use a pure fuzzy intersection.local_connectivity=1, # default 1, The local connectivity required -- i.e. the number of nearest neighbors that should be assumed to be connected at a local level.repulsion_strength=1.0, # default 1.0, Weighting applied to negative samples in low dimensional embedding optimization.negative_sample_rate=5, # default 5, Increasing this value will result in greater repulsive force being applied, greater optimization cost, but slightly more accuracy.transform_queue_size=4.0, # default 4.0, Larger values will result in slower performance but more accurate nearest neighbor evaluation.a=None, # default None, More specific parameters controlling the embedding. If None these values are set automatically as determined by ``min_dist`` and ``spread``.b=None, # default None, More specific parameters controlling the embedding. If None these values are set automatically as determined by ``min_dist`` and ``spread``.random_state=42, # default: None, If int, random_state is the seed used by the random number generator;metric_kwds=None, # default None) Arguments to pass on to the metric, such as the ``p`` value for Minkowski distance.angular_rp_forest=False, # default False, Whether to use an angular random projection forest to initialise the approximate nearest neighbor search.target_n_neighbors=-1, # default -1, The number of nearest neighbors to use to construct the target simplcial set. If set to -1 use the ``n_neighbors`` value.#target_metric='categorical', # default 'categorical', The metric used to measure distance for a target array is using supervised dimension reduction. By default this is 'categorical' which will measure distance in terms of whether categories match or are different.#target_metric_kwds=None, # dict, default None, Keyword argument to pass to the target metric when performing supervised dimension reduction. If None then no arguments are passed on.#target_weight=0.5, # default 0.5, weighting factor between data topology and target topology.transform_seed=42, # default 42, Random seed used for the stochastic aspects of the transform operation.verbose=False, # default False, Controls verbosity of logging.unique=False, # default False, Controls if the rows of your data should be uniqued before being embedded.

使用plotly繪制三維圖

import plotly.express as pxdef chart_plotly(X, y):# --------------------------------------------------------------------------## This section is not mandatory as its purpose is to sort the data by label# so, we can maintain consistent colors for digits across multiple graphs# Concatenate X and y arraysarr_concat = np.concatenate((X, y.reshape(y.shape[0], 1)), axis=1)# Create a Pandas dataframe using the above arraydf = pd.DataFrame(arr_concat, columns=['x', 'y', 'z', 'label'])# Convert label data type from float to integerdf['label'] = df['label'].astype(int)# Finally, sort the dataframe by labeldf.sort_values(by='label', axis=0, ascending=True, inplace=True)# --------------------------------------------------------------------------## Create a 3D graphfig = px.scatter_3d(df, x='x', y='y', z='z', color=df['label'].astype(str), height=900, width=950)# Update chart looksfig.update_layout(title_text='UMAP',showlegend=True,legend=dict(orientation="h", yanchor="top", y=0, xanchor="center", x=0.5),scene_camera=dict(up=dict(x=0, y=0, z=1),center=dict(x=0, y=0, z=-0.1),eye=dict(x=1.5, y=-1.4, z=0.5)),margin=dict(l=0, r=0, b=0, t=0),scene=dict(xaxis=dict(backgroundcolor='white',color='black',gridcolor='#f0f0f0',title_font=dict(size=10),tickfont=dict(size=10),),yaxis=dict(backgroundcolor='white',color='black',gridcolor='#f0f0f0',title_font=dict(size=10),tickfont=dict(size=10),),zaxis=dict(backgroundcolor='lightgrey',color='black',gridcolor='#f0f0f0',title_font=dict(size=10),tickfont=dict(size=10),)))# Update marker sizefig.update_traces(marker=dict(size=3, line=dict(color='black', width=0.1)))fig.show()# 設置reducer中n_components=3 X_trans = reducer.fit_transform(X) # Check the shape of the new data print('Shape of X_trans: ', X_trans.shape) chart(X_trans, y)

總結

以上是生活随笔為你收集整理的UMAP介绍和代码实例的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。