UMAP介绍和代码实例
安裝
pip install umap-learn pip install umap-learn[plot]UMAP包含一個子包UMAP。繪圖UMAP嵌入的結果。這個包需要單獨導入,因為它有額外的需求(matplotlib, datashader和holoviews)。它允許快速和簡單的繪圖,并嘗試做出明智的決定,以避免過度繪圖和其他陷阱。
基礎概念:
Uniform Manifold Approximation and Projection (UMAP)
**流形Manifold:**流形(Manifold)是局部具有歐式空間性質的空間,包括各種緯度的曲線曲面,例如球體、彎曲的平面等。流形的局部和歐式空間是同構的。 把流形的局部假設為歐幾里德空間,以方便研究。
**黎曼流形:**以光滑的方式在每一點的切空間上指定了歐式內積的微分流形。
與PCA,和t-SNE的區別:
https://pair-code.github.io/understanding-umap/
該算法基于關于數據的三個假設:
可以將UMAP分為兩個主要步驟:
步驟一:學習流形結構
1.尋找最近的鄰居:Nearest-Neighbor-Descent算法
**超參數設置:**n_neighbors超參數來指定我們想要使用多少個近鄰點。
一個小的n_neighbors值意味著我們需要一個非常局部的解釋,準確地捕捉結構的細節。而較大的n_neighbors值意味著我們的估計將基于更大的區域,因此在整個流形中更廣泛地準確。
2.構建一個圖:通過連接之前確定的最近鄰來構建圖。
**超參數設置:**local_connectivity(默認值= 1),表示高維空間中的每一個點都與另一個點相關聯。
對這兩個參數的理解:就是可以將他們視為下限和上限
Local_connectivity(默認值為1):100%確定每個點至少連接到另一個點(連接數量的下限)
n_neighbors(默認值為15):一個點直接連接到第16個以上的鄰居的可能性為 0%,因為它在構建圖時落在UMAP使用的局部區域之外
步驟二:尋找低維表示
超參數:min_dist(默認值=0.1),定義嵌入點之間的最小距離
Cross-Entropy,在低維表示中找到邊的最優權值。這些最優權值隨著上述交叉熵函數的最小化而出現,這個過程是可以通過隨機梯度下降法來進行優化的。
UMAP的工作完成了,得到了一個數組,其中包含了指定的低維空間中每個數據點的坐標。
實例一:
使用mnist數據分離數字,并在二維空間中展示:
reducer = umap.UMAP(random_state=42) X_trans = reducer.fit_transform(X) print(X_trans.shape)畫圖
reducer = umap.UMAP(random_state=42) embedding = reducer.fit_transform(digits.data) print(embedding.shape)plt.scatter(embedding[:, 0], embedding[:, 1], c=digits.target, cmap='Spectral', s=5) plt.gca().set_aspect('equal', 'datalim') plt.colorbar(boundaries=np.arange(11)-0.5).set_ticks(np.arange(10)) plt.title('UMAP projection of the Digits dataset') plt.show()參數設置
n_components
控制投影后的維數,默認值為 2。但是,當特征數較多時,2D可能不足以完全保留數據的底層拓撲結構,以 5 步嘗試 2-20 之間的值,并評估不同的基線模型以查看準確性的變化。
n_neighbors
這決定了在流形結構的局部逼近中使用的鄰近點的數量。更大的值將導致更多的全局結構被保留,而失去詳細的局部結構。通常,該參數通常應該在5到50之間,選擇10到15作為合理的默認值。
min_dist
這控制了嵌入的緊密程度,允許壓縮點在一起。較大的值確保嵌入點分布更均勻,而較小的值允許算法更準確地針對局部結構進行優化。合理的值在0.001到0.5之間,0.1是合理的默認值。
metric
計算點之間距離的公式,默認值為euclidean。這決定了用于測量輸入空間中距離的度量的選擇。已經編寫了各種各樣的度量標準,只要用戶定義的函數是numba的JITd,就可以傳遞它。
UMAP 會消耗大量內存,尤其是在擬合和創建連接圖等圖表的過程中,可設置low_memory為 True
n_neighbors=100, # default 15, The size of local neighborhood (in terms of number of neighboring sample points) used for manifold approximation.n_components=3, # default 2, The dimension of the space to embed into.metric='euclidean', # default 'euclidean', The metric to use to compute distances in high dimensional space.n_epochs=1000, # default None, The number of training epochs to be used in optimizing the low dimensional embedding. Larger values result in more accurate embeddings.learning_rate=1.0, # default 1.0, The initial learning rate for the embedding optimization.init='spectral', # default 'spectral', How to initialize the low dimensional embedding. Options are: {'spectral', 'random', A numpy array of initial embedding positions}.min_dist=0.1, # default 0.1, The effective minimum distance between embedded points.spread=1.0, # default 1.0, The effective scale of embedded points. In combination with ``min_dist`` this determines how clustered/clumped the embedded points are.low_memory=False, # default False, For some datasets the nearest neighbor computation can consume a lot of memory. If you find that UMAP is failing due to memory constraints consider setting this option to True.set_op_mix_ratio=1.0, # default 1.0, The value of this parameter should be between 0.0 and 1.0; a value of 1.0 will use a pure fuzzy union, while 0.0 will use a pure fuzzy intersection.local_connectivity=1, # default 1, The local connectivity required -- i.e. the number of nearest neighbors that should be assumed to be connected at a local level.repulsion_strength=1.0, # default 1.0, Weighting applied to negative samples in low dimensional embedding optimization.negative_sample_rate=5, # default 5, Increasing this value will result in greater repulsive force being applied, greater optimization cost, but slightly more accuracy.transform_queue_size=4.0, # default 4.0, Larger values will result in slower performance but more accurate nearest neighbor evaluation.a=None, # default None, More specific parameters controlling the embedding. If None these values are set automatically as determined by ``min_dist`` and ``spread``.b=None, # default None, More specific parameters controlling the embedding. If None these values are set automatically as determined by ``min_dist`` and ``spread``.random_state=42, # default: None, If int, random_state is the seed used by the random number generator;metric_kwds=None, # default None) Arguments to pass on to the metric, such as the ``p`` value for Minkowski distance.angular_rp_forest=False, # default False, Whether to use an angular random projection forest to initialise the approximate nearest neighbor search.target_n_neighbors=-1, # default -1, The number of nearest neighbors to use to construct the target simplcial set. If set to -1 use the ``n_neighbors`` value.#target_metric='categorical', # default 'categorical', The metric used to measure distance for a target array is using supervised dimension reduction. By default this is 'categorical' which will measure distance in terms of whether categories match or are different.#target_metric_kwds=None, # dict, default None, Keyword argument to pass to the target metric when performing supervised dimension reduction. If None then no arguments are passed on.#target_weight=0.5, # default 0.5, weighting factor between data topology and target topology.transform_seed=42, # default 42, Random seed used for the stochastic aspects of the transform operation.verbose=False, # default False, Controls verbosity of logging.unique=False, # default False, Controls if the rows of your data should be uniqued before being embedded.使用plotly繪制三維圖
import plotly.express as pxdef chart_plotly(X, y):# --------------------------------------------------------------------------## This section is not mandatory as its purpose is to sort the data by label# so, we can maintain consistent colors for digits across multiple graphs# Concatenate X and y arraysarr_concat = np.concatenate((X, y.reshape(y.shape[0], 1)), axis=1)# Create a Pandas dataframe using the above arraydf = pd.DataFrame(arr_concat, columns=['x', 'y', 'z', 'label'])# Convert label data type from float to integerdf['label'] = df['label'].astype(int)# Finally, sort the dataframe by labeldf.sort_values(by='label', axis=0, ascending=True, inplace=True)# --------------------------------------------------------------------------## Create a 3D graphfig = px.scatter_3d(df, x='x', y='y', z='z', color=df['label'].astype(str), height=900, width=950)# Update chart looksfig.update_layout(title_text='UMAP',showlegend=True,legend=dict(orientation="h", yanchor="top", y=0, xanchor="center", x=0.5),scene_camera=dict(up=dict(x=0, y=0, z=1),center=dict(x=0, y=0, z=-0.1),eye=dict(x=1.5, y=-1.4, z=0.5)),margin=dict(l=0, r=0, b=0, t=0),scene=dict(xaxis=dict(backgroundcolor='white',color='black',gridcolor='#f0f0f0',title_font=dict(size=10),tickfont=dict(size=10),),yaxis=dict(backgroundcolor='white',color='black',gridcolor='#f0f0f0',title_font=dict(size=10),tickfont=dict(size=10),),zaxis=dict(backgroundcolor='lightgrey',color='black',gridcolor='#f0f0f0',title_font=dict(size=10),tickfont=dict(size=10),)))# Update marker sizefig.update_traces(marker=dict(size=3, line=dict(color='black', width=0.1)))fig.show()# 設置reducer中n_components=3 X_trans = reducer.fit_transform(X) # Check the shape of the new data print('Shape of X_trans: ', X_trans.shape) chart(X_trans, y)總結
以上是生活随笔為你收集整理的UMAP介绍和代码实例的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 杭电acm1256
- 下一篇: dubbo暴露出HTTP服务