當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

pca 主成分分析_六分钟的主成分分析（PCA）的直观说明。

發布時間：2023/12/15 编程问答 33 豆豆

生活随笔收集整理的這篇文章主要介紹了 pca 主成分分析_六分钟的主成分分析（PCA）的直观说明。小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

pca 主成分分析

Principle Component Analysis (PCA) is arguably a very difficult-to-understand topic for beginners in machine learning. Here, I will try my best to intuitively explain what it is, how the algorithm does what it does. This post assumes you have very basic knowledge of Linear Algebra like matrix multiplication, and vectors.

對于機器學習的初學者而言，主成分分析(PCA)可以說是一個很難理解的主題。在這里，我將盡力直觀地解釋它的含義，算法的作用。這篇文章假設您具有線性代數的基本知識，例如矩陣乘法和向量。

什么是PCA？ (What is PCA?)

PCA is a dimensionality-reduction technique used to make large datasets with hundreds of thousands of features into smaller datasets with fewer features while retaining as much information about the dataset as possible.

P CA是一種降維技術，用于將具有數十萬個特征的大型數據集制作為具有較少特征的較小數據集，同時保留有關該數據集的盡可能多的信息。

A perfect example would be:

一個完美的例子是：

Notice that in the original dataset there were five features which could be reduced to two features. These two features generalize the features on the left.

請注意，在原始數據集中有五個要素，可以簡化為兩個要素。這兩個功能概括了左側的功能。

可視化PCA的想法。 (Visualizing the idea of PCA.)

To make a picture of what’s happening, we use our previous example. A 2-dimensional plane showing the correlation of size to number of rooms in a house can be compressed to a single size feature, as shown below:

為了了解正在發生的事情，我們使用前面的示例。可以將顯示房屋大小與房間數量之間關系的二維平面壓縮為單個尺寸特征，如下所示：

If we project the houses on the black line, we would get something like this:

如果我們將房屋投影在黑線上，我們將得到以下內容：

So we need to reduce that projection error (the magnitude of blue lines) in order to retain maximum information.

因此，我們需要減少投影誤差(藍線的大小)以保留最大的信息。

了解PCA算法的先決條件。 (Prerequisites to understanding the PCA Algorithm.)

I will explain some concepts intuitively in order for you to understand the algorithm better.

我將直觀地解釋一些概念，以使您更好地理解算法。

意思 (Mean)

Mean of any dataset refers to the equilibrium of the dataset. Imagine a rod on which balls are placed at some distance x from the wall:

任何數據集的均值是指數據集的平衡。想象一下，一根桿上的球與墻之間的距離為x ：

Summing the distance of the balls from the wall and dividing by the number of balls results the point of equilibrium, where a pivot would balance the rod.

將球與壁的距離相加并除以球的數量可得出平衡點，在該點上，樞軸將平衡桿。

方差 (Variance)

Mean tells us about the equilibrium of the dataset, variance is a metric that tells us about the spread of dataset from its mean. In 1-dimensional dataset, variance can be illustrated as:

均值告訴我們有關數據集的均衡性，方差是一種度量，它告訴我們有關數據集從均值的分布。在一維數據集中，方差可以表示為：

We take the square of each distance so we do not get any negative values!

我們采用每個距離的平方，因此不會得到任何負值！

For 2-dimensional dataset, just take the variance after projecting it on each axis (shown later).

對于二維數據集，只需將其投影在每個軸上即可(稍后顯示)。

協方差 (Covariance)

In two and higher dimensional two very different datasets could have same variance, leading to misinterpretations of data. For example, both the datasets below have the same variance even though they are entirely different.

在二維和更高維中，兩個非常不同的數據集可能具有相同的方差，從而導致對數據的誤解。例如，以下兩個數據集盡管完全不同，但具有相同的方差。

Notice that the variance is same for both entirely different datasets. Also note how x-variance and y-variance are calculated by projecting the data on each axis.

請注意，兩個完全不同的數據集的方差相同。還要注意如何通過將數據投影到每個軸上來計算x方差和y方差。

Covariance depicts the correlation of dataset. It is a quantity describing the variance as well as the direction of the spread. This will help us distinguish between the above two datasets.

協方差描述了數據集的相關性。它是描述方差以及價差方向的數量。這將有助于我們區分上述兩個數據集。

To get the covariance simply add the product of coordinates of each point. For example in red dataset. (-2, -1) results in 2 and so does (2, 1).

要獲得協方差，只需添加每個點的坐標乘積 。例如在紅色數據集中。 (-2，-1)得出2，(2，1)也是如此。

The covariance matrix is a linear transformation that transforms given data into another shape. We will see that later. In 2D, this matrix is (2 x 2).

協方差矩陣是將給定數據轉換為另一種形狀的線性變換。我們稍后會看到。在2D模式下，此矩陣為(2 x 2)。

For the dataset in the example above, after inserting values, a rough estimate of the covariance matrix would be the one shown. To understand the use of this matrix we first need to understand linear transformations.

對于上面示例中的數據集，在插入值之后，將顯示協方差矩陣的粗略估計。要了解此矩陣的用法，我們首先需要了解線性變換。

A linear transformation is a mapping that transforms any point in the 2D plane to another point in the same plane by using a transformation matrix.

線性變換是通過使用變換矩陣將2D平面中的任何點轉換為同一平面中的另一個點的映射。

Back to our example, the covariance matrix is our transformation matrix. A unit circle would be transformed into an ellipse by the covariance matrix (using matrix multiplication) as shown below.

回到我們的示例，協方差矩陣是我們的變換矩陣。如下所示，通過協方差矩陣(使用矩陣乘法)將單位圓轉換為橢圓。

This transformation gives us two very special vectors — the eigenvectors.

這種轉換為我們提供了兩個非常特殊的向量- 特征向量。

These eigenvectors are special because they are not affected by the transformation (their direction remains the same), they are only increased in magnitude. Below are the two vectors in red and teal color.

這些特征向量是特殊的，因為它們不受變換的影響(它們的方向保持不變)，僅在幅度上有所增加。以下是紅色和藍綠色的兩個向量。

Speaking abstractly, these are the vectors on which we shall project our data, that would help us reduce the dimensions.

抽象地說，這些是我們將在其上投影數據的向量，這將有助于我們減小尺寸。

The question is: which eigenvector to choose?

問題是：選擇哪個特征向量？

Let me tell you, you are an intelligent person if you chose the red one, because I myself had absolutely no clue!

讓我告訴您，如果您選擇紅色，您就是一個聰明的人，因為我自己絕對不知道！

Red is the better choice because it would retain maximum variance of the original dataset, which means maximum information is retained.

紅色是更好的選擇，因為它將保留原始數據集的最大方差，這意味著將保留最大的信息。

Imagine you had picked the other eigenvector. Projecting the data would cause most of the points to overlap each other, and you will thus lose information. This is because the x-variance is greater than the y-variance.假設您選擇了另一個特征向量。投影數據會導致大多數點相互重疊，因此您將丟失信息。這是因為x方差大于y方差。

有趣的部分：代碼！ (The interesting part: the code!)

We will apply the algorithm on word embeddings, which is a very high dimensional vector representation of, well, words. The TensorFlow visualization for 10,000 words in 200 dimensions is something like this:

我們將把算法應用于單詞嵌入，它是單詞的非常高維的向量表示。 TensorFlow可視化在200個維度中的10,000個單詞如下所示：

source: http://projector.tensorflow.org/來源： http ： //projector.tensorflow.org/

We will use a dataset with embeddings in 300 dimensions instead.

我們將使用在300個維度上嵌入的數據集。

How our dataset is arranged.我們的數據集如何排列。

The process is as follows:

流程如下：

Demean the data (center it, so each feature has zero mean);

對數據進行平均處理(將其居中，因此每個特征均值為零)；

def compute_pca(X, n_components=2):
"""
Input:
X: of dimension (m,n) where each row corresponds to a word vector
n_components: Number of components you want to keep.
Output:
X_reduced: data transformed in 2 dims/columns + regenerated original data
""" # mean center the data (axis=0 because taking mean along columns)
X_demeaned = X - np.mean(X, axis=0, keepdims=True) # normalize each feature of dataset

2. Compute the covariance matrix;

2.計算協方差矩陣；

# calculate the covariance matrix, rowvar=False because our features are in columns, not rows
covariance_matrix = np.cov(X_demeaned, rowvar=False)

3. Compute and sort the eigenvectors and eigenvalues from highest to lowest magnitude (eigenvalues), remember we need high eigenvalue eigenvectors so as to retain maximum variance (information) of our word embeddings.

3.計算特征向量和特征值從最高到最低的特征值(特征值)并進行排序，請記住，我們需要較高的特征值特征向量，以保留詞嵌入的最大方差(信息)。

# calculate eigenvectors & eigenvalues of the covariance matrix, both are returned in ascending order
eigen_vals, eigen_vecs = np.linalg.eigh(covariance_matrix) # sort the eigen values (highest - lowest)
eigen_vals_sorted = np.flip(eigen_vals)

# sort eigenvectors (highest - lowest)
eigen_vecs_sorted = np.flip(eigen_vecs, axis=1)

4. Select the n eigenvectors corresponding to the eigenvalues in descending order.

4.按降序選擇與特征值相對應的n個特征向量。

# select the first n eigenvectors (n is desired dimension
# of rescaled data array, or dims_rescaled_data), each eigenvector is a column in eigen_vecs matrix. eigen_vecs_subset = eigen_vecs_sorted[:, :n_components]

5. Multiply this subset of eigenvectors by the demeaned data to get the transformed data!

5.將特征向量的這個子集乘以經過除法的數據，即可獲得轉換后的數據！

# transform the data by multiplying the demeaned dataset with the first n eigenvectors X_reduced = np.dot(X_demeaned, eigen_vecs_subset)
return X_reduced

讓我們想象一下！ (Let’s visualize!)

Let’s try this algorithm on some data.

讓我們對某些數據嘗試這種算法。

words = ['oil', 'gas', 'happy', 'sad', 'city', 'town',
'village', 'country', 'continent', 'petroleum', 'joyful'] #11 words# given a list of words and the embeddings, it returns a matrix with all the embeddings
X = get_vectors(word_embeddings, words)
X.shape # returns (11, 300)

Applying our algorithm to reduce from 300 dimensions to 2 dimensions!

應用我們的算法將尺寸從300維減少到2維！

# We have done the plotting for you. Just run this cell.
result = compute_pca(X, 2)
plt.figure(figsize=(12,8))
plt.scatter(result[:, 0], result[:, 1])
for i, word in enumerate(words):
plt.annotate(word, xy=(result[i, 0] - 0.05, result[i, 1] + 0.1), )plt.show()

Takeaway: words that express emotions (sad, joyful, happy) are close to each other, town/village/city are close to each other as well, this is because they are highly related!

要點：表達情感(悲傷，快樂，快樂)的單詞彼此接近，城鎮/鄉村/城市也彼此接近，這是因為它們之間有著密切的聯系！

Now we have reduced 300 dimension embeddings to 2 dimension embeddings! That is why we can visualize it!

現在我們將300維嵌入減少為2維嵌入！這就是為什么我們可以將其可視化！

I hope I explained everything is a friendly and easy manner :)

我希望我解釋的一切都是友好而輕松的方式:)

See you soon!

再見！

Note: All the slides shown in this article belong to a YouTube channel Luis Serrano. His video explaining PCA was amazing and here’s a link to his channel: https://www.youtube.com/channel/UCgBncpylJ1kiVaPyP-PZauQ

注意：本文中顯示的所有幻燈片均屬于YouTube頻道Luis Serrano。他的解釋PCA的視頻很棒，這是他的頻道的鏈接： https : //www.youtube.com/channel/UCgBncpylJ1kiVaPyP-PZauQ

翻譯自: https://medium.com/swlh/intuitive-explanation-for-principle-component-analysis-pca-from-scratch-157a1b162f1f