降维-基于RDD的API
降維-基于RDD的API
? Singular value decomposition (SVD)
o Performance
o SVD Example
? Principal component analysis (PCA)
Dimensionality reduction is the process of reducing the number of variables under consideration. It can be used to extract latent features from raw and noisy features or compress data while maintaining the structure. spark.mllib provides support for dimensionality reduction on the RowMatrix class.
? 奇異值分解(SVD)
o 性能
o SVD示例
? 主成分分析(PCA)
降維是減少所考慮的變量數量的過程。可用于從原始和嘈雜的特征中提取潛在特征,或者在保持結構的同時壓縮數據。 spark.mllib提供對RowMatrix類降維的支持。
奇異值分解Singular value decomposition (SVD)
Singular value decomposition (SVD) factorizes a matrix into three matrices奇異值分解(SVD) 將矩陣分解為三個矩陣: UU, ΣΣ, and VV such that
A=UΣVT,A=UΣVT,
where
? UU is an orthonormal matrix, whose columns are called left singular vectors, 正交矩陣,其列稱為左奇異矢量
? ΣΣ is a diagonal matrix with non-negative diagonals in descending order, whose diagonals are called singular values, 具有非負對角線降序的對角矩陣,其對角線稱為奇異值
? VV is an orthonormal matrix, whose columns are called right singular vectors. 正交矩陣,其列稱為右奇異向量。
For large matrices, usually we don’t need the complete factorization but only the top singular values and its associated singular vectors. This can save storage, de-noise and recover the low-rank structure of the matrix. 對于大型矩陣,通常不需要完整的因式分解,而僅需要頂部top奇異值及其關聯的奇異矢量。可以節省存儲空間,降低噪聲并恢復矩陣的低階結構。
If we keep the top k singular values, then the dimensions of the resulting low-rank matrix will be如果保持領先 ? 奇異值,則所得低秩矩陣的維將為:
? UU: m×km×k,
? ΣΣ: k×kk×k,
? VV: n×kn×k.
Performance
We assume n is smaller than m假設n小于m. The singular values and the right singular vectors are derived from the eigenvalues and the eigenvectors of the Gramian matrix 奇異值和右奇異向量是從Gramian矩陣的特征值和特征向量得出。The matrix storing the left singular vectors UU, is computed via matrix multiplication as U=A(VS?1)U=A(VS?1), if requested by the user via the computeU parameter. The actual method to use is determined automatically based on the computational cost存儲左奇異向量的矩陣ü通過矩陣乘法計算為 ü= A ,如果用戶通過computeU參數請求。實際使用的方法是根據計算成本自動確定的:
? If nn is small (n<100n<100) or k is large compared with nn (k>n/2), we compute the Gramian matrix first and then compute its top eigenvalues and eigenvectors locally on the driver. This requires a single pass with O(n2) storage on each executor and on the driver, and O(n2k) time on the driver.
? Otherwise, we compute (ATA)v in a distributive way and send it to ARPACK to compute (ATA)(ATA)’s top eigenvalues and eigenvectors on the driver node. This requires O(k) passes, O(n) storage on each executor, and O(nk) storage on the driver.
? 如果 ? 是小 (n < 100) 或者 ? 與 ? (k > n / 2個),首先計算Gramian矩陣,然后在驅動程序上局部計算其最高特征值和特征向量。需要一次通過O (?2個)存儲在每個執行器和驅動程序上,以及 O (?2個k ) 在驅動程序上的時間。
? 否則,計算 (一個?A )v以分布式方式將其發送到 ARPACK以進行計算(一個?A ),驅動程序節點上的最大特征值和特征向量。需要O (k )通過, O (n )存儲在每個執行器上,以及 ?(??) 存儲在驅動程序上。
SVD Example
spark.mllib provides SVD functionality to row-oriented matrices, provided in the RowMatrix class. spark.mllib向RowMatrix類提供的面向行的矩陣,提供SVD功能 。
Scala
? Java
? Python
Refer to the SingularValueDecomposition Scala docs for details on the API. 有關該API的詳細信息,請參考SingularValueDecompositionScala文檔。
import org.apache.spark.mllib.linalg.Matrix
import org.apache.spark.mllib.linalg.SingularValueDecomposition
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.linalg.distributed.RowMatrix
val data = Array(
Vectors.sparse(5, Seq((1, 1.0), (3, 7.0))),
Vectors.dense(2.0, 0.0, 3.0, 4.0, 5.0),
Vectors.dense(4.0, 0.0, 0.0, 6.0, 7.0))
val rows = sc.parallelize(data)
val mat: RowMatrix = new RowMatrix(rows)
// Compute the top 5 singular values and corresponding singular vectors.
val svd: SingularValueDecomposition[RowMatrix, Matrix] = mat.computeSVD(5, computeU = true)
val U: RowMatrix = svd.U // The U factor is a RowMatrix.
val s: Vector = svd.s // The singular values are stored in a local dense vector.
val V: Matrix = svd.V // The V factor is a local dense matrix.
Find full example code at “examples/src/main/scala/org/apache/spark/examples/mllib/SVDExample.scala” in the Spark repo.
The same code applies to IndexedRowMatrix if U is defined as an IndexedRowMatrix.
Principal component analysis (PCA) 主成分分析
Principal component analysis (PCA) is a statistical method to find a rotation such that the first coordinate has the largest variance possible, and each succeeding coordinate, in turn, has the largest variance possible. The columns of the rotation matrix are called principal components. PCA is used widely in dimensionality reduction.
spark.mllib supports PCA for tall-and-skinny matrices stored in row-oriented format and any Vectors.
主成分分析(PCA)是一種統計方法,用于查找旋轉,以使第一個坐標具有最大的方差,而每個后續坐標又具有最大的方差。旋轉矩陣的列稱為主成分。PCA被廣泛用于降維。
spark.mllib 支持PCA用于以行格式和任何向量存儲的高和稀疏矩陣。
? Scala
? Java
? Python
The following code demonstrates how to compute principal components on a RowMatrix and use them to project the vectors into a low-dimensional space.
Refer to the RowMatrix Scala docs for details on the API.
以下代碼演示了如何在 RowMatrix 上計算主成分,將向量投影到低維空間中。
有關該API的詳細信息,請參考RowMatrixScala文檔。
import org.apache.spark.mllib.linalg.Matrix
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.linalg.distributed.RowMatrix
val data = Array(
Vectors.sparse(5, Seq((1, 1.0), (3, 7.0))),
Vectors.dense(2.0, 0.0, 3.0, 4.0, 5.0),
Vectors.dense(4.0, 0.0, 0.0, 6.0, 7.0))
val rows = sc.parallelize(data)
val mat: RowMatrix = new RowMatrix(rows)
// Compute the top 4 principal components.
// Principal components are stored in a local dense matrix.
val pc: Matrix = mat.computePrincipalComponents(4)
// Project the rows to the linear space spanned by the top 4 principal components.
val projected: RowMatrix = mat.multiply(pc)
Find full example code at “examples/src/main/scala/org/apache/spark/examples/mllib/PCAOnRowMatrixExample.scala” in the Spark repo.
The following code demonstrates how to compute principal components on source vectors and use them to project the vectors into a low-dimensional space while keeping associated labels:
Refer to the PCA Scala docs for details on the API.
在Spark存儲庫中找到完整的示例代碼。
以下代碼演示了如何在源向量上計算主成分,將向量投影到低維空間中,同時保留關聯的標簽:
有關該API的詳細信息,請參考PCAScala文檔。
import org.apache.spark.mllib.feature.PCA
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.rdd.RDD
val data: RDD[LabeledPoint] = sc.parallelize(Seq(
new LabeledPoint(0, Vectors.dense(1, 0, 0, 0, 1)),
new LabeledPoint(1, Vectors.dense(1, 1, 0, 1, 0)),
new LabeledPoint(1, Vectors.dense(1, 1, 0, 0, 0)),
new LabeledPoint(0, Vectors.dense(1, 0, 0, 0, 0)),
new LabeledPoint(1, Vectors.dense(1, 1, 0, 0, 0))))
// Compute the top 5 principal components.
val pca = new PCA(5).fit(data.map(_.features))
// Project vectors to the linear space spanned by the top 5 principal
// components, keeping the label
val projected = data.map(p => p.copy(features = pca.transform(p.features)))
Find full example code at “examples/src/main/scala/org/apache/spark/examples/mllib/PCAOnSourceVectorExample.scala” in the Spark repo.
總結
以上是生活随笔為你收集整理的降维-基于RDD的API的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: GPU自动调度卷积层
- 下一篇: 聚类Clustering