pca 特征抽取
主成分分析 ( Principal Component Analysis , PCA )是一種掌握事物主要矛盾的統計分析方法,它可以從多元事物中解析出主要影響因素,揭示事物的本質,簡化復雜的問題。計算主成分的目的是將高維數據投影到較低維空間。給定 n 個變量的 m 個觀察值,形成一個 n ′ m 的數據矩陣, n 通常比較大。對于一個由多個變量描述的復雜事物,人們難以認識,那么是否可以抓住事物主要方面進行重點分析呢?如果事物的主要方面剛好體現在幾個主要變量上,我們只需要將這幾個變量分離出來,進行詳細分析。但是,在一般情況下,并不能直接找出這樣的關鍵變量。這時我們可以用原有變量的線性組合來表示事物的主要方面, PCA 就是這樣一種分析方法。
PCA 主要 用于數據降維,對于一系列例子的特征組成的多維向量,多維向量里的某些元素本身沒有區分性,比如某個元素在所有的例子中都為1,或者與1差距不大,那么這個元素本身就沒有區分性,用它做特征來區分,貢獻會非常小。所以我們的目的是找那些變化大的元素,即方差大的那些維,而去除掉那些變化不大的維,從而使特征留下的都是“精品”,而且計算量也變小了。 對于一個k維的特征來說,相當于它的每一維特征與其他維都是正交的(相當于在多維坐標系中,坐標軸都是垂直的),那么我們可以變化這些維的坐標系,從而使這個特征在某些維上方差大,而在某些維上方差很小。例如,一個45度傾斜的橢圓,在第一坐標系,如果按照x,y坐標來投影,這些點的x和y的屬性很難用于區分他們,因為他們在x,y軸上坐標變化的方差都差不多,我們無法根據這個點的某個x屬性來判斷這個點是哪個,而如果將坐標軸旋轉,以橢圓長軸為x軸,則橢圓在長軸上的分布比較長,方差大,而在短軸上的分布短,方差小,所以可以考慮只保留這些點的長軸屬性,來區分橢圓上的點,這樣,區分性比x,y軸的方法要好!
所以我們的做法就是求得一個k維特征的投影矩陣,這個投影矩陣可以將特征從高維降到低維。投影矩陣也可以叫做變換矩陣。新的低維特征必須每個維都正交,特征向量都是正交的。通過求樣本矩陣的協方差矩陣,然后求出協方差矩陣的特征向量,這些特征向量就可以構成這個投影矩陣了。特征向量的選擇取決于協方差矩陣的特征值的大小。
舉一個例子:
對于一個訓練集,100個對象模板,特征是10維,那么它可以建立一個100*10的矩陣,作為樣本。求這個樣本的協方差矩陣,得到一個10*10的協方差矩陣,然后求出這個協方差矩陣的特征值和特征向量,應該有10個特征值和特征向量,我們根據特征值的大小,取前四個特征值所對應的特征向量,構成一個10*4的矩陣,這個矩陣就是我們要求的特征矩陣,100*10的樣本矩陣乘以這個10*4的特征矩陣,就得到了一個100*4的新的降維之后的樣本矩陣,每個特征的維數下降了。
當給定一個測試的特征集之后,比如1*10維的特征,乘以上面得到的10*4的特征矩陣,便可以得到一個1*4的特征,用這個特征去分類。
所以做PCA實際上是求得這個投影矩陣,用高維的特征乘以這個投影矩陣,便可以將高維特征的維數下降到指定的維數。
PCA 的目標是尋找 r ( r<n )個新變量,使它們反映事物的主要特征,壓縮原有數據矩陣的規模。每個新變量是原有變量的線性組合,體現原有變量的綜合效果,具有一定的實際含義。這 r 個新變量稱為“主成分”,它們可以在很大程度上反映原來 n 個變量的影響,并且這些新變量是互不相關的,也是正交的。通過主成分分析,壓縮數據空間,將多元數據的特征在低維空間里直觀地表示出來。例如,將多個時間點、多個實驗條件下的基因表達譜數據( N 維)表示為 3 維空間中的一個點,即將數據的維數從 RN 降到 R3 。
在進行基因表達數據分析時,一個重要問題是確定每個實驗數據是否是獨立的,如果每次實驗數據之間不是獨立的,則會影響基因表達數據分析結果的準確性。對于利用基因芯片所檢測到的基因表達數據,如果用 PCA 方法進行分析,可以將各個基因作為變量,也可以將實驗條件作為變量。當將基因作為變量時,通過分析確定一組“主要基因元素”,它們能夠很好地說明基因的特征,解釋實驗現象;當將實驗條件作為變量時,通過分析確定一組“主要實驗因素”,它們能夠很好地刻畫實驗條件的特征,解釋基因的行為。下面著重考慮以實驗條件作為變量的 PCA 分析方法。假設將數據的維數從 R N 降到 R 3 ,具體的 PCA 分析步驟如下:
(1) 第一步計算矩陣 X 的樣本的協方差矩陣 S :
(2) 第二步計算協方差矩陣S的本征向量 e1,e2,…,eN的本征值 , i = 1,2,…,N 。本征值按大到小排序: ;
(3)第三步投影數據到本征矢張成的空間之中,這些本征矢相應的本征值為 。現在數據可以在三維空間中展示為云狀的點集。
對于 PCA ,確定新變量的個數 r 是一個兩難的問題。我們的目標是減小 r ,如果 r 小,則數據的維數低,便于分析,同時也降低了噪聲,但可能丟失一些有用的信息。究竟如何確定 r 呢?這需要進一步分析每個主元素對信息的貢獻。
令 代表第 i 個特征值,定義第 i 個主元素的貢獻率為:
(8-45)
前 r 個主成分的累計貢獻率為:
(8-46)
貢獻率表示所定義的主成分在整個數據分析中承擔的主要意義占多大的比重,當取前 r 個主成分來代替原來全部變量時,累計貢獻率的大小反應了這種取代的可靠性,累計貢獻率越大,可靠性越大;反之,則可靠性越小。一般要求累計貢獻率達到 70% 以上。
經過 PCA 分析,一個多變量的復雜問題被簡化為低維空間的簡單問題。可以利用這種簡化方法進行作圖,形象地表示和分析復雜問題。在分析基因表達數據時,可以針對基因作圖,也可以針對實驗條件作圖。前者稱為 Q 分析,后者稱為 R 分析。
?
?
?
?
?
?
1) 協方差矩陣
2) 特征值與特征向量
3) 貢獻率
?
?
http://www.cad.zju.edu.cn/home/chenlu/pca.htm
/** PrincipalComponents.java* Copyright (C) 2000 University of Waikato, Hamilton, New Zealand**/package weka.attributeSelection;/**<!-- globalinfo-start -->* Performs a principal components analysis and transformation of the data. Use in conjunction with a Ranker search. Dimensionality reduction is accomplished by choosing enough eigenvectors to account for some percentage of the variance in the original data---default 0.95 (95%). Attribute noise can be filtered by transforming to the PC space, eliminating some of the worst eigenvectors, and then transforming back to the original space.* <p/><!-- globalinfo-end -->*<!-- options-start -->* Valid options are: <p/>* * <pre> -D* Don't normalize input data.</pre>* * <pre> -R* Retain enough PC attributes to account * for this proportion of variance in the original data.* (default = 0.95)</pre>* * <pre> -O* Transform through the PC space and * back to the original space.</pre>* * <pre> -A* Maximum number of attributes to include in * transformed attribute names. (-1 = include all)</pre>* <!-- options-end -->** @author Mark Hall (mhall@cs.waikato.ac.nz)* @author Gabi Schmidberger (gabi@cs.waikato.ac.nz)* @version $Revision: 5987 $*/ public class PrincipalComponents extends UnsupervisedAttributeEvaluator implements AttributeTransformer, OptionHandler {/** for serialization */static final long serialVersionUID = 3310137541055815078L;/** The data to transform analyse/transform */private Instances m_trainInstances;/** Keep a copy for the class attribute (if set) */private Instances m_trainHeader;/** The header for the transformed data format */private Instances m_transformedFormat;/** The header for data transformed back to the original space */private Instances m_originalSpaceFormat;/** Data has a class set */private boolean m_hasClass;/** Class index */private int m_classIndex;/** Number of attributes */private int m_numAttribs;/** Number of instances */private int m_numInstances;/** Correlation matrix for the original data */private double [][] m_correlation;/** Will hold the unordered linear transformations of the (normalized)original data */private double [][] m_eigenvectors;/** Eigenvalues for the corresponding eigenvectors */private double [] m_eigenvalues = null;/** Sorted eigenvalues */private int [] m_sortedEigens;/** sum of the eigenvalues */private double m_sumOfEigenValues = 0.0;/** Filters for original data */private ReplaceMissingValues m_replaceMissingFilter;private Normalize m_normalizeFilter;private NominalToBinary m_nominalToBinFilter;private Remove m_attributeFilter;/** used to remove the class column if a class column is set */private Remove m_attribFilter;/** The number of attributes in the pc transformed data */private int m_outputNumAtts = -1;/** normalize the input data? */private boolean m_normalize = true;/** the amount of varaince to cover in the original data whenretaining the best n PC's */private double m_coverVariance = 0.95;/** transform the data through the pc space and back to the originalspace ? */private boolean m_transBackToOriginal = false;/** maximum number of attributes in the transformed attribute name */private int m_maxAttrsInName = 5;/** holds the transposed eigenvectors for converting back to theoriginal space */private double [][] m_eTranspose;/*** Returns a string describing this attribute transformer* @return a description of the evaluator suitable for* displaying in the explorer/experimenter gui*/public String globalInfo() {return "Performs a principal components analysis and transformation of "+"the data. Use in conjunction with a Ranker search. Dimensionality "+"reduction is accomplished by choosing enough eigenvectors to "+"account for some percentage of the variance in the original data---"+"default 0.95 (95%). Attribute noise can be filtered by transforming "+"to the PC space, eliminating some of the worst eigenvectors, and "+"then transforming back to the original space.";}/*** Returns an enumeration describing the available options. <p>** @return an enumeration of all the available options.**/public Enumeration listOptions () {Vector newVector = new Vector(3);newVector.addElement(new Option("/tDon't normalize input data." , "D", 0, "-D"));newVector.addElement(new Option("/tRetain enough PC attributes to account "+"/n/tfor this proportion of variance in "+"the original data./n"+ "/t(default = 0.95)","R",1,"-R"));newVector.addElement(new Option("/tTransform through the PC space and "+"/n/tback to the original space.", "O", 0, "-O"));newVector.addElement(new Option("/tMaximum number of attributes to include in "+ "/n/ttransformed attribute names. (-1 = include all)", "A", 1, "-A"));return newVector.elements();}/*** Parses a given list of options. <p/>*<!-- options-start -->* Valid options are: <p/>* * <pre> -D* Don't normalize input data.</pre>* * <pre> -R* Retain enough PC attributes to account * for this proportion of variance in the original data.* (default = 0.95)</pre>* * <pre> -O* Transform through the PC space and * back to the original space.</pre>* * <pre> -A* Maximum number of attributes to include in * transformed attribute names. (-1 = include all)</pre>* <!-- options-end -->** @param options the list of options as an array of strings* @throws Exception if an option is not supported*/public void setOptions (String[] options)throws Exception {resetOptions();String optionString;optionString = Utils.getOption('R', options);if (optionString.length() != 0) {Double temp;temp = Double.valueOf(optionString);setVarianceCovered(temp.doubleValue());}optionString = Utils.getOption('A', options);if (optionString.length() != 0) {setMaximumAttributeNames(Integer.parseInt(optionString));}setNormalize(!Utils.getFlag('D', options));setTransformBackToOriginal(Utils.getFlag('O', options));}/*** Reset to defaults*/private void resetOptions() {m_coverVariance = 0.95;m_normalize = true;m_sumOfEigenValues = 0.0;m_transBackToOriginal = false;}/*** Returns the tip text for this property* @return tip text for this property suitable for* displaying in the explorer/experimenter gui*/public String normalizeTipText() {return "Normalize input data.";}/*** Set whether input data will be normalized.* @param n true if input data is to be normalized*/public void setNormalize(boolean n) {m_normalize = n;}/*** Gets whether or not input data is to be normalized* @return true if input data is to be normalized*/public boolean getNormalize() {return m_normalize;}/*** Returns the tip text for this property* @return tip text for this property suitable for* displaying in the explorer/experimenter gui*/public String varianceCoveredTipText() {return "Retain enough PC attributes to account for this proportion of "+"variance.";}/*** Sets the amount of variance to account for when retaining* principal components* @param vc the proportion of total variance to account for*/public void setVarianceCovered(double vc) {m_coverVariance = vc;}/*** Gets the proportion of total variance to account for when* retaining principal components* @return the proportion of variance to account for*/public double getVarianceCovered() {return m_coverVariance;}/*** Returns the tip text for this property* @return tip text for this property suitable for* displaying in the explorer/experimenter gui*/public String maximumAttributeNamesTipText() {return "The maximum number of attributes to include in transformed attribute names.";}/*** Sets maximum number of attributes to include in* transformed attribute names.* @param m the maximum number of attributes*/public void setMaximumAttributeNames(int m) {m_maxAttrsInName = m;}/*** Gets maximum number of attributes to include in* transformed attribute names.* @return the maximum number of attributes*/public int getMaximumAttributeNames() {return m_maxAttrsInName;}/*** Returns the tip text for this property* @return tip text for this property suitable for* displaying in the explorer/experimenter gui*/public String transformBackToOriginalTipText() {return "Transform through the PC space and back to the original space. "+"If only the best n PCs are retained (by setting varianceCovered < 1) "+"then this option will give a dataset in the original space but with "+"less attribute noise.";}/*** Sets whether the data should be transformed back to the original* space* @param b true if the data should be transformed back to the* original space*/public void setTransformBackToOriginal(boolean b) {m_transBackToOriginal = b;}/*** Gets whether the data is to be transformed back to the original* space.* @return true if the data is to be transformed back to the original space*/public boolean getTransformBackToOriginal() {return m_transBackToOriginal;}/*** Gets the current settings of PrincipalComponents** @return an array of strings suitable for passing to setOptions()*/public String[] getOptions () {String[] options = new String[6];int current = 0;if (!getNormalize()) {options[current++] = "-D";}options[current++] = "-R";options[current++] = ""+getVarianceCovered();options[current++] = "-A";options[current++] = ""+getMaximumAttributeNames();if (getTransformBackToOriginal()) {options[current++] = "-O";}while (current < options.length) {options[current++] = "";}return options;}/*** Returns the capabilities of this evaluator.** @return the capabilities of this evaluator* @see Capabilities*/public Capabilities getCapabilities() {Capabilities result = super.getCapabilities();result.disableAll();// attributesresult.enable(Capability.NOMINAL_ATTRIBUTES);result.enable(Capability.NUMERIC_ATTRIBUTES);result.enable(Capability.DATE_ATTRIBUTES);result.enable(Capability.MISSING_VALUES);// classresult.enable(Capability.NOMINAL_CLASS);result.enable(Capability.NUMERIC_CLASS);result.enable(Capability.DATE_CLASS);result.enable(Capability.MISSING_CLASS_VALUES);result.enable(Capability.NO_CLASS);return result;}/*** Initializes principal components and performs the analysis* @param data the instances to analyse/transform* @throws Exception if analysis fails*/public void buildEvaluator(Instances data) throws Exception {// can evaluator handle data?getCapabilities().testWithFail(data);buildAttributeConstructor(data);}private void buildAttributeConstructor (Instances data) throws Exception {m_eigenvalues = null;m_outputNumAtts = -1;m_attributeFilter = null;m_nominalToBinFilter = null;m_sumOfEigenValues = 0.0;m_trainInstances = new Instances(data);// make a copy of the training data so that we can get the class// column to append to the transformed data (if necessary)m_trainHeader = new Instances(m_trainInstances, 0);//處理缺失值、歸一化、類別二值化、刪除單值屬性或者都為缺失值的屬性//mormalize nominaltobin//TODO.... deletedm_numInstances = m_trainInstances.numInstances();m_numAttribs = m_trainInstances.numAttributes();//計算相關矩陣,得到m_numAttribs * m_numAttribs的方陣fillCorrelation();double [] d = new double[m_numAttribs]; double [][] v = new double[m_numAttribs][m_numAttribs];Matrix corr = new Matrix(m_correlation);//得到特征向量和特征值corr.eigenvalueDecomposition(v, d);m_eigenvectors = (double [][])v.clone();m_eigenvalues = (double [])d.clone();// any eigenvalues less than 0 are not worth anything --- change to 0for (int i = 0; i < m_eigenvalues.length; i++) {if (m_eigenvalues[i] < 0) {m_eigenvalues[i] = 0.0;}}m_sortedEigens = Utils.sort(m_eigenvalues);m_sumOfEigenValues = Utils.sum(m_eigenvalues);}/*** Returns just the header for the transformed data (ie. an empty* set of instances. This is so that AttributeSelection can* determine the structure of the transformed data without actually* having to get all the transformed data through transformedData().* @return the header of the transformed data.* @throws Exception if the header of the transformed data can't* be determined.*/public Instances transformedHeader() throws Exception {}/*** Gets the transformed training data.* @return the transformed training data* @throws Exception if transformed data can't be returned*/public Instances transformedData(Instances data) throws Exception {}/*** Evaluates the merit of a transformed attribute. This is defined* to be 1 minus the cumulative variance explained. Merit can't* be meaningfully evaluated if the data is to be transformed back* to the original space.* @param att the attribute to be evaluated* @return the merit of a transformed attribute* @throws Exception if attribute can't be evaluated*/public double evaluateAttribute(int att) throws Exception {if (m_eigenvalues == null) {throw new Exception("Principal components hasn't been built yet!");}if (m_transBackToOriginal) {return 1.0; // can't evaluate back in the original space!}// return 1-cumulative variance explained for this transformed attdouble cumulative = 0.0;for (int i = m_numAttribs - 1; i >= m_numAttribs - att - 1; i--) {cumulative += m_eigenvalues[m_sortedEigens[i]];}return 1.0 - cumulative / m_sumOfEigenValues;}/*** Fill the correlation matrix*/private void fillCorrelation() {m_correlation = new double[m_numAttribs][m_numAttribs];double [] att1 = new double [m_numInstances];double [] att2 = new double [m_numInstances];double corr;for (int i = 0; i < m_numAttribs; i++) {for (int j = 0; j < m_numAttribs; j++) {if (i == j) {m_correlation[i][j] = 1.0;} else {for (int k = 0; k < m_numInstances; k++) {att1[k] = m_trainInstances.instance(k).value(i);att2[k] = m_trainInstances.instance(k).value(j);}corr = Utils.correlation(att1,att2,m_numInstances);m_correlation[i][j] = corr;m_correlation[j][i] = corr;}}}}/*** Main method for testing this class* @param argv should contain the command line arguments to the* evaluator/transformer (see AttributeSelection)*/public static void main(String [] argv) {runEvaluator(new PrincipalComponents(), argv);} }
?
總結