hotelling变换_基于Hotelling-T²的偏最小二乘(PLS)中的变量选择
hotelling變換
背景 (Background)
One of the most common challenges encountered in the modeling of spectroscopic data is to select a subset of variables (i.e. wavelengths) out of a large number of variables associated with the response variable. It is common for spectroscopic data to have a large number of variables relative to the number of observations. In such a situation, the selection of a smaller number of variables is crucial especially if we want to speed up the computation time and gain in the model’s stability and interpretability. Typically, variable selection methods are classified into two groups:
在光譜數據的建模中遇到的最常見的挑戰òne為選擇的變量的子集(即,波長)了大量的與該響應相關聯的變量的變量。 光譜數據相對于觀測數量通常具有大量變量。 在這種情況下,選擇較小數量的變量至關重要,尤其是在我們希望加快計算時間并提高模型穩定性和可解釋性的情況下。 通常,變量選擇方法分為兩類:
? Filter-based methods: the most relevant variables are selected as a preprocessing step independently of the prediction model.? Wrapper-based methods: use the supervised learning approach.
?基于過濾器的方法:與預測模型無關,選擇最相關的變量作為預處理步驟。?基于包裝器的方法:使用監督學習方法。
Hence, any PLS-based variable selection is a wrapper method. Wrapper methods need a selection criterion that relies solely on the characteristics of the data at hand.
因此,任何基于PLS的變量選擇都是包裝器方法。 包裝方法需要一個選擇標準,該選擇標準僅依賴于手頭數據的特征。
方法 (Method)
Let us consider a regression problem for which the relation between the response variable y (n × 1) and the predictor matrix X (n × p) is assumed to be explained by the linear model y = β X, where β (p × 1) is the regression coefficients. Our dataset is comprised of n = 466 observations from various plant materials, and y corresponds to the concentration of calcium (Ca) for each plant. The matrix X is our measured LIBS spectra that includes p = 7151 wavelength variables. Our objective is therefore to find some columns subsets of X with satisfactorily predictive power for the Ca content.
讓我們考慮假定的量,響應變量Y(N×1)和預測器矩陣X(N×P)之間的關系由線性模型Y =βX,其中β(P×1說明一個回歸問題)是回歸系數。 我們的數據集由來自各種植物材料的n = 466個觀測值組成,并且y對應于每種植物的鈣(Ca)濃度。 矩陣X是我們測得的LIBS光譜,其中包括p = 7151個波長變量。 因此,我們的目標是找到一些X的子集,這些子集對于Ca含量具有令人滿意的預測能力。
ROBPCA建模 (ROBPCA modeling)
Let’s first perform robust principal components analysis (ROBPCA) to help visualize our data and detect whether there is an unusual structure or pattern. The obtained scores are illustrated by the scatterplot below in which the ellipses represent the 95% and 99% confidence interval from the Hotelling’s T2. Most observations are below the 95% confidence level, albeit some observations seem to cluster on the top-right corner of the scores scatterplot.
讓我們首先執行健壯的主成分分析( ROBPCA ),以幫助可視化我們的數據并檢測是否存在異常的結構或模式。 所獲得的分數由下面的散點圖說明,其中橢圓表示距Hotelling T 2的95%和99%置信區間。 盡管有些觀察似乎聚集在分數散點圖的右上角,但大多數觀察都低于95%的置信度。
ROBPCA scores scatterplot.ROBPCA對散點圖進行評分。However, when looking more closely, for instance using the outlier map, we can see that ultimately there are only three observations that seem to pose a problem. We have two observations flagged as orthogonal outliers and only one as a bad leverage point. Some observations are flagged as good leverage points, whilst most are regular observations.
但是,當更仔細地觀察時(例如,使用離群值地圖),我們可以看到最終只有三個觀測值似乎構成問題。 我們有兩個觀測值標記為正交離群值,只有一個觀測值標記為不良杠桿點。 一些觀察值被標記為良好的杠桿點,而大多數是常規觀察值。
ROBPCA outlier map.ROBPCA異常值地圖。PLS建模 (PLS modeling)
It is worth mentioning that in our regression problem, ordinary least square (OLS) fitting is no option since n ? p. PLS resolves this by searching for a small set of the so-called latent variables (LVs), that performs a simultaneous decomposition of X and y with the constraint that these components explain as much as possible of the covariance between X and y. The figures below are the results obtained from the PLS model. We obtained an R2 of 0.85 with an RMSE and MAE of 0.08 and 0.06, respectively, which correspond to a mean absolute percentage error (MAPE) of approximately 7%.
值得一提的是,在我們的回歸問題,普通最小二乘法(OLS)擬合是由于N“P別無選擇。 PLS通過搜索一小組所謂的潛在變量(LVs)來解決此問題,該變量在約束X和y盡可能解釋X和y之間的協方差的約束下執行X和y的同時分解。 下圖是從PLS模型獲得的結果。 我們獲得的R2為0.85,RMSE和MAE分別為0.08和0.06,這對應于大約7%的平均絕對百分比誤差(MAPE)。
Observed vs. predicted plot (full dataset).觀測圖與預測圖(完整數據集)。 Residual plot (full dataset).剩余圖(完整數據集)。Similarly to the ROBPCA outlier map, the PLS residual plot has flagged three observations that exhibit high standardized residual value. Another way to check for outliers is to calculate Q-residuals and Hotelling’s T2 from the PLS model, then define a criterion for which an observation is considered as an outlier or not. High Q-residual value corresponds to an observation which is not well explained by the model, while high Hotelling’s T2 value expresses an observation that is far from the center of regular observations (i.e, score = 0). The results are plotted below.
與ROBPCA離群圖相似,PLS殘差圖標記了三個觀測值,這些觀測值表現出較高的標準殘差值。 檢查異常值的另一種方法是從PLS模型計算Q殘差和Hotelling的T2,然后定義一個標準,對于該標準,觀察值是否視為異常值。 高Q殘差值對應于模型無法很好解釋的觀測值,而高Hotelling的T2值表示遠離常規觀測值中心的觀測值(即,得分= 0)。 結果繪制在下面。
Q residuals vs. Hotelling’s T2 plot (full dataset).Q殘差與Hotelling的T2圖(完整數據集)。基于Hotelling-T2的變量選擇 (Hotelling-T2 based variable selection)
Let’s now perform variable selection from our PLS model, which is carried out by computing the T2 statistic (for more details see Mehmood, 2016),
現在,讓我們從我們的PLS模型中執行變量選擇,該模型是通過計算T2統計信息來實現的(有關更多詳細信息,請參閱Mehmood,2016 ),
where W is the loading weight matrix and C is the covariance matrix. Thus, a variable is selected based on the following criteria,
其中W是裝載權重矩陣, C是協方差矩陣。 因此,根據以下條件選擇變量:
where A is number of LVs from our PLS model, and 1-𝛼 is the confidence level (with 𝛼 equals 0.05 or 0.01) from the F-distribution.
其中A是我們的PLS模型中LV的數量,1-𝛼是F分布的置信度(𝛼等于0.05或0.01)。
Thus, from 7151 variables in our original dataset, only 217 were selected based on the aforementioned selection criterion. The observed vs. predicted plot is displayed below along with the model’s R2 and RMSE.
因此,從我們原始數據集中的7151個變量中,僅基于上述選擇標準選擇了217個。 觀察到的與預測的圖以及模型的R2和RMSE一起顯示在下面。
Observed vs. predicted plot (selected variables).觀測圖與預測圖(選定變量)。In the results below, the three observations that were flagged as outliers were removed from the dataset. The mean absolute percentage error is 6%.
在以下結果中,從數據集中刪除了標記為異常值的三個觀察值。 平均絕對百分比誤差為6%。
Observed vs. predicted plot (selected variable, outliers removed).觀測圖與預測圖(選定變量,離群值已刪除)。 Residual plot (selected variable, outliers removed).殘差圖(選定變量,離群值已刪除)。摘要 (Summary)
In this article, we successfully performed Hotelling-T2 based variable selection using partial least squares. We obtained a huge reduction (-97%) in the number of selected variables compared to using the model with the full dataset.
在本文中,我們使用偏最小二乘成功地執行了基于Hotelling-T2的變量選擇。 與使用具有完整數據集的模型相比,我們選擇的變量數量大大減少了(-97%)。
翻譯自: https://towardsdatascience.com/hotelling-t%C2%B2-based-variable-selection-in-partial-least-square-pls-165880272363
hotelling變換
總結
以上是生活随笔為你收集整理的hotelling变换_基于Hotelling-T²的偏最小二乘(PLS)中的变量选择的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 嵌入式系统分类及其应用场景_词嵌入及其应
- 下一篇: 开年放王炸!一加内部开工红包泄露天机:核