合奏:机器学习中唯一(几乎)免费的午餐
弦樂合奏音源
A notebook accompanying this post can be found here.
隨此帖子附帶的筆記本可以在這里找到。
I am grateful to Tetyana Drobot and Igor Pozdeev for their comments and suggestions.
我感謝 Tetyana Drobot 和 Igor Pozdeev 的意見和建議。
摘要 (Summary)
In this post, I cover the somewhat overlooked topic of ensemble optimization. I begin with a brief overview of some common ensemble techniques and outline their weaknesses. I then introduce a simple ensemble optimization algorithm and demonstrate how to apply it to build ensembles of neural networks with Python and PyTorch. Towards the end of the post, I discuss the effectiveness of ensemble methods in deep learning in the context of the current literature on the loss surface geometry of neural networks.
在這篇文章中,我介紹了集成優(yōu)化中一個被忽略的話題。 首先,我簡要概述了一些常用的集成技術(shù),并概述了它們的弱點。 然后,我介紹一個簡單的集成優(yōu)化算法,并演示如何將其應(yīng)用到使用Python和PyTorch構(gòu)建神經(jīng)網(wǎng)絡(luò)的集成中。 在文章的結(jié)尾,我將在神經(jīng)網(wǎng)絡(luò)的損失表面幾何的最新文獻的背景下,討論集成方法在深度學(xué)習(xí)中的有效性。
Key takeaways:
關(guān)鍵要點:
- Strong ensembles consist of models that are both accurate and diverse 強大的合奏包含準(zhǔn)確且多樣的模型
- There are ensemble methods that admit realistic target functions which are not suitable as direct optimization objectives for ML models (think of using cross-entropy for training while being interested in some other metric like accuracy) 有一些集成方法可以接受現(xiàn)實的目標(biāo)函數(shù),這些函數(shù)不適合作為ML模型的直接優(yōu)化目標(biāo)(請考慮使用交叉熵進行訓(xùn)練,同時對準(zhǔn)確性等其他度量標(biāo)準(zhǔn)感興趣)
- Ensembling improves the performance of neural networks not only by dampening their inherent sensitivity to noise but also by combining qualitatively different and uncorrelated solutions 集成不僅可以降低神經(jīng)網(wǎng)絡(luò)對噪聲的固有敏感性,而且可以結(jié)合質(zhì)上不同且不相關(guān)的解決方案,從而提高了神經(jīng)網(wǎng)絡(luò)的性能。
The post is organized as follows:I. IntroductionII. Ensemble OptimizationIII. Building Ensembles of Neural NetworksIV. Ensembles of Neural Networks: the Role of Loss Surface GeometryV. Conclusion
該職位的結(jié)構(gòu)如下: 一,導(dǎo)言 二。 集成優(yōu)化 III。 建立神經(jīng)網(wǎng)絡(luò)集成體 IV。 神經(jīng)網(wǎng)絡(luò)集成:損失表面幾何的作用 V.結(jié)論
一,引言 (I. Introduction)
An ensemble is a collection of models designed to outperform every single one of them by combining their predictions. Strong ensembles comprise models that are accurate, performing well on their own, yet diverse in the sense of making different mistakes. This resonates with me deeply, as I am a finance professional—ensembling is akin to building a robust portfolio consisting of many individual assets and sacrificing higher expected returns on some of them in favor of an overall reduction in risk by diversifying investments. “Diversification is the only free lunch in finance” is the quote attributed to Harry Markowitz, the father of the Modern Portfolio Theory. Given that ensembling and diversification are conceptually related, and in some problems, the two are mathematically equivalent, I decided to give the post its title.
集合是一組模型,旨在通過結(jié)合它們的預(yù)測來勝過其中的每個模型。 強大的合奏包含準(zhǔn)確的模型,可以很好地表現(xiàn)的模型,但是在犯不同錯誤的意義上卻是多樣的。 當(dāng)我是一名金融專業(yè)人員時,這深深地引起了我的共鳴-類似于建立一個由許多個人資產(chǎn)組成的穩(wěn)健的投資組合,并犧牲其中的一些較高的預(yù)期收益,以期通過分散投資來全面降低風(fēng)險。 現(xiàn)代投資組合理論之父哈里·馬克維茲(Harry Markowitz)的話是:“多元化是金融界唯一的免費午餐”。 考慮到集合和多樣化在概念上是相關(guān)的,并且在某些問題上,兩者在數(shù)學(xué)上是等效的,所以我決定給該職位起標(biāo)題。
Why almost, though? Because there is always a lingering problem of computational cost given how resource hungry the most powerful models (yes, neural networks) are. In addition to that, ensembling can hurt interpretability of more transparent machine learning algorithms like decision trees by blurring the decision boundaries of individual models — this point does not really apply to neural networks for which the issue of interpretability arises already on the individual model level.
為什么差不多呢? 鑒于資源的匱乏,最強大的模型(是的,神經(jīng)網(wǎng)絡(luò))總是存在一個計算成本問題。 除此之外,集成還會模糊單個模型的決策邊界,從而損害決策樹等更透明的機器學(xué)習(xí)算法的可解釋性,這一點實際上不適用于已經(jīng)在單個模型級別出現(xiàn)可解釋性問題的神經(jīng)網(wǎng)絡(luò)。
There are several approaches to building ensembles:
有幾種構(gòu)建合奏的方法:
Bagging bootstraps the training set, estimates many copies of a model on the resulting samples, and then averages their predictions.
套袋引導(dǎo)訓(xùn)練集,在所得樣本上估計模型的許多副本,然后平均其預(yù)測。
Boosting sequentially reweights the training samples forcing the model to attend to the training examples with higher loss values.
Boosting順序地對訓(xùn)練樣本進行加權(quán),迫使模型以較高的損失值參加訓(xùn)練樣本。
Stacking uses a separate validation set to train a meta-model that combines predictions of multiple models.
跟蹤使用單獨的驗證集來訓(xùn)練結(jié)合了多個模型預(yù)測的元模型。
See, for example, this post by Gilbert Tanner or the one by Joseph Rocca, or this post by Juhi Ramzai for an extensive overview of these methods.
見,例如, 這篇文章由吉爾伯特坦納或在一個由約瑟夫·羅卡 ,或該職位由朱希Ramzai對這些方法的廣泛概述。
Of course, the methods above come with some common problems. First, given a set of trained models how to select the ones that are most likely to generalize well? In the case of stacking, this question would read ‘how to reduce the number of ensemble candidates to a manageable amount so the stacking model can handle them without a large validation set or high risk of overfitting?’ Well, just pick the best performing models and maybe apply weights inversely proportional to their loss, right? Wrong. Though often it is a good starting point. Recall that a good ensemble consists of both accurate and diverse models: pooling several highly accurate models with strongly correlated predictions would typically result in all models stepping on the same rake.
當(dāng)然,以上方法存在一些常見問題。 首先,給定一組訓(xùn)練有素的模型,如何選擇最有可能泛化的模型? 在堆疊的情況下,該問題將讀為“如何將整體候選者的數(shù)量減少到可管理的數(shù)量,以便堆疊模型可以處理它們而無需大量的驗證集或過度擬合的高風(fēng)險?” 好吧,只要選擇性能最好的模型,然后將權(quán)重與它們的損失成反比,對吧? 錯誤。 盡管通常這是一個很好的起點。 回想一下,一個好的集合包括準(zhǔn)確的模型和多樣的模型:將具有高度相關(guān)的預(yù)測的幾個高度準(zhǔn)確的模型合并在一起通常會導(dǎo)致所有模型都踩著相同的耙子。
The second problem is more subtle. Often the machine learning algorithms we train are all but glorified feature extractors, i.e. the objective in a real-life application might differ significantly from the loss function used to train a model. For instance, the cross-entropy loss is a staple in classification tasks in deep learning because of its differentiability and stable numerical behavior during optimization, however, depending on the domain we might be interested in accuracy, F1 score or false negative rate. As a concrete example consider classifying extreme weather events like floods or hurricanes, where the cost of making a Type II error (false negative) could be astronomically high rendering even the accuracy let alone the cross-entropy useless as an evaluation metric. Similarly, in a regression setting, the common loss function is the mean squared error. In finance, for example, it is common to train the same model for every asset in the sample predicting the return over the next period, while in reality there are hundreds of assets in multiple portfolios with optimization objectives similar to the ones encountered in reinforcement learning and optimal control: multiple time horizons along with state and path dependencies. In any case, you are neither judged by nor compensated for low MSE (unless you are in academia).
第二個問題更加微妙。 通常,我們訓(xùn)練的機器學(xué)習(xí)算法只是美化了特征提取器,即,實際應(yīng)用中的目標(biāo)可能與訓(xùn)練模型所使用的損失函數(shù)有很大差異。 例如,交叉熵?fù)p失是深度學(xué)習(xí)中分類任務(wù)的主要內(nèi)容,因為它的可區(qū)分性和優(yōu)化過程中穩(wěn)定的數(shù)值行為,但是,根據(jù)領(lǐng)域,我們可能對準(zhǔn)確性,F1得分或假陰性率感興趣。 作為一個具體示例,請考慮對洪水或颶風(fēng)等極端天氣事件進行分類,其中發(fā)生II型錯誤(假陰性)的成本在天文上可能會很高,即使是準(zhǔn)確性,更不用說交叉熵作為評估指標(biāo)了。 同樣,在回歸設(shè)置中,共同損失函數(shù)是均方誤差。 例如,在金融領(lǐng)域,通常針對樣本中的每種資產(chǎn)訓(xùn)練相同的模型以預(yù)測下一時期的回報,而實際上,多個投資組合中有數(shù)百種資產(chǎn)的優(yōu)化目標(biāo)與強化學(xué)習(xí)中遇到的目標(biāo)相似。最佳控制:多個時間范圍以及狀態(tài)和路徑依賴性。 無論如何,您都不會因MSE偏低而受到評判或補償(除非您處于學(xué)術(shù)界)。
In this post I thoroughly discuss the ensemble optimization algorithm of Caruana et al. (2004) which addresses the problems outlined above. The algorithm can be broadly described as model-free greedy stacking, i.e. at every optimization step the algorithm either adds a new model to the ensemble or changes the weights of the current constituents minimizing the total loss without any overarching trainable model guiding the selection process. Equipped with several features allowing it to alleviate the overfitting problem, the Caruana et al. (2004) approach also allows building ensembles optimizing custom metrics that may differ from those used to train individual models, thus addressing the second problem. I further demonstrate how to apply the algorithm: first, to a simple example with a closed form solution and next, to a realistic problem by building an optimal ensemble of neural networks for the MNIST dataset (a complete PyTorch implementation can be found here). Towards the end of the post, I explore the mechanisms underpinning the effectiveness of ensembles in deep learning and discuss the current literature on the role of the loss surface geometry in the generalization properties of neural networks.
在這篇文章中,我徹底討論了Caruana等人的集成優(yōu)化算法。 (2004年)解決了上面概述的問題。 該算法可以廣義地描述為無模型貪婪堆疊,即在每個優(yōu)化步驟中,該算法要么向集合添加新模型,要么更改當(dāng)前成分的權(quán)重以使總損失最小化,而無需任何總體可訓(xùn)練的模型來指導(dǎo)選擇過程。 Caruana等人具有一些功能,可以減輕過擬合的問題。 (2004年)的方法還允許構(gòu)建整體來優(yōu)化可能不同于用于訓(xùn)練單個模型的自定義指標(biāo),從而解決第二個問題。 我進一步演示了如何應(yīng)用該算法:首先,通過一個封閉形式的解決方案來解決一個簡單的示例,其次通過為MNIST數(shù)據(jù)集構(gòu)建神經(jīng)網(wǎng)絡(luò)的最佳集合來解決一個現(xiàn)實問題(可以在此處找到完整的PyTorch實現(xiàn))。 在文章的結(jié)尾,我探索了深度學(xué)習(xí)中集成體有效性的機制,并討論了有關(guān)損失表面幾何形狀在神經(jīng)網(wǎng)絡(luò)泛化特性中的作用的當(dāng)前文獻。
The remainder of the post is structured as follows: Section II presents the ensemble optimization approach of Caruana et al. (2004) and illustrates it with a simple numerical example. In Section III I optimize an ensemble of neural networks for the MNIST dataset (PyTorch implementation). Section IV briefly discusses the literature on the optimization landscape in deep learning and its impact on ensembling. Section V concludes.
文章的其余部分結(jié)構(gòu)如下: 第二部分介紹了Caruana等人的整體優(yōu)化方法。 (2004年) ,并通過一個簡單的數(shù)值示例進行了說明。 在第三部分中,我為MNIST數(shù)據(jù)集( PyTorch實現(xiàn) )優(yōu)化了神經(jīng)網(wǎng)絡(luò)的集成。 第四部分簡要討論了深度學(xué)習(xí)中的優(yōu)化前景及其對集成的影響的文獻。 第五節(jié)總結(jié)。
II. Ensemble Optimization: the Caruana et al. (2004) Algorithm
二。 集成優(yōu)化: Caruana等人。 (2004) 算法
The approach of Caruana et al. (2004) is rather straightforward. Given a set of trained models and their predictions on a validation set, a variant of their ensemble construction algorithm is as follows:
Caruana等人的方法。 (2004年) 非常簡單。 給定一組訓(xùn)練有素的模型及其對驗證集的預(yù)測,其集成構(gòu)造算法的變體如下:
Set inint_size— the number of models in the initial ensemble and max_iter — the maximum number of iterations
設(shè)置inint_size —初始集合中的模型數(shù),以及max_iter —最大迭代數(shù)
Initialize the ensemble with init_size best performing models by averaging their predictions and computing the total ensemble loss
通過對init_size最佳的模型進行平均并計算總合計損耗,使用init_size最佳的模型初始化合計
Repeat Step 3 until max_iter is reached
重復(fù)步驟3,直到達到max_iter
This version of the algorithm includes a couple of features designed to prevent overfitting the validation set. First, initializing the ensemble with several well-performing models forms a strong initial ensemble; second, drawing models with replacement practically guarantees that the ensemble loss on the validation set does not increase as the algorithm iterations progress — if adding another model can not further improve the ensemble loss the algorithm adds copies of the incumbent models essentially adjusting their weights in the final prediction. This weight adjustment property allows thinking of the algorithm as model-free stacking. Another interesting feature of this approach is that loss functions used for ensemble construction and to train individual models are not required to be the same: as mentioned earlier, often we train models with a particular loss function because of its mathematical or computational convenience in the (reasonable) hope that the models will generalize well with a related performance metric which is hard to optimize directly. Indeed, the value of cross-entropy on the test set in a malignant tumor classification task should not be our primary concern, in contrast to, for instance, the false negative rate.
該算法的此版本包括一些旨在防止過擬合驗證集的功能。 首先,用幾個性能良好的模型初始化集成體,形成一個強大的初始集成體; 其次,替換模型的繪制實際上保證了驗證集上的集合損失不會隨著算法迭代的進行而增加-如果添加另一個模型不能進一步改善集合損失,則算法會添加現(xiàn)有模型的副本,從而從本質(zhì)上調(diào)整它們的權(quán)重。最終預(yù)測。 此權(quán)重調(diào)整屬性允許將算法視為無模型堆疊。 這種方法的另一個有趣特征是,用于整體構(gòu)建和訓(xùn)練各個模型的損失函數(shù)不必相同:如前所述,由于(或)中的數(shù)學(xué)或計算方便性,我們經(jīng)常訓(xùn)練具有特定損失函數(shù)的模型。合理的),希望這些模型能夠很好地推廣具有難以直接優(yōu)化的相關(guān)性能指標(biāo)。 的確,與例如假陰性率相比,在惡性腫瘤分類任務(wù)中測試集上的交叉熵值不應(yīng)成為我們的主要關(guān)注點。
The following Python function implements the algorithm:
以下Python函數(shù)實現(xiàn)了該算法:
Caruana et al. (2004) ensemble selection algorithm Caruana等。 (2004)整體選擇算法Consider the following toy example: assume we have 10 models with zero-mean normally distributed uncorrelated predictions. Furthermore, assume that the variance of the predictions decreases linearly from 10 to 1, i.e. the first model has the highest variance and the last model has the lowest. Given a sample of data, the goal is to build an ensemble minimizing the mean squared error against the ground truth of 0. Note, that in the context of the Caruana et al. (2004) algorithm ‘build an ensemble’ means assigning a weight between 0 and 1 to each model’s predictions such that the weighted prediction minimizes the MSE, subject to the constraint that all weights sum up to 1.
考慮以下玩具示例:假設(shè)我們有10個模型,這些模型具有零均值正態(tài)分布的不相關(guān)預(yù)測。 此外,假設(shè)預(yù)測方差從10線性減少到1,即第一個模型的方差最大,而最后一個模型的方差最低。 給定一個數(shù)據(jù)樣本,目標(biāo)是建立一個集合,以最小化相對于地面實數(shù)為0的均方誤差。注意,在Caruana等人的上下文中。 (2004年)算法“構(gòu)建整體”是指為每個模型的預(yù)測分配介于0和1之間的權(quán)重,以使加權(quán)的預(yù)測最小化MSE,但前提是所有權(quán)重之和為1。
The finance aficionados would recognize a special case of the minimum variance optimization problem by thinking of the models’ predictions as returns on some assets, and the optimization objective as minimizing portfolio variance. The problem has a closed form solution:
通過將模型的預(yù)測視為某些資產(chǎn)的收益,并將模型的優(yōu)化目標(biāo)視為最小的投資組合方差,金融愛好者將認(rèn)識到最小方差優(yōu)化問題的特例。 該問題有一個封閉式解決方案:
where w is the vector of model weights, and Σ is the variance-covariance matrix of predictions. In our case the predictions are uncorrelated and the off-diagonal elements of Σ are zero. The following code snippet solves this toy problem by both using the ensemble_selector function and the analytical approach, it also constructs a simple ensemble by averaging the predictions:
其中w是模型權(quán)重的向量,而Σ是預(yù)測的方差-協(xié)方差矩陣。 在我們的情況下,預(yù)測是不相關(guān)的, Σ的非對角元素為零。 以下代碼段通過使用ensemble_selector函數(shù)和解析方法來解決此玩具問題,并且還通過平均預(yù)測來構(gòu)造一個簡單的集合:
Figure 1 below compares weights implied by the ensemble optimization (in blue) and the closed form solution (in orange). The results match pretty closely, especially given that to compute the analytical solution we use the true variances and not the sample estimates. Note, that although the models with low prediction uncertainty receive higher weights, the weights of the high uncertainty models do not go to zero: the predictions are uncorrelated, and we can always reduce the variance of a weighted sum of random variables by adding an uncorrelated variable (with finite variance, of course).
下圖1比較了集成優(yōu)化(藍色)和封閉式解決方案(橙色)所隱含的權(quán)重。 結(jié)果非常接近,特別是考慮到計算解析解時,我們使用的是真實方差,而不是樣本估計值。 請注意,盡管預(yù)測不確定性較低的模型獲得的權(quán)重較高,但不確定性較高的模型的權(quán)重不會為零:預(yù)測是不相關(guān)的,并且我們總是可以通過添加不相關(guān)性來減少隨機變量加權(quán)總和的方差變量(當(dāng)然具有有限方差)。
Figure 1: Estimated vs Theoretical Optimal Weights圖1:估計的最佳權(quán)重與理論上的最佳權(quán)重The solid blue line on the next figure plots the ensemble loss for the first 25 iterations of the algorithm. The dashed black and red lines represent, respectively, the loss achieved by the best single model and by a simple ensemble that averages the predictions of all models. After approximately five iterations the optimized ensemble beats the naive one achieving significantly lower MSE values thereafter.
下圖的藍色實線表示該算法前25次迭代的總體損失。 黑色和紅色虛線分別代表最佳單一模型和將所有模型的預(yù)測取平均值的簡單集合所造成的損失。 經(jīng)過大約五次迭代后,優(yōu)化的合奏擊敗了幼稚的合奏,此后獲得了明顯更低的MSE值。
Figure 2: Ensemble Loss vs Optimization Step圖2:整體損失與優(yōu)化步驟What if the number of models in the pool is very large?
如果池中的模型數(shù)量很大怎么辦?
If the model pool is very large some of the models could overfit the validation set purely by chance. Caruana et al. (2004) suggest using bagging to address this issue. In this case, the algorithm is applied to bags of M models randomly drawn from the pool with replacement with the final predictions being averaged over individual bags. For example, with a probability of 25% for a model to be drawn and 20 bags, the chance that any particular model will not be in any of the bags is only around 0.3%.
如果模型庫很大,則某些模型可能純粹是偶然地超出了驗證集的范圍。 Caruana等。 (2004)建議使用裝袋法解決這個問題。 在這種情況下,該算法適用于從池中隨機抽取并替換的M個模型袋,最終預(yù)測值將對各個袋平均。 例如,要抽取一個模型和20個袋子的概率為25%,則任何特定模型不在任何袋子中的機會僅為0.3%左右。
三, 建立神經(jīng)網(wǎng)絡(luò)集成體:MNIST示例 (III. Building Ensembles of Neural Networks: an MNIST Example)
Equipped with the techniques from the previous section, in this one, we will apply them to a realistic task, building and optimizing an ensemble of neural networks on the MNIST dataset. The results of this section can be completely replicated using the accompanying notebook, therefore I am restricting the code snippets in this section to a minimum primarily focusing on ensembling and not on model definitions and training.
在上一節(jié)中,配備了上一部分的技術(shù),我們將把它們應(yīng)用于實際任務(wù),在MNIST數(shù)據(jù)集上建立和優(yōu)化神經(jīng)網(wǎng)絡(luò)的集成。 本節(jié)的結(jié)果可以使用隨附的筆記本完全復(fù)制,因此,我將本節(jié)中的代碼片段限制在最低限度上,主要是集中在匯編而不是模型定義和訓(xùn)練上。
We start with a simple MLP having 3 hidden layers of 100 units each with ReLU activations. Naturally, the input for the MNIST datset is a 28x28 pixels image flattened into a 784-dimensional vector, and the output layer has 10 units corresponding to the number of digits. Therefore, the architecture specified by the MNISTMLP class implemented in PyTorch looks as follows:
我們從一個簡單的MLP開始,該MLP具有3個100單位的隱藏層,每個隱藏層都帶有ReLU激活。 自然,MNIST數(shù)據(jù)集的輸入是平整為784維向量的28x28像素圖像,并且輸出層具有10個與位數(shù)相對應(yīng)的單位。 因此,由MNISTMLP實現(xiàn)的MNISTMLP類指定的體系結(jié)構(gòu)如下所示:
MNISTMLP((layers): Sequential(
(0): Linear(in_features=784, out_features=100, bias=True)
(1): ReLU()
(2): Linear(in_features=100, out_features=100, bias=True)
(3): ReLU()
(4): Linear(in_features=100, out_features=100, bias=True)
(5): ReLU()
(6): Linear(in_features=100, out_features=10, bias=True)
)
)
We then train 10 instances of the model with independent weight initializations (i.e. everything is identical except for the starting weights) for 3 epochs each with a batch size of 32 and a learning rate of 0.001, reserving 25% of the training set of 60,000 images for validation, with the final 10,000 images comprising the test set. The objective is to minimize the cross-entropy (equivalently, negative log-likelihood). Note, that only 3 epochs of training together with a rather small capacity of each model would likely result in underfitting the data, thus allowing to demonstrate the benefits of ensembling in a more dramatic fashion.
然后,我們針對獨立的權(quán)重初始化(即,除了初始權(quán)重外,其他所有權(quán)重相同)的10個模型實例訓(xùn)練3個時期,每個時期的批次大小為32,學(xué)習(xí)率為0.001,保留了60,000張圖像的訓(xùn)練集中的25%用于驗證,最后的10,000張圖像組成測試集。 目的是最小化交叉熵(相當(dāng)于負(fù)對數(shù)似然)。 請注意,只有3個時期的訓(xùn)練以及每個模型的較小能力可能會導(dǎo)致數(shù)據(jù)擬合不足,因此可以更戲劇化地展示集合的好處。
After the training is complete we restore the best checkpoints (by validation loss) of each of the 10 models. The left panel on the figure below shows the validation (in blue) and test (in orange) loss for each model named M0 through M9. Similarly, the right panel plots the validation and test accuracy.
訓(xùn)練完成后,我們將恢復(fù)10個模型中每個模型的最佳檢查點(通過驗證損失)。 下圖的左面板顯示了每個名為M0到M9模型的驗證(藍色)和測試(橙色)損失。 同樣,右側(cè)面板顯示了驗證和測試準(zhǔn)確性。
Figure 3: Validation and Test Set Performance by Model圖3:按模型的驗證和測試集性能As expected, all models perform rather poorly with the best one, M7, achieving only 96.8% accuracy on the test set.
不出所料,所有模型的性能都相當(dāng)差,最好的模型為M7 ,測試集的準(zhǔn)確度僅為96.8%。
To build an optimal ensemble let us first call the ensemble_selector function defined in the previous section and then go over individual arguments in the context of the current problem:
為了構(gòu)建最佳的集成,讓我們首先調(diào)用上一節(jié)中定義的ensemble_selector函數(shù),然后在當(dāng)前問題的上下文中遍歷各個參數(shù):
y_hats_val is a dictionary with the model names as keys and predicted class probabilities for the validation set as items:
y_hats_val是一本字典,其模型名稱為鍵,而驗證集的預(yù)測類概率為項:
>>> y_hats_val["M0"].round(3)array([[0. , 0. , 0. , ..., 0.998, 0. , 0.001],
[0. , 0.003, 0.995, ..., 0. , 0.001, 0. ],
[0. , 0. , 0. , ..., 0.004, 0. , 0.975],
...,
[0.999, 0. , 0. , ..., 0. , 0. , 0. ],
[0. , 0. , 1. , ..., 0. , 0. , 0. ],
[0. , 0. , 0. , ..., 0. , 0.007, 0. ]]) >>> y_hats_val["M7"].round(3)
array([[0. , 0. , 0. , ..., 1. , 0. , 0. ],
[0. , 0. , 1. , ..., 0. , 0. , 0. ],
[0. , 0. , 0. , ..., 0.003, 0. , 0.981],
...,
[0.997, 0. , 0.002, ..., 0. , 0. , 0. ],
[0. , 0. , 1. , ..., 0. , 0. , 0. ],
[0. , 0. , 0. , ..., 0. , 0.002, 0. ]])
y_true_one_hot_val is a numpy array of the corresponding true one-hot encoded labels:
y_true_one_hot_val是對應(yīng)的真正的一鍵編碼標(biāo)簽的numpy數(shù)組:
>>>array([[0., 0., 0., ..., 1., 0., 0.],
[0., 0., 1., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 1.],
...,
[1., 0., 0., ..., 0., 0., 0.],
[0., 0., 1., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]])
The loss_function is a callable mapping arrays of predictions and labels to a scalar:
loss_function是預(yù)測和標(biāo)簽到標(biāo)量的可調(diào)用映射數(shù)組:
>>> cross_entropy(y_hats_val["M7"].round(3), y_true_one_hot_val)0.010982255936197028
Finally, init_size=1 means that we start with an ensemble of a single model; replacement=True means that the models are not removed from the model pool after being added to the ensemble, allowing the algorithm to add the same model several times, thus adjusting the weights of the ensemble constituents; max_iter=10 sets the number of steps the algorithm takes.
最后, init_size=1意味著我們從單個模型的集合開始; replacement=True表示在將模型添加到集合后不會將其從模型池中刪除,從而允許算法多次添加相同的模型,從而調(diào)整集合成分的權(quán)重; max_iter=10設(shè)置算法采取的步驟數(shù)。
Let us now examine the outputs. model_weights is a pandas dataframe containing the ensemble weight of each model for each optimization step. Dropping all models that have zero weights at each optimization step yields:
現(xiàn)在讓我們檢查輸出。 model_weights是一個熊貓數(shù)據(jù)model_weights其中包含每個優(yōu)化步驟每個模型的整體權(quán)重。 在每個優(yōu)化步驟中刪除權(quán)重為零的所有模型將得出:
>>> model_weights.loc[:, (model_weights != 0).any()] M1 M4 M5 M7 M90 0.000000 0.000000 0.000000 1.000000 0.000000
1 0.000000 0.000000 0.500000 0.500000 0.000000
2 0.000000 0.000000 0.333333 0.333333 0.333333
3 0.000000 0.250000 0.250000 0.250000 0.250000
4 0.200000 0.200000 0.200000 0.200000 0.200000
5 0.166667 0.166667 0.166667 0.333333 0.166667
6 0.142857 0.142857 0.285714 0.285714 0.142857
7 0.125000 0.250000 0.250000 0.250000 0.125000
8 0.111111 0.222222 0.222222 0.222222 0.222222
9 0.100000 0.200000 0.200000 0.300000 0.200000
10 0.181818 0.181818 0.181818 0.272727 0.181818
The following figure plots weights of the ensemble constituents as a function of optimization steps with a darker hue corresponding to a higher average weight a model receives during all optimization steps. The ensemble initializes with a single strongest model M7 at step 0, and then progressively adds more models assigning an equal weight to each: at step 1 there are two models M7 and M5 with a 50% weight each, at step 2 the ensemble includes models M7, M5 and M9 each having a weight of one third. After step 4 no new model can further improve ensemble predictions, and the algorithm starts to adjust the weights of its constituents.
下圖繪制了作為優(yōu)化步驟的函數(shù)的整體成分的權(quán)重,其中較深的色相對應(yīng)于模型在所有優(yōu)化步驟中獲得的較高的平均權(quán)重。 集成在第0步使用單個最強模型M7進行初始化,然后逐步添加更多的模型,為每個模型分配相同的權(quán)重:在第1步中,有兩個模型M7和M5各自的權(quán)重為50%,在第2步中,該集成包括模型M7 , M5和M9各自具有三分之一的重量。 在第4步之后,沒有新模型可以進一步改善總體預(yù)測,并且該算法開始調(diào)整其成分的權(quán)重。
Figure 4: Ensemble Weights圖4:整體配重The other output — ensemble_loss — contains the loss of the ensemble at each optimization step. Similar to Figure 2 from the previous section, the left panel on the figure below plots the ensemble loss on the validation set (solid blue line) as the optimization progresses. The dashed black and red lines represent, respectively, the validation loss achieved by the best single model and by a simple ensemble which assigns equal weights to all models. The ensemble loss decreases quite rapidly, surpassing the performance of its simple counterpart after a couple of iterations and stabilizing after the algorithm enters the weight adjustment mode, which is hardly surprising given that the model pool is rather small. The right panel reports the results for the test set: at each iteration I use the current ensemble weights to produce predictions and measure loss on the test set. The ensemble generalizes well on the test sample effectively repeating the pattern observed on the validation set.
另一個輸出( ensemble_loss )包含每個優(yōu)化步驟的整體損失。 與上一節(jié)中的圖2相似,下圖的左面板在優(yōu)化過程中繪制了驗證集上的整體損失(藍色實線)。 黑色和紅色虛線分別代表最佳單一模型和簡單的集合(將所有模型分配相等的權(quán)重)所達到的驗證損失。 整體損失的下降非常快,經(jīng)過兩次迭代后,其性能超過了簡單的整體,并且在算法進入權(quán)重調(diào)整模式后趨于穩(wěn)定,鑒于模型池很小,這不足為奇。 右側(cè)面板報告了測試集的結(jié)果:在每次迭代中,我使用當(dāng)前的集成權(quán)重來生成預(yù)測并測量測試集的損失。 集合在測試樣本上很好地泛化,有效地重復(fù)了在驗證集上觀察到的模式。
Figure 5: Ensemble Loss, MNIST圖5:MNIST的整體損失The Caruana et al. (2004) algorithm is very flexible and we can easily adapt ensemble_selector to, for instance, directly optimize the accuracy by changing the loss_function argument:
Caruana等。 (2004)算法非常靈活,例如,我們可以通過更改loss_function參數(shù)輕松地使ensemble_selector適應(yīng)于直接優(yōu)化精度:
where accuracy is defined as follows:
accuracy定義如下:
The following figure repeats the analysis in the previous one but this time for the validation and test accuracy. The conclusions are similar, although the accuracy path of the ensemble is more volatile in both samples.
下圖重復(fù)了上一個的分析,但是這次是為了驗證和測試準(zhǔn)確性。 結(jié)論是相似的,盡管在兩個樣本中集合的準(zhǔn)確性路徑更加不穩(wěn)定。
Figure 6: Ensemble Accuracy, MNIST圖6:MNIST的合奏精度IV。 有關(guān)神經(jīng)網(wǎng)絡(luò)集成的更多信息:損失表面幾何的重要性 (IV. More on Ensembling in Neural Networks: Importance of Loss Surface Geometry)
Why do random initializations work?
為什么隨機初始化起作用?
The short answer — it is all about the loss surface. The current deep learning research emphasizes the importance of the optimization landscape. For instance, batch normalization (Ioffe and Szegedy (2015)) is traditionally thought to accelerate and regularize training by reducing internal covariate shift — the change in the distribution of network activations during training. However, Santurkar et al. (2018) provide a compelling argument that the success of the technique stems from another property: batch normalization makes the optimization landscape significantly smoother and thus stabilizes the gradients and speeds up training. In a similar vein, Keskar et al. (2016) argue that sharp minima on the loss surface have poor generalization properties in comparison with minima in flatter regions of the landscape.
簡短的答案-全部與損失表面有關(guān)。 當(dāng)前的深度學(xué)習(xí)研究強調(diào)了優(yōu)化環(huán)境的重要性。 例如,傳統(tǒng)上認(rèn)為批量標(biāo)準(zhǔn)化( Ioffe和Szegedy(2015) )通過減少內(nèi)部協(xié)變量漂移 (訓(xùn)練過程中網(wǎng)絡(luò)激活分布的變化)來加速和規(guī)范化訓(xùn)練。 但是, Santurkar等。 (2018)提供了一個令人信服的論點,即該技術(shù)的成功源于另一個特性:批處理歸一化使優(yōu)化環(huán)境變得更加平滑,從而穩(wěn)定了梯度并加快了訓(xùn)練速度。 同樣, Keskar等人。 (2016年)認(rèn)為,與景觀平坦區(qū)域中的極小值相比,損失表面上的極小值極少具有泛化特性。
During training a neural network can be viewed as a function mapping parameters to loss values given the training data. The figure below plots a (very) simplified illustration of the networks’ loss landscape: the space of solutions and loss are along the horizontal and vertical axes respectively. Each point on the x-axis represents all weights and biases of the network yielding the corresponding loss (the blue solid line). The red dots show local minima where we are likely to end up using gradient-based optimization (the two leftmost dots are the global minima).
在訓(xùn)練期間,可以將神經(jīng)網(wǎng)絡(luò)視為將參數(shù)映射到給定訓(xùn)練數(shù)據(jù)的損耗值的函數(shù)。 下圖繪制了網(wǎng)絡(luò)損失狀況的(非常)簡化圖示:解和損失的空間分別沿水平和垂直軸。 x軸上的每個點代表產(chǎn)生相應(yīng)損失的網(wǎng)絡(luò)的所有權(quán)重和偏差(藍色實線)。 紅點表示局部極小值,我們可能會使用基于梯度的優(yōu)化來結(jié)束(最左邊的兩個點是全局極小值)。
In the context of ensembling, this means that we would like to explore many local minima. In the previous section we already saw that the combinations of different initializations of the same neural network architecture result in a superior generalization ability. In fact, in their recent paper Fort et al. (2019) demonstrate that random initializations end up in distant optima and therefore are capable of exploring completely different models with similar accuracy and relatively uncorrelated predictions thus forming strong ensemble components. This finding complements the standard intuition of neural networks being the ultimate low bias-high variance algorithms capable of fitting anything with almost surgical precision albeit plagued by their sensitivity to noise, and therefore benefiting from ensembling due to variance reduction.
在集合的背景下,這意味著我們想探索許多局部最小值。 在上一節(jié)中,我們已經(jīng)看到,相同神經(jīng)網(wǎng)絡(luò)體系結(jié)構(gòu)的不同初始化的組合產(chǎn)生了卓越的泛化能力。 實際上,在他們最近的論文中Fort等人。 (2019)證明了隨機初始化最終會導(dǎo)致遙遠的最優(yōu)解,因此能夠以相似的準(zhǔn)確度和相對不相關(guān)的預(yù)測探索完全不同的模型,從而形成強大的整體成分。 這一發(fā)現(xiàn)補充了神經(jīng)網(wǎng)絡(luò)的標(biāo)準(zhǔn)直覺,后者是最終的低偏差-高方差算法,盡管由于其對噪聲的敏感性而困擾,但能夠以幾乎外科手術(shù)的精度擬合任何事物,因此受益于因方差減少而產(chǎn)生的集合。
But what to do if training several copies of the same model is infeasible?
但是,如果無法訓(xùn)練同一模型的多個副本怎么辦?
Huang et al. (2018) propose building an ensemble during a single training run using cyclical learning rate with annealing and storing a model checkpoint, or snapshot, at the end of each cycle. Intuitively, increasing the learning rate could allow the model to escape any of the local minima on the figure above and land at the neighboring region with a different local minimum eventually converging to it with a subsequent decrease in the learning rate. The next figure illustrates the snapshot ensemble technique. The left plot shows the path over the loss landscape a model traverses during the standard training regime with a constant learning rate and the final stopping point marked with a blue flag. The right plot depicts the path with a cyclical learning rate schedule and periodic snapshots marked with red flags.
黃等。 (2018)建議在單次訓(xùn)練中使用循環(huán)學(xué)習(xí)率和退火來構(gòu)建整體,并在每個循環(huán)結(jié)束時存儲模型檢查點或快照。 直觀地講,提高學(xué)習(xí)率可以使模型逃脫上圖中的任何局部最小值,并以不同的局部最小值降落到相鄰區(qū)域,最終收斂到其上,從而降低學(xué)習(xí)率。 下圖說明了快照集成技術(shù)。 左圖顯示了模型在標(biāo)準(zhǔn)訓(xùn)練過程中以恒定的學(xué)習(xí)率經(jīng)過的損失景觀上的路徑,并以藍色標(biāo)記標(biāo)記了最終的停止點。 右圖描繪了帶有周期性學(xué)習(xí)率計劃和帶有紅色標(biāo)記的周期性快照的路徑。
Snapshot Ensembles, Source: Huang et al. (2017)快照合奏,來源: Huang等。 (2017)Remember, however, that on the two previous figures the whole parameter space is compressed into a single point on the x-axis and xy-plane respectively, meaning that a pair of neighboring points on the graph might be very far apart in the real parameter space, and therefore, the ability of a gradient descent algorithm to traverse multiple minima without getting stuck depends on whether the corresponding valleys on the loss surface are separated by regions of very high loss such that no meaningful increase in the learning would result in a transition to a new valley.
但是請記住,在前兩個圖上,整個參數(shù)空間分別被壓縮為x軸和xy平面上的單個點,這意味著圖形上的一對相鄰點在實際參數(shù)中可能相距很遠空間,因此,梯度下降算法遍歷多個極小值而不會被卡住的能力取決于損耗表面上相應(yīng)的波谷是否被損耗非常高的區(qū)域分隔開,從而使學(xué)習(xí)中沒有有意義的增加不會導(dǎo)致過渡到一個新的山谷。
Fortunately, Garipov et al. (2018) demonstrate that there exist low-loss paths connecting the local minima on the optimization landscape and propose the fast geometric ensembling (FGE) procedure exploiting these connections. Izmailov et al. (2018) propose a further refinement of FGE — stochastic weight averaging (SWA).
幸運的是, 加里波夫等人。 (2018)證明在優(yōu)化景觀上存在連接局部極小值的低損耗路徑,并提出了利用這些連接的快速幾何集合(FGE)程序。 伊茲麥洛夫等。 (2018)提出了對FGE的進一步改進-隨機加權(quán)平均(SWA)。
Max Pechyonkin provides an excellent overview of snapshot ensembles, FGE, and SWA.
Max Pechyonkin 很好地概述了快照合奏,FGE和SWA。
五,結(jié)論 (V. Conclusion)
Let us recap the key takeaways of this post:
讓我們回顧一下這篇文章的主要內(nèi)容:
- Strong ensembles consist of models that are both accurate and diverse 強大的合奏包含準(zhǔn)確且多樣的模型
Some model-free ensemble methods like the Caruana et al. (2004) algorithm admit realistic target functions which are not suitable as optimization objectives for ML models
一些無模型的集成方法,例如Caruana等人。 (2004年)算法接受了不適合作為ML模型優(yōu)化目標(biāo)的實際目標(biāo)函數(shù)
- Ensembling improves the performance of neural networks not only by dampening their inherent sensitivity to noise but also by combining qualitatively different and uncorrelated solutions 集成不僅可以降低神經(jīng)網(wǎng)絡(luò)對噪聲的固有敏感性,而且可以結(jié)合質(zhì)上不同且不相關(guān)的解決方案,從而提高了神經(jīng)網(wǎng)絡(luò)的性能。
To sum up, ensemble learning techniques should arguably be among the most important tools in the arsenal of every machine learning practitioner. It is indeed quite fascinating how far the old adage of not putting all eggs in one basket goes.
綜上所述,集成學(xué)習(xí)技術(shù)可以說是每位機器學(xué)習(xí)從業(yè)人員中最重要的工具之一。 確實令人著迷的是,沒有把所有雞蛋都放在一個籃子里的古老格言走了多遠。
Thank you for reading. Comments and feedback are eagerly anticipated. Also, connect with me on LinkedIn.
感謝您的閱讀。 迫切需要評論和反饋。 另外,在 LinkedIn 上與我聯(lián)系 。
進一步閱讀 (Further reading)
翻譯自: https://towardsdatascience.com/ensembles-the-almost-free-lunch-in-machine-learning-91af7ebe5090
弦樂合奏音源
總結(jié)
以上是生活随笔為你收集整理的合奏:机器学习中唯一(几乎)免费的午餐的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 比亚迪:目前研发团队已全面覆盖各个电池技
- 下一篇: ChatGPT爆火,谷歌Meta等压力大