机器学习论文:《LightGBM: A Highly Efficient Gradient Boosting Decision Tree》
翻譯自《LightGBM: A Highly Efficient Gradient Boosting Decision Tree》
摘要
Gradient Boosting Decision Tree (GBDT) is a popular machine learning algorithm, and has quite a few effective implementations such as XGBoost and pGBRT. Although many engineering optimizations have been adopted in these implementations, the efficiency and scalability are still unsatisfactory when the feature dimension is high and data size is large. A major reason is that for each feature, they need to scan all the data instances to estimate the information gain of all possible split points, which is very time consuming. To tackle this problem, we propose two novel techniques: Gradient-based One-Side Sampling (GOSS) and Exclusive Feature Bundling (EFB). With GOSS, we exclude a significant proportion of data instances with small gradients, and only use the rest to estimate the information gain. We prove that, since the data instances with larger gradients play a more important role in the computation of information gain, GOSS can obtain quite accurate estimation of the information gain with a much smaller data size. With EFB, we bundle mutually exclusive features (i.e., they rarely take nonzero values simultaneously), to reduce the number of features. We prove that finding the optimal bundling of exclusive features is NP-hard, but a greedy algorithm can achieve quite good approximation ratio (and thus can effectively reduce the number of features without hurting the accuracy of split point determination by much). We call our new GBDT implementation with GOSS and EFB LightGBM. Our experiments on multiple public datasets show that, LightGBM speeds up the training process of conventional GBDT by up to over 20 times while achieving almost the same accuracy.
-
梯度提升決策樹是著名的機器學(xué)習(xí)算法,且有一些有效的實現(xiàn): XGBoost、pGBRT
-
盡管以上實現(xiàn)做了工程優(yōu)化,在特征維度較高、數(shù)據(jù)集較大時,效率和可擴展性并不滿足要求,一個主要原因是,對于每個特征,都要掃描所有數(shù)據(jù)以計算不同分割點的信息增益
-
為應(yīng)對這個問題,我們提出兩項新技術(shù):基于梯度的單邊采樣(GOSS)和互斥特征綁定(EFB)
-
對于梯度單邊采樣GOSS,我們排除了相當(dāng)一部分小梯度的數(shù)據(jù)實例,只使用剩余數(shù)據(jù)來估計信息增益。我們證明了,因為較大梯度的數(shù)據(jù)在信息增益計算中起主要作用,GOSS使用小得多的數(shù)據(jù)集也能獲得較準(zhǔn)確的信息增益估計。
-
對于EFB,我們將互斥特征(幾乎不會同時非零)相互綁定來減少特征數(shù)量。我們證明了找到最優(yōu)的綁定方式是NP難題,但貪婪算法可以獲得不錯的近似(故可以有效減少特征數(shù),且對分割點的判斷準(zhǔn)確性不會造成太大影響)
-
我們將這個用GOSS和EGB實現(xiàn)的GBDT算法成為LightGBM。在多個公開數(shù)據(jù)集的實驗顯示,LightGBM將傳統(tǒng)GBDT的訓(xùn)練過程加速了20倍,且獲得相同的準(zhǔn)確性。
引言
Gradient boosting decision tree (GBDT) [1] is a widely-used machine learning algorithm, due to its efficiency, accuracy, and interpretability. GBDT achieves state-of-the-art performances in many machine learning tasks, such as multi-class classification [2], click prediction [3], and learning to rank [4]. In recent years, with the emergence of big data (in terms of both the number of features and the number of instances), GBDT is facing new challenges, especially in the tradeoff between accuracy and efficiency. Conventional implementations of GBDT need to, for every feature, scan all the data instances to estimate the information gain of all the possible split points. Therefore, their computational complexities will be proportional to both the number of features and the number of instances. This makes these implementations very time consuming when handling big data. To tackle this challenge, a straightforward idea is to reduce the number of data instances and the number of features. However, this turns out to be highly non-trivial. For example, it is unclear how to perform data sampling for GBDT. While there are some works that sample data according to their weights to speed up the training process of boosting [5, 6, 7], they cannot be directly applied to GBDT?since there is no sample weight in GBDT at all. In this paper, we propose two novel techniques towards this goal, as elaborated below. Gradient-based One-Side Sampling (GOSS). While there is no native weight for data instance in GBDT, we notice that data instances with different gradients play different roles in the computation of information gain. In particular, according to the definition of information gain, those instances with larger gradients1 (i.e., under-trained instances) will contribute more to the information gain. Therefore, when down sampling the data instances, in order to retain the accuracy of information gain estimation, we should better keep those instances with large gradients (e.g., larger than a pre-defined threshold, or among the top percentiles), and only randomly drop those instances with small gradients. We prove that such a treatment can lead to a more accurate gain estimation than uniformly random sampling, with the same target sampling rate, especially when the value of information gain has a large range. Exclusive Feature Bundling (EFB). Usually in real applications, although there are a large number of features, the feature space is quite sparse, which provides us a possibility of designing a nearly lossless approach to reduce the number of effective features. Specifically, in a sparse feature space, many features are (almost) exclusive, i.e., they rarely take nonzero values simultaneously. Examples include the one-hot features (e.g., one-hot word representation in text mining). We can safely bundle such exclusive features. To this end, we design an efficient algorithm by reducing the optimal bundling problem to a graph coloring problem (by taking features as vertices and adding edges for every two features if they are not mutually exclusive), and solving it by a greedy algorithm with a constant approximation ratio. We call the new GBDT algorithm with GOSS and EFB LightGBM2 . Our experiments on multiple public datasets show that LightGBM can accelerate the training process by up to over 20 times while achieving almost the same accuracy. The remaining of this paper is organized as follows. At first, we review GBDT algorithms and related work in Sec. 2. Then, we introduce the details of GOSS in Sec. 3 and EFB in Sec. 4. Our experiments for LightGBM on public datasets are presented in Sec. 5. Finally, we conclude the paper in Sec. 6.
- 梯度提升決策樹GBDT是廣泛運用的一種機器學(xué)習(xí)算法,因其高效、準(zhǔn)確、可解釋。
- 在很多機器學(xué)習(xí)任務(wù)中,GBDT達到最佳性能,例如多分類、點擊預(yù)測、排序?qū)W習(xí)
- 近年來,由于大數(shù)據(jù)(樣本數(shù)和特征數(shù))出現(xiàn),GBDT遇到了挑戰(zhàn),特別是準(zhǔn)確性和效率的權(quán)衡。
- GBDT 的傳統(tǒng)實現(xiàn)中,對每個特征,需要掃描整個數(shù)據(jù)集來估計分割點的增益,因此計算復(fù)雜度和樣本數(shù)還有特征數(shù)成正比使得這種實現(xiàn)在大數(shù)據(jù)集下耗時明顯
- 為了應(yīng)對這個問題,一個直接的想法是減少數(shù)據(jù)示例數(shù)和特征數(shù),然而這個并不簡單,例如,不知道如何對GBDT做數(shù)據(jù)采樣。盡管有關(guān)于根據(jù)權(quán)重對數(shù)據(jù)采樣從加快訓(xùn)練速度的研究,但不能直接應(yīng)用于GBDT,因為GBDT樣本沒有權(quán)重;
- 本文提出兩種新方法來解決,如下文所述。
- 基于梯度的單邊采樣GOSS;盡管GBDT的數(shù)據(jù)實例沒有權(quán)重,我們注意到不同梯度的數(shù)據(jù)實例在信息增益的計算中作用不一。根據(jù)信息增益的定義,有較大梯度的實例對信息增益的貢獻較大;因此對數(shù)據(jù)下采樣時,為了保留信息增益估計的準(zhǔn)確性,應(yīng)該保留梯度較大的數(shù)據(jù)實例(大于某個閾值或梯度前幾的),并隨機丟棄梯度較小的實例。我們證明了在相同采樣率下,這種處理比隨機均勻采樣對增益估計更準(zhǔn)確,特別是在信息增益范圍較大時。
- 互斥特征綁定EFB;在實際應(yīng)用中,盡管有大量特征,特征空間往往很稀疏,這為我們設(shè)計一個幾乎無損的減少有效特征數(shù)的方法提供了可能。特別的,在一個稀疏的特征空間,很多特征幾乎互斥(幾乎不同時在一個特征取非零值)。有獨熱編碼的實例(如文本挖掘的獨熱碼表示),這類特征可以安全的綁定。為此,我們設(shè)計了一個有效算法,將最優(yōu)綁定問題簡化為一個圖上色問題(將特征視作頂點,若兩個特征互斥則它們之間添加一條邊)。用固定近似比的貪婪算法來求解。
- 我們將這個新的GOSS和EGB的GBDT算法成為LightGBM2。在多個公共數(shù)據(jù)集的實驗顯示,LightGBM可以將訓(xùn)練過程加速20倍以上,且達到幾乎相同的準(zhǔn)確率。
- 本文剩余部分編排如下:首先回顧GBDT算法和相關(guān)工作。然后介紹GOSS的細(xì)節(jié),再來EGB,LightGBM在公開數(shù)據(jù)集的實驗結(jié)果,結(jié)論。
初步認(rèn)識
GBDT 與其復(fù)雜度分析
GBDT is an ensemble model of decision trees, which are trained in sequence [1]. In each iteration, GBDT learns the decision trees by fitting the negative gradients (also known as residual errors). The main cost in GBDT lies in learning the decision trees, and the most time-consuming part in learning a decision tree is to find the best split points. One of the most popular algorithms to find split points is the pre-sorted algorithm [8, 9], which enumerates all possible split points on the pre-sorted feature values. This algorithm is simple and can find the optimal split points, however, it is inefficient in both training speed and memory consumption. Another popular algorithm is the histogram-based algorithm [10, 11, 12], as shown in Alg. 13 . Instead of finding the split points on the sorted feature values, histogram-based algorithm buckets continuous feature values into discrete bins and uses these bins to construct feature histograms during training. Since the histogram-based algorithm is more efficient in both memory consumption and training speed, we will develop our work on its basis. As shown in Alg. 1, the histogram-based algorithm finds the best split points based on the feature histograms. It costs O(#data × #feature) for histogram building and O(#bin × #feature) for split point finding. Since #bin is usually much smaller than #data, histogram building will dominate the computational complexity. If we can reduce #data or #feature, we will be able to substantially speed up the training of GBDT
-
GBDT 是決策樹的集成模型,按一定順序訓(xùn)練。每次迭代中,GBDT通過擬合負(fù)梯度(殘差)來學(xué)習(xí)決策樹。
-
GBDT 最主要的復(fù)雜度在學(xué)習(xí)決策樹上,而學(xué)習(xí)決策樹最耗時的部分在于找到最佳分割點。分割點尋找最佳算法之一是預(yù)先排序算法,將所有可能的分割點在預(yù)先排好序的特征值上標(biāo)出,這種算法簡單且可以找到最佳分割點,然而在訓(xùn)練速度和內(nèi)存消耗上都不高效。
-
另一個算法是基于直方圖的算法那,相比在排好序的特征值上搜尋,直方圖算法將連續(xù)特征值劃分為多個離散的桶,并在訓(xùn)練中通過這些桶構(gòu)造直方圖。因為此方法在內(nèi)存消耗和訓(xùn)練速度上相對高效,我們將基于此進展工作。直方圖算法那基于特征直方圖找到最佳分割點。直方圖構(gòu)建復(fù)雜度O(#data × #feature),分割點搜索復(fù)雜度O(#bin × #feature)。因為桶的數(shù)量通常遠(yuǎn)小于 #data,直方圖構(gòu)建占據(jù)大部分復(fù)雜度。若能減少 #data 或#feature,就可以加速GBDT的訓(xùn)練了。
相關(guān)工作
There have been quite a few implementations of GBDT in the literature, including XGBoost [13], pGBRT [14], scikit-learn [15], and gbm in R [16] 4 . Scikit-learn and gbm in R implements the presorted algorithm, and pGBRT implements the histogram-based algorithm. XGBoost supports both?the pre-sorted algorithm and histogram-based algorithm. As shown in [13], XGBoost outperforms the other tools. So, we use XGBoost as our baseline in the experiment section. To reduce the size of the training data, a common approach is to down sample the data instances. For example, in [5], data instances are filtered if their weights are smaller than a fixed threshold. SGB [20] uses a random subset to train the weak learners in every iteration. In [6], the sampling ratio are dynamically adjusted in the training progress. However, all these works except SGB [20] are based on AdaBoost [21], and cannot be directly applied to GBDT since there are no native weights for data instances in GBDT. Though SGB can be applied to GBDT, it usually hurts accuracy and thus it is not a desirable choice. Similarly, to reduce the number of features, it is natural to filter weak features [22, 23, 7, 24]. This is usually done by principle component analysis or projection pursuit. However, these approaches highly rely on the assumption that features contain significant redundancy, which might not always be true in practice (features are usually designed with their unique contributions and removing any of them may affect the training accuracy to some degree). The large-scale datasets used in real applications are usually quite sparse. GBDT with the pre-sorted algorithm can reduce the training cost by ignoring the features with zero values [13]. However, GBDT with the histogram-based algorithm does not have efficient sparse optimization solutions. The reason is that the histogram-based algorithm needs to retrieve feature bin values (refer to Alg. 1) for each data instance no matter the feature value is zero or not. It is highly preferred that GBDT with the histogram-based algorithm can effectively leverage such sparse property. To address the limitations of previous works, we propose two new novel techniques called Gradientbased One-Side Sampling (GOSS) and Exclusive Feature Bundling (EFB). More details will be introduced in the next sections.
- 文章中有GBDT的多個實現(xiàn),包括XGBoost,pGBRT,scikit-learn,R語言的gbm。scikit-learn和gbm 實現(xiàn)的是預(yù)先排序算法,pGBRT實現(xiàn)了直方圖算法。
- XGBoost 同時支持預(yù)先排序和直方圖算法。在文獻[13]中,XGBoost 表現(xiàn)得比其他方法出色。所以我們的實驗選擇XGBoost 作為baseline。
- 要減少訓(xùn)練樣本數(shù),常用方法是下采樣。例如文獻【5】中當(dāng)數(shù)據(jù)權(quán)重小于一個指定閾值就被過濾。SGB【20】在每次迭代占用使用隨機的子集來訓(xùn)練弱學(xué)習(xí)器。【6】采樣率動態(tài)調(diào)整。然而所有的方法除了SGB都是基于adaboost的,不能直接應(yīng)用于GBDT,因為GBDT的數(shù)據(jù)實例沒有權(quán)重。即使SGB可以用于GBDT,它會影響準(zhǔn)確率,所以也不是理想的選擇。
- 類似的,要減少特征數(shù),自然想到對弱特征進行過濾【22,23,7,24】。通常用主成分分析和投影尋蹤。然而這些方法高度依賴一個假設(shè):特征有大量冗余,實際應(yīng)用中未必滿足(特征通常由各自的分布產(chǎn)生,去掉任何一個都可能某種程度上影響訓(xùn)練準(zhǔn)確率)。實際應(yīng)用中的大范圍數(shù)據(jù)集往往很稀疏,預(yù)先排序的GBDT可以通過忽略0值的特征來減少訓(xùn)練復(fù)雜度【13】。然而直方圖GBDT沒有有效的稀疏優(yōu)化方案。因為直方圖算法需要為每個數(shù)據(jù)實例提取特征桶的值,無論特征值是否為0.
- 可以處理好稀疏特征的直方圖GBDT 還是首選。為解決以往工作的局限性,我們提出兩點創(chuàng)新技術(shù):GOSS 梯度單邊采樣和互斥特征綁定EFB。更多細(xì)節(jié)在下一節(jié)
?
單邊梯度采樣
In this section, we propose a novel sampling method for GBDT that can achieve a good balance between reducing the number of data instances and keeping the accuracy for learned decision trees.
-
這一節(jié)我們對GBDT提出一個新的方法,在減少樣本數(shù)和保證學(xué)到的決策樹的準(zhǔn)確率上達到一個不錯的平衡。
算法描述
In AdaBoost, the sample weight serves as a good indicator for the importance of data instances. However, in GBDT, there are no native sample weights, and thus the sampling methods proposed for AdaBoost cannot be directly applied. Fortunately, we notice that the gradient for each data instance in GBDT provides us with useful information for data sampling. That is, if an instance is associated with a small gradient, the training error for this instance is small and it is already well-trained. A straightforward idea is to discard those data instances with small gradients. However, the data distribution will be changed by doing so, which will hurt the accuracy of the learned model. To avoid this problem, we propose a new method called Gradient-based One-Side Sampling (GOSS).
GOSS keeps all the instances with large gradients and performs random sampling on the instances with small gradients. In order to compensate the influence to the data distribution, when computing the information gain, GOSS introduces a constant multiplier for the data instances with small gradients (see Alg. 2). Specifically, GOSS firstly sorts the data instances according to the absolute value of their gradients and selects the top a×100% instances. Then it randomly samples b×100% instances from the rest of the data. After that, GOSS amplifies the sampled data with small gradients by a constant when calculating the information gain. By doing so, we put more focus on the under-trained instances without changing the original data distribution by much.
-
在AdaBoost 中,樣本權(quán)重代表了實例的重要性。然而在GBDT中,沒有原始的樣本權(quán)重,因此adaboost提出的采樣方法不能直接應(yīng)用。幸運的是,我們注意到GBDT中每個實例的梯度給我們提供了很有用的信息用于數(shù)據(jù)采樣:若一個實例具有小梯度,這個實例的訓(xùn)練誤差就比較小,說明已經(jīng)訓(xùn)練的較好了(是因為梯度小說明接近極值么??)。一個直觀的想法是丟棄這些小梯度的樣本。然而,這么做會改變數(shù)據(jù)的分布,影響所學(xué)模型的準(zhǔn)確度。為了避免這個問題,我們提出了新的方法叫做:單邊梯度采樣(GOSS)
-
GOSS 保留所有大梯度樣本,對小梯度樣本實例采樣隨機采樣。為了補償對數(shù)據(jù)分布的影響,在計算信息增益時,GOSS對小梯度的實例添加了一個常數(shù)乘子(是否和隨機采樣率對應(yīng),使得采樣后的樣本數(shù)疊加了乘子后,小梯度樣本對信息增益貢獻和原來相近??)GOSS首先依照樣本梯度的絕對值對其排序,選擇前a×100% 個樣本。然后對剩余樣本(小梯度)應(yīng)用b×100%的采樣率。然后GOSS在計算信息增益時,對采樣后的小梯度樣本添加一個倍數(shù) 放大。從而對訓(xùn)練好的實例賦予更多的關(guān)注,同時不過多改變原數(shù)據(jù)的分布.
理論分析
GBDT uses decision trees to learn a function from the input space X s to the gradient space G [1]. Suppose that we have a training set with n i.i.d. instances {x1, · · · , xn}, where each xi is a vector with dimension s in space??. In each iteration of gradient boosting, the negative gradients of the loss function with respect to the output of the model are denoted as {g1, · · · , gn}. The decision tree model splits each node at the most informative feature (with the largest information gain). For GBDT, the information gain is usually measured by the variance after splitting, which is defined as below.
- GBDT 使用決策樹學(xué)習(xí)從輸入空間X到梯度空間G的函數(shù)。假設(shè)我們有n個樣本訓(xùn)練集{x1, · · · , xn},每個xi是?空間的s維向量。在梯度提升的每個迭代,損失函數(shù)關(guān)于模型輸出的負(fù)梯度標(biāo)記為: {g1, · · · , gn}。決策樹模型將每個節(jié)點在最多信息(最大信息增益)的特征上分割。對于GBDT,信息增益通常由分割后的方差衡量,定義如下:
定義3.1
Let O be the training dataset on a fixed node of the decision tree. The variance gain of splitting feature j at point d for this node is defined as
where , and .
For feature j, the decision tree algorithm selects and calculates the largest gain . 5 Then, the data are split according feature j ? at point dj ? into the left and right child nodes. In our proposed GOSS method, first, we rank the training instances according to their absolute values of their gradients in the descending order; second, we keep the top-a × 100% instances with the larger gradients and get an instance subset A; then, for the remaining set Ac consisting (1 ? a) × 100% instances with smaller gradients, we further randomly sample a subset B with size b × |Ac |; finally, we split the instances according to the estimated variance gain over the subset A ∪ B, i.e.,
where ,,,Br = {xi ∈ B : xij > d}, and the coefficient is used to normalize the sum of the gradients over B back to the size of Ac . Thus, in GOSS, we use the estimated over a smaller instance subset, instead of the accurate over all the instances to determine the split point, and the computation cost can be largely reduced. More importantly, the following theorem indicates that GOSS will not lose much training accuracy and will outperform random sampling. Due to space restrictions, we leave the proof of the theorem to the supplementary materials.
- 讓O代表一棵決策樹某節(jié)點的訓(xùn)練樣本,在特征 j 上值為d 處分割的方差增益為:
其中?(該樹節(jié)點樣本數(shù)),?(該節(jié)點特征j 小于等于d的樣本數(shù)),?(該節(jié)點特征j大于d的樣本數(shù)) - 對特征j,決策樹算法選擇??作為分割點并計算最大增益:,然后數(shù)據(jù)根據(jù)特征??和對應(yīng)的值??將該節(jié)點數(shù)據(jù)分為左右子節(jié)點。
- 在我們提出的GOSS算法中,首先我們將樣本按照梯度絕對值降序排列,保留a × 100%個大梯度的實例作為子集A,對于剩余的?(1 ? a) × 100%的小梯度子集Ac,我們隨機采樣出b × |Ac | 大小的子集B,最終我們根據(jù)在?A ∪ B上估計的方差增益?來分割樣本。
其中?,,,Br = {xi ∈ B :xij > d},系數(shù)?是用于將B的梯度和標(biāo)準(zhǔn)化為原來Ac的大小(不是除以b就行么??)。 - 因此在GOSS中,我們用在子集上估計的,代替在所有樣本計算的,來尋找分割點。大大減少了計算復(fù)雜度。更重要的是,接下來的理論顯示,BOSS不會損失太大的訓(xùn)練準(zhǔn)確率,性能比隨機采樣更好。因為篇幅限制,我們將理論證明放到補充資料中。
理論3.1
We denote the approximation error in GOSS as and ??,?. With probability at least 1 ? δ, we have
where , and .
According to the theorem, we have the following discussions: (1) The asymptotic approximation ratio of GOSS is . If the split is not too unbalanced (i.e., and ), the approximation error will be dominated by the second term of Ineq.(2) which decreases to 0 in with n → ∞. That means when number of data is large, the approximation is quite accurate. (2) Random sampling is a special case of GOSS with a = 0. In many cases, GOSS could outperform random sampling, under the condition , which is equivalent to with . Next, we analyze the generalization performance in GOSS. We consider the generalization error in GOSS , which is the gap between the variance gain calculated by the sampled training instances in GOSS and the true variance gain for the underlying distribution. We have . Thus, the generalization error with GOSS will be close to that calculated by using the full data instances if the GOSS approximation is accurate. On the other hand, sampling will increase the diversity of the base learners, which potentially help to improve the generalization performance [24].
-
我們將GOSS中的近似誤差記為?,,?,至少有1 ? δ 的概率:
,其中?, -
根據(jù)這個理論有以下討論:
-
GOSS 的漸進近似比為:,若分割很不平衡(且),近似誤差會被第二項(2DC……)占據(jù),這一項在n 趨于無窮時根據(jù)減小到0。這意味著當(dāng)數(shù)據(jù)量很大時,這個近似是較為準(zhǔn)確的。
-
對樣本隨機采樣是GOSS 在a = 0 的一個特殊情況。在多數(shù)情況下,GOSS在 ?下的表現(xiàn)的優(yōu)于隨機采樣,這個條件等價于,其中?(結(jié)合Ca,b的定義,?等價于)?
-
-
接下來我們分析GOSS的泛化性能。我們考慮GOSS的泛化誤差:,這是GOSS采樣過的數(shù)據(jù)的方差增益和原始數(shù)據(jù)分布的真實方差增益的誤差,有:,因此若GOSS近似是準(zhǔn)確的,泛化誤差與使用所有數(shù)據(jù)計算得到的相近。
-
另一方面,采樣會增加基礎(chǔ)學(xué)習(xí)器的多樣性,潛在改進了泛化性能。
(以上是GOSS方法合理性的說明,詳細(xì)證明參考相關(guān)資料)
?
互斥特征綁定
In this section, we propose a novel method to effectively reduce the number of features
- 本節(jié),我們提出一種新的減少樣本特征數(shù)的方法。
High-dimensional data are usually very sparse. The sparsity of the feature space provides us a possibility of designing a nearly lossless approach to reduce the number of features. Specifically, in a sparse feature space, many features are mutually exclusive, i.e., they never take nonzero values simultaneously. We can safely bundle exclusive features into a single feature (which we call an exclusive feature bundle). By a carefully designed feature scanning algorithm, we can build the same feature histograms from the feature bundles as those from individual features. In this way, the complexity of histogram building changes from O(#data × #feature) to O(#data × #bundle), while #bundle << #feature. Then we can significantly speed up the training of GBDT without hurting the accuracy. In the following, we will show how to achieve this in detail. There are two issues to be addressed. The first one is to determine which features should be bundled together. The second is how to construct the bundle.
- 高維數(shù)據(jù)往往很稀疏。特征空間的稀疏性為我們設(shè)計一種近似無損的減少特征數(shù)方法提供了可能。
- 特別的,在一個稀疏特征空間,很多特征互斥(不會同時取非0值),我們可以安全的將互斥特征綁定為單一特征(成為互斥特征綁定)。通過精心設(shè)計一個特征掃描算法,我們可以基于綁定的特征建立與原始獨立特征相同的特征直方圖。
- 這里直方圖構(gòu)建的構(gòu)建復(fù)雜度從O(#data × #feature) 變?yōu)? O(#data × #bundle),其中?#bundle << #feature。然后可以極大加速GBDT的訓(xùn)練,同時不影響準(zhǔn)確率。
- 接下來我們講述實現(xiàn)細(xì)節(jié),有量問題待解決,一個是決定哪些特征要綁定在一起,第二個是如何綁定
理論4.1
The problem of partitioning features into a smallest number of exclusive bundles is NP-hard.
Proof: We will reduce the graph coloring problem [25] to our problem. Since graph coloring problem is NP-hard, we can then deduce our conclusion.
Given any instance G = (V, E) of the graph coloring problem. We construct an instance of our problem as follows. Take each row of the incidence matrix of G as a feature, and get an instance of our problem with |V | features. It is easy to see that an exclusive bundle of features in our problem corresponds to a set of vertices with the same color, and vice versa.
For the first issue, we prove in Theorem 4.1 that it is NP-Hard to find the optimal bundling strategy, which indicates that it is impossible to find an exact solution within polynomial time. In order to find a good approximation algorithm, we first reduce the optimal bundling problem to the graph coloring problem by taking features as vertices and adding edges for every two features if they are not mutually exclusive, then we use a greedy algorithm which can produce reasonably good results (with a constant approximation ratio) for graph coloring to produce the bundles. Furthermore, we notice that there are usually quite a few features, although not 100% mutually exclusive, also rarely take nonzero values simultaneously. If our algorithm can allow a small fraction of conflicts, we can get an even smaller number of feature bundles and further improve the computational efficiency.? By simple calculation, random polluting a small fraction of feature values will affect the training accuracy by at most O([(1 ? γ)n] ?2/3 )(See Proposition 2.1 in the supplementary materials), where γ is the maximal conflict rate in each bundle. So, if we choose a relatively small γ, we will be able to achieve a good balance between accuracy and efficiency.
Based on the above discussions, we design an algorithm for exclusive feature bundling as shown in Alg. 3. First, we construct a graph with weighted edges, whose weights correspond to the total conflicts between features. Second, we sort the features by their degrees in the graph in the descending order. Finally, we check each feature in the ordered list, and either assign it to an existing bundle with a small conflict (controlled by γ), or create a new bundle. The time complexity of Alg. 3 is O(#feature2 ) and it is processed only once before training. This complexity is acceptable when the number of features is not very large, but may still suffer if there are millions of features. To further improve the efficiency, we propose a more efficient ordering strategy without building the graph: ordering by the count of nonzero values, which is similar to ordering by degrees since more nonzero values usually leads to higher probability of conflicts. Since we only alter the ordering strategies in Alg. 3, the details of the new algorithm are omitted to avoid duplication.
For the second issues, we need a good way of merging the features in the same bundle in order to reduce the corresponding training complexity. The key is to ensure that the values of the original features can be identified from the feature bundles. Since the histogram-based algorithm stores discrete bins instead of continuous values of the features, we can construct a feature bundle by letting exclusive features reside in different bins. This can be done by adding offsets to the original values of the features. For example, suppose we have two features in a feature bundle. Originally, feature A takes value from [0, 10) and feature B takes value [0, 20). We then add an offset of 10 to the values of feature B so that the refined feature takes values from [10, 30). After that, it is safe to merge features A and B, and use a feature bundle with range [0, 30] to replace the original features A and B. The detailed algorithm is shown in Alg. 4.
EFB algorithm can bundle many exclusive features to the much fewer dense features, which can effectively avoid unnecessary computation for zero feature values. Actually, we can also optimize the basic histogram-based algorithm towards ignoring the zero feature values by using a table for each feature to record the data with nonzero values. By scanning the data in this table, the cost of histogram building for a feature will change from O(#data) to O(#non_zero_data). However, this method needs additional memory and computation cost to maintain these per-feature tables in the whole tree growth process. We implement this optimization in LightGBM as a basic function. Note, this optimization does not conflict with EFB since we can still use it when the bundles are sparse
-
將特征分割為最少的互斥捆綁是個NP難題
-
證明:我們引入圖上色問題,因為圖上色是NP難題,我們可以推導(dǎo)出結(jié)論
-
考慮一個上色問題下的圖G = (V, E),我們按照以下步驟構(gòu)建我們需解決問題的實例。將G的關(guān)聯(lián)矩陣每一行視作特征,那么我們的問題就有|V|個特征。可以看出我們問題里特征的互斥綁定對應(yīng)于相同顏色的定點集。
-
第一件事(哪些特征綁定在一起),我們證明找到最優(yōu)綁定策略是NP難題,意味著在多項式時間內(nèi)找到準(zhǔn)確解是不可能的。為了找到一個良好的近似算法,我們先將最優(yōu)綁定問題簡化為圖上色問題,其中將特征視作圖的定點,若兩個特征間不互斥就在對應(yīng)定點上添加邊。然后我們使用貪婪算法(一般可以得到較為合理的不錯結(jié)果),使用一個近似比常數(shù),用于圖上色,以找到綁定關(guān)系(相連的定點不能上相同的色,最后相同色的特征頂點可以綁定在一起)。我們注意到很多特征即使不是100%互斥,也極少會在同一特征上取非零值。?若我們的算法允許一小部分沖突,我們可以得到更小的特征綁定集合,進一步提高計算效率。簡單計算下,隨機污染一小部分特征值會對訓(xùn)練準(zhǔn)確率造成最多?的影響(詳見補充資料), γ 是每個綁定中的最大沖突率。所以我們選擇一個而相對小的 γ,能夠在準(zhǔn)確率和效率中達到一個好的平衡。(將特征分成多份綁定,每一份中的特征應(yīng)互斥,允許的沖突率越大,對互斥的要求越低?)
-
基于以上討論,我們?yōu)榛コ馓卣鹘壎ㄔO(shè)計一套算法。首先我們構(gòu)建一個加權(quán)圖,權(quán)重表示特征之間的沖突(是指同時取非0值的次數(shù)??)。然后我們按照特征(頂點)的度數(shù)(出入度)降序排列,接著我們按順序考察每個特征,要么將其放入一個已有的綁定中(y控制沖突),或創(chuàng)建一個新的綁定。時間復(fù)雜度是 O(#feature2 ) ,只需要在訓(xùn)練前處理一次。當(dāng)特征數(shù)不是特別大時,這個復(fù)雜度可以接受,但有上百萬特征時依然是個問題。為了進一步提高效率,我們提出一個更有效的排序策略(不需構(gòu)建圖):按照非0值的數(shù)量排序,這與按度數(shù)排序近似,因為更多的非0值意味著更高的沖突概率(有沖突就是有關(guān)聯(lián)了,在圖中就會有邊,度數(shù)就多了)。因為我們值改變了排序的策略,算法其他細(xì)節(jié)這里省略。
-
第二個問題(如何綁定),我們需要一個好方法來將同一份綁定中的特征合并,以減少訓(xùn)練復(fù)雜度。關(guān)鍵在于保證原特征值可以在綁定中識別出。因為直方圖算法存儲了離散的桶而不是特征的連續(xù)值,我們可以構(gòu)建一個綁定,使得互斥特征落到不同的桶中。可以通過對特征的原始值添加一個偏移來完成。例如我們再一個綁定中有兩個特征,特征A取值范圍 [0,10),特征B取值范圍 [0, 20),我們對特征B添加一個偏移10,調(diào)整后的特征取值范圍是[10,30)。這樣就能安全的合并特征A和B,該特征綁定的取值范圍是[0, 30)。
-
EFB 算法可以將多個互斥特征綁定,轉(zhuǎn)化為更少的緊密聯(lián)系的特征,減少了0特征值的不必要計算。實際上我們可以優(yōu)化基本的直方圖算法以忽略0特征值,這要通過使用一個特征表只記錄每個特征非0的值。通過掃描這張表,特征的直方圖構(gòu)建復(fù)雜度從O(#data) 到O(#non_zero_data)。然而此方法需要額外存儲空間還有額外計算,在樹增長過程中存放著特征表。我們在LightGBM? 將此優(yōu)化 作為一個基本函數(shù)。注意這個優(yōu)化與EFB不沖突,因為在綁定很稀疏時依然會使用它。
?
實驗
In this section, we report the experimental results regarding our proposed LightGBM algorithm. We use five different datasets which are all publicly available. The details of these datasets are listed in Table 1. Among them, the Microsoft Learning to Rank (LETOR) [26] dataset contains 30K web search queries. The features used in this dataset are mostly dense numerical features. The Allstate Insurance Claim [27] and the Flight Delay [28] datasets both contain a lot of one-hot coding features. And the last two datasets are from KDD CUP 2010 and KDD CUP 2012. We directly use the features used by the winning solution from NTU [29, 30, 31], which contains both dense and sparse features, and these two datasets are very large. These datasets are large, include both sparse and dense features, and cover many real-world tasks. Thus, we can use them to test our algorithm thoroughly. Our experimental environment is a Linux server with two E5-2670 v3 CPUs (in total 24 cores) and 256GB memories. All experiments run with multi-threading and the number of threads is fixed to 16
-
這一節(jié)展示我們提出的LightGBM 的實驗結(jié)果。我們使用5哥不同的公開數(shù)據(jù)集,數(shù)據(jù)細(xì)節(jié)見表1.Microsoft Learning to Rank 數(shù)據(jù)集有30k個網(wǎng)絡(luò)查詢。這個數(shù)據(jù)集的特征是稠密的數(shù)值特征。Allstate Insurance Claim 和?Flight Delay數(shù)據(jù)集都包含了很多獨熱編碼特征。最后兩個數(shù)據(jù)集來自 KDD CUP 2010和 KDD CUP 2012.我們直接使用取勝解法的特征,包含了稠密和稀疏的特征。
-
這些數(shù)據(jù)集都很龐大,包含了稠密與稀疏特征,覆蓋了很多實際任務(wù)。因此我們用他們來徹底測試下算法。
-
我們實驗環(huán)境是 linux服務(wù)器,兩塊?E5-2670 v3 CPU,256G內(nèi)存。所有實驗用多線程跑,線程數(shù)為16.
整體比較
We present the overall comparisons in this subsection. XGBoost [13] and LightGBM without GOSS and EFB (called lgb_baselline) are used as baselines. For XGBoost, we used two versions, xgb_exa (pre-sorted algorithm) and xgb_his (histogram-based algorithm). For xgb_his, lgb_baseline, and LightGBM, we used the leaf-wise tree growth strategy [32]. For xgb_exa, since it only supports layer-wise growth strategy, we tuned the parameters for xgb_exa to let it grow similar trees like other
- 這里我們列出整體比較。XGBoost 和 不帶GOSS和EFB的LightGBM 作為baseline。
- XGBoost 使用兩個版本:xgb_exa (預(yù)先排序算法)和xgb_his (直方圖算法)(查詢分割點)
- 對 xgb_his, lgb_baseline, LightGBM 我們使用按葉子的增長策略。
- 對于xgb_exa,因為只支持按層的增長策略,我們調(diào)整參數(shù),使得生成的樹和其他的類似。
The training time and test accuracy are summarized in Table 2 and Table 3 respectively. From these results, we can see that LightGBM is the fastest while maintaining almost the same accuracy as baselines. The xgb_exa is based on the pre-sorted algorithm, which is quite slow comparing with histogram-base algorithms. By comparing with lgb_baseline, LightGBM speed up 21x, 6x, 1.6x, 14x and 13x respectively on the Allstate, Flight Delay, LETOR, KDD10 and KDD12 datasets. Since xgb_his is quite memory consuming, it cannot run successfully on KDD10 and KDD12 datasets due to out-of-memory. On the remaining datasets, LightGBM are all faster, up to 9x speed-up is achieved on the Allstate dataset. The speed-up is calculated based on training time per iteration since all algorithms converge after similar number of iterations. To demonstrate the overall training process, we also show the training curves based on wall clock time on Flight Delay and LETOR in the Fig. 1
- 訓(xùn)練時間和測試準(zhǔn)確度在表2和表3中呈現(xiàn)。
- 從這里結(jié)果看,LightGBM是最快的,且達到與baseline幾乎相同的準(zhǔn)確率。
- xgb_exa 基于預(yù)先排序算法那,相對直方圖算法太慢
- 與lgb_baseline相比,LightGBM 在Allstate, Flight Delay, LETOR, KDD10 ,?KDD12數(shù)據(jù)集將速度提高了不同倍數(shù):21x, 6x, 1.6x, 14x,13x?
- 因為xgb_his比較消耗內(nèi)存,無法成功跑KDD10 、?KDD12數(shù)據(jù)集(因為內(nèi)存溢出了)
- 在剩余數(shù)據(jù)集中,LightGBM 都是最快的,速度提高了9倍。因為每個算法在相近的迭代次數(shù)后收斂,這里的速度計算基于每次迭代的時間。
and Fig. 2, respectively. To save space, we put the remaining training curves of the other datasets in the supplementary material. On all datasets, LightGBM can achieve almost the same test accuracy as the baselines. This indicates that both GOSS and EFB will not hurt accuracy while bringing significant speed-up. It is consistent with our theoretical analysis in Sec. 3.2 and Sec. 4. LightGBM achieves quite different speed-up ratios on these datasets. The overall speed-up comes from the combination of GOSS and EFB, we will break down the contribution and discuss the effectiveness of GOSS and EFB separately in the next sections.
分析GOSS
First, we study the speed-up ability of GOSS. From the comparison of LightGBM and EFB_only (LightGBM without GOSS) in Table 2, we can see that GOSS can bring nearly 2x speed-up by its own with using 10% - 20% data. GOSS can learn trees by only using the sampled data. However, it retains some computations on the full dataset, such as conducting the predictions and computing the gradients. Thus, we can find that the overall speed-up is not linearly correlated with the percentage of sampled data. However, the speed-up brought by GOSS is still very significant and the technique is universally applicable to different datasets. Second, we evaluate the accuracy of GOSS by comparing with Stochastic Gradient Boosting (SGB) [20]. Without loss of generality, we use the LETOR dataset for the test. We tune the sampling ratio by choosing different a and b in GOSS, and use the same overall sampling ratio for SGB. We run these settings until convergence by using early stopping. The results are shown in Table 4. We can see the accuracy of GOSS is always better than SGB when using the same sampling ratio. These results are consistent with our discussions in Sec. 3.2. All the experiments demonstrate that GOSS is a more effective sampling method than stochastic sampling.
-
首先我們研究GOSS的加速特性。通過將LightGBM 與EFB_only 比較,我們發(fā)現(xiàn)使用10%-20%數(shù)據(jù),GOSS可以帶來2倍的速度提升。GOSS可以只用采樣的數(shù)據(jù)來學(xué)習(xí)樹模型。然而在完整數(shù)據(jù)集上,它保留了一些計算例如預(yù)測和梯度計算。因此我們發(fā)現(xiàn)整體的加速與采樣比例不是線性相關(guān)。盡管如此,GOSS帶來的加速仍然很可觀,對于不同數(shù)據(jù)集都可應(yīng)用。
-
其次,我們評估GOSS的準(zhǔn)確性,通過與隨機梯度下降比較。為不是一般性,我們使用LETOR數(shù)據(jù)集。我們選擇GOSS不同的a和b參數(shù),對于SGB使用相同的全局采樣比例。在此參數(shù)下訓(xùn)練至收斂。結(jié)果在表4.可以看出在使用相同采樣比率下,總是比SGB好。這個結(jié)果與3.2說的一致,所有結(jié)果顯示GOSS相比隨機采樣更有效。
分析EFB
We check the contribution of EFB to the speed-up by comparing lgb_baseline with EFB_only. The results are shown in Table 2. Here we do not allow the confliction in the bundle finding process (i.e., γ = 0).7 We find that EFB can help achieve significant speed-up on those large-scale datasets. Please note lgb_baseline has been optimized for the sparse features, and EFB can still speed up the training by a large factor. It is because EFB merges many sparse features (both the one-hot coding features and implicitly exclusive features) into much fewer features. The basic sparse feature optimization is included in the bundling process. However, the EFB does not have the additional cost on maintaining nonzero data table for each feature in the tree learning process. What is more, since many previously isolated features are bundled together, it can increase spatial locality and improve cache hit rate significantly. Therefore, the overall improvement on efficiency is dramatic. With above analysis, EFB is a very effective algorithm to leverage sparse property in the histogram-based algorithm, and it can bring a significant speed-up for GBDT training process.
- 我們核實EFB的加速,通過比較lgb_baseline 與EFB_only。結(jié)果在表2.
- 在綁定搜索過程中,我們不允許沖突。發(fā)現(xiàn)EFB可以在大規(guī)模數(shù)據(jù)集上達到明顯的加速。注意lgb_baseline針對稀疏特征已做優(yōu)化。而EFB仍然可以極大的對訓(xùn)練過程提速。這是因為EFB合并了很多稀疏特征(包括獨熱編碼特征和互斥特征)。綁定過程中引入了基本的稀疏特征優(yōu)化。EFB 不需要額外的開銷(每個特征的非0值數(shù)據(jù)表)。因為很多互斥特征綁定在一起,可以提高數(shù)據(jù)本地化來增加內(nèi)存命中率。因此,整體效率提升是很可觀的。
- 基于以上分析,EFB是直方圖算法中處理稀疏特征的有效算法,提速了GBDT的訓(xùn)練過程。
結(jié)論
In this paper, we have proposed a novel GBDT algorithm called LightGBM, which contains two novel techniques: Gradient-based One-Side Sampling and Exclusive Feature Bundling to deal with large number of data instances and large number of features respectively. We have performed both theoretical analysis and experimental studies on these two techniques. The experimental results are consistent with the theory and show that with the help of GOSS and EFB, LightGBM can significantly outperform XGBoost and SGB in terms of computational speed and memory consumption. For the future work, we will study the optimal selection of a and b in Gradient-based One-Side Sampling and continue improving the performance of Exclusive Feature Bundling to deal with large number of features no matter they are sparse or not.
- 文本提出了新的GBDT算法?LightGBM,包含兩種新技術(shù):單邊梯度采樣和互斥特征綁定 來解決大規(guī)模數(shù)據(jù)和特征。我們從理論分析和實驗研究對其進行驗證。實驗結(jié)果與理論相符,展現(xiàn)了在GOSS和EFB下,LightGBM 可以極大處處XGBoost 和SGB的性能,尤其在計算速度和內(nèi)存開銷上。
- 對于未來的工作,我們將研究GOSS中a和b的最優(yōu)選擇,繼續(xù)提高EFB在處理大規(guī)模特征下的性能,包括稀疏與否。
?
學(xué)習(xí)小結(jié):
本篇降到了lightgbm很多優(yōu)化,但是看完感覺有些思路還不太清晰,最好結(jié)合一些優(yōu)秀博客來學(xué)習(xí)
LightGBM算法總結(jié)?對直方圖的優(yōu)化(減少要遍歷的值)做了詳盡表述?
LightGBM Established: October 1, 2016?官網(wǎng)介紹
Features?官方文檔?
快的不要不要的lightGBM?對特征的綁定有講述
?
?
?
?
?
?
?
?
總結(jié)
以上是生活随笔為你收集整理的机器学习论文:《LightGBM: A Highly Efficient Gradient Boosting Decision Tree》的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 201812CCF-CCSP竞赛:第1题
- 下一篇: .htaccess文件