我总结了70篇论文的方法,帮你透彻理解神经网络的剪枝算法
無論是在計算機視覺、自然語言處理還是圖像生成方面,深度神經(jīng)網(wǎng)絡(luò)目前表現(xiàn)出來的性能都是最先進的。然而,它們在計算能力、內(nèi)存或能源消耗方面的成本可能令人望而卻步,這使得大部份公司的因為有限的硬件資源而完全負(fù)擔(dān)不起訓(xùn)練的費用。但是許多領(lǐng)域都受益于神經(jīng)網(wǎng)絡(luò),因此需要找到一個在保持其性能的同時降低成本的辦法。
這就是神經(jīng)網(wǎng)絡(luò)壓縮的重點。該領(lǐng)域包含多個方法系列,例如量化 [11]、分解[13]、蒸餾 [32]。而本文的重點是剪枝。
神經(jīng)網(wǎng)絡(luò)剪枝是一種移除網(wǎng)絡(luò)中性能良好但需要大量資源的多余部分的方法。盡管大型神經(jīng)網(wǎng)絡(luò)已經(jīng)無數(shù)次證明了它們的學(xué)習(xí)能力,但事實證明,在訓(xùn)練過程結(jié)束后,并非它們的所有部分都仍然有用。這個想法是在不影響網(wǎng)絡(luò)性能的情況下消除這些多余部分。
不幸的是,每年發(fā)表的數(shù)十篇(可能是數(shù)百篇的話)論文都揭示了這個被認(rèn)為直截了當(dāng)?shù)南敕ㄋ[藏的復(fù)雜性。事實上,只要快速瀏覽一下文獻,就會發(fā)現(xiàn)有無數(shù)方法可以在訓(xùn)練前、訓(xùn)練中或訓(xùn)練后識別這些無用的部分,或?qū)⑵湟瞥?最主要的是并不是所有類型的剪枝都能加速神經(jīng)網(wǎng)絡(luò),這才是關(guān)鍵所在。
這篇文章的目標(biāo)是為解決圍繞神經(jīng)網(wǎng)絡(luò)剪枝各種問題。我們將依次回顧三個似乎是整個領(lǐng)域核心的問題:“我應(yīng)該修剪什么樣的部分?”,“如何判斷哪些部分可以修剪?”和“如何在不損害網(wǎng)絡(luò)的情況下進行修剪?”。綜上所述,我們將詳細(xì)介紹剪枝結(jié)構(gòu)、剪枝標(biāo)準(zhǔn)和剪枝方法。
1 - 剪枝介紹
1.1 - 非結(jié)構(gòu)化剪枝
在談到神經(jīng)網(wǎng)絡(luò)的成本時,參數(shù)數(shù)量肯定是最廣泛使用的指標(biāo)之一,還有 FLOPS(每秒浮點運算)。當(dāng)我們看到網(wǎng)絡(luò)顯示出天文數(shù)字的權(quán)重(GPT3的參數(shù)數(shù)量是1,750億)確實令人生畏。實際上,修剪連接是文獻中最廣泛的范式之一,足以被視為處理剪枝時的默認(rèn)框架。 Han等人的開創(chuàng)性工作[26]提出了這種剪枝方法,并作為許多貢獻的基礎(chǔ) [18, 21, 25]。
直接修剪參數(shù)有很多優(yōu)點。首先,它很簡單,因為在參數(shù)張量中用零替換它們的權(quán)重值就足以修剪連接。被廣泛使用的深度學(xué)習(xí)框架,例如 Pytorch,允許輕松訪問網(wǎng)絡(luò)的所有參數(shù),使其實現(xiàn)起來非常簡單。盡管如此,修剪連接的最大優(yōu)勢是它們是網(wǎng)絡(luò)中最小、最基本的元素,因此,它們的數(shù)量足以在不影響性能的情況下大量修剪它們。如此精細(xì)的粒度允許修剪非常細(xì)微的模式,例如,最多可修剪卷積核內(nèi)的參數(shù)。由于修剪權(quán)重完全不受任何約束的限制,并且是修剪網(wǎng)絡(luò)的最佳方式,因此這種范式稱為非結(jié)構(gòu)化剪枝。
然而,這種方法存在一個主要的、致命的缺點:大多數(shù)框架和硬件無法加速稀疏矩陣計算,這意味著無論你用多少個零填充參數(shù)張量,它都不會影響網(wǎng)絡(luò)的實際成本。然而,影響它的是以一種直接改變網(wǎng)絡(luò)架構(gòu)的方式進行修剪,任何框架都可以處理。
非結(jié)構(gòu)化(左)和結(jié)構(gòu)化(右)剪枝的區(qū)別:結(jié)構(gòu)化剪枝去除卷積濾波器和內(nèi)核行,而不僅僅是剪枝連接。 這導(dǎo)致中間表示中的特征圖更少。
1.2 - 結(jié)構(gòu)化剪枝
這就是為什么許多工作都專注于修剪更大的結(jié)構(gòu)的原因,例如整個神經(jīng)元 [36],或者在更現(xiàn)代的深度卷積網(wǎng)絡(luò)中直接等效,卷積過濾器 [40, 41, 66]。由于大型網(wǎng)絡(luò)往往包括許多卷積層,每個層數(shù)多達數(shù)百或數(shù)千個過濾器,因此過濾器修剪允許使用可利用但足夠精細(xì)的粒度。移除這樣的結(jié)構(gòu)不僅會導(dǎo)致稀疏層可以直接實例化為更薄的層,而且這樣做還會消除作為此類過濾器輸出的特征圖。
因此,由于參數(shù)較少這種網(wǎng)絡(luò)不僅易于存儲,而且它們需要更少的計算并生成更輕的中間表示,因此在運行時需要更少的內(nèi)存。實際上,有時減少帶寬比減少參數(shù)計數(shù)更有益。事實上,對于涉及大圖像的任務(wù),例如語義分割或?qū)ο髾z測,中間表示可能會消耗大量內(nèi)存,遠(yuǎn)遠(yuǎn)超過網(wǎng)絡(luò)本身。由于這些原因,過濾器修剪現(xiàn)在被視為結(jié)構(gòu)化剪枝的默認(rèn)類型。
然而,在應(yīng)用這種修剪時,應(yīng)注意以下幾個方面。讓我們考慮如何構(gòu)建卷積層:對于輸入通道中的 C 和輸出通道中的 C,卷積層由 Cout 過濾器組成,每個過濾器都計算 Cin 核;每個過濾器輸出一個特征圖,在每個過濾器中,一個內(nèi)核專用于每個輸入通道。考慮到這種架構(gòu),在修剪整個過濾器時,人們可能會觀察到修剪當(dāng)前過濾器,然后它會影響當(dāng)前輸出的特征圖,實際上也會導(dǎo)致在隨后的層中修剪相應(yīng)的過濾器。這意味著,在修剪過濾器時,實際上可能會修剪一開始被認(rèn)為要刪除的參數(shù)數(shù)量的兩倍。
讓我們也考慮一下,當(dāng)整個層碰巧被修剪時(這往往是由于層崩潰 [62],但并不總是破壞網(wǎng)絡(luò),具體取決于架構(gòu)),前一層的輸出現(xiàn)在完全沒有連接,因此也被刪減:刪減整個層實際上可能刪減其所有先前的層,這些層的輸出在其他地方?jīng)]有以某種方式連接(由于殘差連接[28]或整個并行路徑[61])。因此在修剪過濾器時,應(yīng)考慮計算實際修剪參數(shù)的確切數(shù)量。事實上,根據(jù)過濾器在體系結(jié)構(gòu)中的分布情況,修剪相同數(shù)量的過濾器可能不會導(dǎo)致相同數(shù)量的實際修剪參數(shù),從而使任何結(jié)果都無法與之進行比較。
在轉(zhuǎn)移話題之前,讓我們提一下,盡管數(shù)量很少,但有些工作專注于修剪卷積核(過濾器)、核內(nèi)結(jié)構(gòu) [2,24, 46] 甚至特定的參數(shù)結(jié)構(gòu)。但是,此類結(jié)構(gòu)需要特殊的實現(xiàn)才能實現(xiàn)任何類型的加速(如非結(jié)構(gòu)化剪枝)。然而,另一種可利用的結(jié)構(gòu)是通過修剪每個內(nèi)核中除一個參數(shù)之外的所有參數(shù)并將卷積轉(zhuǎn)換為“位移層”(shift layers),然后可以將其總結(jié)為位移操作和 1×1 卷積的組合 [24]。
結(jié)構(gòu)化剪枝的危險:改變層的輸入和輸出維度會導(dǎo)致一些差異。 如果在左邊,兩個層輸出相同數(shù)量的特征圖,然后可以很好地相加,右邊的剪枝產(chǎn)生不同維度的中間表示,如果不處理它們就無法相加。
2 - 剪枝標(biāo)準(zhǔn)
一旦決定了要修剪哪種結(jié)構(gòu),下一個可能會問的問題是:“現(xiàn)在,我如何確定要保留哪些結(jié)構(gòu)以及要修剪哪些結(jié)構(gòu)?”。 為了回答這個問題,需要一個適當(dāng)?shù)男藜魳?biāo)準(zhǔn),這將對參數(shù)、過濾器或其他的相對重要性進行排名。
2.1- 權(quán)重大小標(biāo)準(zhǔn)
一個非常直觀且非常有效的標(biāo)準(zhǔn)是修剪絕對值(或“幅度”)最小的權(quán)重。實際上,在權(quán)重衰減的約束下,那些對函數(shù)沒有顯著貢獻的函數(shù)在訓(xùn)練期間會縮小幅度。因此,多余的權(quán)重被定義為是那些絕對值較小的權(quán)重[8]。盡管它很簡單,但幅度標(biāo)準(zhǔn)仍然廣泛用于最新的方法 [21, 26, 58],使其成為該領(lǐng)域的主要內(nèi)容。
然而,雖然這個標(biāo)準(zhǔn)在非結(jié)構(gòu)化剪枝的情況下實現(xiàn)起來似乎微不足道,但人們可能想知道如何使其適應(yīng)結(jié)構(gòu)化剪枝。一種直接的方法是根據(jù)過濾器的范數(shù)(例如 L 1 或 L 2)對過濾器進行排序 [40, 70]。如果這種方法非常簡單,人們可能希望將多組參數(shù)封裝在一個度量中:例如,一個卷積過濾器、它的偏差和它的批量歸一化參數(shù),或者甚至是并行層中的相應(yīng)過濾器,其輸出隨后被融合。
一種方法是在不需要計算這些參數(shù)的組合范數(shù)的情況下,在要修剪的每組圖層之后為每個特征圖插入一個可學(xué)習(xí)的乘法參數(shù)。當(dāng)這個參數(shù)減少到零時,有效地修剪了負(fù)責(zé)這個通道的整套參數(shù),這個參數(shù)的大小說明了所有參數(shù)的重要性。因此,該方法包括修剪較小量級的參數(shù) [36, 41]。
2.2 - 梯度幅度剪枝
權(quán)重的大小并不是唯一存在的流行標(biāo)準(zhǔn)(或標(biāo)準(zhǔn)系列)。 實際上,一直持續(xù)到現(xiàn)在的另一個主要標(biāo)準(zhǔn)是梯度的大小。 事實上,早在 80 年代,一些基礎(chǔ)工作 [37, 53] 通過移除參數(shù)對損失的影響的泰勒分解進行了理論化,一些從反向傳播梯度導(dǎo)出的度量可以提供一種很好的方法來確定 可以在不損壞網(wǎng)絡(luò)的情況下修剪哪些參數(shù)。
該方法 [4, 50] 的最新的實現(xiàn)實際上是在小批量訓(xùn)練數(shù)據(jù)上累積梯度,并根據(jù)該梯度與每個參數(shù)的相應(yīng)權(quán)重之間的乘積進行修剪。 該標(biāo)準(zhǔn)也可以應(yīng)用于上述參數(shù)方法[49]。
2.3 — 全局或局部剪枝
要考慮的最后一個方面是所選標(biāo)準(zhǔn)是否是全局應(yīng)用于網(wǎng)絡(luò)的所有參數(shù)或過濾器,或者是否為每一層獨立計算。 雖然多次證明全局修剪可以產(chǎn)生更好的結(jié)果,但它可能導(dǎo)致層崩潰 [62]。 避免這個問題的一個簡單方法是采用逐層局部剪枝,即在使用的方法不能防止層崩潰時,在每一層剪枝相同的速率。
局部剪枝(左)和全局剪枝(右)的區(qū)別:局部剪枝對每一層應(yīng)用相同的速率,而全局剪枝一次在整個網(wǎng)絡(luò)上應(yīng)用。
3 - 剪枝方法
現(xiàn)在我們已經(jīng)獲得了修剪結(jié)構(gòu)和標(biāo)準(zhǔn),剩下的唯一需要確認(rèn)的是我們應(yīng)該使用哪種方法來修剪網(wǎng)絡(luò)。 這實際上這是文獻中最令人困惑的話題,因為每篇論文都會帶來自己的怪癖和噱頭,以至于人們可能會在有條不紊的相關(guān)內(nèi)容和給定論文的特殊性之間迷失。
這就是為什么我們將按主題概述一些最流行的修剪神經(jīng)網(wǎng)絡(luò)的方法系列,以突出訓(xùn)練期間使用稀疏性的演變。
3.1 - 經(jīng)典框架:訓(xùn)練、修剪和微調(diào)
要知道的第一個基本框架是訓(xùn)練、修剪和微調(diào)方法,它顯然涉及 1) 訓(xùn)練網(wǎng)絡(luò) 2) 通過將修剪結(jié)構(gòu)和標(biāo)準(zhǔn)所針對的所有參數(shù)設(shè)置為 0 來修剪它(這些參數(shù)之后無法恢復(fù))和 3)用最低的學(xué)習(xí)率訓(xùn)練網(wǎng)絡(luò)幾個額外的時期,讓它有機會從修剪引起的性能損失中恢復(fù)過來。 通常,最后兩個步驟可以迭代,每次都會增加修剪率。
Han等人提出的方法 [26] 應(yīng)用的就是這種方法,在修剪和微調(diào)之間進行 5 次迭代,以進行權(quán)重修剪。 迭代已被證明可以提高性能,但代價是額外的計算和訓(xùn)練時間。 這個簡單的框架是許多方法 [26, 40, 41, 50, 66] 的基礎(chǔ),可以看作是其他所有作品的默認(rèn)方法。
3.2 - 擴展經(jīng)典框架
雖然沒有偏離太多,但某些方法對 Han 等人的上述經(jīng)典框架進行了重大修改[26],Gale 等人 [21] 通過在整個訓(xùn)練過程中逐漸移除越來越多的權(quán)重,進一步推動了迭代的原則,這使得可以從迭代的優(yōu)勢中受益并移除整個微調(diào)過程。He等人[29] 在每個 epoch 將可修剪的過濾器逐步減少到 0,同時不阻止它們學(xué)習(xí)和之后更新,以便讓它們的權(quán)重在修剪后重新增長,同時在訓(xùn)練期間加強稀疏性。
最后,Renda 等人的方法 [58] 涉及在修剪網(wǎng)絡(luò)后完全重新訓(xùn)練網(wǎng)絡(luò)。與以最低學(xué)習(xí)率執(zhí)行的微調(diào)不同,再訓(xùn)練遵循與訓(xùn)練相同的學(xué)習(xí)率計劃,因此被稱為:“Learning-Rate Rewinding”。與單純的微調(diào)相比,這種再訓(xùn)練已顯示出更好的性能,而且成本要高得多。
3.3 - 初始化時的修剪
為了加快訓(xùn)練速度,避免微調(diào)并防止在訓(xùn)練期間或之后對架構(gòu)進行任何更改,多項工作都集中在訓(xùn)練前的剪枝上。在 SNIP [39] 之后,許多方法都研究了 Le Cun 等人的方法 [37] 或 Mozer 和 Smolensky [53] 在初始化時修剪 [12, 64],包括深入的理論研究 [27, 38, 62]。然而,Optimal Brain Damage [37] 依賴于多個近似值,包括“極值”近似值,即“假設(shè)訓(xùn)練收斂后將執(zhí)行參數(shù)刪除”[37];這個事實很少被提及,即使在基于它的方法中也是如此。一些工作對此類方法生成掩碼的能力提出了保留意見,這些掩碼的相關(guān)性優(yōu)于每層相似分布的隨機掩碼[20]。
另一個研究修剪和初始化之間關(guān)系的方法家族圍繞著“彩票假設(shè)”[18]。這個假設(shè)指出“隨機初始化的密集神經(jīng)網(wǎng)絡(luò)包含一個子網(wǎng)工作,它被初始化,這樣當(dāng)單獨訓(xùn)練時它可以在訓(xùn)練最多相同迭代次數(shù)后與原始網(wǎng)絡(luò)的測試精度相匹配”。在實踐中,該文獻研究了使用已經(jīng)收斂的網(wǎng)絡(luò)定義的剪枝掩碼在剛初始化時可以應(yīng)用于網(wǎng)絡(luò)的效果如何。多項工作擴展、穩(wěn)定或研究了這一假設(shè) [14, 19, 45, 51, 69]。然而,多項工作再次傾向于質(zhì)疑假設(shè)的有效性以及用于研究它的方法 [21, 42],有些甚至傾向于表明它的好處來自于使用確定性掩碼而不是完全訓(xùn)練的原則,“Winning Ticket”[58]。
經(jīng)典的“訓(xùn)練、剪枝和微調(diào)”框架 [26]、彩票實驗 [18] 和Learning-Rate Rewinding [58] 之間的比較。
3.4 - 稀疏訓(xùn)練
上面提到的方法都與一個看似共享的潛在主題相關(guān)聯(lián):在稀疏約束下訓(xùn)練。這個原則是一系列方法的核心,稱為稀疏訓(xùn)練,它包括在訓(xùn)練期間強制執(zhí)行恒定的稀疏率,同時其分布變化并逐漸調(diào)整。由 Mocanu 等人提出 [47],它包括:1) 用隨機掩碼初始化網(wǎng)絡(luò),修剪一定比例的網(wǎng)絡(luò) 2) 在一個輪次內(nèi)訓(xùn)練這個修剪過的網(wǎng)絡(luò) 3) 修剪一定數(shù)量的最低數(shù)量的權(quán)重 4) 重新增長相同的隨機權(quán)重的數(shù)量。
這樣,修剪掩碼首先是隨機的,逐漸調(diào)整以針對最小的導(dǎo)入權(quán)重,同時在整個訓(xùn)練過程中強制執(zhí)行稀疏性。每一層 [47] 或全局 [52] 的稀疏級別可以相同。其他方法通過使用某個標(biāo)準(zhǔn)來重新增加權(quán)重而不是隨機選擇它們來擴展稀疏訓(xùn)練 [15, 17]。
稀疏訓(xùn)練在訓(xùn)練期間周期性地削減和增長不同的權(quán)重,這會導(dǎo)致調(diào)整后的掩碼應(yīng)僅針對相關(guān)參數(shù)。
3.5 - 掩碼學(xué)習(xí)
與依賴任意標(biāo)準(zhǔn)來修剪或重新增加權(quán)重不同,多種方法專注于在訓(xùn)練期間學(xué)習(xí)修剪掩碼。兩種方法似乎在這個領(lǐng)域盛行:1)通過單獨的網(wǎng)絡(luò)或?qū)舆M行掩碼學(xué)習(xí);2)通過輔助參數(shù)進行掩碼學(xué)習(xí)。多種策略可以適用于第一類方法:訓(xùn)練單獨的代理以盡可能多地修剪一層的過濾器,同時最大限度地提高準(zhǔn)確性 [33]、插入基于注意力的層 [68] 或使用強化學(xué)習(xí) [30] .第二種方法旨在將剪枝視為一個優(yōu)化問題,它傾向于最小化網(wǎng)絡(luò)的 L 0 范數(shù)及其監(jiān)督損失。
由于 L0 是不可微的,因此各種方法主要涉及通過使用懲罰輔助參數(shù)來規(guī)避這個問題,這些輔助參數(shù)在前向傳遞期間與其相應(yīng)的參數(shù)相乘 [59, 23]。許多方法 [44, 60, 67] 依賴于一種類似于“二元連接”[11] 的方法,即:對參數(shù)應(yīng)用隨機門,這些參數(shù)的值每個都從它們自己的參數(shù) p 的伯努利分布中隨機抽取“Straight Through Estimator”[3] 或其他方式 [44]。
3.6 - 基于懲罰的方法
許多方法不是手動修剪連接或懲罰輔助參數(shù),而是對權(quán)重本身施加各種懲罰,使它們逐漸縮小到 0。這個概念實際上很古老 [57],因為權(quán)重衰減已經(jīng)是一個必不可少的權(quán)重大小標(biāo)準(zhǔn)。除了使用單純的權(quán)重衰減之外,甚至在那時也有多項工作專注于制定專門用于強制執(zhí)行稀疏性的懲罰 [55, 65]。今天,除了權(quán)重衰減之外,各種方法應(yīng)用不同的正則化來進一步增加稀疏性(通常使用 L 1 范數(shù) [41])。
在最新的方法中,多種方法依賴于 LASSO[22, 31, 66] 來修剪權(quán)重或組。其他方法制定了針對弱連接的懲罰,以增加要保留的參數(shù)和要修剪的參數(shù)之間的差距,從而減少它們的刪除影響 [7, 16]。一些方法表明,針對在整個訓(xùn)練過程中不斷增長的懲罰的權(quán)重子集可以逐步修剪它們并可以進行無縫刪除[6, 9, 63]。文獻還計算了圍繞“Variational Dropout”原理構(gòu)建的一系列方法 [34],這是一種基于變分推理 [5] 的方法,應(yīng)用于深度學(xué)習(xí) [35]。作為一種剪枝方法 [48],它產(chǎn)生了多種將其原理應(yīng)用于結(jié)構(gòu)化剪枝 [43, 54] 的方法。
4 - 可用的框架
如果這些方法中的大多數(shù)必須從頭開始實現(xiàn)(或者可以從每篇論文的提供源代碼中重用),以下這些框架都可以應(yīng)用基本方法或使上述實現(xiàn)更容易。
4.1 - Pytorch
Pytorch [56] 提供了一些基本的剪枝方法,例如全局剪枝或局部剪枝,無論是結(jié)構(gòu)化的還是非結(jié)構(gòu)化的。結(jié)構(gòu)化修剪可以應(yīng)用于權(quán)重張量的任何維度,它可以修剪過濾器、內(nèi)核行甚至內(nèi)核內(nèi)部的一些行和列。那些內(nèi)置的基本方法還允許隨機修剪或根據(jù)各種規(guī)范進行修剪。
4.2 - Tensorflow
Tensorflow [1] 的 Keras [10] 庫提供了一些基本工具來修剪最低量級的權(quán)重。例如在 Han 等人 [25] 的工作中,修剪的效率是根據(jù)所有插入的零引入的冗余程度來衡量的,可以更好地壓縮模型(與量化結(jié)合得很好)。
4.3 - ShrinkBench
Blalock 等人 [4] 在他們的工作中提供了一個自定義庫,以幫助社區(qū)規(guī)范剪枝算法的比較方式。 ShrinkBench 基于 Pytorch,旨在使剪枝方法的實施更容易,同時規(guī)范訓(xùn)練和測試的條件。它提供了幾種不同的基線,例如隨機剪枝、全局或分層以及權(quán)重大小或梯度大小剪枝。
5 - 方法的簡要回顧
在這篇文章中,引用了許多不同的論文。這是一個簡單的表格,粗略總結(jié)了它們的作用以及它們的區(qū)別(提供的日期是首次發(fā)布的日期):
| Classic methods | ||||||
| Han et al. | 2015 | weights | weights magnitude | train, prune and fine-tune | prototypical pruning method | none |
| Gale et al. | 2019 | weights | weights magnitude | gradual removal | - | none |
| Renda et al. | 2020 | weights | weights magnitude | train, prune and re-train (“LR-Rewinding”) | - | yes |
| Li et al. | 2016 | filters | L1 norm of weights | train, prune and fine-tune | - | none |
| Molchanov et al. | 2016 | filters | gradient magnitude | train, prune and fine-tune | - | none |
| Liu et al. | 2017 | filters | magnitude of batchnorm parameters | train, prune and fine-tune | gates-based structured pruning | none |
| He et al. | 2018 | filters | L2 norm of weights | soft pruning | zeroes out filters without removal until the end | yes |
| Molchanov et al. | 2019 | filters | gradient magnitude | train, prune and fine-tune | inserts gates to prune filters | none |
| Pruning at initialization | ||||||
| Lee et al. | 2018 | weights | gradient magnitude | prune and train | “SNIP” | yes |
| Lee et al. | 2019 | weights | “dynamical isometry” | prune and train | dataless method | yes |
| Wang et al. | 2020 | weights | second-order derivative | prune and train | “GraSP”: alike SNIP but with a criterion closer to that of Le Cun et al. | yes |
| Tanaka et al. | 2020 | weights | “synaptic flow” | prune and train | “SynFlow”: dataless method | yes |
| Frankle et al. | 2018 | weights | weights magnitude | train, rewind, prune and retrain | “l(fā)ottery ticket” | none |
| Sparse training | ||||||
| Mocanu et al. | 2018 | weights | weights magnitude | sparse training | random regrowth of pruned weights | yes |
| Mostafa and Wang | 2019 | weights | weights magnitude | sparse training | alike Mocanu et al. but global instead of layer-wise | none |
| Dettmers and Zettlemoyer | 2019 | weights | weights magnitude | sparse training | regrowth and layer-wise pruning rate depending on momentum | yes |
| Evci et al. | 2019 | weights | weights magnitude | sparse training | regrowth on gradient magnitude | yes |
| Mask learning | ||||||
| Huang et al. | 2018 | filters | N/A | train, prune and fine-tune | trains pruning agents that target filters to prune | none |
| He et al. | 2018 | filters | N/A | train, prune and fine-tune | uses reinforcement learning to target filters to prune | yes |
| Yamamoto and Maeno | 2018 | filters | N/A | train, prune and fine-tune | “PCAS”: uses attention modules to target filters to prune | none |
| Guo et al. | 2016 | weights | weight magnitude | mask learning | updates a mask depending on two different thresholds on the magnitude of weights | yes |
| Srinivas et al. | 2016 | weights | N/A | mask learning | alike Binary Connect applied to auxiliary parameters | none |
| Louizos et al. | 2017 | weights | N/A | mask learning | variant of Binary Connect, applied to auxiliary parameters, that avoids resorting to the Straight Through Estimator | yes |
| Xiao et al. | 2019 | weights | N/A | mask learning | alike Binary Connect but alters the gradient propagated to the auxiliary parameters | none |
| Savarese et al. | 2019 | weights | N/A | mask learning | approximates L0 with a heavyside function, which is itself approximated by a sigmoid of increasing temperature over auxiliary parameters | yes |
| Penalty-based methods | ||||||
| Wen et al. | 2016 | filters | N/A | Group-LASSO regularization | - | yes |
| He et al. | 2017 | filters | N/A | Group-LASSO regularization | also reconstructs the outputs of pruned layers by least squares | yes |
| Gao et al. | 2019 | filters | N/A | Group-LASSO regularization | prunes matching filters accross layers and penalizes variance of weights | none |
| Chang and Sha | 2018 | weights | weight magnitude | global penalty | modifies the weight decay to make it induce more sparsity | none |
| Molchanov et al. | 2017 | weights | N/A | “Variational Dropout” | application of variational inference on pruning | none |
| Neklyudov et al. | 2017 | filters | N/A | “Variational Dropout” | structured version of variational dropout | yes |
| Louizos et al. | 2017 | filters | N/A | “Variational Dropout” | another structured version of variational dropout | none |
| Ding et al. | 2018 | filters | weight magnitude | targeted penalty | penalizes or stimulate filters depending on the distance of their L 2 norm to a given threshold | none |
| Choi et al. | 2018 | weights | weight magnitude | targeted penalty | at each step penalizes weights of least magnitude by its L 2 norm, with an importance that is learned throughout training | none |
| Carreira-Perpi?án and Idelbayev | 2018 | weights | weight magnitude | targeted penalty | defines a mask depending on weights of least magnitudes and penalizes them toward zero | none |
| Tessier et al. | 2020 | any | any (weight magnitude) | targeted penalty | at each step penalizes prunable weights or filters by its L2 norm, with an importance that grows exponentially throughout training | yes |
5 - 總結(jié)
在我們對文獻的快速概覽中,我們看到 1) 剪枝結(jié)構(gòu)定義了從剪枝中期望獲得的收益 2) 剪枝標(biāo)準(zhǔn)基于各種理論或?qū)嵺` 3) 剪枝方法傾向于在訓(xùn)練期間引入稀疏性兼顧性能和成本。我們還看到,盡管它的最開始的工作可以追溯到 80 年代后期,但神經(jīng)網(wǎng)絡(luò)剪枝是一個非常動態(tài)的領(lǐng)域,今天仍然經(jīng)歷著基本的發(fā)現(xiàn)和新的基本概念。
盡管該領(lǐng)域每天都有貢獻,但似乎仍有很大的探索和創(chuàng)新空間。如果方法的每個子族都可以看作是回答問題的一個嘗試(“如何重新生成剪枝后的權(quán)重?”、“如何通過優(yōu)化學(xué)習(xí)剪枝掩碼?”、“如何通過更柔和的平均值來進行權(quán)重去除?”…… ),根據(jù)文獻的演變似乎指出了一個方向:整個訓(xùn)練的稀疏性。這個方向提出了許多問題,例如:“剪枝標(biāo)準(zhǔn)在尚未收斂的網(wǎng)絡(luò)上是否有效?”或者“如何從一開始就從任何類型的稀疏性訓(xùn)練中區(qū)分選擇要修剪的權(quán)重的好處?”
引用
[1] Mart??n Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org.
[2] Sajid Anwar, Kyuyeon Hwang, and Wonyong Sung. Structured pruning of deep convolutional neural networks. ACM Journal on Emerging Technologies in Computing Systems (JETC), 13(3):1–18, 2017.
[3] Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013.
[4] Davis Blalock, Jose Javier Gonzalez Ortiz, Jonathan Frankle, and John Guttag. What is the state of neural network pruning? arXiv preprint arXiv:2003.03033, 2020.
[5] David M Blei, Alp Kucukelbir, and Jon D McAuliffe. Variational inference: A review for statisticians. Journal of the American statistical Association, 112(518):859–877, 2017.
[6] Miguel A Carreira-Perpinán and Yerlan Idelbayev. “l(fā)earning-compression” algorithms for neural net pruning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8532–8541, 2018.
[7] Jing Chang and Jin Sha. Prune deep neural networks with the modified L1/2 penalty. IEEE Access, 7:2273–2280, 2018.
[8] Yves Chauvin. A back-propagation algorithm with optimal use of hidden units. In NIPS, volume 1, pages 519–526, 1988.
[9] Yoojin Choi, Mostafa El-Khamy, and Jungwon Lee. Compression of deep convolutional neural networks under joint sparsity constraints. arXiv preprint arXiv:1805.08303, 2018.
[10] Francois Chollet et al. Keras, 2015.
[11] Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. Binaryconnect: Training deep neural networks with binary weights during propagations. In NIPS, 2015.
[12] Pau de Jorge, Amartya Sanyal, Harkirat S Behl, Philip HS Torr, Gregory Rogez, and Puneet K Dokania. Progressive skeletonization: Trimming more fat from a network at initialization. arXiv preprint arXiv:2006.09081, 2020.
[13] Emily Denton, Wojciech Zaremba, Joan Bruna, Yann LeCun, and Rob Fergus. Exploiting linear structure within convolutional networks for efficient evaluation. In 28th Annual Conference on Neural Information Processing Systems 2014, NIPS 2014, pages 1269–1277. Neural information processing systems foundation, 2014.
[14] Shrey Desai, Hongyuan Zhan, and Ahmed Aly. Evaluating lottery tickets under distributional shifts. In Proceedings of the 2nd Workshop on Deep Learning Approaches for Low-Resource NLP (DeepLo 2019), pages 153–162, 2019.
[15] Tim Dettmers and Luke Zettlemoyer. Sparse networks from scratch: Faster training without losing performance. arXiv preprint arXiv:1907.04840, 2019.
[16] Xiaohan Ding, Guiguang Ding, Xiangxin Zhou, Yuchen Guo, Jungong Han, and Ji Liu. Global sparse momentum sgd for pruning very deep neural networks. arXiv preprint arXiv:1909.12778, 2019.
[17] Utku Evci, Trevor Gale, Jacob Menick, Pablo Samuel Castro, and Erich Elsen. Rigging the lottery: Making all tickets winners. In International Conference on Machine Learning, pages 2943–2952. PMLR, 2020.
[18] Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. arXiv preprint arXiv:1803.03635, 2018.
[19] Jonathan Frankle, Gintare Karolina Dziugaite, Daniel M Roy, and Michael Carbin. Stabilizing the lottery ticket hypothesis. arXiv preprint arXiv:1903.01611, 2019.
[20] Jonathan Frankle, Gintare Karolina Dziugaite, Daniel M Roy, and Michael Carbin. Pruning neural networks at initialization: Why are we missing the mark? arXiv preprint arXiv:2009.08576, 2020.
[21] Trevor Gale, Erich Elsen, and Sara Hooker. The state of sparsity in deep neural networks. arXiv preprint arXiv:1902.09574, 2019.
[22] Susan Gao, Xin Liu, Lung-Sheng Chien, William Zhang, and Jose M Alvarez. Vacl: Variance-aware cross-layer regularization for pruning deep residual networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, pages 0–0, 2019.
[23] Yiwen Guo, Anbang Yao, and Yurong Chen. Dynamic network surgery for efficient dnns. In NIPS, 2016.
[24] Ghouthi Boukli Hacene, Carlos Lassance, Vincent Gripon, Matthieu Courbariaux, and Yoshua Bengio. Attention based pruning for shift networks. In 2020 25th International Conference on Pattern Recognition (ICPR), pages 4054–4061. IEEE, 2021.
[25] Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015.
[26] Song Han, Jeff Pool, John Tran, and William J Dally. Learning both weights and connections for efficient neural network. In NIPS, 2015.
[27] Soufiane Hayou, Jean-Francois Ton, Arnaud Doucet, and Yee Whye Teh. Robust pruning at initialization.
[28] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
[29] Yang He, Guoliang Kang, Xuanyi Dong, Yanwei Fu, and Yi Yang. Soft filter pruning for accelerating deep convolutional neural networks. arXiv preprint arXiv:1808.06866, 2018.
[30] Yihui He, Ji Lin, Zhijian Liu, Hanrui Wang, Li-Jia Li, and Song Han. Amc: Automl for model compression and acceleration on mobile devices. In Proceedings of the European Conference on Computer Vision (ECCV), pages 784–800, 2018.
[31] Yihui He, Xiangyu Zhang, and Jian Sun. Channel pruning for accelerating very deep neural networks. In Proceedings of the IEEE International Conference on Computer Vision, pages 1389–1397, 2017.
[32] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. stat, 1050:9, 2015.
[33] Qiangui Huang, Kevin Zhou, Suya You, and Ulrich Neumann. Learning to prune filters in convolutional neural networks. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 709–718. IEEE, 2018.
[34] Diederik P Kingma, Tim Salimans, and Max Welling. Variational dropout and the local reparameterization trick. stat, 1050:8, 2015.
[35] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. stat, 1050:1, 2014.
[36] John K Kruschke and Javier R Movellan. Benefits of gain: Speeded learning and minimal hidden layers in back-propagation networks. IEEE Transactions on systems, Man, and Cybernetics, 21(1):273–280, 1991.
[37] Yann LeCun, John S Denker, and Sara A Solla. Optimal brain damage. In Advances in neural information processing systems, pages 598–605, 1990.
[38] Namhoon Lee, Thalaiyasingam Ajanthan, Stephen Gould, and Philip HS Torr. A signal propagation perspective for pruning neural networks at initialization. In International Conference on Learning Representations, 2019.
[39] Namhoon Lee, Thalaiyasingam Ajanthan, and Philip HS Torr. Snip: Single-shot network pruning based on connection sensitivity. International Conference on Learning Representations, ICLR, 2019.
[40] Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710, 2016.
[41] Zhuang Liu, Jianguo Li, Zhiqiang Shen, Gao Huang, Shoumeng Yan, and Changshui Zhang. Learning efficient convolutional networks through network slimming. In Proceedings of the IEEE International Conference on Computer Vision, pages 2736–2744, 2017.
[42] Zhuang Liu, Mingjie Sun, Tinghui Zhou, Gao Huang, and Trevor Darrell. Rethinking the value of network pruning. In International Conference on Learning Representations, 2018.
[43] C Louizos, K Ullrich, and M Welling. Bayesian compression for deep learning. In 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA., 2017.
[44] Christos Louizos, Max Welling, and Diederik P Kingma. Learning sparse neural networks through l 0 regularization. arXiv preprint arXiv:1712.01312, 2017.
[45] Eran Malach, Gilad Yehudai, Shai Shalev-Schwartz, and Ohad Shamir. Proving the lottery ticket hypothesis: Pruning is all you need. In International Conference on Machine Learning, pages 6682–6691. PMLR, 2020.
[46] Huizi Mao, Song Han, Jeff Pool, Wenshuo Li, Xingyu Liu, Yu Wang, and William J Dally. Exploring the regularity of sparse structure in convolutional neural networks. arXiv preprint arXiv:1705.08922, 2017.
[47] Decebal Constantin Mocanu, Elena Mocanu, Peter Stone, Phuong H Nguyen, Madeleine Gibescu, and Antonio Liotta. Scalable training of artificial neural networks with adaptive sparse connectivity inspired by network science. Nature communications, 9(1):1–12, 2018.
[48] Dmitry Molchanov, Arsenii Ashukha, and Dmitry Vetrov. Variational dropout sparsifies deep neural networks. In International Conference on Machine Learning, pages 2498–2507. PMLR, 2017.
[49] Pavlo Molchanov, Arun Mallya, Stephen Tyree, Iuri Frosio, and Jan Kautz. Importance estimation for neural network pruning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11264–11272, 2019.
[50] Pavlo Molchanov, Stephen Tyree, Tero Karras, Timo Aila, and Jan Kautz. Pruning convolutional neural networks for resource efficient inference. arXiv preprint arXiv:1611.06440, 2016.
[51] Ari S Morcos, Haonan Yu, Michela Paganini, and Yuandong Tian. One ticket to win them all: generalizing lottery ticket initializations across datasets and optimizers. stat, 1050:6, 2019.
[52] Hesham Mostafa and Xin Wang. Parameter efficient training of deep convolutional neural networks by dynamic sparse reparameterization. In International Conference on Machine Learning, pages 4646–4655. PMLR, 2019.
[53] Michael C Mozer and Paul Smolensky. Skeletonization: A technique for trimming the fat from a network via relevance assessment. In Advances in neural information processing systems, pages 107–115, 1989.
[54] Kirill Neklyudov, Dmitry Molchanov, Arsenii Ashukha, and Dmitry Vetrov. Structured bayesian pruning via log-normal multiplicative noise. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pages 6778–6787, 2017.
[55] Steven J Nowlan and Geoffrey E Hinton. Simplifying neural networks by soft weight-sharing. Neural Computation, 4(4):473–493, 1992.
[56] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. 2017.
[57] Russell Reed. Pruning algorithms-a survey. IEEE transactions on Neural Networks, 4(5):740–747, 1993.
[58] Alex Renda, Jonathan Frankle, and Michael Carbin. Comparing rewinding and fine-tuning in neural network pruning. arXiv preprint arXiv:2003.02389, 2020.
[59] Pedro Savarese, Hugo Silva, and Michael Maire. Winning the lottery with continuous sparsification. Advances in Neural Information Processing Systems, 33, 2020.
[60] Suraj Srinivas, Akshayvarun Subramanya, and R Venkatesh Babu. Training sparse neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 138–145, 2017.
[61] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1–9, 2015.
[62] Hidenori Tanaka, Daniel Kunin, Daniel L Yamins, and Surya Ganguli. Pruning neural networks without any data by iteratively conserving synaptic flow. Advances in Neural Information Processing Systems, 33, 2020.
[63] Hugo Tessier, Vincent Gripon, Mathieu Léonardon, Matthieu Arzel, Thomas Hannagan, and David Bertrand. Rethinking weight decay for efficient neural network pruning. 2021.
[64] Chaoqi Wang, Guodong Zhang, and Roger Grosse. Picking winning tickets before training by preserving gradient flow. In International Conference on Learning Representations, 2019.
[65] Andreas S Weigend, David E Rumelhart, and Bernardo A Huberman. Generalization by weight-elimination with application to forecasting. In Advances in neural information processing systems, pages 875–882, 1991.
[66] Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Learning structured sparsity in deep neural networks. In NIPS, 2016.
[67] Xia Xiao, Zigeng Wang, and Sanguthevar Rajasekaran. Autoprune: Automatic network pruning by regularizing auxiliary parameters. Advances in neural information processing systems, 32, 2019.
[68] Kohei Yamamoto and Kurato Maeno. Pcas: Pruning channels with attention statistics for deep network compression. arXiv preprint arXiv:1806.05382, 2018.
[69] Hattie Zhou, Janice Lan, Rosanne Liu, and Jason Yosinski. Deconstructing lottery tickets: Zeros, signs, and the supermask. arXiv preprint arXiv:1905.01067, 2019.
[70] Zhuangwei Zhuang, Mingkui Tan, Bohan Zhuang, Jing Liu, Yong Guo, Qingyao Wu, Junzhou Huang, and Jin-Hui Zhu. Discrimination-aware channel pruning for deep neural networks. In NeurIPS, 2018.
本文作者:Hugo Tessier
Deconstructing lottery tickets: Zeros, signs, and the supermask. arXiv preprint arXiv:1905.01067, 2019.
[70] Zhuangwei Zhuang, Mingkui Tan, Bohan Zhuang, Jing Liu, Yong Guo, Qingyao Wu, Junzhou Huang, and Jin-Hui Zhu. Discrimination-aware channel pruning for deep neural networks. In NeurIPS, 2018.
本文作者:Hugo Tessier
總結(jié)
以上是生活随笔為你收集整理的我总结了70篇论文的方法,帮你透彻理解神经网络的剪枝算法的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 【天梯赛 - L2习题集】啃题(12 /
- 下一篇: hibernate+servlet+my