网络模型计算量评估
目錄
計算量
訪存
計算量
計算性能指標:
● FlOPS: floating point operations per second
計算量指標:
● MACCs or MADDs: multiply accumulate operations
?
FLOPS和FLOPs的區別:
FLOPS:注意全大寫,是floating point operations per second的縮寫,意指每秒浮點運算次數,理解為計算速度。是一個衡量硬件性能的指標。
FLOPs:注意s小寫,是floating point operations的縮寫(s表復數),意指浮點運算數,理解為計算量。可以用來衡量算法/模型的復雜度。
注意點:MACCs就是乘加次數,FLOPs就是乘與加的次數之和
?
點乘求和舉例說明:
● y = w[0]*x[0] + w[1]*x[1] + w[2]*x[2] + ... + w[n-1]*x[n-1]
w[0]*x[0] + ... 認為是1個MACC,所以是n MACCs
上式乘加表達式中包含n個浮點乘法和n - 1浮點加法,所以是2n-1 FLOPS
一個 MACC差不多是兩個FLOPS
注意點: 嚴格的說,上述公式中只有n-1個加法,比乘法數少一個。這里MACC的數量是一個近似值,就像Big-O符號是一個算法復雜性的近似值一樣。
?
實際卷積計算量:
關于計算量相關的細節可以參考文章《PRUNING CONVOLUTIONAL NEURAL NETWORKS FOR RESOURCE EFFICIENT INFERENCE, ICLR2017》,
假設采用滑動窗實現卷積且忽略非線性計算開銷,則卷積核的FLOPs為:
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ??
神經網絡各層FLOPs計算:
● Full Connected Layer:multiplying a vector of length I with an I × J matrix to get a vector of length J, takes I × J MACCs or (2I - 1) × J FLOPs.
● Activate Layer:? We do not measure these in MACCs but in FLOPs, because they’re not dot products.
● Convolution Layer: K × K × Cin × Hout × Wout × Cout MACCs
● Depthwise-Seperable Layer: (K × K × Cin × Hout × Wout) + (Cin × Hout × Wout × Cout) MACCs
->Cin × Hout × Wout × (K × K + Cout) MACCs
● Factor is K × K × Cout / (K × K + Cout).
?
訪存
● 計算量僅僅是運算速度的一個方面,另一個重要的方面是內存帶寬(memory bandwidth),甚至比計算量還重要
● 對現代計算機來說,a single memory access from main memory is much slower than a single computation — by a factor of about 100 or more!
● 一個網絡來說,內存訪問需要訪問多少次呢?對每一層來說,包括了以下內存訪問:
??? 1. 讀取每層的輸入
??? 2. 計算結果:包括了載入權重
??? 3. 輸出每層的結果
?
Memory for weights
● Full Connected:輸入I/輸出J,總共是(I+1)*J
● Convolutional layers have less weights than fully-connected layers:K × K × Cin × Cout
● 因為內存訪問非常慢,所以大量的內存訪問給網絡運行帶來的很大的影響,甚至超過了計算量
?
Feature maps and intermediate results
● Convolution Layer:
?(the weights here are negligible)
input = Hin × Win × Cin × K × K × Cout
output = Hout × Wout × Cout
weights = K × K × Cin × Cout + Cout
Example: Cin = 256, Cout = 512, H = W = 28, K = 3,S = 1
?
1、Normal convolution layer
? ? ?input = 28 × 28 × 256 × 3 × 3 × 512 = 924,844,032
? ? ?output = 28 × 28 × 512 = 401,408
? ? ?weights = 3 × 3 × 256 × 512 + 512 = 1,180,160
? ? ?total = 926,425,600
?
2、depthwise layer+pointwise layer
? ? 1)depthwise layer
? ? ? ? input = 28 × 28 × 256 × 3 × 3 = 1,806,336
? ? ? ? output = 28 × 28 × 256 = 200,704
? ? ? ? weights = 3 × 3 × 256 + 256 = 2,560
? ? ? ? total = 2,009,600
? ? ?2)pointwise layer
? ? ? ? input = 28 × 28 × 256 × 1 × 1 × 512 = 102,760,448
? ? ? ? output = 28 × 28 × 512 = 401,408
? ? ? ? weights = 1 × 1 × 256 × 512 + 512 = 131,584
? ? ? ? ?total = 103,293,440
? ? ? ? ?total of both layers = 105,303,040
?
案例研究
● Input dimension: 126x224
MobileNet V1 parameters (multiplier = 1.0): 1.6M
MobileNet V2 parameters (multiplier = 1.0): 0.5M
MobileNet V2 parameters (multiplier = 1.4): 1.0M
MobileNet V1 MACCs (multiplier = 1.0): 255M
MobileNet V2 MACCs (multiplier = 1.0): 111M
MobileNet V2 MACCs (multiplier = 1.4): 214M
MobileNet V1 memory accesses (multiplier = 1.0): 283M
MobileNet V2 memory accesses (multiplier = 1.0): 159M
MobileNet V2 memory accesses (multiplier = 1.4): 286M
MobileNet V2 (multiplier = 1.4) is slightly slower than MobileNet V1 (multiplier = 1.0)
This provides some proof for my hypothesis that the amount of memory accesses is the primary factor for determining the speed of the neural net.
?
結論
“I hope this shows that all these things — number of computations, number of parameters, and number of memory accesses — are deeply related. A model that works well on mobile needs to carefully balance those factors.”
總結
- 上一篇: |(与或移位等)的工程运用
- 下一篇: 模型占用GPU显存计算