當(dāng)前位置：首頁(yè) > 编程资源 > 综合教程 >内容正文

综合教程

PyTorch显存机制分析

發(fā)布時(shí)間：2023/12/29 综合教程 22 生活家

生活随笔收集整理的這篇文章主要介紹了 PyTorch显存机制分析小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

參考：

=======================================================

在pytorch中有幾個(gè)關(guān)于顯存的關(guān)鍵詞：

在pytorch中顯存為緩存和變量分配的空間之和叫做reserved_memory，為變量分配的顯存叫做memory_allocated，由此可知reserved_memory一定大于等于memory_allocated，但是pytorch獲得總顯存要比reserved_memory要大，pytorch獲得的總顯存空間為reserved_memory+PyTorch context。

在不同顯卡和驅(qū)動(dòng)下PyTorch context的大小是不同的，如：

https://zhuanlan.zhihu.com/p/424512257

所述，RTX 3090的context 開(kāi)銷(xiāo)。其中3090用的CUDA 11.3，開(kāi)銷(xiāo)為1639MB。

執(zhí)行代碼：

import torch
temp = torch.tensor([1.0]).cuda()

NVIDIA顯存消耗：

其中：

我們知道m(xù)emory_reserved大小為2MB，那么context大小大約為1639MB。

給出

https://zhuanlan.zhihu.com/p/424512257

圖片：

可以知道，pytorch并沒(méi)有直接采用操作系統(tǒng)的顯存管理機(jī)制而是自己又寫(xiě)了一個(gè)顯存管理機(jī)制，使用這種層級(jí)的管理機(jī)制在cache中申請(qǐng)顯存不需要向OS申請(qǐng)而是在自己的顯存管理程序中進(jìn)行調(diào)配，如果自己的cache中顯存空間不夠再會(huì)通過(guò)OS來(lái)申請(qǐng)顯存，通過(guò)這種方法可以進(jìn)一步提升顯存的申請(qǐng)速度和減少顯存碎片，當(dāng)然這樣也有不好的地方，那就是多人使用共享顯卡的話(huà)容易導(dǎo)致一方一直不釋放顯存而另一方無(wú)法獲得足夠顯存，當(dāng)然pytorch也給出了一些顯存分配的配置方法，但是主要還是為了減少顯存碎片的。

對(duì)https://zhuanlan.zhihu.com/p/424512257 中代碼進(jìn)行一定修改：

import torch


s = 0

# 模型初始化
linear1 = torch.nn.Linear(1024,1024, bias=False).cuda() # + 4194304
s = s+4194304
print(torch.cuda.memory_allocated(), s)
linear2 = torch.nn.Linear(1024, 1, bias=False).cuda() # + 4096
s+=4096
print(torch.cuda.memory_allocated(), s)

# 輸入定義
inputs = torch.tensor([[1.0]*1024]*1024).cuda() # shape = (1024,1024) # + 4194304
s+=4194304
print(torch.cuda.memory_allocated(), s)

# 前向傳播
s=s+4194304+512
loss = sum(linear2(linear1(inputs))) # shape = (1) # memory + 4194304 + 512
print(torch.cuda.memory_allocated(), s)

# 后向傳播
loss.backward() # memory - 4194304 + 4194304 + 4096
s = s-4194304+4194304+4096
print(torch.cuda.memory_allocated(), s)

# 再來(lái)一次~
loss = sum(linear2(linear1(inputs))) # shape = (1) # memory + 4194304  (512沒(méi)了，因?yàn)閘oss的ref還在)
s+=4194304
print(torch.cuda.memory_allocated(), s)
loss.backward() # memory - 4194304
s-=4194304
print(torch.cuda.memory_allocated(), s)

============================================

=================================================

修改代碼:

import torch
s = 0
# 模型初始化
linear1 = torch.nn.Linear(1024,1024, bias=False).cuda() # + 4194304
s = s+4194304
print(torch.cuda.memory_allocated(), s)
linear2 = torch.nn.Linear(1024, 1, bias=False).cuda() # + 4096
s+=4096
print(torch.cuda.memory_allocated(), s)

# 輸入定義
inputs = torch.tensor([[1.0]*1024]*1024).cuda() # shape = (1024,1024) # + 4194304
s+=4194304
print(torch.cuda.memory_allocated(), s)

# 前向傳播
s=s+4194304+512
loss = sum(linear2(linear1(inputs))) # shape = (1) # memory + 4194304 + 512
print(torch.cuda.memory_allocated(), s)

# 后向傳播
loss.backward() # memory - 4194304 + 4194304 + 4096
s = s-4194304+4194304+4096
print(torch.cuda.memory_allocated(), s)

# 再來(lái)一次~
for _ in range(10000):
    loss = sum(linear2(linear1(inputs))) # shape = (1) # memory + 4194304  (512沒(méi)了，因?yàn)閘oss的ref還在)
    loss.backward() # memory - 4194304



print(torch.cuda.max_memory_reserved()/1024/1024, "MB")
print(torch.cuda.max_memory_allocated()/1024/1024, "MB")
print(torch.cuda.max_memory_cached()/1024/1024, "MB")
print(torch.cuda.memory_summary())

View Code

那么問(wèn)題來(lái)了，問(wèn)了保證這個(gè)程序完整運(yùn)行下來(lái)的顯存量是多少呢？？？

已經(jīng)知道最大的reserved_memory 為 22MB，那么保證該程序運(yùn)行的最大顯存空間為reserved_memory+context_memory，

這里我們是使用1060G顯卡運(yùn)行，先對(duì)一下context_memory:

執(zhí)行代碼：

import torch
temp = torch.tensor([1.0]).cuda()

NVIDIA顯存消耗：

所以context_memory為681MB-2MB=679MB

由于max_reserved_memory=22MB，因此該程序完整運(yùn)行下來(lái)最高需要679+22=701MB，驗(yàn)證一下：

再次運(yùn)行代碼：

import torch
import time
s = 0
# 模型初始化
linear1 = torch.nn.Linear(1024,1024, bias=False).cuda() # + 4194304
s = s+4194304
print(torch.cuda.memory_allocated(), s)
linear2 = torch.nn.Linear(1024, 1, bias=False).cuda() # + 4096
s+=4096
print(torch.cuda.memory_allocated(), s)

# 輸入定義
inputs = torch.tensor([[1.0]*1024]*1024).cuda() # shape = (1024,1024) # + 4194304
s+=4194304
print(torch.cuda.memory_allocated(), s)

# 前向傳播
s=s+4194304+512
loss = sum(linear2(linear1(inputs))) # shape = (1) # memory + 4194304 + 512
print(torch.cuda.memory_allocated(), s)

# 后向傳播
loss.backward() # memory - 4194304 + 4194304 + 4096
s = s-4194304+4194304+4096
print(torch.cuda.memory_allocated(), s)

# 再來(lái)一次~
for _ in range(10000):
    loss = sum(linear2(linear1(inputs))) # shape = (1) # memory + 4194304  (512沒(méi)了，因?yàn)閘oss的ref還在)
    loss.backward() # memory - 4194304



print(torch.cuda.max_memory_reserved()/1024/1024, "MB")
print(torch.cuda.max_memory_allocated()/1024/1024, "MB")
print(torch.cuda.max_memory_cached()/1024/1024, "MB")
print(torch.cuda.memory_summary())

time.sleep(60)

View Code

發(fā)現(xiàn) 803-701=102MB，這中間差的數(shù)值無(wú)法解釋?zhuān)荒苷f(shuō)memory_context可以隨著程序不同數(shù)值也不同，不同程序引入的pytorch函數(shù)不同導(dǎo)致context_memory也不同，這里我們按照這個(gè)想法反推，context_memory在這里為803-22=781MB，為了驗(yàn)證我們修改代碼：

修改代碼：

import torch
import time
s = 0
# 模型初始化
linear1 = torch.nn.Linear(1024,1024*2, bias=False).cuda() # + 4194304
s = s+4194304
print(torch.cuda.memory_allocated(), s)
linear2 = torch.nn.Linear(1024*2, 1, bias=False).cuda() # + 4096
s+=4096
print(torch.cuda.memory_allocated(), s)

# 輸入定義
inputs = torch.tensor([[1.0]*1024]*1024).cuda() # shape = (1024,1024) # + 4194304
s+=4194304
print(torch.cuda.memory_allocated(), s)

# 前向傳播
s=s+4194304+512
loss = sum(linear2(linear1(inputs))) # shape = (1) # memory + 4194304 + 512
print(torch.cuda.memory_allocated(), s)

# 后向傳播
loss.backward() # memory - 4194304 + 4194304 + 4096
s = s-4194304+4194304+4096
print(torch.cuda.memory_allocated(), s)

# 再來(lái)一次~
for _ in range(100):
    loss = sum(linear2(linear1(inputs))) # shape = (1) # memory + 4194304  (512沒(méi)了，因?yàn)閘oss的ref還在)
    loss.backward() # memory - 4194304



print(torch.cuda.max_memory_reserved()/1024/1024, "MB")
print(torch.cuda.max_memory_allocated()/1024/1024, "MB")
print(torch.cuda.max_memory_cached()/1024/1024, "MB")
print(torch.cuda.memory_summary())

time.sleep(60)

View Code

運(yùn)行結(jié)果：

那么該代碼完整運(yùn)行需要的顯存空間為：781+42=823MB

參考NVIDIA顯卡的顯存消耗：

發(fā)現(xiàn)支持剛才的猜想，也就是說(shuō)不同的pytorch函數(shù)，顯卡型號(hào)，驅(qū)動(dòng)，操作系統(tǒng)，cuda版本都是會(huì)影響context_memory大小的。

其中最為難以測(cè)定的就是pytorch函數(shù)，因?yàn)槟憧赡芤恢痹谕粋€(gè)平臺(tái)上跑代碼但是不太可能一直都用相同的pytorch函數(shù)，所以一個(gè)程序跑完最低需要的顯存空間的測(cè)定其實(shí)是需要完整跑一次網(wǎng)絡(luò)的反傳才可以測(cè)定的。

我這里采用的測(cè)定最低需要的顯存空間的方法是不考慮context_memory而去直接考慮一次反傳后最大需要的顯存，此時(shí)我們可以一次反傳后把程序掛住，如sleep一下，然后看下NVIDIA顯卡一共消耗了多少顯存。而且由上面的信息可知context_memory的測(cè)定是與具體使用的函數(shù)相關(guān)的，因此最穩(wěn)妥的方法就是使用NVIDIA-smi監(jiān)測(cè)一次完整反傳后最大顯存的消耗。

=====================================================

本博客是博主個(gè)人學(xué)習(xí)時(shí)的一些記錄，不保證是為原創(chuàng)，個(gè)別文章加入了轉(zhuǎn)載的源地址還有個(gè)別文章是匯總網(wǎng)上多份資料所成，在這之中也必有疏漏未加標(biāo)注者，如有侵權(quán)請(qǐng)與博主聯(lián)系。

總結(jié)

以上是生活随笔為你收集整理的PyTorch显存机制分析的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問(wèn)題。

如果覺(jué)得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。