生活随笔
收集整理的這篇文章主要介紹了
mmdetection/mmdetection3d多机多卡训练
小編覺得挺不錯的,現在分享給大家,幫大家做個參考.
因為3d檢測訓練時間太久,所以想要在mmdet3d上開多機,發現加載完標注文件pkl/json之后,卡住了,找到如下報錯
其中有個warning :using best-guess GPU, 大概率是rank不對,
找到相關代碼:
def init_dist(launcher
, backend
='nccl', **kwargs
):if mp
.get_start_method
(allow_none
=True) is None:mp
.set_start_method
('spawn')if launcher
== 'pytorch':_init_dist_pytorch
(backend
, **kwargs
)elif launcher
== 'mpi':_init_dist_mpi
(backend
, **kwargs
)elif launcher
== 'slurm':_init_dist_slurm
(backend
, **kwargs
)else:raise ValueError
(f'Invalid launcher type: {launcher}')def _init_dist_pytorch(backend
, **kwargs
):rank
= int(os
.environ
['RANK'])local_rank
= int(os
.environ
["LOCAL_RANK"])num_gpus
= torch
.cuda
.device_count
()torch
.cuda
.set_device
(local_rank
)dist
.init_process_group
(backend
=backend
, **kwargs
)
沒什么問題,按照提示修改torch.cuda.set_device(local_rank)還是不work,
懷疑環境沒搞對,增加環境初始化:
def configure_nccl():import subprocessos
.environ
["NCCL_LAUNCH_MODE"] = ""os
.environ
["NCCL_IB_DISABLE"] = "0"os
.environ
["NCCL_IB_HCA"] = subprocess
.getoutput
("cd /sys/class/infiniband/ > /dev/null; for i in mlx5_*; ""do cat $i/ports/1/gid_attrs/types/* 2>/dev/null ""| grep v >/dev/null && echo $i ; done; > /dev/null")os
.environ
["NCCL_IB_GID_INDEX"] = "3"os
.environ
["NCCL_IB_TC"] = "106"
work!
總結
以上是生活随笔為你收集整理的mmdetection/mmdetection3d多机多卡训练的全部內容,希望文章能夠幫你解決所遇到的問題。
如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。