在OpenShift平台上验证NVIDIA DGX系统的分布式多节点自动驾驶AI训练
在OpenShift平臺上驗證NVIDIA
DGX系統的分布式多節點自動駕駛AI訓練
自動駕駛汽車的深度神經網絡(DNN)開發是一項艱巨的工作。本文驗證了DGX多節點,多GPU,分布式訓練在DXC機器人驅動環境中運行。
還使用了一個機器人學習平臺來驅動深度學習(11.3)的工作負載。目前,OpenShift 3.11已部署在許多大型GPU加速的自動駕駛(AD)開發和測試環境中。這里顯示的方法同樣適用于新的OpenShift版本,并且可以轉移到其他基于OpenShift的集群中。
DXC Robotic Drive是一個自動駕駛的數據驅動開發平臺,可大大降低風險,加快ADAS/AD功能的開發、測試和驗證,以支持2級以上5級自主功能。它是目前已知的最大的EB級開發解決方案,利用業界成熟的本地和云基礎設施、方法、工具和加速器實現高度自動化的廣告開發過程。
互操作性測試環境是運行OpenShift 3.11和4.3的機器人驅動創新實驗室。
DL workloads at scale
數據并行(dataparallelishm)是最常用的擴展DL工作負載的設計模式。關于如何加速視覺和遞歸神經網絡,有許多參考文獻和實踐。
DL模型被多次實例化,并且數據在這些實例中并行傳輸。實例彼此交換漸變,以協同工作,而不是獨立工作。
這是來自高性能計算(HPC)領域的消息傳遞接口(MPI)框架的經典計算模式。因此,在眾所周知的MPI的幫助下對這些工作負載進行編排是很簡單的。MPI還可以輕松擴展到多個節點之外。
支持多GPU的DL框架,如PyTorch和TensorFlow,在任何項目開始時都非常適合使用,以確保工作負載可以使用單個GPU工作站直到大型GPU集群。
這些框架還支持使用MPI本機進行數據并行訓練,并且可以使用MPI工具(如mpirun或mpiexec)觸發工作負載。數據并行模式的多種實現都遵循這種模式,比如Horovod。
RedHat OpenShift Container Platform(OCP)是基于Kubernetes的Docker或CRI-O運行時容器構建的平臺即服務。OpenShift專注于安全性,并且確實包括了對上游Kubernetes的缺陷、安全性和性能問題的修復。作為Kubernetes,OpenShift允許在RedHat的支持下大規模地部署和管理集群。
Kubernetes和OpenShift可以輕松地處理MPI工作負載。一個集成以Kubeflow MPI操作符的形式存在,它在后臺協調資源并提高工作負載。
圖1顯示了使用兩個DGX-1系統的DL工作負載。在這種情況下,有16個單獨的進程。在Horovod中,它們被賦予一個名為rank的唯一ID來區分它們:rank 0到rank
15。所有單獨的進程在輸入數據的不同部分并行工作,并交換它們的梯度以協同工作。
Figure 1. DL workload using two DGX-1 systems.
為了在各個塔之間進行有效的通信,使用NVIDIA集體通信庫(NCCL)。NCCL實現了針對NVIDIA
GPU的性能優化的多GPU和多節點集合通信原語。
訓練數據的低延遲POSIX存儲是通過機器人驅動的持久Volumes來實現的。大規模地處理存儲是至關重要的,但不是這里的重點。
Installation steps
下面是如何在運行OpenShift v3.11的DXC機器人驅動環境中安裝DGX系統。
測試系統概述
OpenShift v3.11至少需要一個臨時引導計算機、三個主節點和至少兩個計算節點。
因為DL可能是一個數據密集型工作負載,所以集群需要一個合適的網絡解決方案。機器人驅動創新實驗室提供了HPE FlexFabric 5945 32QSFP28交換機,這些交換機與DGX系統的Mellanox適配器結合使用。
DGX-1的所有集群互連適配器都以以太網模式使用,并通過使用LACP分組模式捆綁在一起。
下表總結了群集的硬件和軟件配置。
Table 1. Overview of HW/SW configuration.
Preparing the DGX-1 systems
要在DGX系統上安裝RHEL 7.7,請使用NVIDIA提供的安裝說明。這些步驟還包括安裝特定于DGX的軟件。
將DGX系統連接到OpenShift群集
按照OpenShift 3.11的RedHat文檔中的說明安裝集群。
集群啟動并運行后,使用RedHat提供的官方版本來擴展集群并包括兩個DGX系統。這些執行手冊添加必要的庫和配置節點,并將它們添加到集群本身。
圖2顯示了OCP儀表板的開始顯示,允許與OpenShift集群交互。這些交互包括監測資源、創建pod和檢索日志。
Figure 2. Start screen of an OpenShift cluster.
Enabling GPUs in the cluster
OpenShift(和Kubernetes)都支持標準資源,比如CPU、內存和監測可用的磁盤空間。使用設備插件或操作符處理其它資源。在此設置中,將使用用于OpenShift 3.11的NVIDIA GPU設備插件。
在Kubernetes和OpenShift(v1.13+,v4.1+)的更新版本中,引入了operator框架。使用這個操作框架,NVIDIA GPU操作符允許自動部署以前必須手動部署的組件。這些組件包括NVIDIA驅動程序、用于gpu的Kubernetes設備插件、NVIDIA容器運行時、自動節點標記和基于DCGM的監視。
有關更多信息,請參閱設備插件和NVIDIA GPU運營商。
Figure 3. Multi-GPU, single-node workload example.
In the following examples, we show you how to trigger an MPI-based, DL workload using the Horovod framework. For this test, we used an MPI-enabled Docker container with the required frameworks, such as NVIDIA GPU CLOUD (NGC) containers. NGC is a GPU-optimized software hub, simplifying AI and HPC workflows.
Start the workload, then wrap the
command in an OpenShift YAML for orchestration.
To run a scalable ResNet-50 training
with randomly generated data natively, run the following command:
docker run --gpus all -it
horovod/horovod:0.18.1-tf1.14.0-torch1.2.0-mxnet1.5.0-py3.6
The previous
example makes use of all GPUs available on the system with the --gpus all flag. To
allocate only a smaller number of GPUs, change that to --gpus 1. You can also
specify individual GPUs by using their device IDs.
To start the workload in a container in
interactive mode, a typical command looks like the following code example. In
this example, eight GPUs are used:
horovodrun -np 8 -H
localhost:8 python pytorch_synthetic_benchmark.py
A corresponding YAML file is used to
start this workload through OpenShift, which can be deployed straight from
OpenShift login node, OpenShift master node, or the GUI.
oc create -f horovod_example_8gpus.yaml
kubectl create -f
horovod_example_8gpus.yaml
This is the content of the YAML file
used:
apiVersion: v1
kind: Pod
metadata:
name: horovod-new_test
namespace: managed-machine-learning
spec:
serviceAccount: tensorflow-sa
restartPolicy: OnFailure
containers:
-
name: horovod-test
image:
horovod/horovod:0.18.1-tf1.14.0-torch1.2.0-mxnet1.5.0-py3.6command: [
“horovodrun”, “-np”, “8”, “python” ,
“pytorch_synthetic_benchmark.py” ]env:
-
name:
NVIDIA_VISIBLE_DEVICESvalue: all
-
name: NVIDIA_DRIVER_CAPABILITIES
value:
“compute,utility” -
name:
NVIDIA_REQUIRE_CUDAvalue:
“cuda>=9.0”
resources:
limits:
-
nvidia.com/gpu: “8”
requests:
nvidia.com/gpu: “8”
securityContext:privileged:
true
To influence the scalability of the
workload set the following parameters accordingly.
The command
section specifies the command to be run, which is the same as used in the
Docker example. The number of processes spawned can be controlled via the -np 8 flag.
command: [ “horovodrun”,
“-np”, “8”, “python” ,
“pytorch_synthetic_benchmark.py” ]
Allocate the
proper number of GPUs in the resources section. The number should correspond to
the number of processes set in the horovodrun command.
resources:limits:
nvidia.com/gpu:“8”
requests:
nvidia.com/gpu:“8”
A pod is a group of one or more containers. Figure 4 shows the
creation of pods using the web interface of OpenShift. As mentioned earlier,
you can also use the CLI by either using oc or kubectl to create a new pod.
Figure 4. Created pods inside the OpenShift dashboard.
To retrieve outputs or logs of pods
running in an OpenShift cluster, there are several possible ways. One way is to
access the log through the CLI by using either kubectl logs, or oc logs:
kubectl logs -n
oc logs -n
Figure 5 shows the other possibility,
using the OpenShift GUI.
Figure 5. Access to the pod output using the dashboard.
Multi-GPU, single-node workload
有幾種方法可以利用多個系統的功能來獲得單個DL訓練的好處。多個系統的計算資源可以聚合起來以加速訓練。這對于自主車輛DNN開發中的DNN訓練等工作負載尤其有利,因為在大型數據集上運行此類工作負載時,實驗的周轉時間通常是一個關鍵因素。
圖1顯示了兩個DGX系統在單個DL作業中協同工作。DL工作負載在DGX-1系統上總共使用16個GPU,每個GPU為8個。每個作業根據數據的一個分區處理其計算。
所有的工作人員通過NVIDIA NVLINK與他們的同事同步他們的技術,這是一種由NVIDIA開發的通信協議,允許在cpu和gpu或gpu之間以及網絡之間傳輸數據和控制代碼。此同步步驟顯示為虛線。
重用以前的相同代碼庫和容器。與以前的編排方法的唯一區別是使用已知的MPI運算符。這簡化了跨多個DGX系統的工作負載分配和部署。
MPI Operator
有幾個選項可以在OpenShift或Kubernetes集群上運行多GPU、多節點工作負載。其中一個常見的框架是MPI操作符,由Kubeflow項目提供。MPI操作符處理DL工作負載的編排,如前所示。
安裝MPI操作符時,會在集群中引入一種新的作業類型:MPIJob。下面的作業規范的代碼示例顯示了通過OpenShift或Kubernetes啟動這種MPIJob工作負載的對應YAML文件。它可以直接從OpenShift登錄節點、主節點或GUI部署。
與上一節中找到的YAML文件不同,該示例文件使用NVIDIA TensorFlow Docker映像,并在兩個DGX-1系統的16個gpu上運行合成ResNet-50基準測試。使用通過NGC提供的TensorFlow Docker映像可以受益于NVIDIAs不斷改進的性能。創建的三個pod(一個啟動器pod和兩個worker pod)之間的通信由MPI操作員負責。
與Horovod分布式模型的YAML定義一樣,使用mpirun-np16的線程數再次對應于需求的gpu數量。在本例中,它是兩個worker pod,每個都連接了8個gpu。通過更改worker pod的數量和生成的線程的數量,可以輕松地擴展這個示例。
16 GPU培訓作業的作業規范如下所示:
apiVersion: kubeflow.org/v1alpha2kind: MPIJobmetadata: name: 16gpu-tensorflow-benchmark-imagenetspec: slotsPerWorker: 8 cleanPodPolicy: Running mpiReplicaSpecs: Launcher: replicas: 1 template: spec: containers: - image: nvcr.io/nvidia/tensorflow:19.10-py3 name: tensorflow-benchmarks env: - name: NVIDIA_DRIVER_CAPABILITIES value: compute,utility - name: NVIDIA_REQUIRE_CUDA value: cuda>=9.0 command: - mpirun - -np - “16” - -bind-to - none - -map-by - slot - -x - NCCL_DEBUG=INFO - -x - LD_LIBRARY_PATH - -x - PATH - -mca - pml - ob1 - -mca - btl - ^openib - python - nvidia-examples/resnet50v1.5/main.py - --mode=training_benchmark - --batch_size=128 - --num_iter=90 - --iter_unit=epoch - --results_dir=/efs Worker: replicas: 2 template: spec: containers: - image: nvcr.io/nvidia/tensorflow:19.10-py3 name: tensorflow-benchmarks env: - name: NVIDIA_DRIVER_CAPABILITIES value: compute,utility - name: NVIDIA_REQUIRE_CUDA value: cuda>=9.0 resources: limits: nvidia.com/gpu: 8
On Kubernetes, the YAML
file looks like the one used in OpenShift. One important difference is that
environment variables in OpenShift v3.11 must be set inside the spec of the
pods.
env: - name: NVIDIA_DRIVER_CAPABILITIES value: compute,utility - name: NVIDIA_REQUIRE_CUDA value: cuda>=9.0
Aggregating cutoff resources
In large computing environments, there
is often a cutoff or some unused resources. The MPI Operator is of great use in
such a case, as it can aggregate those cutoffs and avoid waste. This is a
property that only the MPI Operator can deliver.
The following code example shows an
exemplary two-GPU job that can aggregate resources from multiple systems. The
job requests two GPUs for the workload. You can customize this example to fit
almost every situation, for example three GPUs on three different nodes or four
GPUs on two different nodes.
apiVersion: kubeflow.org/v1alpha2
kind: MPIJob
metadata:
name: tensorflow-benchmarks-gpu-v1a2
spec:
slotsPerWorker: 1
cleanPodPolicy: Running
mpiReplicaSpecs:
Launcher:replicas: 1template:spec:containers:- image: mpioperator/tensorflow-benchmarks:latest
name: tensorflow-benchmarks-gpu
env:
-
name: NVIDIA_DRIVER_CAPABILITIES
value: compute,utility
-
name: NVIDIA_REQUIRE_CUDA
value: cuda>=9.0
command:
-
mpirun
-
-np
-
“2”
-
-bind-to
-
none
-
-map-by
-
slot
-
-x
-
NCCL_DEBUG=INFO
-
-x
-
LD_LIBRARY_PATH
-
-x
-
PATH
-
-mca
-
pml
-
ob1
-
-mca
-
btl
-
^openib
-
-mca
-
plm_base_verbose
-
“100”
-
-mca
-
btl_base_verbose
-
“30”
-
python
-
scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py
-
–model=resnet101
-
–batch_size=64
-
–variable_update=horovod
resources:
limits:
nvidia.com/gpu: 1Worker:replicas: 2template:spec:containers:- image:
mpioperator/tensorflow-benchmarks:latest
name: tensorflow-benchmarksenv:
-
name: NVIDIA_DRIVER_CAPABILITIES
value: compute,utility
-
name: NVIDIA_REQUIRE_CUDA
value:
cuda>=9.0resources:
limits:
nvidia.com/gpu: 1
Running this YAML file results in a log
file like the following. It can either be collected using the GUI or CLI.
oc logs -n
kubectl logs -n
tensorflow-benchmarks-gpu-v1a2-worker-1:26:156
[0] NCCL INFO Ring 01 : 1 -> 0 [send] via NET/Socket/0
tensorflow-benchmarks-gpu-v1a2-worker-0:26:156
[0] NCCL INFO Ring 01 : 0 -> 1 [send] via NET/Socket/0
tensorflow-benchmarks-gpu-v1a2-worker-0:26:156
[0] NCCL INFO Using 256 threads, Min Comp Cap 7, Trees disabled
tensorflow-benchmarks-gpu-v1a2-worker-1:26:156
[0] NCCL INFO comm 0x7f887032e120 rank 1 nranks 2 cudaDev 0 nvmlDev 7 - Init
COMPLETE
tensorflow-benchmarks-gpu-v1a2-worker-0:26:156
[0] NCCL INFO comm 0x7f003434e940 rank 0 nranks 2 cudaDev 0 nvmlDev 3 - Init
COMPLETE
tensorflow-benchmarks-gpu-v1a2-worker-0:26:156
[0] NCCL INFO Launch mode Parallel
Done warm up
Step
Img/sec total_loss
Done warm up
Step
Img/sec total_loss
1 images/sec: 147.6 +/-
0.0 (jitter = 0.0) 8.308
1 images/sec: 147.8 +/-
0.0 (jitter = 0.0) 8.378
10 images/sec: 159.1 +/-
2.3 (jitter = 4.7) 8.526
Querying for the running pods inside the
cluster does show that cutoff resources from two nodes are being used.
$ oc get pods -o wide
NAME
READY STATUS
RESTARTS AGE
IP
NODE
tensorflow-benchmarks-gpu-v1a2-launcher-dqgsk
1/1 Running 0 3m45s
10.233.XXX.XXX dgx01.dev.XXX
tensorflow-benchmarks-gpu-v1a2-worker-0
1/1 Running
0 3m45s 10.233.XXX.XXX
dgx02.dev.XXX
tensorflow-benchmarks-gpu-v1a2-worker-1
1/1 Running
0 3m45s 10.233.XXX.XXX
dgx01.dev.XXX
Docker
network configuration
NCCL is discovering the topology with
its peers. The following output was taken from one of the running pods. It
shows that, besides the standard local adapter, there is only one additional
connection configured.
eth0 Link
encap:Ethernet HWaddr 0a:58:0a:81:07:b2
inet addr:10.129.7.178 Bcast:10.129.7.255 Mask:255.255.254.0
inet6 addr: fe80::419:d8ff:fe19:1bbd/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1450 Metric:1
RX packets:21275 errors:0 dropped:0 overruns:0 frame:0
TX packets:33993 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:11305584 (11.3 MB) TX bytes:4496191 (4.4 MB)
lo
Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MTU:65536 Metric:1
RX packets:28205726 errors:0 dropped:0 overruns:0 frame:0
TX packets:28205726 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:12809324160 (12.8 GB) TX bytes:12809324160 (12.8 GB)
Having one in a
Docker NIC (here, eth0) is a starting point. A more
sophisticated setup may use multiple Docker network adapters.
Get started with multi-GPU, multi-node training running on OpenShift
Adopt industry state-of-the-art DL
workloads and deploy them at scale today. In this post, we showed you that data
parallel training making use of the MPI paradigm is highly flexible for different
environments.
With this style, DL engineers of a
scalable DGX system cluster get the following benefits, regardless of their
orchestration software:
·
Scale beyond the limitations of a single
node and enable DL in a larger cluster.
·
Prevent turning resource cutoffs in
waste by aggregating leftover resources effectively.
OpenShift上的Robotic Drive容器化計算平臺大規模地協調DL工作負載,包括使用NVIDIA DGX系統的多GPU、多節點作業。首先,基于視覺的ML模型能夠為最先進的復雜駕駛行為任務(如運動規劃)提供良好的技術基礎。這些任務需要大量的探索和實驗,這些都由機器人驅動創新實驗來解決。
總結
以上是生活随笔為你收集整理的在OpenShift平台上验证NVIDIA DGX系统的分布式多节点自动驾驶AI训练的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: NVIDIA DRIVE AGX开发工具
- 下一篇: cuDNN 功能模块解析