當前位置：首頁 > 人工智能 > pytorch >内容正文

pytorch

使用Amazon Deep Learning AMI 快速实现 CUDA，cuDNN 和深度学习框架版本兼容

發(fā)布時間：2024/1/8 pytorch 32 豆豆

生活随笔收集整理的這篇文章主要介紹了使用Amazon Deep Learning AMI 快速实现 CUDA，cuDNN 和深度学习框架版本兼容小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

前言

在開展深度學習項目時，我們通常會選擇合適的深度學習框架。使用深度學習框架進行模型開發(fā)，能減少大量的重復代碼工作。目前最流行的深度學習框架有：TensorFlow，PyTorch，MXNect，Caffe等。在深度學習項目的開發(fā)過程中，使用GPU加速訓練是必不可少的。以TensorFlow為例，GPU加速環(huán)境需要在系統(tǒng)中安裝 CUDA/cuDNN/TensorFlow。而到目前為止CUDA、cuDNN和TensorFlow有很多可用的版本，這可能會讓開發(fā)人員困惑，無法選擇兼容的組合來構建他們的開發(fā)環(huán)境。例如我們在深度學習模型中使用batch normalization, layer normalization，我們的代碼可以在CPU上正常的運行，但如果深度學習環(huán)境CUDA/cuDNN/TensorFlow的版本不兼容，代碼就無法使用GPU加速運行。如果要自行搭建版本兼容的深度學習環(huán)境，我們需要在一臺機器上經(jīng)過一系列復雜操作來實現(xiàn)。如果使用Amazon EC2中的Amazon Deep Learning AMI 來構建我們的深度學習環(huán)境，則可以簡化這個過程。Amazon Deep Learning AMI預裝了GPU驅(qū)動程序和最近多個版本的加速庫，AMI默認配置為NVIDA CUDA 10.0環(huán)境。你可以通過幾條命令輕松地在不同的版本之間進行切換，以兼容我們使用的深度學習框架用于部署我們自定義的深度學習環(huán)境。下面我們將以TensorFlow為例，來演示如何快速搭建 CUDA/cuDNN/TensorFlow版本兼容的GPU加速深度學習環(huán)境。

使用Ubuntu Deep Learning AMI

創(chuàng)建Amazon EC2

在亞馬遜云科技服務界面中，選擇Amazon EC2服務，并選擇我們希望創(chuàng)建Amazon EC2的可用區(qū)（演示使用中國寧夏區(qū)cn-northwest-1）。點擊Launch Instance。

搜索ubuntu并選擇Amazon Deep Learning AMI(Ubuntu 18.04) Version 51.0。

選擇GPU型號P3.2xlarge點擊Review and Lanuch創(chuàng)建Amazon EC2。

CPU 運行Layer Normalization

在創(chuàng)建好Amazon EC2實例之后，登陸到機器測試TensorFlow中的Layer Normalization功能。首先我們用python pip工具安裝TensorFlow 2.3.0。

pip3 install tensorflow-gpu==2.3.0

下面是一個Layer Normalization示例，對于給定張量輸入，對指定的軸執(zhí)行歸一化。

import?os import?tensorflow?as?tf import?numpy?as?np os.environ['CUDA_VISIBLE_DEVICES']?=?'-1'?#?Hide?GPU?from?visible?devices data?=?tf.constant(np.arange(10).reshape(5,?2)?*?10,?dtype=tf.float32) print('data:',data) layer?=?tf.keras.layers.LayerNormalization(axis=1)?#?Layer?normalization output?=?layer(data) print('output:',output)

*左滑查看更多

代碼中屏蔽了GPU資源，讓代碼在CPU中運行。代碼成功運行，得到了期望的輸出結果。

data:?tf.Tensor( [[?0.?10.][20.?30.][40.?50.][60.?70.][80.?90.]],?shape=(5,?2),?dtype=float32) output:?tf.Tensor( [[-0.99998??0.99998][-0.99998??0.99998][-0.99998??0.99998][-0.99998??0.99998][-0.99998??0.99998]],?shape=(5,?2),?dtype=float32)

*左滑查看更多

GPU 運行Layer Normalization

接下來刪掉上面代碼中屏蔽 GPU 的部分，我們嘗試著使用 GPU 來運行 Layer Normalization 的代碼。

import?os import?tensorflow?as?tf import?numpy?as?np data?=?tf.constant(np.arange(10).reshape(5,?2)?*?10,?dtype=tf.float32)?ge print('data:',data) layer?=?tf.keras.layers.LayerNormalization(axis=1)?#?Layer?normalization output?=?layer(data) print('output:',output)

*左滑查看更多

代碼運行時報錯 “InternalError: cuDNN launch failure”。查看報錯信息 “Loaded runtime CuDNN library: 7.5.1 but source was compiled with: 7.6.4.”。系統(tǒng)提示我們需要使用CuDNN library 7.6.4或者更高級的版本來兼容TersorFlow版本。

2021-11-12?04:04:46.223483:?I?tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982]?successful?NUMA?node?read?from?SysFS?had?negative?value?(-1),?but?there?must?be?at?least?one?NUMA?node,?so?returning?NUMA?node?zero 2021-11-12?04:04:46.224430:?I?tensorflow/core/common_runtime/gpu/gpu_device.cc:1402]?Created?TensorFlow?device?(/job:localhost/replica:0/task:0/device:GPU:0?with?14764?MB?memory)?->?physical?GPU?(device:?0,?name:?Tesla?V100-SXM2-16GB,?pci?bus?id:?0000:00:1e.0,?compute?capability:?7.0) data:?tf.Tensor( [[?0.?10.][20.?30.][40.?50.][60.?70.][80.?90.]],?shape=(5,?2),?dtype=float32) 2021-11-12?04:04:46.591870:?I?tensorflow/stream_executor/platform/default/dso_loader.cc:48]?Successfully?opened?dynamic?library?libcudnn.so.7 2021-11-12?04:04:51.113380:?E?tensorflow/stream_executor/cuda/cuda_dnn.cc:318]?Loaded?runtime?CuDNN?library:?7.5.1?but?source?was?compiled?with:?7.6.4.??CuDNN?library?major?and?minor?version?needs?to?match?or?have?higher?minor?version?in?case?of?CuDNN?7.0?or?later?version.?If?using?a?binary?install,?upgrade?your?CuDNN?library.??If?building?from?sources,?make?sure?the?library?loaded?at?runtime?is?compatible?with?the?version?specified?during?compile?configuration. 2021-11-12?04:04:51.114387:?W?./tensorflow/stream_executor/stream.h:2049]?attempting?to?perform?DNN?operation?using?StreamExecutor?without?DNN?support Traceback?(most?recent?call?last):File?"LN_test.py",?line?8,?in?<module>output?=?layer(data)File?"/home/ubuntu/.local/lib/python3.6/site-packages/tensorflow/python/keras/engine/base_layer.py",?line?985,?in?__call__outputs?=?call_fn(inputs,?*args,?**kwargs)File?"/home/ubuntu/.local/lib/python3.6/site-packages/tensorflow/python/keras/layers/normalization.py",?line?1251,?in?calldata_format=data_format)File?"/home/ubuntu/.local/lib/python3.6/site-packages/tensorflow/python/util/dispatch.py",?line?201,?in?wrapperreturn?target(*args,?**kwargs)File?"/home/ubuntu/.local/lib/python3.6/site-packages/tensorflow/python/ops/nn_impl.py",?line?1647,?in?fused_batch_normname=name)File?"/home/ubuntu/.local/lib/python3.6/site-packages/tensorflow/python/ops/gen_nn_ops.py",?line?4268,?in?fused_batch_norm_v3_ops.raise_from_not_ok_status(e,?name)File?"/home/ubuntu/.local/lib/python3.6/site-packages/tensorflow/python/framework/ops.py",?line?6843,?in?raise_from_not_ok_statussix.raise_from(core._status_to_exception(e.code,?message),?None)File?"<string>",?line?3,?in?raise_from tensorflow.python.framework.errors_impl.InternalError:?cuDNN?launch?failure?:?input?shape?([1,5,2,1])?[Op:FusedBatchNormV3]

*左滑查看更多

查看Amazon EC2 Deep Learning AMI中CUDA/cuDNN版本：

nvcc --version

默認版本是NVIDIA CUDA 10.0

nvcc:?NVIDIA?(R)?Cuda?compiler?driver Copyright?(c)?2005-2018?NVIDIA?Corporation Built?on?Sat_Aug_25_21:08:01_CDT_2018 Cuda?compilation?tools,?release?10.0,?V10.0.130

*左滑查看更多

查看 Amazon EC2 Deep Learning AMI 中 cuDNN 版本：

cat /usr/local/cuda/include/cudnn.h | grep CUDNN_MAJOR -A 2

默認版本cuDNN 7.5.1

#define?CUDNN_MAJOR?7 #define?CUDNN_MINOR?5 #define?CUDNN_PATCHLEVEL?1 -- #define?CUDNN_VERSION?(CUDNN_MAJOR?*?1000?+?CUDNN_MINOR?*?100?+?CUDNN_PATCHLEVEL)#include?"driver_types.h"

*左滑查看更多

我們可以去官方網(wǎng)站查詢TensorFlow和CUDA/cuDNN的版兼容信息?

https://www.tensorflow.org/install/source#tested_build_configurations 。下表列出來 CUDA，cuDNN 和 TensorFlow 的兼容版本：

表中可以看到推薦的兼容版本為CUDA 10.1，cuDNN 7.6，TensorFlow 2.3.0。檢查目前Amazon EC2中預裝的CUDA版本。

cd /usr/local/

可以看到Amazon EC2 Deep Learning AMI中預裝CUDA版本有 10.0/10.1/10.2/11.0/11.1。

bin???cuda-10.0??cuda-10.2??cuda-11.1??etc????include??lib??sbin???src cuda??cuda-10.1??cuda-11.0??dcgm???????games??init?????man??share

*左滑查看更多

用以下命令將CUDA切換到版本10.1從而兼容TensorFlow 2.3.0。

sudo?rm?/usr/local/cuda sudo?ln?-s?/usr/local/cuda-10.1?/usr/local/cuda

*左滑查看更多

切換后查看 CUDA 版本：

nvcc -V

可以看到，CUDA 版本已經(jīng)切換至 10.1。

nvcc:?NVIDIA?(R)?Cuda?compiler?driver Copyright?(c)?2005-2019?NVIDIA?Corporation Built?on?Sun_Jul_28_19:07:16_PDT_2019 Cuda?compilation?tools,?release?10.1,?V10.1.243

*左滑查看更多

查看 cuDNN 版本。

cat /usr/local/cuda/include/cudnn.h | grep CUDNN_MAJOR -A 2

cuDNN版本已經(jīng)切換到7.6.5。到此已經(jīng)實現(xiàn)了TensorFlow官方推薦的兼容版本。

#define?CUDNN_MAJOR?7 #define?CUDNN_MINOR?6 #define?CUDNN_PATCHLEVEL?5 -- #define?CUDNN_VERSION?(CUDNN_MAJOR?*?1000?+?CUDNN_MINOR?*?100?+?CUDNN_PATCHLEVEL)#include?"driver_types.h"

*左滑查看更多

再次運行 Layer Normalization 代碼，代碼成功運行，得到了期望的輸出結果。

data:?tf.Tensor( [[?0.?10.][20.?30.][40.?50.][60.?70.][80.?90.]],?shape=(5,?2),?dtype=float32) 2021-11-11?16:40:18.997086:?I?tensorflow/stream_executor/platform/default/dso_loader.cc:48]?Successfully?opened?dynamic?library?libcudnn.so.7 output:?tf.Tensor( [[-0.99998????0.99998??][-0.99998????0.99998??][-0.99998????0.99998??][-0.99998????0.99998??][-0.9999809??0.999979?]],?shape=(5,?2),?dtype=float32)

*左滑查看更多

總結

Amazon Deep Learning AMI為機器學習從業(yè)者和研究人員提供基礎設施和深度學習環(huán)境。您可以快速啟動預裝了主流深度學習框架的Amazon EC2實例，并且快速地切換CUDA版本實現(xiàn)與深度學習框架的兼容，輕松構建深度學習環(huán)境。從而讓我們將更多的時間用于嘗試新算法，學習新技術。

本篇作者

陳恒智

專業(yè)服務團隊數(shù)據(jù)科學家

聽說，點完下面4個按鈕

就不會碰到bug了！

總結

以上是生活随笔為你收集整理的使用Amazon Deep Learning AMI 快速实现 CUDA，cuDNN 和深度学习框架版本兼容的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇：一次进入 Cisco Router（SD
下一篇： python办公室自动化之office颜