當(dāng)前位置：首頁(yè) > 编程资源 > 编程问答 >内容正文

编程问答

NVIDIA-CUDA编程初探

發(fā)布時(shí)間：2023/12/20 编程问答 32 豆豆

生活随笔收集整理的這篇文章主要介紹了 NVIDIA-CUDA编程初探小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

CUDA的全稱是Compute Unified Device Architecture,是顯卡廠商N(yùn)VIDIA推出的運(yùn)算平臺(tái)，開(kāi)發(fā)者可以使用C語(yǔ)言來(lái)編寫(xiě)CUDA代碼，使用NVCC編譯器可以在支持CUDA的GPU處理器上以高速運(yùn)行。雖然AMD也做顯卡，但是CUDA是老黃自家提出的標(biāo)準(zhǔn)，沒(méi)帶AMD一起玩兒，所以，提到基于CUDA的高性能計(jì)算，使用的都是Nvidia的顯卡。

首先安裝CUDA環(huán)境，具體方式參考博客：

FairMOT Cuda環(huán)境搭建并進(jìn)行推理_tugouxp的專欄-CSDN博客環(huán)境準(zhǔn)備1.PC Host Ubuntu 18.04.6,Linux Kernel 5.4，內(nèi)核版本關(guān)系不大，記錄下來(lái)備查。2.安裝基礎(chǔ)工具,比如GCC，CMAKE，VIM，GIT等等，工具盡量完備, 如果做不到，遇到問(wèn)題臨時(shí)下載也可。3.安裝python3發(fā)行版，我用的是anaconda發(fā)行版，具體版本是 Anaconda3-2020.11-Linux-x86_64.sh下載地址在如下鏈接，選擇對(duì)應(yīng)的版本即可。https://repo.anaco...https://blog.csdn.net/tugouxp/article/details/121248457環(huán)境安裝后，可以用如下方法驗(yàn)證環(huán)境：

nvidia-smi命令枚舉了系統(tǒng)中的所有顯卡支持信息

nvcc工具是CUDA編譯器，用nvcc -V 驗(yàn)證編譯器是否可以工作:

cuda編程

編輯helloworld.cu文件，編碼內(nèi)容:

#include <cuda_runtime.h> #include <stdio.h>int main(void) {printf("hellow world!\n");return 0; }

之后執(zhí)行 nvcc helloworld.cu -o helloworld,并運(yùn)行

可以看到，運(yùn)行程序后打印除了helloworld.

但是，這個(gè)程序用到顯卡了嗎？很遺憾，沒(méi)有。如果非要用顯卡做點(diǎn)什么的化，可以改成這個(gè)樣子：

#include <cuda_runtime.h> #include <stdio.h>__global__ void kernel(void) {}int main(void) {kernel<<<1,1>>>();printf("hellow world!\n");return 0; }

我們定義了一個(gè)空函數(shù)送給GPU跑，函數(shù)是空函數(shù)，什么也沒(méi)做，白嫖一下GPU就退出，編譯并運(yùn)行：

生成的helloworld文件是ELF格式的目標(biāo)文件，與GCC產(chǎn)生的無(wú)異，可以通過(guò)objdump反編譯一把：

來(lái)看一下main函數(shù)的片段：

粗略一看，首先給人的印象是NVCC不是一個(gè)人在戰(zhàn)斗，畢竟我們的代碼才短短幾行，反編譯后卻有這么多條指令，而且貌似有些指令是沒(méi)有出現(xiàn)在源碼層面調(diào)用的。還能看出一點(diǎn)的就是源碼是按照C++編譯的，因?yàn)榭吹搅嗣黠@的名字改編。

那就是編譯器做的手腳咯,幸好我們有辦法確認(rèn)這一點(diǎn)，方式就是在nvcc編譯的時(shí)候加上--verbose選項(xiàng):

#$ _NVVM_BRANCH_=nvvm #$ _SPACE_= #$ _CUDART_=cudart #$ _HERE_=/usr/local/cuda-11.5/bin #$ _THERE_=/usr/local/cuda-11.5/bin #$ _TARGET_SIZE_= #$ _TARGET_DIR_= #$ _TARGET_DIR_=targets/x86_64-linux #$ TOP=/usr/local/cuda-11.5/bin/.. #$ NVVMIR_LIBRARY_DIR=/usr/local/cuda-11.5/bin/../nvvm/libdevice #$ LD_LIBRARY_PATH=/usr/local/cuda-11.5/bin/../lib::/usr/local/cuda-11.5/lib64 #$ PATH=/usr/local/cuda-11.5/bin/../nvvm/bin:/usr/local/cuda-11.5/bin:/home/caozilong/anaconda3/bin:/home/caozilong/anaconda3/condabin:/home/caozilong/.local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/usr/local/cuda-11.5/bin #$ INCLUDES="-I/usr/local/cuda-11.5/bin/../targets/x86_64-linux/include" #$ LIBRARIES= "-L/usr/local/cuda-11.5/bin/../targets/x86_64-linux/lib/stubs" "-L/usr/local/cuda-11.5/bin/../targets/x86_64-linux/lib" #$ CUDAFE_FLAGS= #$ PTXAS_FLAGS= #$ gcc -D__CUDA_ARCH__=520 -D__CUDA_ARCH_LIST__=520 -E -x c++ -DCUDA_DOUBLE_MATH_FUNCTIONS -D__CUDACC__ -D__NVCC__ "-I/usr/local/cuda-11.5/bin/../targets/x86_64-linux/include" -D__CUDACC_VER_MAJOR__=11 -D__CUDACC_VER_MINOR__=5 -D__CUDACC_VER_BUILD__=50 -D__CUDA_API_VER_MAJOR__=11 -D__CUDA_API_VER_MINOR__=5 -D__NVCC_DIAG_PRAGMA_SUPPORT__=1 -include "cuda_runtime.h" -m64 "helloworld.cu" -o "/tmp/tmpxft_0000596f_00000000-9_helloworld.cpp1.ii" #$ cicc --c++14 --gnu_version=70500 --display_error_number --orig_src_file_name "helloworld.cu" --orig_src_path_name "/home/caozilong/cuda/helloworld.cu" --allow_managed -arch compute_52 -m64 --no-version-ident -ftz=0 -prec_div=1 -prec_sqrt=1 -fmad=1 --include_file_name "tmpxft_0000596f_00000000-3_helloworld.fatbin.c" -tused --gen_module_id_file --module_id_file_name "/tmp/tmpxft_0000596f_00000000-4_helloworld.module_id" --gen_c_file_name "/tmp/tmpxft_0000596f_00000000-6_helloworld.cudafe1.c" --stub_file_name "/tmp/tmpxft_0000596f_00000000-6_helloworld.cudafe1.stub.c" --gen_device_file_name "/tmp/tmpxft_0000596f_00000000-6_helloworld.cudafe1.gpu" "/tmp/tmpxft_0000596f_00000000-9_helloworld.cpp1.ii" -o "/tmp/tmpxft_0000596f_00000000-6_helloworld.ptx" #$ ptxas -arch=sm_52 -m64 "/tmp/tmpxft_0000596f_00000000-6_helloworld.ptx" -o "/tmp/tmpxft_0000596f_00000000-10_helloworld.sm_52.cubin" #$ fatbinary -64 --cicc-cmdline="-ftz=0 -prec_div=1 -prec_sqrt=1 -fmad=1 " "--image3=kind=elf,sm=52,file=/tmp/tmpxft_0000596f_00000000-10_helloworld.sm_52.cubin" "--image3=kind=ptx,sm=52,file=/tmp/tmpxft_0000596f_00000000-6_helloworld.ptx" --embedded-fatbin="/tmp/tmpxft_0000596f_00000000-3_helloworld.fatbin.c" #$ rm /tmp/tmpxft_0000596f_00000000-3_helloworld.fatbin #$ gcc -D__CUDA_ARCH_LIST__=520 -E -x c++ -D__CUDACC__ -D__NVCC__ "-I/usr/local/cuda-11.5/bin/../targets/x86_64-linux/include" -D__CUDACC_VER_MAJOR__=11 -D__CUDACC_VER_MINOR__=5 -D__CUDACC_VER_BUILD__=50 -D__CUDA_API_VER_MAJOR__=11 -D__CUDA_API_VER_MINOR__=5 -D__NVCC_DIAG_PRAGMA_SUPPORT__=1 -include "cuda_runtime.h" -m64 "helloworld.cu" -o "/tmp/tmpxft_0000596f_00000000-5_helloworld.cpp4.ii" #$ cudafe++ --c++14 --gnu_version=70500 --display_error_number --orig_src_file_name "helloworld.cu" --orig_src_path_name "/home/caozilong/cuda/helloworld.cu" --allow_managed --m64 --parse_templates --gen_c_file_name "/tmp/tmpxft_0000596f_00000000-6_helloworld.cudafe1.cpp" --stub_file_name "tmpxft_0000596f_00000000-6_helloworld.cudafe1.stub.c" --module_id_file_name "/tmp/tmpxft_0000596f_00000000-4_helloworld.module_id" "/tmp/tmpxft_0000596f_00000000-5_helloworld.cpp4.ii" #$ gcc -D__CUDA_ARCH__=520 -D__CUDA_ARCH_LIST__=520 -c -x c++ -DCUDA_DOUBLE_MATH_FUNCTIONS "-I/usr/local/cuda-11.5/bin/../targets/x86_64-linux/include" -m64 "/tmp/tmpxft_0000596f_00000000-6_helloworld.cudafe1.cpp" -o "/tmp/tmpxft_0000596f_00000000-11_helloworld.o" #$ nvlink -m64 --arch=sm_52 --register-link-binaries="/tmp/tmpxft_0000596f_00000000-7_helloworld_dlink.reg.c" "-L/usr/local/cuda-11.5/bin/../targets/x86_64-linux/lib/stubs" "-L/usr/local/cuda-11.5/bin/../targets/x86_64-linux/lib" -cpu-arch=X86_64 "/tmp/tmpxft_0000596f_00000000-11_helloworld.o" -lcudadevrt -o "/tmp/tmpxft_0000596f_00000000-12_helloworld_dlink.sm_52.cubin" #$ fatbinary -64 --cicc-cmdline="-ftz=0 -prec_div=1 -prec_sqrt=1 -fmad=1 " -link "--image3=kind=elf,sm=52,file=/tmp/tmpxft_0000596f_00000000-12_helloworld_dlink.sm_52.cubin" --embedded-fatbin="/tmp/tmpxft_0000596f_00000000-8_helloworld_dlink.fatbin.c" #$ rm /tmp/tmpxft_0000596f_00000000-8_helloworld_dlink.fatbin #$ gcc -D__CUDA_ARCH_LIST__=520 -c -x c++ -DFATBINFILE="\"/tmp/tmpxft_0000596f_00000000-8_helloworld_dlink.fatbin.c\"" -DREGISTERLINKBINARYFILE="\"/tmp/tmpxft_0000596f_00000000-7_helloworld_dlink.reg.c\"" -I. -D__NV_EXTRA_INITIALIZATION= -D__NV_EXTRA_FINALIZATION= -D__CUDA_INCLUDE_COMPILER_INTERNAL_HEADERS__ "-I/usr/local/cuda-11.5/bin/../targets/x86_64-linux/include" -D__CUDACC_VER_MAJOR__=11 -D__CUDACC_VER_MINOR__=5 -D__CUDACC_VER_BUILD__=50 -D__CUDA_API_VER_MAJOR__=11 -D__CUDA_API_VER_MINOR__=5 -D__NVCC_DIAG_PRAGMA_SUPPORT__=1 -m64 "/usr/local/cuda-11.5/bin/crt/link.stub" -o "/tmp/tmpxft_0000596f_00000000-13_helloworld_dlink.o" #$ g++ -D__CUDA_ARCH_LIST__=520 -m64 -Wl,--start-group "/tmp/tmpxft_0000596f_00000000-13_helloworld_dlink.o" "/tmp/tmpxft_0000596f_00000000-11_helloworld.o" "-L/usr/local/cuda-11.5/bin/../targets/x86_64-linux/lib/stubs" "-L/usr/local/cuda-11.5/bin/../targets/x86_64-linux/lib" -lcudadevrt -lcudart_static -lrt -lpthread -ldl -Wl,--end-group -o "helloworld"

現(xiàn)在總結(jié)一下CUDA編程的規(guī)則：

核函數(shù)，在GPU上執(zhí)行的函數(shù)通常成為核函數(shù)，如上面程序中的kernel函數(shù)。
核函數(shù)一般通過(guò)標(biāo)識(shí)符__global__修飾，通過(guò)<<<參數(shù)1，參數(shù)2>>>調(diào)用，用于說(shuō)明內(nèi)核函數(shù)中的線程數(shù)量，以及線程是如何組織的。
以線程格(Grid)的形式組織，每個(gè)線程格有若干個(gè)線程塊(block)組成，而每個(gè)線程塊又由若干個(gè)線程(thread)組成。
以Block為單位執(zhí)行
能在主機(jī)端代碼中調(diào)用
調(diào)用時(shí)必須聲明內(nèi)核函數(shù)的執(zhí)行參數(shù)
在編程時(shí)，必須先為kernel函數(shù)中用到的數(shù)組或者變量分配好足夠的空間，再調(diào)用kernel函數(shù)，否則在GPU計(jì)算時(shí)會(huì)發(fā)生錯(cuò)誤，例如越界或者報(bào)錯(cuò)，甚至導(dǎo)致藍(lán)屏和死機(jī)。

CUDA的變成模型如下圖所示：

上面例子中,kernel函數(shù)恰好叫kernel是一種巧合，實(shí)際上你可以改成任何有意義的名字，只要按照CUDA要求的方式調(diào)用即可

#include <cuda_runtime.h> #include <stdio.h>__global__ void dummy(void) {}int main(void) {dummy<<<1,1>>>();printf("hellow world!\n");return 0; }

對(duì)于上如上的例子，我們探究一下它的控制流是如何進(jìn)行的,首先我們看到反編譯文件中，首先main函數(shù)調(diào)用了_Z5dummyv

?不難看出這個(gè)函數(shù)名是經(jīng)過(guò)C++名字改編的，我們用c++filt工具將其還原：

可以看到它就是dummy,我們繼續(xù)追蹤

可以看到dummy調(diào)用了_Z23__device_stub__Z5dummyvv函數(shù),繼續(xù)追蹤，發(fā)現(xiàn)執(zhí)行流最終調(diào)用了_Z16cudaLaunchKernelIcE9cudaErrorPKT_4dim3S4_PPvmP11CUstream_st，而經(jīng)過(guò)反命名后，發(fā)現(xiàn)它是名為cudaLaunchKernel的cuda函數(shù)，這個(gè)函數(shù)并非代碼中顯示調(diào)用的，而是NVCC工具鏈生成的，所以，顧名思義，很可能就是這句調(diào)用發(fā)起了對(duì)GPU控制流的交接。

?global和device

__global__和__device__是函數(shù)修飾符，__global__表明被修飾的函數(shù)在設(shè)備上執(zhí)行，但是在主機(jī)上調(diào)用，__device__，表示被修飾的函數(shù)在設(shè)備上執(zhí)行，但是只能在其他__device__或者_(dá)_global__函數(shù)中調(diào)用，說(shuō)白了它只能在GPU中執(zhí)行，并且被GPU中執(zhí)行的函數(shù)調(diào)用。

新的例子

計(jì)算3+6等于幾的例子

#include <cuda_runtime.h> #include <stdio.h>__global__ void add(int a, int b, int *c) {*c = a + b; }int main(void) {int c;int *gpu_c;cudaMalloc((void **)&gpu_c, sizeof(int));add<<<1,1>>>(3,6,gpu_c);cudaMemcpy(&c, gpu_c, sizeof(int), cudaMemcpyDeviceToHost);cudaFree(gpu_c);printf("3 + 9 eques %d.\n", c);return 0; }

我們得到了正確的計(jì)算結(jié)果。

我們稍微改一下程序，將計(jì)算過(guò)程改為循環(huán)計(jì)算，然后用nvidia-smi工具監(jiān)視一下GPU的資源使用情況:

#include <cuda_runtime.h> #include <stdio.h>__global__ void add(int a, int b, int *c) {*c = a + b; }int main(void) {int c;int *gpu_c;while(1){cudaMalloc((void **)&gpu_c, sizeof(int));add<<<1,1>>>(3,6,gpu_c);cudaMemcpy(&c, gpu_c, sizeof(int), cudaMemcpyDeviceToHost);cudaFree(gpu_c);printf("3 + 9 eques %d.\n", c);}return 0; }

編譯并運(yùn)行：

運(yùn)行過(guò)程中，使用watch -n 1 nvidia-smi命令監(jiān)控GPU的資源變化情況,可以看到內(nèi)存占用和GPU負(fù)載在不斷的發(fā)生變化：

復(fù)雜一些的例子：

下面的例子對(duì)兩個(gè)列向量求算術(shù)平方和，循環(huán)進(jìn)行M次，分別用CPU和GPU計(jì)算，最后對(duì)統(tǒng)計(jì)到的計(jì)算速度進(jìn)行對(duì)比：

#include <cuda_runtime.h> #include <device_launch_parameters.h> #include <stdio.h> #include <time.h>#define N (1024 * 1024) #define M (10000) #define THREADS_PER_BLOCK (1024)void cpu_vector_add(double *a, double *b, double *c, int n , int m) {int index, j;for(index = 0; index < n; index ++){for(j = 0; j < m; j ++){c[index] = a[index] * a[index] + b[index] * b[index];}}return; }__global__ void gpu_vector_add(double *a, double *b, double *c) {int j;int index = blockIdx.x * blockDim.x + threadIdx.x;for(j = 0; j < M; j ++){c[index] = a[index] * a[index] + b[index] * b[index];} }int main(void) {clock_t start, end;double *a, *b, *c;int size = N * sizeof(double);a = (double *)malloc(size);b = (double *)malloc(size);c = (double *)malloc(size);if(!a || !b || !c){printf("%s line %d, fatal error,malloc buffer failure.\n", __func__, __LINE__);return -1;}int j;for(j = 0; j < N; j ++){a[j] = b[j] = j;c[j] = 0;}start = clock();cpu_vector_add(a, b, c, N, M);printf("[%d]=%f\n", 0, c[0]);printf("[%d]=%f\n", N-1, c[N-1]);end = clock();float time_cpu_cost = ((float)(end-start))/CLOCKS_PER_SEC;printf("CPU cost %f sectonds.\n", time_cpu_cost);start = clock();double *gpu_a, *gpu_b, *gpu_c;cudaMalloc((void**)&gpu_a, size);cudaMalloc((void**)&gpu_b, size);cudaMalloc((void**)&gpu_c, size);cudaMemcpy(gpu_a, a, size, cudaMemcpyHostToDevice);cudaMemcpy(gpu_b, b, size, cudaMemcpyHostToDevice);gpu_vector_add<<< (N + (THREADS_PER_BLOCK-1)) / THREADS_PER_BLOCK, THREADS_PER_BLOCK >>>(gpu_a, gpu_b, gpu_c);cudaMemcpy(c, gpu_c, size, cudaMemcpyDeviceToHost);printf("[%d]=%f\n", 0, c[0]);printf("[%d]=%f\n", N-1, c[N-1]);end = clock();float time_gpu_cost = ((float)(end-start))/CLOCKS_PER_SEC;printf("GPU cost %f sectonds.\n", time_gpu_cost);float faster_than = time_cpu_cost/time_gpu_cost;printf("GPU cost faster than CPU %f times.\n", faster_than);//printf("a = %p, b = %p, c = %p, gpu_a = %p, gpu_b = %p, gpu_c = %p.\n", a, b, c, gpu_a, gpu_b, gpu_c);free(a);free(b);free(c);cudaFree(gpu_a);cudaFree(gpu_b);cudaFree(gpu_c);return 0; }

編譯運(yùn)行，查看計(jì)算結(jié)果：

nvcc helloworld.cu

經(jīng)過(guò)實(shí)際測(cè)試,同樣的計(jì)算量，我的紅米本使用的MX250入門(mén)級(jí)顯卡要比Intel(R) Core(TM) i7-8565U CPU @ 1.80GHz處理器快大約30倍，更別提那些用于數(shù)學(xué)計(jì)算的專業(yè)N卡了，這個(gè)數(shù)據(jù)讓我對(duì)顯卡的恐怖計(jì)算能力有了進(jìn)一步的認(rèn)識(shí)。

問(wèn)題：

程序中存在兩類指針，malloc分配的主存指針和cudaMalloc分配的顯存指針，添加打印，打印出a,b,c,gpu_a,gpu_b,gpu_c的指針數(shù)值，根據(jù)打印來(lái)看，這些指針沒(méi)有明顯差別，難道cudaMalloc分配的并非是顯存上的地址？或者顯存和主存之間存在某種映射?不過(guò)可以確定的是，雖然指針的地址范圍相似，但是不可以在主機(jī)代碼中使用cudaMalloc()分配的指針進(jìn)行主機(jī)內(nèi)存讀寫(xiě)操作（即不能進(jìn)行解引用）。

#include <cuda_runtime.h> #include <device_launch_parameters.h> #include <stdio.h> #include <stdlib.h> #include <time.h>#define N (1024 * 1024) #define M (10000) #define THREADS_PER_BLOCK (1024)void cpu_vector_add(double *a, double *b, double *c, int n , int m) {int index, j;for(index = 0; index < n; index ++){for(j = 0; j < m; j ++){c[index] = a[index] * a[index] + b[index] * b[index];}}return; }__global__ void gpu_vector_add(double *a, double *b, double *c) {int j;int index = blockIdx.x * blockDim.x + threadIdx.x;for(j = 0; j < M; j ++){c[index] = a[index] * a[index] + b[index] * b[index];} }int main(void) {clock_t start, end;double *a, *b, *c;int size = N * sizeof(double);a = (double *)malloc(size);b = (double *)malloc(size);c = (double *)malloc(size);if(!a || !b || !c){printf("%s line %d, fatal error,malloc buffer failure.\n", __func__, __LINE__);return -1;}int j;for(j = 0; j < N; j ++){a[j] = b[j] = j;c[j] = 0;}start = clock();cpu_vector_add(a, b, c, N, M);printf("[%d]=%f\n", 0, c[0]);printf("[%d]=%f\n", N-1, c[N-1]);end = clock();float time_cpu_cost = ((float)(end-start))/CLOCKS_PER_SEC;printf("CPU cost %f sectonds.\n", time_cpu_cost);start = clock();double *gpu_a, *gpu_b, *gpu_c;cudaMalloc((void**)&gpu_a, size);cudaMalloc((void**)&gpu_b, size);cudaMalloc((void**)&gpu_c, size);cudaMemcpy(gpu_a, a, size, cudaMemcpyHostToDevice);cudaMemcpy(gpu_b, b, size, cudaMemcpyHostToDevice);gpu_vector_add<<< (N + (THREADS_PER_BLOCK-1)) / THREADS_PER_BLOCK, THREADS_PER_BLOCK >>>(gpu_a, gpu_b, gpu_c);cudaMemcpy(c, gpu_c, size, cudaMemcpyDeviceToHost);printf("[%d]=%f\n", 0, c[0]);printf("[%d]=%f\n", N-1, c[N-1]);end = clock();float time_gpu_cost = ((float)(end-start))/CLOCKS_PER_SEC;printf("GPU cost %f sectonds.\n", time_gpu_cost);float faster_than = time_cpu_cost/time_gpu_cost;printf("GPU cost faster than CPU %f times.\n", faster_than);printf("a = %p, b = %p, c = %p, gpu_a = %p, gpu_b = %p, gpu_c = %p.\n", a, b, c, gpu_a, gpu_b, gpu_c);while(1){printf("%s line %d, *a = %f.\n", __func__, __LINE__, *a);printf("%s line %d, *gpu_c = %f.\n", __func__, __LINE__, *gpu_c);}free(a);free(b);free(c);cudaFree(gpu_a);cudaFree(gpu_b);cudaFree(gpu_c);return 0; }

和OPENCL的關(guān)系：

架構(gòu)上，它們?cè)谕粚用?#xff0c;cuda和OpenCL都屬于一種并行計(jì)算開(kāi)發(fā)語(yǔ)言，這從CUDA使用的編譯器和OpenCL雖然使用GCC編譯HOST側(cè)代碼，但是端冊(cè)代碼卻需編譯一個(gè)文本CL程序文件，交給OpenCL API執(zhí)行在線編譯看出來(lái)，他們雖然都有吸取C的語(yǔ)法特點(diǎn)，但是異構(gòu)加速核心這一塊和CPU端的編譯器是不共用的，關(guān)于OPENCL開(kāi)發(fā)的例子，可以參考如下博客。

OpenCL編程之二_tugouxp的專欄-CSDN博客白嫖來(lái)的C端代碼:matrix.c：#include <stdio.h>#include <stdlib.h>#include <alloca.h>#include <CL/cl.h>#pragma warning( disable : 4996 )int main() {cl_int error;cl_platform_id platforms;cl_device_id devices;cl_context contehttps://blog.csdn.net/tugouxp/article/details/121844159

后來(lái)的測(cè)試：

半年后，重新安裝系統(tǒng)，使用DARKNET環(huán)境安裝CUDA和CUDANN之后:

?./setup.sh -InstallCUDA

再用老辦法安裝NVIDIA 驅(qū)動(dòng)?

此時(shí)CUDA運(yùn)行速率明顯加快，原因未知。

總結(jié)

NVCC是NVIDIA提供的用于編譯CUDA C程序的編譯器，它會(huì)自動(dòng)將.cu文件分為帶有CUDA C語(yǔ)句的部分和不帶CUDA C語(yǔ)句的部分，并將后者交給本地的C/C++編譯器，當(dāng)兩部分文件都編譯完畢，NVCC再將他們連接成可執(zhí)行文件，默認(rèn)的編譯命令不帶有任何參數(shù)，可以直接生成可執(zhí)行文件。

KERNEL有三種編譯方式，分別為：

1.靜態(tài)編譯，也叫離線編譯，在執(zhí)行前由C++、GCC或者VENDOR SPECIFIC的工具鏈進(jìn)行編譯。

2.動(dòng)態(tài)編譯，也叫運(yùn)行時(shí)編譯，在CPU運(yùn)行過(guò)程中編譯GPU KENREL代碼。

3.運(yùn)行時(shí)IR編譯，執(zhí)行前先i將KERNEL代碼編譯為IR，用戶驅(qū)動(dòng)集成COMPILER的BACKEND代碼，在運(yùn)行時(shí)再將IR編譯為GPU上運(yùn)行的指令，是1，2，兩種情況的融合。

OPENCL可以看作是CUDA的一個(gè)保守發(fā)行版，對(duì)用戶來(lái)說(shuō)，如果要使用NVIDIA產(chǎn)品的最新特性，則需要使用CUDA而不是OPENCL。

結(jié)束！?

總結(jié)

以上是生活随笔為你收集整理的NVIDIA-CUDA编程初探的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問(wèn)題。

如果覺(jué)得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

nvidia
CUDA

上一篇：简单几何(极角排序) POJ 2007
下一篇：数据结构学习感悟

编程问答

NVIDIA-CUDA编程初探

cuda編程

?__global__和__device__

新的例子

復(fù)雜一些的例子：

問(wèn)題：

和OPENCL的關(guān)系：

后來(lái)的測(cè)試：

總結(jié)

結(jié)束！?

總結(jié)

?global和device