當(dāng)前位置：首頁(yè) > 人文社科 > 生活经验 >内容正文

生活经验

Cuda Stream流分析

發(fā)布時(shí)間：2023/11/28 生活经验 25 豆豆

生活随笔收集整理的這篇文章主要介紹了 Cuda Stream流分析小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

Cuda Stream流分析
Stream
一般來說，cuda c并行性表現(xiàn)在下面兩個(gè)層面上：
? Kernel level
? Grid level
Stream和event簡(jiǎn)介
Cuda stream是指一堆異步的cuda操作，他們按照host代碼調(diào)用的順序執(zhí)行在device上。
典型的cuda編程模式我們已經(jīng)熟知了：
? 將輸入數(shù)據(jù)從host轉(zhuǎn)移到device
? 在device上執(zhí)行kernel
? 將結(jié)果從device上轉(zhuǎn)移回host
Cuda Streams
所有的cuda操作（包括kernel執(zhí)行和數(shù)據(jù)傳輸）都顯式或隱式的運(yùn)行在stream中，stream也就兩種類型，分別是：
? 隱式聲明stream（NULL stream）
? 顯示聲明stream（non-NULL stream）
異步且基于stream的kernel執(zhí)行和數(shù)據(jù)傳輸能夠?qū)崿F(xiàn)以下幾種類型的并行：
? Host運(yùn)算操作和device運(yùn)算操作并行
? Host運(yùn)算操作和host到device的數(shù)據(jù)傳輸并行
? Host到device的數(shù)據(jù)傳輸和device運(yùn)算操作并行
? Device內(nèi)的運(yùn)算并行
下面代碼是常見的使用形式，默認(rèn)使用NULL stream:
cudaMemcpy(…, cudaMemcpyHostToDevice);
kernel<<<grid, block>>>(…);
cudaMemcpy(…, cudaMemcpyDeviceToHost);
下面版本是異步版本的cudaMemcpy：
cudaError_t cudaMemcpyAsync(void* dst, const void* src, size_t count,cudaMemcpyKind kind, cudaStream_t stream = 0);
上面代碼使用了默認(rèn)stream，如果要聲明一個(gè)新的stream則使用下面的API定義一個(gè)：
cudaError_t cudaStreamCreate(cudaStream_t* pStream);
Pinned memory的分配如下：
cudaError_t cudaMallocHost(void **ptr, size_t size);
cudaError_t cudaHostAlloc(void **pHost, size_t size, unsigned int flags);
在執(zhí)行kernel時(shí)要想設(shè)置stream的話，只要加一個(gè)stream參數(shù)就好：
kernel_name<<<grid, block, sharedMemSize, stream>>>(argument list);
// 非默認(rèn)的stream聲明
cudaStream_t stream;
// 初始化
cudaStreamCreate(&stream);
// 資源釋放
cudaError_t cudaStreamDestroy(cudaStream_t stream);
所有stram的執(zhí)行都是異步的，需要一些API在必要的時(shí)候做同步操作：
cudaError_t cudaStreamSynchronize(cudaStream_t stream);
cudaError_t cudaStreamQuery(cudaStream_t stream);
看一下代碼片段：

for (int i = 0; i < nStreams; i++) {
int offset = i * bytesPerStream;
cudaMemcpyAsync(&d_a[offset], &a[offset], bytePerStream, streams[i]);
kernel<<grid, block, 0, streams[i]>>(&d_a[offset]);
cudaMemcpyAsync(&a[offset], &d_a[offset], bytesPerStream, streams[i]);
}

for (int i = 0; i < nStreams; i++) {
cudaStreamSynchronize(streams[i]);
}

使用了三個(gè)stream，數(shù)據(jù)傳輸和kernel運(yùn)算都被分配在了這幾個(gè)并發(fā)的stream中。

kernel數(shù)目是依賴于device本身的，Fermi支持16路并行，Kepler是32。并行數(shù)是受限于shared memory，寄存器等device資源。
Stream Scheduling

C和P以及R和X是可以并行的，因?yàn)樗麄冊(cè)诓煌膕tream中，但是ABC，PQR以及XYZ卻不行，比如，在B沒完成之前，C和P都在等待。
Hyper-Q
Hyper-Q的技術(shù)， Kepler上出現(xiàn)了32個(gè)工作隊(duì)列。實(shí)現(xiàn)了TPC上可以同時(shí)運(yùn)行compute和graphic的應(yīng)用。當(dāng)然，如果超過32個(gè)stream被創(chuàng)建了，依然會(huì)出現(xiàn)偽依賴的情況。

Stream Priorities
對(duì)于CC3.5及以上版本，stream可以有優(yōu)先級(jí)的屬性：
cudaError_t cudaStreamCreateWithPriority(cudaStream_t* pStream, unsigned int flags, int priority);
該函數(shù)創(chuàng)建一個(gè)stream，賦予priority的優(yōu)先級(jí)，高優(yōu)先級(jí)的grid可以搶占低優(yōu)先級(jí)執(zhí)行。
cudaError_t cudaDeviceGetStreamPriorityRange(int leastPriority, int greatestPriority);
leastPriority是下限，gretestPriority是上限。數(shù)值較小則擁有較高優(yōu)先級(jí)。如
Cuda Events
Event是stream用來標(biāo)記strean執(zhí)行過程的某個(gè)特定的點(diǎn)。其主要用途是：
? 同步stream執(zhí)行
? 操控device運(yùn)行步調(diào)
Creation and Destruction
// 聲明
cudaEvent_t event;
// 創(chuàng)建
cudaError_t cudaEventCreate(cudaEvent_t event);
// 銷毀
cudaError_t cudaEventDestroy(cudaEvent_t event);
streeam的釋放，在操作完成后自動(dòng)釋放資源。
Recording Events and Mesuring Elapsed Time
cudaError_t cudaEventRecord(cudaEvent_t event, cudaStream_t stream = 0);
等待event會(huì)阻塞調(diào)用host線程，同步操作調(diào)用下面的函數(shù)：
cudaError_t cudaEventSynchronize(cudaEvent_t event);
類似于cudaStreamSynchronize，等待event而不是整個(gè)stream執(zhí)行完畢。使用API來測(cè)試event是否完成，該函數(shù)不會(huì)阻塞host：
cudaError_t cudaEventQuery(cudaEvent_t event);
該函數(shù)類似cudaStreamQuery。此外，還有專門的API可以度量?jī)蓚€(gè)event之間的時(shí)間間隔：
cudaError_t cudaEventElapsedTime(float ms, cudaEvent_t start, cudaEvent_t stop);
返回start和stop之間的時(shí)間間隔，單位是毫秒。Start和stop不必關(guān)聯(lián)到同一個(gè)stream上。
下面代碼簡(jiǎn)單展示了如何使用event來度量時(shí)間：

// create two events
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
// record start event on the default stream
cudaEventRecord(start);
// execute kernel
kernel<<<grid, block>>>(arguments);
// record stop event on the default stream
cudaEventRecord(stop);
// wait until the stop event completes
cudaEventSynchronize(stop);
// calculate the elapsed time between two events
float time;
cudaEventElapsedTime(&time, start, stop);
// clean up the two events
cudaEventDestroy(start);
cudaEventDestroy(stop);

Stream Synchronization
由于所有non-default stream的操作對(duì)于host來說都是非阻塞的，就需要相應(yīng)的同步操作。
從host的角度來看，cuda操作可以被分為兩類：
? Memory相關(guān)的操作
? Kernel launch
Kernel launch對(duì)于host來說都是異步的，許多memory操作則是同步的，比如cudaMemcpy，cuda runtime也會(huì)提供異步函數(shù)來執(zhí)行memory操作。
阻塞和非阻塞stream
使用cudaStreamCreate創(chuàng)建的是阻塞stream，也就是說，該stream中執(zhí)行的操作會(huì)被早先執(zhí)行的同步stream阻塞。
例如：
kernel_1<<<1, 1, 0, stream_1>>>();
kernel_2<<<1, 1>>>();
kernel_3<<<1, 1, 0, stream_2>>>();
可以通過下面的API配置生成非阻塞stream：
cudaError_t cudaStreamCreateWithFlags(cudaStream_t* pStream, unsigned int flags);
// flag為以下兩種，默認(rèn)為第一種，非阻塞便是第二種。
cudaStreamDefault: default stream creation flag (blocking)
cudaStreamNonBlocking: asynchronous stream creation flag (non-blocking)
Implicit Synchronization
Cuda有兩種類型的host和device之間同步：顯式和隱式。已經(jīng)了解到顯式同步API有：
? cudaDeviceSynchronize
? cudaStreamSynchronize
? cudaEventSynchronize
這三個(gè)函數(shù)由host顯式的調(diào)用，在device上執(zhí)行。
許多memory相關(guān)的操作都會(huì)影響當(dāng)前device的操作，比如：
? A page-locked host memory allocation
? A device memory allocation
? A device memset
? A memory copy between two addresses on the same device
? A modification to the L1/shared memory confi guration
Explicit Synchronization
從grid level來看顯式同步方式，有如下幾種：
? Synchronizing the device
? Synchronizing a stream
? Synchronizing an event in a stream
? Synchronizing across streams using an event
可以使用cudaDeviceSynchronize來同步該device上的所有操作。通過使用cudaStreamSynchronize可以使host等待特定stream中的操作全部完成或者使用非阻塞版本的cudaStreamQuery來測(cè)試是否完成。
Cuda event可以用來實(shí)現(xiàn)更細(xì)粒度的阻塞和同步，相關(guān)函數(shù)為cudaEventSynchronize和cudaEventSynchronize，用法類似stream相關(guān)的函數(shù)。此外，cudaStreamWaitEvent提供了一種靈活的方式來引入stream之間的依賴關(guān)系：
cudaError_t cudaStreamWaitEvent(cudaStream_t stream, cudaEvent_t event);
該函數(shù)會(huì)指定該stream等待特定的event，該event可以關(guān)聯(lián)到相同或者不同的stream，對(duì)于不同stream的情況，如下圖所示：

Stream2會(huì)等待stream1中的event完成后繼續(xù)執(zhí)行。
Configurable Events
Event的配置可用下面函數(shù)：
cudaError_t cudaEventCreateWithFlags(cudaEvent_t* event, unsigned int flags);
cudaEventDefault
cudaEventBlockingSync
cudaEventDisableTiming
cudaEventInterprocess

總結(jié)

以上是生活随笔為你收集整理的Cuda Stream流分析的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

Stream
CUDA

生活经验

Cuda Stream流 分析

總結(jié)

Cuda Stream流分析