适用于CUDA GPU的Numba例子
? 適用于CUDA GPU的Numba例子
矩陣乘法
這是使用CUDA內核的矩陣乘法的簡單實現:
@cuda.jit
def matmul(A, B, C):
“”“Perform square matrix multiplication of C = A * B
“””
i, j = cuda.grid(2)
if i < C.shape[0] and j < C.shape[1]:
tmp = 0.
for k in range(A.shape[1]):
tmp += A[i, k] * B[k, j]
C[i, j] = tmp
這種實現方式簡單直觀,但性能不佳,因為相同的矩陣元素將從設備內存中多次加載,這很慢(某些設備可能具有透明的數據緩存,但它們可能不足以一次容納整個輸入)。
如果使用阻塞算法來減少對設備內存的訪問,它將更快。CUDA為 塊中的線程提供快速共享內存,以協作執行任務。以下使用共享內存實現了方陣乘法的更快版本:
from numba import cuda, float32
Controls threads per block and shared memory usage.
The computation will be done on blocks of TPBxTPB elements.
TPB = 16
@cuda.jit
def fast_matmul(A, B, C):
# Define an array in the shared memory
# The size and type of the arrays must be known at compile time
sA = cuda.shared.array(shape=(TPB, TPB), dtype=float32)
sB = cuda.shared.array(shape=(TPB, TPB), dtype=float32)
x, y = cuda.grid(2)tx = cuda.threadIdx.x
ty = cuda.threadIdx.y
bpg = cuda.gridDim.x # blocks per gridif x >= C.shape[0] and y >= C.shape[1]:# Quit if (x, y) is outside of valid C boundaryreturn# Each thread computes one element in the result matrix.
# The dot product is chunked into dot products of TPB-long vectors.
tmp = 0.
for i in range(bpg):# Preload data into shared memorysA[tx, ty] = A[x, ty + i * TPB]sB[tx, ty] = B[tx + i * TPB, y]# Wait until all threads finish preloadingcuda.syncthreads()# Computes partial product on the shared memoryfor j in range(TPB):tmp += sA[tx, j] * sB[j, ty]# Wait until all threads finish computingcuda.syncthreads()C[x, y] = tmp
因為共享內存是有限的資源,所以代碼一次從輸入數組中預加載小塊。然后,調用 syncthreads()以等待所有線程完成預加載,再對共享內存進行計算。計算之后,再次同步,以確保所有線程在共享內存中的數據均已完成之后,在下一個循環迭代中將其覆蓋。
總結
以上是生活随笔為你收集整理的适用于CUDA GPU的Numba例子的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 共享CUDA内存
- 下一篇: 适用于CUDA GPU的Numba 随机