當(dāng)前位置：首頁 > 人文社科 > 生活经验 >内容正文

生活经验

GPU上如何优化卷积

發(fā)布時(shí)間：2023/11/28 生活经验 32 豆豆

生活随笔收集整理的這篇文章主要介紹了 GPU上如何优化卷积小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

GPU上如何優(yōu)化卷積

本文將演示如何在TVM中編寫高性能卷積實(shí)現(xiàn)。我們以平方大小的輸入張量和濾波器為例，假設(shè)卷積的輸入是大批量的。在本例中，使用不同的布局來存儲(chǔ)數(shù)據(jù)，以實(shí)現(xiàn)更好的數(shù)據(jù)局部性。緩沖區(qū)布局為HWCN，代表高度、寬度、通道、批次。

Preparation and Algorithm

對(duì)于256個(gè)通道和14 x 14維的輸入張量，使用固定大小。批量大小是256。卷積濾波器包含512個(gè)尺寸為3 x 3的濾波器。使用步幅大小1和填充大小1進(jìn)行卷積。下面的代碼定義了TVM中的卷積算法。

import numpy as np

import tvm

from tvm import te

The sizes of inputs and filters

batch = 256

in_channel = 256

out_channel = 512

in_size = 14

kernel = 3

pad = 1

stride = 1

Algorithm

A = te.placeholder((in_size,
in_size, in_channel, batch), name=“A”)

W = te.placeholder((kernel,
kernel, in_channel, out_channel), name=“W”)

out_size = (in_size - kernel + 2 * pad) // stride + 1

Pad input

Apad = te.compute(

(in_size + 2 * pad, in_size + 2 * pad, in_channel, batch),

lambda yy, xx, cc, nn: tvm.tir.if_then_else(

    tvm.tir.all(yy >= pad, yy - pad < in_size, xx >= pad, xx - pad < in_size),A[yy - pad, xx - pad, cc, nn],tvm.tir.const(0.0, "float32"),

name=“Apad”,

)

Create reduction variables

rc = te.reduce_axis((0, in_channel), name=“rc”)

ry = te.reduce_axis((0, kernel), name=“ry”)

rx = te.reduce_axis((0, kernel), name=“rx”)

Compute the convolution

B = te.compute(

(out_size, out_size, out_channel, batch),

lambda yy, xx, ff, nn: te.sum(

    Apad[yy * stride + ry, xx * stride + rx, rc, nn] * W[ry, rx, rc, ff],

axis=[ry, rx, rc]

name=“B”,

)

Memory Hierarchy

首先指定緩沖區(qū)的內(nèi)存層次結(jié)構(gòu)。下圖顯示了GPU內(nèi)存層次結(jié)構(gòu)。與CPU內(nèi)存層次結(jié)構(gòu)的一個(gè)重要區(qū)別是GPU提供了一個(gè)稱為共享內(nèi)存的緩存緩沖區(qū)，由程序員管理。因此，如何最大限度地利用共享內(nèi)存中的數(shù)據(jù)是實(shí)現(xiàn)GPU內(nèi)核高性能的關(guān)鍵。

在本例中，將Apad和W加載到緩沖區(qū)AA和WW中，存儲(chǔ)在共享內(nèi)存中。這些緩沖區(qū)將由同一線程塊內(nèi)的所有線程共享，以計(jì)算卷積。然后每個(gè)線程將自己的部分從共享緩沖區(qū)加載到本地寄存器AL和WL中。BL是輸出B的本地緩存，它也存儲(chǔ)在線程本地寄存器中。

Designate the memory hierarchy

s = te.create_schedule(B.op)

s[Apad].compute_inline() # compute Apad
inline

AA = s.cache_read(Apad, “shared”, [B])

WW = s.cache_read(W, “shared”, [B])

AL = s.cache_read(AA, “l(fā)ocal”, [B])

WL = s.cache_read(WW, “l(fā)ocal”, [B])

BL = s.cache_write(B, “l(fā)ocal”)

Blocking

下面的代碼將工作負(fù)載分成線程塊和單個(gè)線程。我們遵循矩陣乘法中的分塊方案。如下圖所示，給定一個(gè)像素坐標(biāo)（y，x），線程塊負(fù)責(zé)計(jì)算輸出通道和批處理的塊系數(shù)x塊系數(shù)（64x64）的區(qū)域。由于共享內(nèi)存空間的限制，我們每次只從Apad和B加載stepx塊系數(shù)（8x64）數(shù)據(jù)到共享內(nèi)存中的緩沖區(qū)。

tile consts

tile = 8

num_thread = 8

block_factor = tile * num_thread

step = 8

vthread = 2

Get the GPU thread indices

block_x = te.thread_axis(“blockIdx.x”)

block_y = te.thread_axis(“blockIdx.y”)

block_z = te.thread_axis(“blockIdx.z”)

thread_x = te.thread_axis((0, num_thread), “threadIdx.x”)

thread_y = te.thread_axis((0, num_thread), “threadIdx.y”)

thread_xz = te.thread_axis((0, vthread), “vthread”, name=“vx”)

thread_yz = te.thread_axis((0, vthread), “vthread”, name=“vy”)

Split the workloads

hi, wi, fi, ni = s[B].op.axis

bz = s[B].fuse(hi, wi)

by, fi = s[B].split(fi, factor=block_factor)

bx, ni = s[B].split(ni, factor=block_factor)

Bind the iteration variables to GPU

thread indices

s[B].bind(bz, block_z)

s[B].bind(by, block_y)

s[B].bind(bx, block_x)

Virtual Thread Split

進(jìn)一步將工作負(fù)載從一個(gè)線程塊分割到各個(gè)線程。為了避免沖突，將8個(gè)線程分成4個(gè)部分，然后使用8個(gè)線程分成4個(gè)部分。因此，如下圖所示，每個(gè)線程計(jì)算4個(gè)跨距網(wǎng)格，其中每個(gè)網(wǎng)格的大小為4 x 4。

tyz, fi = s[B].split(fi, nparts=vthread) # virtual thread split

txz, ni = s[B].split(ni, nparts=vthread) # virtual thread split

ty, fi = s[B].split(fi, nparts=num_thread)

tx, ni = s[B].split(ni, nparts=num_thread)

s[B].reorder(bz, by, bx, tyz, txz, ty, tx, fi, ni)

s[B].bind(tyz, thread_yz)

s[B].bind(txz, thread_xz)

s[B].bind(ty, thread_y)

s[B].bind(tx, thread_x)

Cooperative Fetching

如前所述，每個(gè)時(shí)間步都需要將步驟x塊因子數(shù)據(jù)從GPU全局內(nèi)存?zhèn)鬏數(shù)焦蚕韮?nèi)存。為了減少每個(gè)線程的內(nèi)存?zhèn)鬏?#xff0c;下面的代碼允許同一線程塊中的線程協(xié)同從全局內(nèi)存中獲取相關(guān)數(shù)據(jù)。

Schedule BL local write

s[BL].compute_at(s[B], tx)

yi, xi, fi, ni = s[BL].op.axis

ry, rx, rc = s[BL].op.reduce_axis

rco, rci = s[BL].split(rc, factor=step)

s[BL].reorder(rco, ry, rx, rci, fi, ni)

Attach computation to iteration

variables

s[AA].compute_at(s[BL], rx)

s[WW].compute_at(s[BL], rx)

s[AL].compute_at(s[BL], rci)

s[WL].compute_at(s[BL], rci)

Schedule for A’s shared memory

load

yi, xi, ci, ni = s[AA].op.axis

ty, ci = s[AA].split(ci, nparts=num_thread)

tx, ni = s[AA].split(ni, nparts=num_thread)

_, ni = s[AA].split(ni, factor=4)

s[AA].reorder(ty, tx, yi, xi, ci, ni)

s[AA].bind(ty, thread_y)

s[AA].bind(tx, thread_x)

s[AA].vectorize(ni) # vectorize memory load

Schedule for W’s shared memory

load

yi, xi, ci, fi = s[WW].op.axis

ty, ci = s[WW].split(ci, nparts=num_thread)

tx, fi = s[WW].split(fi, nparts=num_thread)

_, fi = s[WW].split(fi, factor=4)

s[WW].reorder(ty, tx, yi, xi, ci, fi)

s[WW].bind(ty, thread_y)

s[WW].bind(tx, thread_x)

s[WW].vectorize(fi) #
vectorize memory load

Generate CUDA Kernel

最后利用TVM生成并編譯了CUDA內(nèi)核，并對(duì)卷積延遲進(jìn)行了評(píng)估。

func = tvm.build(s, [A, W, B], “cuda”)

ctx = tvm.gpu(0)

a_np = np.random.uniform(size=(in_size,
in_size, in_channel, batch)).astype(A.dtype)

w_np = np.random.uniform(size=(kernel, kernel,
in_channel, out_channel)).astype(W.dtype)

a = tvm.nd.array(a_np, ctx)

w = tvm.nd.array(w_np, ctx)

b = tvm.nd.array(np.zeros((out_size, out_size,
out_channel, batch), dtype=B.dtype), ctx)

func(a, w, b)

evaluator = func.time_evaluator(func.entry_name, ctx, number=1)

print(“Convolution: %f ms” % (evaluator(a,
w, b).mean * 1e3))

Out:

Convolution: 53.197723 ms

https://tvm.apache.org/docs/tutorials/optimize/opt_conv_cuda.html

總結(jié)

以上是生活随笔為你收集整理的GPU上如何优化卷积的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

卷积
GPU

上一篇：如何在CPU上优化GEMM（下）
下一篇： TensorRT宏碁自建云(BYOC,