TVM中的调度原语
TVM中的調(diào)度原語
TVM是一種用于高效內(nèi)核構(gòu)造的領(lǐng)域?qū)S谜Z言。
本文將展示如何通過TVM提供的,各種原語調(diào)度計算。
from future import absolute_import, print_function
import tvm
from tvm import te
import numpy as np
通常存在多種方法來計算相同的結(jié)果,不同的方法會導(dǎo)致不同的局部性和性能。TVM要求用戶提供,如何執(zhí)行稱為Schedule的計算。
調(diào)度是一組計算轉(zhuǎn)換,轉(zhuǎn)換程序中的計算循環(huán)。
declare some variables for use later
n = te.var(“n”)
m = te.var(“m”)
可以從操作列表中創(chuàng)建調(diào)度,默認情況下,調(diào)度按行主要順序,串行方式計算張量。
declare a matrix element-wise multiply
A = te.placeholder((m, n), name=“A”)
B = te.placeholder((m, n), name=“B”)
C = te.compute((m, n), lambda i, j: A[i, j] * B[i, j], name=“C”)
s = te.create_schedule([C.op])
lower will transform the computation from definition to the real
callable function. With argument simple_mode=True, it will
return you a readable C like statement, we use it here to print the
schedule result.
print(tvm.lower(s, [A, B, C], simple_mode=True))
Out:
primfn(A_1: handle, B_1: handle, C_1: handle) -> ()
attr = {“global_symbol”: “main”, “tir.noalias”: True}
buffers = {B: Buffer(B_2: Pointer(float32), float32, [m: int32, n: int32], [stride: int32, stride_1: int32], type=“auto”),
C: Buffer(C_2: Pointer(float32), float32, [m, n], [stride_2: int32, stride_3: int32], type="auto"),A: Buffer(A_2: Pointer(float32), float32, [m, n], [stride_4: int32, stride_5: int32], type="auto")}
buffer_map = {A_1: A, B_1: B, C_1: C} {
for (i: int32, 0, m) {
for (j: int32, 0, n) {C_2[((i*stride_2) + (j*stride_3))] = ((float32*)A_2[((i*stride_4) + (j*stride_5))]*(float32*)B_2[((i*stride) + (j*stride_1))])}
}
}
一個調(diào)度由多個階段組成,一個階段代表一個算子的進度。提供各種方法來分派每個階段。
split分裂
“split拆分”可以按因子factor,將指定的軸拆分為兩個軸。
A = te.placeholder((m,), name=“A”)
B = te.compute((m,), lambda i: A[i] * 2, name=“B”)
s = te.create_schedule(B.op)
xo, xi = s[B].split(B.op.axis[0], factor=32)
print(tvm.lower(s, [A, B], simple_mode=True))
Out:
primfn(A_1: handle, B_1: handle) -> ()
attr = {“global_symbol”: “main”, “tir.noalias”: True}
buffers = {B: Buffer(B_2: Pointer(float32), float32, [m: int32], [stride: int32], type=“auto”),
A: Buffer(A_2: Pointer(float32), float32, [m], [stride_1: int32], type="auto")}
buffer_map = {A_1: A, B_1: B} {
for (i.outer: int32, 0, floordiv((m + 31), 32)) {
for (i.inner: int32, 0, 32) {if @tir.likely((((i.outer*32) + i.inner) < m), dtype=bool) {B_2[(((i.outer*32) + i.inner)*stride)] = ((float32*)A_2[(((i.outer*32) + i.inner)*stride_1)]*2f32)}}
}
}
可以按nparts拆分軸,這將與factor相反拆分軸。
A = te.placeholder((m,), name=“A”)
B = te.compute((m,), lambda i: A[i], name=“B”)
s = te.create_schedule(B.op)
bx, tx = s[B].split(B.op.axis[0], nparts=32)
print(tvm.lower(s, [A, B], simple_mode=True))
Out:
primfn(A_1: handle, B_1: handle) -> ()
attr = {“global_symbol”: “main”, “tir.noalias”: True}
buffers = {B: Buffer(B_2: Pointer(float32), float32, [m: int32], [stride: int32], type=“auto”),
A: Buffer(A_2: Pointer(float32), float32, [m], [stride_1: int32], type="auto")}
buffer_map = {A_1: A, B_1: B} {
for (i.outer: int32, 0, 32) {
for (i.inner: int32, 0, floordiv((m + 31), 32)) {if @tir.likely(((i.inner + (i.outer*floordiv((m + 31), 32))) < m), dtype=bool) {B_2[((i.inner + (i.outer*floordiv((m + 31), 32)))*stride)] = (float32*)A_2[((i.inner + (i.outer*floordiv((m + 31), 32)))*stride_1)]}}
}
}
tile
tile help you execute the computation tile by tile over two axises.
A = te.placeholder((m, n), name=“A”)
B = te.compute((m, n), lambda i, j: A[i, j], name=“B”)
s = te.create_schedule(B.op)
xo, yo, xi, yi = s[B].tile(B.op.axis[0], B.op.axis[1], x_factor=10, y_factor=5)
print(tvm.lower(s, [A, B], simple_mode=True))
Out:
primfn(A_1: handle, B_1: handle) -> ()
attr = {“global_symbol”: “main”, “tir.noalias”: True}
buffers = {B: Buffer(B_2: Pointer(float32), float32, [m: int32, n: int32], [stride: int32, stride_1: int32], type=“auto”),
A: Buffer(A_2: Pointer(float32), float32, [m, n], [stride_2: int32, stride_3: int32], type="auto")}
buffer_map = {A_1: A, B_1: B} {
for (i.outer: int32, 0, floordiv((m + 9), 10)) {
for (j.outer: int32, 0, floordiv((n + 4), 5)) {for (i.inner: int32, 0, 10) {if @tir.likely((((i.outer*10) + i.inner) < m), dtype=bool) {for (j.inner: int32, 0, 5) {if @tir.likely((((j.outer*5) + j.inner) < n), dtype=bool) {B_2[((((i.outer*10) + i.inner)*stride) + (((j.outer*5) + j.inner)*stride_1))] = (float32*)A_2[((((i.outer*10) + i.inner)*stride_2) + (((j.outer*5) + j.inner)*stride_3))]}}}}}
}
}
fuse
fuse can fuse two consecutive axises of one computation.
A = te.placeholder((m, n), name=“A”)
B = te.compute((m, n), lambda i, j: A[i, j], name=“B”)
s = te.create_schedule(B.op)
tile to four axises first: (i.outer, j.outer, i.inner, j.inner)
xo, yo, xi, yi = s[B].tile(B.op.axis[0], B.op.axis[1], x_factor=10, y_factor=5)
then fuse (i.inner, j.inner) into one axis: (i.inner.j.inner.fused)
fused = s[B].fuse(xi, yi)
print(tvm.lower(s, [A, B], simple_mode=True))
Out:
primfn(A_1: handle, B_1: handle) -> ()
attr = {“global_symbol”: “main”, “tir.noalias”: True}
buffers = {B: Buffer(B_2: Pointer(float32), float32, [m: int32, n: int32], [stride: int32, stride_1: int32], type=“auto”),
A: Buffer(A_2: Pointer(float32), float32, [m, n], [stride_2: int32, stride_3: int32], type="auto")}
buffer_map = {A_1: A, B_1: B} {
for (i.outer: int32, 0, floordiv((m + 9), 10)) {
for (j.outer: int32, 0, floordiv((n + 4), 5)) {for (i.inner.j.inner.fused: int32, 0, 50) {if @tir.likely((((i.outer*10) + floordiv(i.inner.j.inner.fused, 5)) < m), dtype=bool) {if @tir.likely((((j.outer*5) + floormod(i.inner.j.inner.fused, 5)) < n), dtype=bool) {B_2[((((i.outer*10) + floordiv(i.inner.j.inner.fused, 5))*stride) + (((j.outer*5) + floormod(i.inner.j.inner.fused, 5))*stride_1))] = (float32*)A_2[((((i.outer*10) + floordiv(i.inner.j.inner.fused, 5))*stride_2) + (((j.outer*5) + floormod(i.inner.j.inner.fused, 5))*stride_3))]}}}}
}
}
reorder
reorder can reorder the axises in the specified order.
A = te.placeholder((m, n), name=“A”)
B = te.compute((m, n), lambda i, j: A[i, j], name=“B”)
s = te.create_schedule(B.op)
tile to four axises first: (i.outer, j.outer, i.inner, j.inner)
xo, yo, xi, yi = s[B].tile(B.op.axis[0], B.op.axis[1], x_factor=10, y_factor=5)
then reorder the axises: (i.inner, j.outer, i.outer, j.inner)
s[B].reorder(xi, yo, xo, yi)
print(tvm.lower(s, [A, B], simple_mode=True))
Out:
primfn(A_1: handle, B_1: handle) -> ()
attr = {“global_symbol”: “main”, “tir.noalias”: True}
buffers = {B: Buffer(B_2: Pointer(float32), float32, [m: int32, n: int32], [stride: int32, stride_1: int32], type=“auto”),
A: Buffer(A_2: Pointer(float32), float32, [m, n], [stride_2: int32, stride_3: int32], type="auto")}
buffer_map = {A_1: A, B_1: B} {
for (i.inner: int32, 0, 10) {
for (j.outer: int32, 0, floordiv((n + 4), 5)) {for (i.outer: int32, 0, floordiv((m + 9), 10)) {if @tir.likely((((i.outer*10) + i.inner) < m), dtype=bool) {for (j.inner: int32, 0, 5) {if @tir.likely((((j.outer*5) + j.inner) < n), dtype=bool) {B_2[((((i.outer*10) + i.inner)*stride) + (((j.outer*5) + j.inner)*stride_1))] = (float32*)A_2[((((i.outer*10) + i.inner)*stride_2) + (((j.outer*5) + j.inner)*stride_3))]}}}}}
}
}
bind
bind can bind a specified axis with a thread axis, often used in gpu programming.
A = te.placeholder((n,), name=“A”)
B = te.compute(A.shape, lambda i: A[i] * 2, name=“B”)
s = te.create_schedule(B.op)
bx, tx = s[B].split(B.op.axis[0], factor=64)
s[B].bind(bx, te.thread_axis(“blockIdx.x”))
s[B].bind(tx, te.thread_axis(“threadIdx.x”))
print(tvm.lower(s, [A, B], simple_mode=True))
Out:
primfn(A_1: handle, B_1: handle) -> ()
attr = {“global_symbol”: “main”, “tir.noalias”: True}
buffers = {B: Buffer(B_2: Pointer(float32), float32, [n: int32], [stride: int32], type=“auto”),
A: Buffer(A_2: Pointer(float32), float32, [n], [stride_1: int32], type=“auto”)}
buffer_map = {A_1: A, B_1: B} {
attr [IterVar(blockIdx.x: int32, (nullptr), “ThreadIndex”, “blockIdx.x”)] “thread_extent” = floordiv((n + 63), 64);
attr [IterVar(threadIdx.x: int32, (nullptr), “ThreadIndex”, “threadIdx.x”)] “thread_extent” = 64;
if @tir.likely((((blockIdx.x64) + threadIdx.x) < n), dtype=bool) {
B_2[(((blockIdx.x64) + threadIdx.x)stride)] = ((float32)A_2[(((blockIdx.x*64) + threadIdx.x)*stride_1)]*2f32)
}
}
compute_at
For a schedule that consists of multiple operators, TVM will compute tensors at the root separately by default.
A = te.placeholder((m,), name=“A”)
B = te.compute((m,), lambda i: A[i] + 1, name=“B”)
C = te.compute((m,), lambda i: B[i] * 2, name=“C”)
s = te.create_schedule(C.op)
print(tvm.lower(s, [A, B, C], simple_mode=True))
Out:
primfn(A_1: handle, B_1: handle, C_1: handle) -> ()
attr = {“global_symbol”: “main”, “tir.noalias”: True}
buffers = {C: Buffer(C_2: Pointer(float32), float32, [m: int32], [stride: int32], type=“auto”),
B: Buffer(B_2: Pointer(float32), float32, [m], [stride_1: int32], type="auto"),A: Buffer(A_2: Pointer(float32), float32, [m], [stride_2: int32], type="auto")}
buffer_map = {A_1: A, B_1: B, C_1: C} {
for (i: int32, 0, m) {
B_2[(i*stride_1)] = ((float32*)A_2[(i*stride_2)] + 1f32)
}
for (i_1: int32, 0, m) {
C_2[(i_1*stride)] = ((float32*)B_2[(i_1*stride_1)]*2f32)
}
}
compute_at can move computation of B into the first axis of computation of C.
A = te.placeholder((m,), name=“A”)
B = te.compute((m,), lambda i: A[i] + 1, name=“B”)
C = te.compute((m,), lambda i: B[i] * 2, name=“C”)
s = te.create_schedule(C.op)
s[B].compute_at(s[C], C.op.axis[0])
print(tvm.lower(s, [A, B, C], simple_mode=True))
Out:
primfn(A_1: handle, B_1: handle, C_1: handle) -> ()
attr = {“global_symbol”: “main”, “tir.noalias”: True}
buffers = {B: Buffer(B_2: Pointer(float32), float32, [m: int32], [stride: int32], type=“auto”),
C: Buffer(C_2: Pointer(float32), float32, [m], [stride_1: int32], type="auto"),A: Buffer(A_2: Pointer(float32), float32, [m], [stride_2: int32], type="auto")}
buffer_map = {A_1: A, B_1: B, C_1: C} {
for (i: int32, 0, m) {
B_2[(i*stride)] = ((float32*)A_2[(i*stride_2)] + 1f32)C_2[(i*stride_1)] = ((float32*)B_2[(i*stride)]*2f32)
}
}
compute_inline
compute_inline can mark one stage as inline, then the body of computation will be expanded and inserted at the address where the tensor is required.
A = te.placeholder((m,), name=“A”)
B = te.compute((m,), lambda i: A[i] + 1, name=“B”)
C = te.compute((m,), lambda i: B[i] * 2, name=“C”)
s = te.create_schedule(C.op)
s[B].compute_inline()
print(tvm.lower(s, [A, B, C], simple_mode=True))
Out:
primfn(A_1: handle, B_1: handle, C_1: handle) -> ()
attr = {“global_symbol”: “main”, “tir.noalias”: True}
buffers = {C: Buffer(C_2: Pointer(float32), float32, [m: int32], [stride: int32], type=“auto”),
B: Buffer(B_2: Pointer(float32), float32, [m], [stride_1: int32], type="auto"),A: Buffer(A_2: Pointer(float32), float32, [m], [stride_2: int32], type="auto")}
buffer_map = {A_1: A, B_1: B, C_1: C} {
for (i: int32, 0, m) {
C_2[(i*stride)] = (((float32*)A_2[(i*stride_2)] + 1f32)*2f32)
}
}
compute_root
compute_root can move computation of one stage to the root.
A = te.placeholder((m,), name=“A”)
B = te.compute((m,), lambda i: A[i] + 1, name=“B”)
C = te.compute((m,), lambda i: B[i] * 2, name=“C”)
s = te.create_schedule(C.op)
s[B].compute_at(s[C], C.op.axis[0])
s[B].compute_root()
print(tvm.lower(s, [A, B, C], simple_mode=True))
Out:
primfn(A_1: handle, B_1: handle, C_1: handle) -> ()
attr = {“global_symbol”: “main”, “tir.noalias”: True}
buffers = {B: Buffer(B_2: Pointer(float32), float32, [m: int32], [stride: int32], type=“auto”),
C: Buffer(C_2: Pointer(float32), float32, [m], [stride_1: int32], type="auto"),A: Buffer(A_2: Pointer(float32), float32, [m], [stride_2: int32], type="auto")}
buffer_map = {A_1: A, B_1: B, C_1: C} {
for (i: int32, 0, m) {
B_2[(i*stride)] = ((float32*)A_2[(i*stride_2)] + 1f32)
}
for (i_1: int32, 0, m) {
C_2[(i_1*stride_1)] = ((float32*)B_2[(i_1*stride)]*2f32)
}
}
Summary
本文介紹tvm中的調(diào)度原語,它允許用戶輕松靈活地調(diào)度計算。
為了獲得性能良好的內(nèi)核實現(xiàn),一般的工作流程通常是:
通過一系列的運算來描述計算。
嘗試用基元來調(diào)度計算。
編譯運行,查看性能差異。
根據(jù)運行結(jié)果調(diào)整日程調(diào)度。
下載Python源代碼:schedule_primitives.py
下載Jupyter筆記本:schedule_primitives.ipynb
總結(jié)
- 上一篇: NVIDIA GPU的神经网络自动调度
- 下一篇: 算子本质与数学函数