當前位置：首頁 > 人文社科 > 生活经验 >内容正文

生活经验

编写可调模板并使用Auto-tuner自动调谐器

發布時間：2023/11/28 生活经验 36 豆豆

生活随笔收集整理的這篇文章主要介紹了编写可调模板并使用Auto-tuner自动调谐器小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

編寫可調模板并使用Auto-tuner自動調諧器
本文介紹在TVM自動調諧模塊。
自動調諧有兩個步驟。第一步是定義搜索空間。第二步是運行一個搜索算法來探索這個空間?？梢詫W習如何在TVM中執行這兩個步驟。以矩陣乘法為例說明了整個工作流程。
本文不會在Windows或最新版本的macOS上運行。要讓它運行，需要將主體包裝在if name == “main”:塊中。
安裝依賴項
要在TVM中使用autotvm包，需要安裝一些額外的依賴項。此步驟（安裝xgboost）可以跳過，它不需要xgboost（如果使用python2，請將“3”更改為“2”）：
pip3 install --user psutil xgboost
為了使TVM的調諧速度更快，建議使用cython作為TVM的FFI。在TVM的根目錄中，執行（如果使用python2，將“3”更改為“2”）：
pip3 install --user cython
sudo make cython3
現在回到python代碼。導入包。
import logging
import sys

import numpy as np
import tvm
from tvm import te, testing

the module is called `autotvm`

from tvm import autotvm
Step 1: Define the search space
在本節中，將把一個確定的TVM調度代碼重寫為可調調度模板。可以將定義搜索空間的過程視為現有計劃代碼的參數化。
首先，這里是如何在TVM中實現分塊矩陣乘法。

Matmul V0: Constant tiling factor

def matmul_v0(N, L, M, dtype):
A = te.placeholder((N, L), name=“A”, dtype=dtype)
B = te.placeholder((L, M), name=“B”, dtype=dtype)

k = te.reduce_axis((0, L), name="k")
C = te.compute((N, M), lambda i, j: te.sum(A[i, k] * B[k, j], axis=k), name="C")
s = te.create_schedule(C.op)# schedule
y, x = s[C].op.axis
k = s[C].op.reduce_axis[0]yo, yi = s[C].split(y, 8)
xo, xi = s[C].split(x, 8)s[C].reorder(yo, xo, k, yi, xi)

return s, [A, B, C]
Parametrize the schedule
在前面的計劃代碼中，使用常數“8”作為平鋪系數。然而，它可能不是最好的，因為最佳平鋪系數取決于實際的硬件環境和輸入形狀。
如果希望計劃代碼在更廣泛的輸入形狀和目標硬件之間可移植，則最好定義一組候選值，并根據目標硬件上的測量結果選擇最佳值。
在autotvm中，可以定義一個可調參數，或者為此類值定義一個“旋鈕”。

Matmul V1: List candidate values

@autotvm.template(“tutorial/matmul_v1”) # 1. use a decorator
def matmul_v1(N, L, M, dtype):
A = te.placeholder((N, L), name=“A”, dtype=dtype)
B = te.placeholder((L, M), name=“B”, dtype=dtype)

k = te.reduce_axis((0, L), name="k")
C = te.compute((N, M), lambda i, j: te.sum(A[i, k] * B[k, j], axis=k), name="C")
s = te.create_schedule(C.op)# schedule
y, x = s[C].op.axis
k = s[C].op.reduce_axis[0]# 2. get the config object
cfg = autotvm.get_config()# 3. define search space
cfg.define_knob("tile_y", [1, 2, 4, 8, 16])
cfg.define_knob("tile_x", [1, 2, 4, 8, 16])# 4. schedule according to config
yo, yi = s[C].split(y, cfg["tile_y"].val)
xo, xi = s[C].split(x, cfg["tile_x"].val)s[C].reorder(yo, xo, k, yi, xi)return s, [A, B, C]

這里對前面的調度代碼做了四個修改，得到了一個可調的“模板”?？梢灾鹨唤忉屝薷摹?br /> 使用修飾符將此函數標記為簡單模板。
獲取一個config對象：可以將這個cfg看作這個函數的一個參數，但是以不同的方式獲得它。有了這個參數，這個函數不再是一個確定性的調度代碼。相反，可以將不同的配置傳遞給這個函數并獲得不同的調度，所以這個函數是一個“模板”。
為了使模板函數更緊湊，在一個函數中做兩件事。（1）定義一個搜索空間和（2）根據該空間中的實體調度。為了實現這一點，將cfg設置為ConfigSpace或ConfigEntity對象。
當它是一個ConfigSpace時，它將收集此函數中的所有可調旋鈕并構建搜索空間。當它是ConfigEntity時，它將忽略所有空間定義API（即，定義(…)). 相反，它存儲所有可調旋鈕的確定值，根據這些值進行調度。
在自動調優期間，將首先使用ConfigSpace對象調用此模板來構建搜索空間。然后使用構建空間中不同的ConfigEntity調用這個模板，以獲得不同的調度。最后，將測量由不同計劃生成的代碼，并選擇最佳的。
定義兩個可調旋鈕。第一個是帶5個可能值的圖塊。第二個是tile_x，它具有相同的可能值列表。這兩個旋鈕是獨立的，因此它們跨越一個搜索空間，大小為5x5=25
根據cfg中的確定值進行調度
使用更好的空間定義API
在前面的模板中，手動列出旋鈕的所有可能值。這是定義空間的最低級別API。不過，還提供了另一組API，以使空間定義更簡單、更智能。建議使用這套高級API。
在下面的示例中，使用ConfigSpace.define_split定義拆分旋鈕。它將列舉所有可能的方法來分割一個軸和構造空間。
也有ConfigSpace.define_reorder重新排序用于重新訂購旋鈕和ConfigSpace.define_annotate用于像展開、矢量化、線程綁定之類的注釋。當高級API不能滿足的需求時，可以隨時使用低級API。
@autotvm.template(“tutorial/matmul”)
def matmul(N, L, M, dtype):
A = te.placeholder((N, L), name=“A”, dtype=dtype)
B = te.placeholder((L, M), name=“B”, dtype=dtype)

k = te.reduce_axis((0, L), name="k")
C = te.compute((N, M), lambda i, j: te.sum(A[i, k] * B[k, j], axis=k), name="C")
s = te.create_schedule(C.op)# schedule
y, x = s[C].op.axis
k = s[C].op.reduce_axis[0]##### define space begin #####
cfg = autotvm.get_config()
cfg.define_split("tile_y", y, num_outputs=2)
cfg.define_split("tile_x", x, num_outputs=2)
##### define space end ###### schedule according to config
yo, yi = cfg["tile_y"].apply(s, C, y)
xo, xi = cfg["tile_x"].apply(s, C, x)s[C].reorder(yo, xo, k, yi, xi)

return s, [A, B, C]
Note
More Explanation on cfg.defile_split
In this template, cfg.define_split(“tile_y”, y, num_outputs=2) will enumerate all possible combinations that can split axis y into two axes with factors of the length of y. For example, if the length of y is 32 and we want to split it into two axes using factors of 32, then there are 6 possible values for (length of outer axis, length of inner axis) pair, namely (32, 1), (16, 2), (8, 4), (4, 8), (2, 16) or (1, 32). They are just the 6 possible values of tile_y.
During schedule, cfg[“tile_y”] is a SplitEntity object. We stores the lengths of outer axes and inner axes in cfg[‘tile_y’].size (a tuple with two elements). In this template, we apply it by using yo, yi = cfg[‘tile_y’].apply(s, C, y). Actually, this is equivalent to yo, yi = s[C].split(y, cfg[“tile_y”].size[1]) or yo, yi = s[C].split(y, nparts=cfg['tile_y"].size[0])
The advantage of using cfg.apply API is that it makes multi-level split (when num_outputs >= 3) easier.
Step 2: Search through the space
在步驟1中，通過將舊的調度代碼擴展到模板中來構建搜索空間。下一步是選擇一個調諧器并在這個空間中探索。
TVM中的自動調諧器
調諧器的工作可以通過以下偽代碼來描述
ct = 0
while ct < max_number_of_trials:
propose a batch of configs
measure this batch of configs on real hardware and get results
ct += batch_size
當建議下一批配置時，調諧器可以采取不同的策略。在autotvm中提供了四種不同策略的調諧器。
? RandomTuner: Enumerate the space in a random order
? GridSearchTuner: Enumerate the space in a grid search order
? GATuner: Using genetic algorithm to search through the space
? XGBTuner: Uses a model based method. Train a XGBoost model to predict the speed of lowered IR and pick the next batch according to the prediction.
可以根據空間大小、時間預算和其他因素選擇調諧器。例如，如果空間很小（小于1000），一個gridsearch調諧器或一個隨機調諧器就足夠了。如果空間級別為10^9（這是CUDA GPU上conv2d操作符的空間大小），XGBoostTuner可以更高效地探索并找到更好的配置。
開始調諧
這里繼續矩陣乘法例子。首先，應該創建一個調優任務。也可以檢查初始化的搜索空間。在這種情況下，對于512x512平方矩陣乘法，空間大小為10x10=100。
N, L, M = 512, 512, 512
task = autotvm.task.create(“tutorial/matmul”, args=(N, L, M, “float32”), target=“llvm”)
print(task.config_space)
Out:
ConfigSpace (len=100, space_map=
0 tile_y: Split(policy=factors, product=512, num_outputs=2) len=10
1 tile_x: Split(policy=factors, product=512, num_outputs=2) len=10
)
然后需要定義如何測量生成的代碼并選擇調諧器。因為空間很小，隨機調諧器就可以了。
本文只進行了10次試驗以供演示。實際上，可以根據你的時間預算做更多的試驗。將把調整結果記錄到一個日志文件中。此文件可用于以后獲得最佳配置。

logging config (for printing tuning log to the screen)

logging.getLogger(“autotvm”).setLevel(logging.DEBUG)
logging.getLogger(“autotvm”).addHandler(logging.StreamHandler(sys.stdout))

There are two steps for measuring a config: build and run.

By default, we use all CPU cores to compile program. Then measure them sequentially.

We measure 5 times and take average to reduce variance.

measure_option = autotvm.measure_option(builder=“local”, runner=autotvm.LocalRunner(number=5))

Begin tuning with RandomTuner, log records to file `matmul.log`

You can use alternatives like XGBTuner.

tuner = autotvm.tuner.RandomTuner(task)
tuner.tune(
n_trial=10,
measure_option=measure_option,
callbacks=[autotvm.callback.log_to_file(“matmul.log”)],
)
Out:
Get devices for measurement successfully!
No: 1 GFLOPS: 0.52/0.52 result: MeasureResult(costs=(0.5179643672,), error_no=0, all_cost=8.699557542800903, timestamp=1607225778.9184623) [(‘tile_y’, [-1, 64]), (‘tile_x’, [-1, 1])],None,6
No: 2 GFLOPS: 2.05/2.05 result: MeasureResult(costs=(0.1307110214,), error_no=0, all_cost=2.452157735824585, timestamp=1607225781.4836178) [(‘tile_y’, [-1, 512]), (‘tile_x’, [-1, 8])],None,39
No: 3 GFLOPS: 2.77/2.77 result: MeasureResult(costs=(0.0968108324,), error_no=0, all_cost=2.015434741973877, timestamp=1607225783.5040994) [(‘tile_y’, [-1, 2]), (‘tile_x’, [-1, 8])],None,31
No: 4 GFLOPS: 7.71/7.71 result: MeasureResult(costs=(0.0348177938,), error_no=0, all_cost=0.9887301921844482, timestamp=1607225784.5313203) [(‘tile_y’, [-1, 1]), (‘tile_x’, [-1, 32])],None,50
No: 5 GFLOPS: 13.46/13.46 result: MeasureResult(costs=(0.0199451586,), error_no=0, all_cost=0.7833263874053955, timestamp=1607225785.3334467) [(‘tile_y’, [-1, 256]), (‘tile_x’, [-1, 64])],None,68
No: 6 GFLOPS: 11.91/13.46 result: MeasureResult(costs=(0.0225446656,), error_no=0, all_cost=0.7622959613800049, timestamp=1607225786.1802726) [(‘tile_y’, [-1, 256]), (‘tile_x’, [-1, 512])],None,98
No: 7 GFLOPS: 0.92/13.46 result: MeasureResult(costs=(0.2913359364,), error_no=0, all_cost=5.074311971664429, timestamp=1607225791.3119547) [(‘tile_y’, [-1, 128]), (‘tile_x’, [-1, 2])],None,17
No: 8 GFLOPS: 2.37/13.46 result: MeasureResult(costs=(0.1133100596,), error_no=0, all_cost=2.2167930603027344, timestamp=1607225793.595454) [(‘tile_y’, [-1, 8]), (‘tile_x’, [-1, 4])],None,23
No: 9 GFLOPS: 11.52/13.46 result: MeasureResult(costs=(0.0233022846,), error_no=0, all_cost=0.7279143333435059, timestamp=1607225795.1428313) [(‘tile_y’, [-1, 256]), (‘tile_x’, [-1, 32])],None,58
No: 10 GFLOPS: 14.67/14.67 result: MeasureResult(costs=(0.0182990712,), error_no=0, all_cost=0.7626948356628418, timestamp=1607225795.9127738) [(‘tile_y’, [-1, 64]), (‘tile_x’, [-1, 128])],None,76
Finally we apply history best from the cache file and check its correctness. We can call the function matmul directly under the autotvm.apply_history_best context. When we call this function, it will query the dispatch context with its argument and get the best config with the same argument.
最后，從緩存文件中應用歷史記錄，并檢查其正確性?？梢灾苯釉赼utotvm.apply_history_best上下文。當調用這個函數時，它將用它的參數查詢分派上下文，并用相同的參數獲得最佳配置。

apply history best from log file

with autotvm.apply_history_best(“matmul.log”):
with tvm.target.Target(“llvm”):
s, arg_bufs = matmul(N, L, M, “float32”)
func = tvm.build(s, arg_bufs)

check correctness

a_np = np.random.uniform(size=(N, L)).astype(np.float32)
b_np = np.random.uniform(size=(L, M)).astype(np.float32)
c_np = a_np.dot(b_np)

c_tvm = tvm.nd.empty(c_np.shape)
func(tvm.nd.array(a_np), tvm.nd.array(b_np), c_tvm)

tvm.testing.assert_allclose(c_np, c_tvm.asnumpy(), rtol=1e-2)
Download Python source code: tune_simple_template.py
Download Jupyter notebook: tune_simple_template.ipynb

https://tvm.apache.org/docs/tutorials/autotvm/tune_simple_template.html

總結

以上是生活随笔為你收集整理的编写可调模板并使用Auto-tuner自动调谐器的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：用Auto-TensorCore代码生成
下一篇： NVIDIA GPU卷积网络的自动调谐