硬件专业化和软件映射的敏捷框架
硬件專業化和軟件映射的敏捷框架
概述
隨著摩爾定律接近尾聲,設計專用硬件及將應用程序映射到專用硬件的軟件,都是很有前途的解決方案。硬件設計決定了峰值性能,軟件也很重要,決定了實際性能。硬件/軟件(HW/SW)協同優化硬件加速和軟件映射,提高整體性能。當前的流程將硬件和軟件設計隔離。由于編程層次低,設計空間大,硬件和軟件都難以設計和優化。
將介紹AHS,一個用于張量應用程序的硬件專業化和軟件映射的敏捷框架。對于使用高級語言描述的張量應用程序,AHS可以自動定義硬件和軟件間的接口,協同導航巨大的設計空間,自動生成硬件實現和軟件映射。AHS由幾個組件組成,每個組件都帶有一個開源工具。
第一,介紹HASCO,一種用于硬件和軟件協同設計的工具。HASCO使用基于循環的IR的匹配方法,探索不同的硬件/軟件分區方法。由于設計目標和評估成本不同,HASCO在硬件和軟件方面采用了不同的DSE算法。
第二,介紹了TENET,一種用于硬件數據流表示法和性能模型的工具。TENET可以使用以關系為中心的表示法,完全覆蓋硬件數據流的設計空間。
第三,介紹Tennet的合成后端TensorLib。TensorLib可以自動生成用Chisel編寫的硬件數據流實現。
第四,介紹了Flextensor,一種用于自動軟件映射和優化的工具。Flextensor可以為各種硬件平臺(包括CPU、GPU、FPGA和ASIC)自動生成優化的軟件實現。
Chisel是由伯克利大學發布的一種開源硬件構建語言,通過使用高度化的參數生成器和分層的專用硬件設計語言,支持高級硬件設計。
重要特性:
內嵌Scala編程語言
層次化+面向對象+功能構建
使用Scala中的元編程可以高度地參數化
支持專用設計語言的分層
生成低級Verilog設計文件,傳遞到標準ASIC或FPGA工具
采用Chisel設計的電路,經過編譯,可以得到針對FPGA、ASIC的Verilog HDL代碼,可以得到對應的時鐘精確C++模擬器。
Chisel -> FPGA Verilog
Chisel -> ASIC Verilog
Chisel -> C++ Simulator
調度
先概述AHS項目,再介紹一系列技術演示和開源工具演示。涵蓋AHS的所有組件,包括硬件和軟件協同設計、硬件專業化和軟件映射。
組織者
梁云(Eric)目前是北京大學EECS學院的副教授。研究興趣包括計算機體系結構、電子設計自動化和編譯器。在ISCA、MICRO、DAC、FPGA等領域發表了100多篇科學論文。研究獲得了兩項最佳論文獎和六項最佳論文提名。擔任MICRO、ISCA、ASPLO、HPCA、DAC、FPGA、FCCM等的技術項目委員會,以及ACM TEC和TRET的助理編輯。
羅紫章目前是北京大學研究生的最后一年。對特定領域芯片的架構和軟件協同設計感興趣。
陸立強是北京大學五年級的博士生。2017年,在同一所大學獲得學士學位。對空間架構和可重構計算感興趣。
賈連成是北京大學四年級的博士生。于2018年在同一所大學獲得學士學位。對高級綜合和敏捷硬件設計感興趣。
鄭Size北京大學三年級的博士生。2019年,在同一所大學獲得學士學位。對特定領域加速器的編譯器設計和優化感興趣。
Install Steps
Install
Shell script
You can use this shell shell script to install everything.
sh –c “$(wget https://pku-ahs.github.io/tutorial/en/master/_downloads/9064601015f9cd5e747a641dbdacf3aa/install_ahs.sh –O -)”
source ~/.bashrc
The shell script is tested under Ubuntu:20.04LTS. If you use another OS, or if you use Anaconda or Virtualenv for python, you may need to modify the script yourself. For windows users, it is best to use WSL.
Docker
You can pull our docker. We had everything prepared, configured and installed for you.
docker pull ericlyun/ahsmicro:latest
docker run –it ahsmicro:latest /bin/bash
Requirement
Apt
? python3
? python3-pip
? git
? llvm-9
? cmake
? build-essential
? make
? autoconf
? automake
? scons
? libboost-all-dev
? libgmp10-dev
? libtool
? default-jdk
? csvtool
Pip
? numpy
? decorator
? attrs
? tornado
? psutil
? xgboost
? cloudpickle
? tensorflow
? tqdm
? IPython
? botorch
? jinja2
? pandas
? scipy
? scikit-learn
? plotly
Sbt
echo “deb https://repo.scala-sbt.org/scalasbt/debian all main” | sudo tee /etc/apt/sources.list.d/sbt.list
echo “deb https://repo.scala-sbt.org/scalasbt/debian /” | sudo tee /etc/apt/sources.list.d/sbt_old.list
curl -sL “https://keyserver.ubuntu.com/pks/lookup?op=get&search=0x2EE0EA64E40A89B84B2DF73499E82A75642AC823” | sudo apt-key add
sudo apt-get update
sudo apt-get install sbt
Git
git clone --recursive -b micro_tutorial https://github.com/pku-liang/HASCO.git
git clone --recursive -b micro_tutorial https://github.com/pku-liang/TENET.git
git clone https://github.com/KnowingNothing/FlexTensor-Micro.git
git clone -b demo https://github.com/pku-liang/TensorLib.git
Configure & Compile
Hasco
cd ./ HASCO
bash ./install.sh
Settings
vim ~/.bashrc
append:
export TVM_HOME=<install_dir>/HASCO/src/tvm
export AX_HOME=<install_dir>/HASCO/src/Ax
export PYTHONPATH=TVMHOME/python:TVM_HOME/python:TVMH?OME/python:AX_HOME:${PYTHONPATH}
source ~/.bashrc
TENET
cd ./TENET
bash ./init.sh
vim ~/.bashrc
append:
export LD_LIBRARY_PATH=<install_dir>/TENET/external/lib:$LD_LIBRARY_PATH
source ~/.bashrc
cd TENET
make cli
make hasco
Dockerfile
The size of the docker is about 7G. If you find it difficult to pull it due to its size, you can run the following Dockerfile to build the docker by yourself.
syntax=docker/dockerfile:1
FROM ubuntu:20.04
ENV DEBIAN_FRONTEND=noninterative
RUN apt-get update
&& apt-get -y -q install git sudo vim python3 python3-pip llvm-9 cmake build-essential make autoconf automake scons libboost-all-dev libgmp10-dev libtool curl default-jdk csvtool
&& pip3 install tensorflow decorator attrs tornado psutil xgboost cloudpickle tqdm IPython botorch jinja2 pandas scipy scikit-learn plotly
&& echo “deb https://repo.scala-sbt.org/scalasbt/debian all main” | sudo tee /etc/apt/sources.list.d/sbt.list
&& echo “deb https://repo.scala-sbt.org/scalasbt/debian /” | sudo tee /etc/apt/sources.list.d/sbt_old.list
&& curl -sL “https://keyserver.ubuntu.com/pks/lookup?op=get&search=0x2EE0EA64E40A89B84B2DF73499E82A75642AC823” | sudo apt-key add
&& sudo apt-get update
&& sudo apt-get -y -q install sbt
&& mkdir AHS
&& cd AHS
&& git clone --recursive -b micro_tutorial https://github.com/pku-liang/HASCO.git
&& git clone --recursive -b micro_tutorial https://github.com/pku-liang/TENET.git
&& git clone -b demo https://github.com/pku-liang/TensorLib.git
&& git clone https://github.com/KnowingNothing/FlexTensor-Micro.git
&& cd HASCO
&& bash ./install.sh
&& cd …/TENET
&& bash ./init.sh
Run
HASCO
Config
vim src/codesign/config.py
mastro_home = “<install_dir>/HASCO/src/maestro”
tenet_path = “<install_dir>/TENET/bin/HASCO_Interface”
tenet_params = {
“avg_latency”:16 # average latency for each computation
“f_trans”:12 # energy consume for each element transfered
“f_work”:16 # energy consume for each element in the workload
}
tensorlib_home = “<install_dir>/TensorLib”
tensorlib_main = “tensorlib.ParseJson”
Python API
python3 testbench/co_mobile_conv.py
python3 testbench/co_resnet_gemm.py
…
CLI
cd HASCO
./hasco.py -h
Run a GEMM intrinsic with MobileNetV2 benchmark
./hasco.py -i GEMM -b MobileNetv2 -f gemm_example.json -l 1000 -p 20 -a 0
Results:
? rst/MobileNetV2_CONV.csv config of best design for each constraint, view with column -s, -t < MobileNetV2_CONV.csv
? rst/software/MobileNetV2_CONV_* tvm IR for each design
? rst/hardware/CONV_.json TensorLib config for each design
? rst/hardware/CONV_.v TensorLib generated Verilog
TENET
cd TENET
Help Text
./bin/tenet -h
Run a KC-systolic dataflow
./bin/tenet -p ./dataflow_example/pe_array.p -s ./dataflow_example/conv.s -m ./dataflow_example/KC_systolic_dataflow.m -o output.csv --all
Run a OxOy dataflow
./bin/tenet -p ./dataflow_example/pe_array.p -s ./dataflow_example/conv.s -m ./dataflow_example/OxOy_dataflow.m -o output.csv --all
Run all layers in MobileNet
./bin/tenet -e ./network_example/MobileNet/config -d ./network_example -o output.csv --all
Result:output.csv
TensorLib
cd TensorLib
Optional, download the requirements from MAVEN, so that the rest instructions runs faster
sbt compile
Examples of Scala APIs
sbt “runMain tensorlib.Example_GenConv2D”
sbt “runMain tensorlib.Example_GenGEMM”
Examples of JSON interface
sbt “runMain tensorlib.ParseJson ./examples/conv2d.json ./output/conv2d.v”
sbt “runMain tensorlib.ParseJson ./examples/gemm.json ./output/gemm.v”
Testing the result
sbt “runMain tensorlib.Test_Runner_Gemm”
Result:
Scala Interface: PEArray.v
ParseJson: the second argument you specified.
FlexTensor
cd FlexTensor-Micro
export PYTHONPATH=$PYTHONPATH:/path/to/FlexTensor-Micro
cd FlexTensor-Micro/flextensor/tutorial
First, CPU experiments
cd conv2d_llvm
run flextensor
python optimize_conv2d.py --shapes res --target llvm --parallel 8 --timeout 20 --log resnet_config.log
run test
python optimize_conv2d.py --test resnet_optimize_log.txt
run baseline
python conv2d_baseline.py --type tvm_generic --shapes res --number 100
run plot
python plot.py
Next, GPU experiments
cd …/conv2d_cuda
run flextensor
python optimize_conv2d.py --shapes res --target cuda --parallel 4 --timeout 20 --log resnet_config.log
run test
python optimize_conv2d.py --test resnet_optimize_log.txt
run baseline
python conv2d_baseline.py --type pytorch --shapes res --number 100
run plot
python plot.py
At last, VNNI experiments
cd …/gemm_vnni
run flextensor (cascadelake)
python optimize_gemm.py --target “llvm -mcpu=cascadelake” --target_host “llvm -mcpu=cascadelake” --parallel 8 --timeout 20 --log gemm_config.log --dtype int32
run flextensor (skylake)
python optimize_gemm.py --target “llvm -mcpu=skylake-avx512” --target_host “llvm -mcpu=skylake-avx512” --parallel 8 --timeout 20 --log gemm_config.log
run test
python optimize_gemm.py --test gemm_optimize_log.txt
run baseline
python gemm_baseline.py --type numpy --number 100
run plot
python plot.py
參考鏈接:
https://pku-ahs.github.io/tutorial/en/master/
總結
以上是生活随笔為你收集整理的硬件专业化和软件映射的敏捷框架的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: LLVM Clang前端编译与调试
- 下一篇: 华为公有云架构解决方案