使用Tensorrt部署,C++ API yolov7_pose模型
使用Tensorrt部署,C++ API yolov7_pose模型
雖然標題叫部署yolov7_pose模型,但是接下來的教程可以使用Tensorrt部署任何pytorch模型。
倉庫地址:https://github.com/WongKinYiu/yolov7/tree/pose
系統版本:ubuntu18.4
驅動版本:CUDA Version: 11.4
在推理過程中,基于 TensorRT 的應用程序的執行速度可比 CPU 平臺的速度快 40 倍。借助 TensorRT,您可以優化在所有主要框架中訓練的神經網絡模型,精確校正低精度,并最終將模型部署到超大規模數據中心、嵌入式或汽車產品平臺中。
TensorRT 以 NVIDIA 的并行編程模型 CUDA 為基礎構建而成,可幫助您利用 CUDA-X 中的庫、開發工具和技術,針對人工智能、自主機器、高性能計算和圖形優化所有深度學習框架中的推理。
TensorRT 針對多種深度學習推理應用的生產部署提供 INT8 和 FP16 優化,例如視頻流式傳輸、語音識別、推薦和自然語言處理。推理精度降低后可顯著減少應用延遲,這恰巧滿足了許多實時服務、自動和嵌入式應用的要求。
我們部署的主要步驟為:將PytorchModel轉化為OnnxModel,在將OnnxModel轉化為TensorrtModel.
雖然看似步驟簡單,但是坑還是有點多。
1.安裝TensorRT
首先查看自己的Cuda版本,Windows 在cmd中執行nvidia-smi,Ubuntu在終端執行nvidia-smi即可查看cuda的版本。一般我們選擇自己所能下載的最新的版本,避免有的算子沒有實現的問題。我之前在這里被坑了一天。
然后根據版本在官網下載,點擊Download,沒有注冊英偉達賬號的需要注冊賬號登陸。官網地址:https://developer.nvidia.com/tensorrt
同意協議,然后根據自己的cuda版本選擇,合適的版本。比如我的版本是cuda 11.4,一般選擇Tar包
接下來將tar包或者zip包解壓到你想安裝的位置。這個軟件解壓即用,不用再安裝。我們需要做的就是把軟件的bin目錄添加到環境變量。
Ubuntu:用vim打開~/.bashrc,將下面兩行添加到文件最后面。
export LD_LIBRARY_PATH=/home/ubuntu/mySoftware/TensorRT-8.6.1.6/lib:$LD_LIBRARY_PATH
export PATH=/home/ubuntu/mySoftware/TensorRT-8.6.1.6/bin:$PATH
其中tensorrt的地址應該換成你解壓的地址。然后sourse一下當前的終端
source ~/.bashrc
然后直接執行trtexec,如果沒有報錯證明成功安裝了tensorrt
~/Downloads trtexec
&&&& RUNNING TensorRT.trtexec [TensorRT v8601] # trtexec -h
=== Model Options ===
--uff=<file> UFF model
--onnx=<file> ONNX model
--model=<file> Caffe model (default = no model, random weights used)
--deploy=<file> Caffe prototxt file
--output=<name>[,<name>]* Output names (it can be specified multiple times); at least one output is
......
2.轉換pytorch模型為onnx格式的模型
先說yolo項目:項目目錄下有個model/export.py
打開文件查看參數可以看到有一下參數設置。
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('--weights', type=str, default='./yolov5s.pt', help='weights path')
parser.add_argument('--img-size', nargs='+', type=int, default=[640, 640], help='image size') # height, width
parser.add_argument('--batch-size', type=int, default=1, help='batch size')
parser.add_argument('--grid', action='store_true', help='export Detect() layer grid')
parser.add_argument('--device', default='cpu', help='cuda device, i.e. 0 or 0,1,2,3 or cpu')
parser.add_argument('--dynamic', action='store_true', help='dynamic ONNX axes') # ONNX-only
parser.add_argument('--simplify', action='store_true', help='simplify ONNX model') # ONNX-only
parser.add_argument('--export-nms', action='store_true', help='export the nms part in ONNX model') # ONNX-only, #opt.grid has to be set True for nms export to work
opt = parser.parse_args()
opt.img_size *= 2 if len(opt.img_size) == 1 else 1 # expand
print(opt)
set_logging()
t = time.time()
根據自己模型設置合適的參數,注意如果你修改過模型的輸出分類數,關鍵點數目。那么在導出nms層的時候就需要你自己手動修改網絡模型。在models/common.py中的第361行修改non_max_suppression參數
class NMS(nn.Module):
# Non-Maximum Suppression (NMS) module
iou = 0.45 # IoU threshold
classes = None # (optional list) filter by class
def __init__(self, conf=0.25, kpt_label=False):
super(NMS, self).__init__()
self.conf=conf
self.kpt_label = kpt_label
def forward(self, x):
return non_max_suppression(x[0], conf_thres=self.conf, iou_thres=self.iou, classes=self.classes, kpt_label=self.kpt_label,nc=2,nkpt=3)
class NMS_Export(nn.Module):
# Non-Maximum Suppression (NMS) module used while exporting ONNX model
iou = 0.45 # IoU threshold
classes = None # (optional list) filter by class
def __init__(self, conf=0.001, kpt_label=False):
super(NMS_Export, self).__init__()
self.conf = conf
self.kpt_label = kpt_label
def forward(self, x):
return non_max_suppression_export(x[0], conf_thres=self.conf, iou_thres=self.iou, classes=self.classes, kpt_label=self.kpt_label,nc=2)
我們需要把nc和nkpt改為自己的設置的參數,比如我的分類為2,關鍵點數量為3。然后導出模型。
python --img-size 960 --weights /home/ubuntu/GITHUG/yolov7_pose/runs/train/exp2/weights/best.pt --grid --export-nms --simplify
如果順利的話,我們會得到一個onnx格式的模型。我們可以打開https://netron.app/ 然后選擇onnx模型打開。我們可以看到模型的圖像
我們需要關注的就是模型的輸入,輸出。以及他們的形狀。
從圖中可以看出我的模型輸入為images,大小為13 * 960 960輸出為detections形狀暫時不清楚。如果不清楚我們可以用onnxruntime跑一下查看形狀
import onnxruntime
import numpy as np
import cv2
# 指定你的 ONNX 模型文件路徑
onnx_model_path = '/home/ubuntu/GITHUG/yolov7_pose/runs/train/exp2/weights/best.onnx'
# 創建 ONNX Runtime 的推理會話
sess = onnxruntime.InferenceSession(onnx_model_path)
# 獲取輸入名稱和形狀
input_name = sess.get_inputs()[0].name
input_shape = sess.get_inputs()[0].shape
# 指定圖像文件路徑
image_path = '/home/ubuntu/GITHUG/yolov7_pose/501_png.rf.9cc0a917ca7972be6c8088aa9d17d651.jpg'
# 使用 OpenCV 讀取圖像
image = cv2.imread(image_path)
# 將圖像調整為模型的輸入形狀
resized_image = cv2.resize(image, (input_shape[3], input_shape[2]))
# 將圖像轉換為浮點數并進行歸一化
input_data = resized_image.astype(np.float32) / 255.0
# 將圖像數據轉換為 ONNX 模型期望的輸入形狀
input_data = np.transpose(input_data, [2, 0, 1])
input_data = np.expand_dims(input_data, axis=0)
# 運行推理
outputs = sess.run(None, {input_name: input_data})
# 輸出模型的每個輸出
for i, output_data in enumerate(outputs):
print(f"Output {i + 1}: {output_data}")
print(f"Output {output_data.shape}")
Output 1: [[8.01661621e+02 1.53809937e+02 9.72689453e+02 3.77949707e+02
4.21597920e-02 0.00000000e+00 5.15294671e-02 8.84101624e+02
2.51692810e+02 9.91612077e-01 9.03469177e+02 1.68072296e+02
6.35425091e-01 8.85691345e+02 1.72709320e+02 7.30822206e-01]
[7.85901917e+02 1.61294067e+02 9.64655701e+02 3.66809448e+02
4.08335961e-02 1.00000000e+00 6.32926583e-01 8.77714966e+02
2.57085205e+02 9.89280879e-01 8.91954224e+02 1.80863663e+02
2.32283741e-01 8.78342041e+02 1.87161697e+02 5.20734370e-01]
[7.05231201e+02 3.90309601e+02 7.51886230e+02 4.35935760e+02
1.86153594e-02 1.00000000e+00 6.94520175e-01 7.35046814e+02
4.11621490e+02 7.23196447e-01 7.14923584e+02 4.14582092e+02
4.62090850e-01 7.09832214e+02 4.12042603e+02 2.80124098e-01]
[4.01937828e+01 4.64705994e+02 1.51267151e+02 6.35167419e+02
1.55489137e-02 1.00000000e+00 9.99976933e-01 8.51227875e+01
5.72096252e+02 9.97074127e-01 8.59449158e+01 4.89000427e+02
9.83235717e-01 8.48072968e+01 5.18143494e+02 9.95639443e-01]
[4.67657043e+02 2.47014786e+02 6.09315125e+02 4.11179565e+02
1.50994565e-02 0.00000000e+00 1.29642010e-01 5.45577820e+02
3.71885773e+02 9.93896723e-01 5.56157104e+02 3.50972717e+02
9.97142434e-01 5.54454590e+02 3.20836670e+02 9.76849675e-01]
[3.69356445e+02 1.81159134e+01 4.91651611e+02 1.81579437e+02
1.44530777e-02 1.00000000e+00 9.98439074e-01 4.16761169e+02
1.16163483e+02 9.97292042e-01 4.29588745e+02 2.69206352e+01
9.79286790e-01 4.28487366e+02 8.01969910e+01 9.97563720e-01]
[7.12836548e+02 3.89805634e+02 7.66137817e+02 4.36001556e+02
1.32421134e-02 0.00000000e+00 2.13130921e-01 7.40284363e+02
4.09640594e+02 7.56286979e-01 7.18195129e+02 4.12563293e+02
1.05279446e-01 7.11785156e+02 4.14483521e+02 1.00254148e-01]
[7.01546204e+02 3.92902222e+02 7.31227966e+02 4.25415100e+02
1.30005283e-02 1.00000000e+00 9.94012475e-01 7.22401733e+02
4.12053406e+02 4.85429347e-01 7.12319214e+02 4.13364197e+02
7.06610680e-01 7.13084656e+02 4.11362488e+02 4.67233360e-01]
[6.80663696e+02 4.66796997e+02 7.09215454e+02 4.98112915e+02
1.06324852e-02 0.00000000e+00 6.49383068e-02 6.97597473e+02
4.87214142e+02 9.42029715e-01 6.90804749e+02 4.85028137e+02
9.82081532e-01 6.85866089e+02 4.70633820e+02 9.92424369e-01]]
Output (9, 16)
最后輸出可以看出我的輸出為1* 9 * 16,因為經過nms層后最后檢測框的數量是不固定的所以應該是1 * x *16。仔細觀察16緯的數據可以發現,每個數據都是
[x1,y1,x2,y2,confi,prob1,prob2,kpt1,conf1,pkt2,conf2,kpt3,conf3]
其中前四個數據為檢測框,然后是置信度,分類概率,關鍵點以及關鍵點的置信度。
3.將onnx格式的模型轉為.engine的tensorrt模型。
直接執行命令,然后等待模型轉換成功。
trtexec --onnx=yolov7.onnx --fp16 --saveEngine=yolov7.engine
如果報錯,比如什么算子不支持可以嘗試更新tensorrt到最新版本。
4.C++部署
#include <iostream>
#include <fstream>
#include <vector>
#include <opencv2/opencv.hpp>
#include <NvInfer.h>
#include <cuda_runtime_api.h>
#define INPUT_W 960
#define INPUT_H 960
#define DEVICE 0 // GPU id
#define CONF_THRESH 0.2
using namespace nvinfer1;
class Logger : public ILogger {
void log(Severity severity, const char *msg) noexcept override {
// suppress info-level messages
if (severity <= Severity::kWARNING)
std::cout << msg << std::endl;
}
} logger;
#define CHECK(status) \
do\
{\
auto ret = (status);\
if (ret != 0)\
{\
std::cerr << "Cuda failure: " << ret << std::endl;\
abort();\
}\
} while (0)
float *blobFromImage(cv::Mat &img) {
cv::cvtColor(img, img, cv::COLOR_BGR2RGB);
float *blob = new float[img.total() * 3];
int channels = 3;
int img_h = img.rows;
int img_w = img.cols;
for (int c = 0; c < channels; c++) {
for (int h = 0; h < img_h; h++) {
for (int w = 0; w < img_w; w++) {
blob[c * img_w * img_h + h * img_w + w] =
(((float) img.at<cv::Vec3b>(h, w)[c]) / 255.0f);
}
}
}
return blob;
}
cv::Mat static_resize(cv::Mat &img) {
float r = std::min(INPUT_W / (img.cols * 1.0), INPUT_H / (img.rows * 1.0));
int unpad_w = r * img.cols;
int unpad_h = r * img.rows;
cv::Mat re(unpad_h, unpad_w, CV_8UC3);
cv::resize(img, re, re.size());
cv::Mat out(INPUT_W, INPUT_H, CV_8UC3, cv::Scalar(114, 114, 114));
re.copyTo(out(cv::Rect(0, 0, re.cols, re.rows)));
return out;
}
const char *INPUT_BLOB_NAME = "images";
const char *OUTPUT_BLOB_NAME = "detections";
static Logger gLogger;
static constexpr int MAX_OUTPUT_BBOX_COUNT = 100;
static constexpr int CLASS_NUM = 2;
static constexpr int LOCATIONS = 4;
static constexpr int KEY_POINTS_NUM = 3;
struct Keypoint {
float x;
float y;
float kpt_conf;
};
struct alignas(float) Detection {
//center_x center_y w h
float bbox[LOCATIONS];
float conf; // bbox_conf * cls_conf
float prob[CLASS_NUM]; // Probabilities for each class
// 3 keypoints
Keypoint kpts[KEY_POINTS_NUM];
};
void
doInference(IExecutionContext &context, float *input, float *output, const int output_size, const int input_shape) {
const ICudaEngine &engine = context.getEngine();
// Pointers to input and output device buffers to pass to engine.
// Engine requires exactly IEngine::getNbBindings() number of buffers.
assert(engine.getNbBindings() == 2);
void *buffers[2];
// In order to bind the buffers, we need to know the names of the input and output tensors.
// Note that indices are guaranteed to be less than IEngine::getNbBindings()
const int inputIndex = engine.getBindingIndex(INPUT_BLOB_NAME);
assert(engine.getBindingDataType(inputIndex) == nvinfer1::DataType::kFLOAT);
const int outputIndex = engine.getBindingIndex(OUTPUT_BLOB_NAME);
assert(engine.getBindingDataType(outputIndex) == nvinfer1::DataType::kFLOAT);
// int mBatchSize = engine.getMaxBatchSize();
// Create GPU buffers on device
CHECK(cudaMalloc(&buffers[inputIndex], input_shape * sizeof(float)));
CHECK(cudaMalloc(&buffers[outputIndex], output_size * sizeof(float)));
// Create stream
cudaStream_t stream;
CHECK(cudaStreamCreate(&stream));
// DMA input batch data to device, infer on the batch asynchronously, and DMA output back to host
CHECK(cudaMemcpyAsync(buffers[inputIndex], input, input_shape * sizeof(float), cudaMemcpyHostToDevice, stream));
context.enqueueV2(buffers, stream, nullptr);
CHECK(cudaMemcpyAsync(output, buffers[outputIndex], output_size * sizeof(float), cudaMemcpyDeviceToHost, stream));
cudaStreamSynchronize(stream);
// Release stream and buffers
cudaStreamDestroy(stream);
CHECK(cudaFree(buffers[inputIndex]));
CHECK(cudaFree(buffers[outputIndex]));
}
static constexpr int DETECTION_SIZE = sizeof(Detection) / sizeof(float);
static void
postprocess_decode(float *feat_blob, float prob_threshold,std::vector<Detection> &objects_map) {
for (int i = 0; i < MAX_OUTPUT_BBOX_COUNT; i++) {
int base_index = i * DETECTION_SIZE; // Calculate the base index for the current detection
if (feat_blob[base_index + LOCATIONS] <= prob_threshold)
continue;
Detection det;
// Copy the detection information from feat_blob to the Detection structure
memcpy(&det, &feat_blob[base_index], DETECTION_SIZE * sizeof(float));
objects_map.push_back(det);
}
}
int main() {
char *trtModelStream{nullptr};
cudaSetDevice(DEVICE);
size_t size{0};
const char *engine_file_path = "/home/ubuntu/GITHUG/yolov7_pose/yolov7.engine";
std::ifstream file(engine_file_path, std::ios::binary);
if (file.good()) {
file.seekg(0, file.end);
size = file.tellg();
file.seekg(0, file.beg);
trtModelStream = new char[size];
assert(trtModelStream);
file.read(trtModelStream, size);
file.close();
}
// create a model using the API directly and serialize it to a stream
IRuntime *runtime = createInferRuntime(gLogger);
assert(runtime != nullptr);
ICudaEngine *engine = runtime->deserializeCudaEngine(trtModelStream, size);
assert(engine != nullptr);
IExecutionContext *context = engine->createExecutionContext();
assert(context != nullptr);
delete[] trtModelStream;
// auto out_dims = engine->getBindingDimensions(1);
int input_size = 1 * 3 * 960 * 960;
int output_size = MAX_OUTPUT_BBOX_COUNT * 16 * 1;
static float *prob = new float[output_size];
const char *input_image_path = "/home/ubuntu/GITHUG/yolov7_pose/501_png.rf.9cc0a917ca7972be6c8088aa9d17d651.jpg";
cv::Mat img = cv::imread(input_image_path);
cv::Mat pr_img = static_resize(img);
float *blob;
// cv::imshow("Image", pr_img);
blob = blobFromImage(pr_img);
cv::waitKey(200);
// 關閉窗口
cv::destroyAllWindows();
doInference(*context, blob, prob, output_size, input_size);
std::vector<Detection> objects_map;
for (int i = 0; i < prob[0] && i < MAX_OUTPUT_BBOX_COUNT; i++) {
std::cout << ": " << prob[i] << std::endl;
}
postprocess_decode(prob, CONF_THRESH, objects_map);
float r_w = INPUT_W / (img.cols * 1.0);
float r_h = INPUT_H / (img.rows * 1.0);
cv::cvtColor(pr_img, pr_img, cv::COLOR_RGB2BGR);
for (const auto &det: objects_map) {
// Access other information in the Detection structure as needed
// Example: Print bbox coordinates
std::cout << " Bbox: ";
for (int i = 0; i < LOCATIONS; i++) {
std::cout << det.bbox[i] << " ";
}
float r = 0.0;
if (img.rows <= img.cols) {
r = r_w;
} else {
r = r_h;
}
cv::Point pt1(det.bbox[0]/r, det.bbox[1]/r);
cv::Point pt2(det.bbox[2]/r, det.bbox[3]/r);
cv::rectangle(img, pt1, pt2, cv::Scalar(0, 255, 0), 2);
cv::Point point1(det.kpts[0].x / r, det.kpts[0].y / r);
cv::Point point2(det.kpts[1].x / r, det.kpts[1].y / r);
cv::Point point3(det.kpts[2].x / r, det.kpts[2].y / r);
// 畫線段
cv::line(img, point1, point2, cv::Scalar(0, 0, 255), 2); // Scalar 參數表示顏色,這里是紅色 (B, G, R)
cv::line(img, point2, point3, cv::Scalar(255, 0, 0), 2); // Scalar 參數表示顏色,這里是紅色 (B, G, R)
cv::imshow("Rectangle", img);
cv::waitKey(0);
std::cout << std::endl;
}
}
這是我的demo以及最后的效果。
其中的關鍵代碼為解析模型輸出的部分,大家可以參考一下
總結
以上是生活随笔為你收集整理的使用Tensorrt部署,C++ API yolov7_pose模型的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 3天面了20个候选人,聊聊我的感受
- 下一篇: 运维初级实践——Linux系统命令教程