OpenCV程序效率优化方法1
OpenCV程序效率優化方法
使用指針方法遍歷像素點
OpenCV中圖像的存儲對象為Mat類,該類提供了多種方式訪問像素的的值。一般來說分為以at方法類與ptr指針的方式訪問,相較之下使用指針ptr的訪問方式效率將會高一些。尤其是在debug模式之下指針的效率提升更加明顯,release模式則相差不大。其中原因是at方法訪問像素時包含多一次的邊界檢查:
//at方法實現,返回像素值前有5次斷言判斷 template<typename _Tp> inline _Tp& Mat::at(int i0, int i1) {CV_DbgAssert(dims <= 2);CV_DbgAssert(data);CV_DbgAssert((unsigned)i0 < (unsigned)size.p[0]);CV_DbgAssert((unsigned)(i1 * DataType<_Tp>::channels) < (unsigned)(size.p[1] * channels()));CV_DbgAssert(CV_ELEM_SIZE1(traits::Depth<_Tp>::value) == elemSize1());return ((_Tp*)(data + step.p[0] * i0))[i1]; }//ptr方法實現,返回像素之前有4次判斷 template<typename _Tp> inline _Tp* Mat::ptr(int i0, int i1) {CV_DbgAssert(dims >= 2);CV_DbgAssert(data);CV_DbgAssert((unsigned)i0 < (unsigned)size.p[0]);CV_DbgAssert((unsigned)i1 < (unsigned)size.p[1]);return (_Tp*)(data + i0 * step.p[0] + i1 * step.p[1]); }所以基于ptr指針的方式訪問,在debug模式下效率將會有一定的提升
uchar GetGammaTrans(const uchar src) {double gamma = 0.7;uchar result = src;result = std::min(int(result * std::pow(result / 255.0, gamma)), 255);return result; }//Access pixels by pointer void AccessPixelsByPointer() {size_t rows = 2000;size_t cols = 2000;cv::Mat mono = cv::Mat::zeros(rows, cols, CV_8UC1);//acess by at()auto t0 = std::chrono::system_clock::now();for (size_t i = 0; i < rows; i++){for (size_t j = 0; j < cols; j++){mono.at<uchar>(i, j) = GetGammaTrans(*mono.ptr<uchar>(i, j));} }auto t1 = std::chrono::system_clock::now();//access by pointerfor (size_t i = 0; i < rows; i++){for (size_t j = 0; j < cols; j++){*mono.ptr<uchar>(i, j) = GetGammaTrans(*mono.ptr<uchar>(i, j));}}auto t2 = std::chrono::system_clock::now();//access by row-pointerfor (size_t i = 0; i < rows; i++){uchar* row_ptr = mono.ptr<uchar>(i, 0);for (size_t j = 0; j < cols; j++){row_ptr[j] = GetGammaTrans(*mono.ptr<uchar>(i, j));}}auto t3 = std::chrono::system_clock::now();std::cout << "Mono image test: \n";std::cout << "Access by at() cost "<< std::chrono::duration_cast<std::chrono::milliseconds>(t1 - t0).count() << " ms.\n";std::cout << "Access by ptr() cost "<< std::chrono::duration_cast<std::chrono::milliseconds>(t2 - t1).count() << " ms.\n";std::cout << "Access by row-ptr cost "<< std::chrono::duration_cast<std::chrono::milliseconds>(t3 - t2).count() << " ms.\n"; }測試結果:
//debug Mono image test: Access by at() cost 346 ms. Access by ptr() cost 317 ms. Access by row-ptr cost 205 ms.//release Mono image test: Access by at() cost 44 ms. Access by ptr() cost 44 ms. Access by row-ptr cost 55 ms.使用LUT查找表函數
一些重復性的像素處理操作可以基于查找表的思路進行效率提升,OpenCV也提供了查找表的工具函數。查找表的提升原理是預先將結果計算好存放至一張表里,后續使用直接從表里查找結果,無需每次都計算一遍。這種方法常適用于伽馬變換、灰度拉伸等針對灰度值固定映射的場景。
//Use Lut-table void UseLutTable() {size_t rows = 2000;size_t cols = 2000;cv::Mat mono = cv::Mat::zeros(rows, cols, CV_8UC1);auto t0 = std::chrono::system_clock::now();//Compute the value everytimefor (size_t i = 0; i < rows; i++){for (size_t j = 0; j < cols; j++){*mono.ptr<uchar>(i, j) = GetGammaTrans(*mono.ptr<uchar>(i, j));} }auto t1 = std::chrono::system_clock::now();//Use Lut-tablecv::Mat table = cv::Mat(1, 256, CV_8UC1);for (size_t i = 0; i < 256; i++){*table.ptr<uchar>(0, i) = GetGammaTrans(i);}cv::Mat mono_src = cv::Mat::zeros(rows, cols, CV_8UC1);cv::LUT(mono_src, table, mono);auto t2 = std::chrono::system_clock::now();std::cout << "Lut-table test: \n";std::cout << "Compute the value everytime cost "<< std::chrono::duration_cast<std::chrono::milliseconds>(t1 - t0).count() << " ms.\n";std::cout << "Use Lut-table cost "<< std::chrono::duration_cast<std::chrono::milliseconds>(t2 - t1).count() << " ms.\n"; }測試結果:
//debug Lut-table test: Compute the value everytime cost 279 ms. Use Lut-table cost 19 ms.//release Lut-table test: Compute the value everytime cost 44 ms. Use Lut-table cost 15 ms.使用Parallel for并行框架
對于圖像處理,循環遍歷像素點的操作OpenCV也提供了一個并行循環框架,基于該框架可以充分多核心處理器的并行處理能力,大大提升處理效率。但是使用循環并行框架時,需注意將外循環并行而非內循環并行,內循環并行反而可能適得其反。
void ParalleyFor() {size_t rows = 2000;size_t cols = 2000;cv::Mat mono = cv::Mat::zeros(rows, cols, CV_8UC1);auto t0 = std::chrono::system_clock::now();//Compute the value everytimefor (size_t i = 0; i < rows; i++){for (size_t j = 0; j < cols; j++){*mono.ptr<uchar>(i, j) = GetGammaTrans(*mono.ptr<uchar>(i, j));} }auto t1 = std::chrono::system_clock::now();//Use parallel-for frameworkcv::parallel_for_(cv::Range(0, rows), [&](const cv::Range& range){for (size_t i = range.start; i < range.end; i++){for (size_t j = 0; j < cols; j++){*mono.ptr<uchar>(i, j) = GetGammaTrans(*mono.ptr<uchar>(i, j));}}}); auto t2 = std::chrono::system_clock::now();for (size_t i = 0; i < rows; i++){cv::parallel_for_(cv::Range(0, cols), [&](const cv::Range& range){for (size_t j = range.start; j < range.end; j++){*mono.ptr<uchar>(i, j) = GetGammaTrans(*mono.ptr<uchar>(i, j));}});}auto t3 = std::chrono::system_clock::now();std::cout << "Paralley-for test: \n";std::cout << "Compute the value everytime cost "<< std::chrono::duration_cast<std::chrono::milliseconds>(t1 - t0).count() << " ms.\n";std::cout << "Paralley-for outside loop cost "<< std::chrono::duration_cast<std::chrono::milliseconds>(t2 - t1).count() << " ms.\n";std::cout << "Paralley-for inside loop cost "<< std::chrono::duration_cast<std::chrono::milliseconds>(t3 - t2).count() << " ms.\n"; }測試結果:
//debug Paralley-for test: Compute the value everytime cost 269 ms. Paralley-for outside loop cost 51 ms. Paralley-for inside loop cost 861 ms.//release Paralley-for test: Compute the value everytime cost 46 ms. Paralley-for outside loop cost 11 ms. Paralley-for inside loop cost 433 ms.循環展開
循環展開一種減少循環次數的優化方法,通過增加循環的累加步長減少循環的次數,減少循環次數的好處就是可以減少循環中的分之預測,從而提升效率。但缺點是使得代碼膨脹,以及可讀性降低。
//Loop unrolling void LoopUnrolling() {size_t rows = 2000;size_t cols = 2000;cv::Mat mono = cv::Mat::zeros(rows, cols, CV_8UC1);auto t0 = std::chrono::system_clock::now();//Origin loopfor (size_t i = 0; i < rows; i++){for (size_t j = 0; j < cols; j ++){mono.ptr<uchar>(i, j)[0] = GetGammaTrans(*mono.ptr<uchar>(i, j));}}auto t1 = std::chrono::system_clock::now();//Loop unrollingfor (size_t i = 0; i < rows; i++){for (size_t j = 0; j < cols; j += 4){mono.ptr<uchar>(i, j)[0] = GetGammaTrans(*mono.ptr<uchar>(i, j));mono.ptr<uchar>(i, j)[1] = GetGammaTrans(*mono.ptr<uchar>(i, j + 1));mono.ptr<uchar>(i, j)[2] = GetGammaTrans(*mono.ptr<uchar>(i, j + 2));mono.ptr<uchar>(i, j)[3] = GetGammaTrans(*mono.ptr<uchar>(i, j + 3));}}auto t2 = std::chrono::system_clock::now();std::cout << "Loop unrolling test: \n";std::cout << "Origin loop cost " << std::chrono::duration_cast<std::chrono::milliseconds>(t1 - t0).count() << " ms.\n";std::cout << "Unrolling loop cost " << std::chrono::duration_cast<std::chrono::milliseconds>(t2 - t1).count() << " ms.\n"; }測試結果:
//debug Loop unrolling test: Origin loop cost 282 ms. Unrolling loop cost 274 ms.//release Loop unrolling test: Origin loop cost 46 ms. Unrolling loop cost 46 ms.內存塊平鋪
利用cpu的緩存優化特性,對于涉及大量數據的內存讀寫操作,將內存分塊訪問進行讀寫有利于提高cpu的緩存命中率,從而提高程序運行性能。而平鋪(tile)這一優化技巧就是基于這一原理,但是現代編譯器非常智能,在絕大多數場景下都無需程序員手動進行平鋪優化,編譯器在release優化級別下一般會自動平鋪處理。所以一般情況下不建議手動平鋪,建議在數據處理邏輯非常復雜的情況下可以嘗試手動平鋪優化。
//Block tile void BlockTile() {size_t rows = 2400;size_t cols = 2400;cv::Mat mono = cv::Mat::zeros(rows, cols, CV_8UC1);auto t0 = std::chrono::system_clock::now();//Origin loopfor (size_t i = 0; i < rows; i++){for (size_t j = 0; j < cols; j++){*mono.ptr<uchar>(i, j) = 0;}}auto t1 = std::chrono::system_clock::now();//Block tilesize_t tile_row = 32;size_t tile_col = 8;for (size_t r = 0; r < rows; r += tile_row){for (size_t c = 0; c < cols; c += tile_col){for (size_t i = r; i < r + tile_row; i++){for (size_t j = c; j < c + tile_col; j++){*mono.ptr<uchar>(i, j) = 0;}}}}auto t2 = std::chrono::system_clock::now();std::cout << "Block tile test: \n";std::cout << "Origin loop cost " << std::chrono::duration_cast<std::chrono::milliseconds>(t1 - t0).count() << " ms.\n";std::cout << "Block tile cost " << std::chrono::duration_cast<std::chrono::milliseconds>(t2 - t1).count() << " ms.\n"; }測試結果:
//debug Block tile test: Origin loop cost 114 ms. Block tile cost 115 ms.//release Block tile test: Origin loop cost 4 ms. Block tile cost 4 ms.Simd指令集優化
cpu指令集優化也是提升圖像處理效率的一個有效手段,SIMD指令集優化可以實現單條指令一次計算多個數據,也是屬于硬件級別的并行處理。不同架構的cpu指令集優化接口并不相同,為此OpenCV提供了統一的指令集優化框架,基于該框架編寫指令集優化可以實現x86、arm等平臺的加速。
//Universal Simd void UniversalSimd() {size_t rows = 2400;size_t cols = 2400;cv::Mat mono = cv::Mat::zeros(rows, cols, CV_8UC1);auto t0 = std::chrono::system_clock::now();//Origin loopfor (size_t i = 0; i < rows; i++){for (size_t j = 0; j < cols; j++){*mono.ptr<uchar>(i, j) = 0;}}auto t1 = std::chrono::system_clock::now();//simdsize_t step = cv::v_uint8::nlanes;for (size_t i = 0; i < rows; i++){for (size_t j = 0; j < cols; j += step){cv::v_uint8 v_src = cv::v_load(mono.ptr<uchar>(i, j));v_src = cv::v_setzero_u8();cv::v_store(mono.ptr<uchar>(i, j), v_src);}}auto t2 = std::chrono::system_clock::now();std::cout << "Simd test: \n";std::cout << "Origin loop cost " << std::chrono::duration_cast<std::chrono::milliseconds>(t1 - t0).count() << " ms.\n";std::cout << "Simd cost " << std::chrono::duration_cast<std::chrono::milliseconds>(t2 - t1).count() << " ms.\n"; }測試結果:
//debug Simd test: Origin loop cost 114 ms. Simd cost 21 ms.//release Simd test: Origin loop cost 5 ms. Simd cost 0 ms.Gapi優化框架
GApi是OpenCV 4.x推出的一個新模塊,該模塊基于圖的方式實現程序的效率優化,將每一個處理步驟抽象為一個節點,加入處理流程(graph),把計算的流程構建成一個有向圖后,最后一起計算。
void GraphApi() {size_t rows = 4000;size_t cols = 4000;cv::Mat color = cv::Mat(rows, cols, CV_8UC3);cv::circle(color, { 1000, 1000 }, 500, { 255, 200,100 }, 3);auto t0 = std::chrono::system_clock::now();cv::Mat img_resize;cv::resize(color, img_resize, cv::Size(), 0.5, 0.5);cv::Mat img_gray;cv::cvtColor(img_resize, img_gray, cv::COLOR_BGR2GRAY);cv::Mat img_blur;cv::blur(img_gray, img_blur, cv::Size(3, 3));cv::Mat edge1;cv::Canny(img_blur, edge1, 32, 128, 3);auto t1 = std::chrono::system_clock::now();cv::GMat in;cv::GMat vga = cv::gapi::resize(in, cv::Size(), 0.5, 0.5);cv::GMat gray = cv::gapi::BGR2Gray(vga);cv::GMat blurred = cv::gapi::blur(gray, cv::Size(5, 5));cv::GMat out = cv::gapi::Canny(blurred, 32, 128, 3);cv::GComputation ac(in, out);cv::Mat edge2;ac.apply(color, edge2);auto t2 = std::chrono::system_clock::now();std::cout << "Gapi test: \n";std::cout << "Origin proccess cost " << std::chrono::duration_cast<std::chrono::milliseconds>(t1 - t0).count() << " ms.\n";std::cout << "Gapi proccess cost " << std::chrono::duration_cast<std::chrono::milliseconds>(t2 - t1).count() << " ms.\n"; }測試結果:
//debug Gapi test: Origin proccess cost 180 ms. Gapi proccess cost 203 ms.//release Gapi test: Origin proccess cost 23 ms. Gapi proccess cost 27 ms.本文由芒果浩明發布,轉載請注明出處。
本文鏈接:https://mangoroom.cn/opencv/optimization-methods-in-opencv-1.html
總結
以上是生活随笔為你收集整理的OpenCV程序效率优化方法1的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 雷塞卡回零,演示消息泵的用法
- 下一篇: 太平洋证券:金融信创与数字人民币双轮驱动