當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

Why is processing a sorted array faster than an unsorted array?

發(fā)布時間：2023/11/29 编程问答 32 豆豆

生活随笔收集整理的這篇文章主要介紹了 Why is processing a sorted array faster than an unsorted array? 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

這是我在逛 Stack Overflow 時遇見的一個高分問題：Why is processing a sorted array faster than an unsorted array?，我覺得這是一個非常好的用來講分支預測（Branch Prediction）的例子，分享給大家看看

一、問題引入

先看這個代碼：

#include <algorithm> #include <ctime> #include <iostream> #include <stdint.h>int main() {uint32_t arraySize = 20000;uint32_t data[arraySize];for (uint32_t i = 0; i < arraySize; ++ i) {data[i] = std::rand() % 256;}// !!! With this, the next loop runs fasterstd::sort(data, data + arraySize);clock_t start = clock();uint64_t sum = 0;for (uint32_t cnt = 0; cnt < 100000; ++ cnt) {for (uint32_t i = 0; i < arraySize; ++ i) {if (data[i] > 128) {sum += data[i];}}}double processTime = static_cast<double>(clock() - start) / CLOCKS_PER_SEC;std::cout << "processTime: " << processTime << std::endl;std::cout << "sum: " << sum << std::endl;return 0; };

注意：這里特地沒有加隨機數種子是為了確保 data 數組中的偽隨機數始終不變，為接下來的對比分析做準備，盡可能減少實驗中的變量

我們編譯并運行這段代碼（gcc 版本 4.1.2，太高的話會被優(yōu)化掉）：

$ g++ a.cpp -o a -O3 $ ./a processTime: 1.78 sum: 191444000000

下面，把下面的這一行注釋掉，然后再編譯并運行：

std::sort(data, data + arraySize); $ g++ a.cpp -o b -O3 $ ./b processTime: 10.06 sum: 191444000000

注意到了嗎？去掉那一行排序的代碼后，整個計算時間被延長了十倍！

二、是 Cache Miss 導致的嗎？

答案顯然是否定的。cache miss 率并不會因為數組是否排序而改變，因為兩份代碼取數據的順序是一樣的，數據量大小是一樣的，數據布局也是一樣的，并且在同一臺機器上運行，并沒有任何差別，所以可以肯定的是：和 cache miss 無任何關系

為了驗證我們的分析，可以用 valgrind 提供的 cachegrind tool 查看 cache miss 率：

$ valgrind --tool=cachegrind ./a ==26548== Cachegrind, a cache and branch-prediction profiler ==26548== Copyright (C) 2002-2015, and GNU GPL'd, by Nicholas Nethercote et al. ==26548== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info ==26548== Command: ./a ==26548== --26548-- warning: L3 cache found, using its data for the LL simulation. --26548-- warning: specified LL cache: line_size 64 assoc 20 total_size 15,728,640 --26548-- warning: simulated LL cache: line_size 64 assoc 30 total_size 15,728,640 processTime: 68.57 sum: 191444000000 ==26548== ==26548== I refs: 14,000,637,620 ==26548== I1 misses: 1,327 ==26548== LLi misses: 1,293 ==26548== I1 miss rate: 0.00% ==26548== LLi miss rate: 0.00% ==26548== ==26548== D refs: 2,001,434,596 (2,000,993,511 rd + 441,085 wr) ==26548== D1 misses: 125,115,133 ( 125,112,303 rd + 2,830 wr) ==26548== LLd misses: 7,085 ( 4,770 rd + 2,315 wr) ==26548== D1 miss rate: 6.3% ( 6.3% + 0.6% ) ==26548== LLd miss rate: 0.0% ( 0.0% + 0.5% ) ==26548== ==26548== LL refs: 125,116,460 ( 125,113,630 rd + 2,830 wr) ==26548== LL misses: 8,378 ( 6,063 rd + 2,315 wr) ==26548== LL miss rate: 0.0% ( 0.0% + 0.5% ) $ valgrind --tool=cachegrind ./b ==13898== Cachegrind, a cache and branch-prediction profiler ==13898== Copyright (C) 2002-2015, and GNU GPL'd, by Nicholas Nethercote et al. ==13898== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info ==13898== Command: ./b ==13898== --13898-- warning: L3 cache found, using its data for the LL simulation. --13898-- warning: specified LL cache: line_size 64 assoc 20 total_size 15,728,640 --13898-- warning: simulated LL cache: line_size 64 assoc 30 total_size 15,728,640 processTime: 76.7 sum: 191444000000 ==13898== ==13898== I refs: 13,998,930,559 ==13898== I1 misses: 1,316 ==13898== LLi misses: 1,281 ==13898== I1 miss rate: 0.00% ==13898== LLi miss rate: 0.00% ==13898== ==13898== D refs: 2,000,938,800 (2,000,663,898 rd + 274,902 wr) ==13898== D1 misses: 125,010,958 ( 125,008,167 rd + 2,791 wr) ==13898== LLd misses: 7,083 ( 4,768 rd + 2,315 wr) ==13898== D1 miss rate: 6.2% ( 6.2% + 1.0% ) ==13898== LLd miss rate: 0.0% ( 0.0% + 0.8% ) ==13898== ==13898== LL refs: 125,012,274 ( 125,009,483 rd + 2,791 wr) ==13898== LL misses: 8,364 ( 6,049 rd + 2,315 wr) ==13898== LL miss rate: 0.0% ( 0.0% + 0.8% )

對比可以發(fā)現，他們倆的 cache miss rate 和 cache miss 數幾乎相同，因此確實和 cache miss 無關

三、Branch Prediction

使用到 valgrind 提供的 callgrind tool 可以查看分支預測失敗率：

$ valgrind --tool=callgrind --branch-sim=yes ./a ==29373== Callgrind, a call-graph generating cache profiler ==29373== Copyright (C) 2002-2015, and GNU GPL'd, by Josef Weidendorfer et al. ==29373== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info ==29373== Command: ./a ==29373== ==29373== For interactive control, run 'callgrind_control -h'. processTime: 288.68 sum: 191444000000 ==29373== ==29373== Events : Ir Bc Bcm Bi Bim ==29373== Collected : 14000637633 4000864744 293254 23654 395 ==29373== ==29373== I refs: 14,000,637,633 ==29373== ==29373== Branches: 4,000,888,398 (4,000,864,744 cond + 23,654 ind) ==29373== Mispredicts: 293,649 ( 293,254 cond + 395 ind) ==29373== Mispred rate: 0.0% ( 0.0% + 1.7% )

可以看到，在計算 sum 之前對數組排序，分支預測失敗率非常低，幾乎相當于沒有失敗

$ valgrind --tool=callgrind --branch-sim=yes ./b ==23202== Callgrind, a call-graph generating cache profiler ==23202== Copyright (C) 2002-2015, and GNU GPL'd, by Josef Weidendorfer et al. ==23202== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info ==23202== Command: ./b ==23202== ==23202== For interactive control, run 'callgrind_control -h'. processTime: 287.12 sum: 191444000000 ==23202== ==23202== Events : Ir Bc Bcm Bi Bim ==23202== Collected : 13998930783 4000477534 1003409950 23654 395 ==23202== ==23202== I refs: 13,998,930,783 ==23202== ==23202== Branches: 4,000,501,188 (4,000,477,534 cond + 23,654 ind) ==23202== Mispredicts: 1,003,410,345 (1,003,409,950 cond + 395 ind) ==23202== Mispred rate: 25.1% ( 25.1% + 1.7% )

而這個未排序的就不同了，分支預測失敗率達到了 25%。因此可以確定的是：兩份代碼在運行時 CPU 分支預測失敗率不同導致了運行時間的不同

四、分支預測

那么到底什么是分支預測，分支預測的策略是什么呢？這兩個問題我覺得 Mysticial 的回答解釋的非常好：

假設我們現在處于 1800 年代，那會長途通信或者無線通信還沒有出現。你是某個鐵路分叉口的操作員，當你正在打盹的時候，遠方傳來了火車轟隆隆的聲音。你知道又有一輛列車開過來了，但是你不知道它要走哪條路，因此列車不得不停下來，在得知它要去哪個方向后，你把開關撥向正確的位置，列車緩緩啟動駛向遠方。

但是列車很重，自身的慣性很大，停止和啟動都需要花很長很長的時間。有什么方法能讓列車更快的到達目的地嗎？有：你來猜測列車將駛向哪個方向。

如果你猜中了，列車繼續(xù)前進；如果沒有猜中：司機發(fā)現路不對后剎車、倒車、沖你發(fā)一頓火，最后你把開關撥到另一邊，然后司機啟動列車，走另一條路。

現在讓我們來看看那條 if 語句：

if (data[i] >= 128) {sum += data[i] }

現在假設你是 CPU，當遇到這個 if 語句時，接下來該做什么：把 data[i] 累加到 sum 上面還是什么都不做？

怎么辦？難道是暫停下來，等待 if 表達式算出結果，如果是 true 就執(zhí)行 sum += data[i]，否則什么也不做？

經過幾十年的發(fā)展，現代處理器異常復雜并擁有者超長的 pipeline，它需要花費很長的時間“暫停”和重新執(zhí)行命令，為了加快執(zhí)行速度，處理器需要猜測接下來要做什么，也就是說：你先忽略 if 表達式的結果，讓它一邊算去，你選擇其中一個分支繼續(xù)執(zhí)行下去。

如果你猜對了，程序繼續(xù)執(zhí)行；如果猜錯了，需要 flush pipeline、回滾到分支判斷那、選擇另一個分支執(zhí)行下去。

如果每次都猜中：程序執(zhí)行過程中永遠不會出現中途暫停的情況
如果大多數都猜錯了：你將消耗大量的時間在“暫停、回滾、重新執(zhí)行”上面

這就是分支預測。那么 CPU 在猜測接下來要執(zhí)行哪個分支時有什么策略嗎？當然是根據已有的經驗啦：根據歷史經驗尋找一個模式

如果過去 99% 的火車都走了左邊，你就猜測下次火車到來還是會走左邊；如果是左右交替著走，那么每次火車來的時候你把開關撥向另一邊就可以了；如果每三輛車走右邊后會有一輛車走左邊，那么你也對應的猜測并操作開關...

也就是說：從火車的行進方向歷史中找到一個固有的模式，然后按照這個模式猜測下次火車將走哪個方向。這種工作方式和處理器的分支預測器非常相似

大多數應用程序都有表現良好的分支選擇（讓 CPU 有跡可循）模式，因此現代分支預測器基本上都有著 90% 以上的命中率。但是當面臨有著無法識別的分支選擇模式時，分支預測器的命中率極度低下，毫無可用性可言，比如上面未排序的隨機數組 data

關于分支預測的更多解釋，感興趣的話大家可以看看維基百科的解釋：Branch predictor

轉載于:https://www.cnblogs.com/zhj5chengfeng/p/5662802.html

總結

以上是生活随笔為你收集整理的Why is processing a sorted array faster than an unsorted array?的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。