[C++]怎么样实现一个较快的Hash Table
我們服務(wù)器一直在用boost/sgl stl的hash table,但是從來沒有考慮過其中的效率問題,雖然hash_map/unordered_map跑的可能真的比map快一些,可能應(yīng)該不是你理解的那么快.其實(shí)他可以更快一些!!!
當(dāng)我自己嘗試著實(shí)現(xiàn)了一個(gè)hash table之后,我發(fā)現(xiàn)確實(shí)如此.這篇文章也是來說說,如何實(shí)現(xiàn)較快的一個(gè).
通常的hash table都是用開鏈法,開放地址法來解決沖突.開鏈法是總?cè)菀讓?shí)現(xiàn)的一個(gè),而且因?yàn)樾史€(wěn)定,被加入了C++11,取名unordered_map.不過效率實(shí)在不咋地.
開放地址法的hash table,我是從google-sparsehash里面注意到的,雖然數(shù)據(jù)結(jié)構(gòu),算法導(dǎo)論都會(huì)講到.網(wǎng)上說速度很快,我就去看了一下API,其比普通的unordered_map多了一組API:
1.? set_empty_key/set_deleted_key
在開鏈法中,所有的節(jié)點(diǎn)都是容器內(nèi)的內(nèi)容,可是開放地址法中不是的.所以需要額外的信息來維護(hù)節(jié)點(diǎn)的可用性信息.
當(dāng)時(shí)我看到這兩個(gè)API,大概就猜到內(nèi)存是怎么實(shí)現(xiàn)的,閑來無(wú)事就是試著寫了一個(gè)demo,在VC 2008下面跑的結(jié)果是,比unordered_map快一倍多;在Linux x64 gcc 4.4下面的結(jié)果是,比unordered_map快了將近1倍.
2. 高性能的hash table必須是開放地址法的
這么說,是有原因的.鏈表的特性就是容易刪除,容易插入.可是hash table不需要這些特性,hash table只需要快.可以鏈表這東西,偏偏做不到快速定位,雖然你知道有下一個(gè)節(jié)點(diǎn),但是你不知道下一個(gè)節(jié)點(diǎn)的準(zhǔn)確位置,經(jīng)常會(huì)造成緩存未命中,浪費(fèi)大量時(shí)間.
3. bucket的容量
bucket的容量也是影響hash table性能的一個(gè)因素.無(wú)數(shù)的數(shù)據(jù)結(jié)構(gòu)和算法書籍,都教導(dǎo)大家,通過質(zhì)數(shù)取余數(shù),可以獲得比較好的下標(biāo)分布.可是,無(wú)論是除法還是乘法,消耗都是相當(dāng)高的.十幾個(gè)或者幾十個(gè)時(shí)鐘周期,始終比不上一兩個(gè)時(shí)鐘周期快.所以,高性能的hash table必須要把bucket的容量設(shè)置成2^n.google-sparsehash里面初始容量是32.擴(kuò)容的話,都是直接左移;算下標(biāo)的話,都是(容量-1) & hash_value,簡(jiǎn)單的一個(gè)位運(yùn)算搞定.
4. 正確實(shí)現(xiàn)find_position
我自己實(shí)現(xiàn)的hash table,是線性探測(cè)法的.所以find position也是比較簡(jiǎn)單,就是通過hash value和掩碼,獲取到其實(shí)下標(biāo),然后一個(gè)一個(gè)test.需要把buckets當(dāng)作是環(huán)形的,否則buckets最末位的數(shù)據(jù)沖突就會(huì)不好搞.(我當(dāng)時(shí)沒有考慮這一點(diǎn),直接給他擴(kuò)容了.....)
5. 對(duì)象模型
不同的Key和Value模型,可以導(dǎo)致你對(duì)Hash Table的不同實(shí)現(xiàn).簡(jiǎn)單的說,在C里面,你可以不用考慮Key和Value的生命周期(:D),但是C++里面,你不得不考慮Key,Value的生命周期問題.你不能做一個(gè)假設(shè),key和value都是簡(jiǎn)單數(shù)據(jù)類型.一個(gè)int映射到一個(gè)對(duì)象,這種經(jīng)常會(huì)用到的.
所以,erase一個(gè)key的時(shí)候,需要把key設(shè)置成deleted,然后還要把value重置一遍.如果沒有重置,對(duì)象所引用的內(nèi)存有可能就會(huì)被泄露.
這引發(fā)了我另外一個(gè)想法,就是通過模板,來特化Value的reset行為.因?yàn)椴皇撬械腣alue都是需要被重置的,只有那些復(fù)雜對(duì)象,才需要.
6. 可以考慮緩沖hash value
如果key都是簡(jiǎn)單數(shù)據(jù),而非string或者復(fù)雜的數(shù)據(jù)類型,緩沖是沒有任何意義的,因?yàn)閔ash value可以被快速的計(jì)算出來;但是當(dāng)key是char*,或者一些復(fù)雜的數(shù)據(jù)類型,緩沖就會(huì)變的有意義.而且緩沖更有利于重排,容器擴(kuò)容的時(shí)候速度會(huì)更快一些.
7. 考慮使用C的內(nèi)存分配器
盡量不要使用C++的new/delete來分配內(nèi)存.new,delete會(huì)有對(duì)象的構(gòu)造,析構(gòu)過程,這可能不是你所希望的.針對(duì)key和value數(shù)據(jù)類型的不同,你可能會(huì)有自己的特有的構(gòu)造,析構(gòu)過程.而且,C的內(nèi)存分配器,同樣可以被一些第三方庫(kù)優(yōu)化,比如tcmalloc/jemalloc等.
8. 選一個(gè)好的Hash函數(shù)(這是最重要的)
9. 盡力防止拷貝
rehash非常耗時(shí),如果支持C++11,就使用move操作;如果不支持,就用swap,否則會(huì)復(fù)制很多次.
?代碼貼上:
1 //Copyright 2012, egmkang wang. 2 // All rights reserved. 3 // 4 // Redistribution and use in source and binary forms, with or without 5 // modification, are permitted provided that the following conditions are 6 // met: 7 // 8 // * Redistributions of source code must retain the above copyright 9 // notice, this list of conditions and the following disclaimer. 10 // * Redistributions in binary form must reproduce the above 11 // copyright notice, this list of conditions and the following disclaimer 12 // in the documentation and/or other materials provided with the 13 // distribution. 14 // * Neither the name of green_turtle nor the names of its 15 // contributors may be used to endorse or promote products derived from 16 // this software without specific prior written permission. 17 // 18 // THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS 19 // "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT 20 // LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR 21 // A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT 22 // OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, 23 // SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT 24 // LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, 25 // DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY 26 // THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT 27 // (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE 28 // OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 29 // 30 // author: egmkang (egmkang@gmail.com) 31 32 #ifndef __MY_HASH_TABLE__ 33 #define __MY_HASH_TABLE__ 34 #include <utility> 35 #include <functional> 36 #include <cstddef> 37 #include <stdlib.h> 38 namespace green_turtle{ 39 40 //hash_table with linear probing 41 template<class Key, 42 class T, 43 class Hash = std::hash<Key>, 44 class KeyEqual = std::equal_to<Key> > 45 class hash_map 46 { 47 public: 48 typedef Key key_type; 49 typedef T mapped_type; 50 typedef std::pair<const Key,T> value_type; 51 typedef size_t size_type; 52 typedef Hash hash_fn; 53 typedef KeyEqual equal_fn; 54 typedef value_type* iterator; 55 56 hash_map(size_type capacity = 32,key_type empty = key_type(),key_type deleted = key_type()): 57 empty_key_(empty) 58 ,deleted_key_(deleted) 59 ,size_(0) 60 ,capacity_(capacity) 61 ,buckets_(nullptr) 62 ,hasher_() 63 ,equaler_() 64 { 65 init_buckets(); 66 } 67 ~hash_map() 68 { 69 delete_buckets(); 70 } 71 hash_map(hash_map&& m,size_type capacity = 32): 72 buckets_(nullptr) 73 { 74 empty_key_ = m.empty_key_; 75 deleted_key_ = m.deleted_key_; 76 size_ = 0; 77 capacity_ = m.capacity_; 78 //to impl the increase and decrease method 79 if(capacity_ != capacity && capacity >= 32) 80 capacity_ = capacity; 81 hasher_ = m.hasher_; 82 equaler_ = m.equaler_; 83 84 init_buckets(); 85 86 copy_from(m); 87 } 88 hash_map& operator = (const hash_map& m) 89 { 90 empty_key_ = m.empty_key_; 91 deleted_key_ = m.deleted_key_; 92 size_ = 0; 93 capacity_ = m.capacity_; 94 hasher_ = m.hasher_; 95 equaler_ = m.equaler_; 96 97 init_buckets(); 98 99 copy_from(m); 100 } 101 void swap(hash_map& m) 102 { 103 std::swap(empty_key_ , m.empty_key_); 104 std::swap(deleted_key_ , m.deleted_key_); 105 std::swap(size_ , m.size_); 106 std::swap(capacity_ , m.capacity_); 107 std::swap(hasher_ , m.hasher_); 108 std::swap(equaler_ , m.equaler_); 109 std::swap(buckets_ , m.buckets_); 110 } 111 112 iterator end() { return nullptr; } 113 iterator end() const { return nullptr; } 114 115 iterator find(const key_type& key) 116 { 117 if(is_key_empty(key) || is_key_deleted(key)) 118 return NULL; 119 iterator pair_ = find_position(key); 120 if(!pair_ || !equaler_(key,pair_->first)) 121 return NULL; 122 return pair_; 123 } 124 iterator find(const key_type& key) const 125 { 126 if(is_key_empty(key) || is_key_deleted(key)) 127 return NULL; 128 iterator pair_ = find_position(key); 129 if(!pair_ || !equaler_(key,pair_->first)) 130 return NULL; 131 return pair_; 132 } 133 134 std::pair<iterator, bool> insert(const value_type& v) 135 { 136 std::pair<iterator, bool> result(nullptr, false); 137 result.first = _insert(v); 138 result.second = result.first ? true : false; 139 return result; 140 } 141 142 template<class P> 143 std::pair<iterator, bool> insert(P&& p) 144 { 145 std::pair<iterator, bool> result(nullptr, false); 146 result.first = _insert(std::forward<P>(p)); 147 result.second = result.first ? true : false; 148 return result; 149 } 150 151 template<class... Args> 152 std::pair<iterator, bool> emplace(Args&&... args) 153 { 154 std::pair<iterator, bool> result(nullptr, false); 155 value_type _v(std::forward<Args>(args)...); 156 result.first = _insert(std::move(_v)); 157 result.second = result.first ? true : false; 158 return result; 159 } 160 161 mapped_type& operator[](const key_type& key) 162 { 163 value_type *pair_ = find(key); 164 if(!pair_) 165 { 166 pair_ = insert(std::make_pair(key,mapped_type())); 167 } 168 return pair_->second; 169 } 170 171 mapped_type& operator[](key_type&& key) 172 { 173 value_type *pair_ = find(key); 174 if(!pair_) 175 { 176 pair_ = insert(std::make_pair(std::move(key), std::move(mapped_type()))); 177 } 178 return pair_->second; 179 } 180 181 void erase(const key_type& key) 182 { 183 assert(empty_key_ != deleted_key_ && "you must set a deleted key value before delete it"); 184 value_type *pair = find(key); 185 if(pair && equaler_(key,pair->first)) 186 set_key_deleted(pair); 187 --size_; 188 decrease_capacity(); 189 } 190 void erase(const value_type* value) 191 { 192 if(value) erase(value->first); 193 } 194 void clear() 195 { 196 if(empty()) return; 197 for(size_t idx = 0; idx < capacity_; ++idx) 198 { 199 buckets_[idx]->first = empty_key_; 200 buckets_[idx]->second = mapped_type(); 201 } 202 size_ = 0; 203 } 204 //bool (const value_type&); 205 template<class Fn> 206 void for_each(Fn f) const 207 { 208 if(empty()) return; 209 for(size_t idx = 0; idx < capacity_; ++idx) 210 { 211 if(is_key_deleted(buckets_[idx].first) || 212 is_key_empty(buckets_[idx].first)) 213 continue; 214 if(!f(buckets_[idx])) 215 break; 216 } 217 } 218 219 inline void set_deleted_key(key_type k) 220 { 221 assert(empty_key_ != k); 222 if(deleted_key_ != empty_key_) 223 assert(deleted_key_ == k); 224 deleted_key_ = k; 225 } 226 inline bool empty() const { return size_ == 0; } 227 inline size_type size() const { return size_; } 228 inline size_type capacity() const { return capacity_; } 229 private: 230 //return key equal position 231 //or first deleted postion 232 //or empty postion 233 value_type* find_position(const key_type& key) const 234 { 235 size_type hash_pair_ = hasher_(key); 236 size_type mask_ = capacity_ - 1; 237 size_type begin_ = hash_pair_ & mask_; 238 size_type times_ = 0; 239 value_type *first_deleted_ = NULL; 240 while(true) 241 { 242 if(is_key_deleted(buckets_[begin_].first) && !first_deleted_) 243 first_deleted_ = &buckets_[begin_]; 244 else if(is_key_empty(buckets_[begin_].first)) 245 { 246 if(first_deleted_) return first_deleted_; 247 return &buckets_[begin_]; 248 } 249 else if(equaler_(key,buckets_[begin_].first)) 250 return &buckets_[begin_]; 251 252 begin_ = (begin_ + 1) & mask_; 253 assert(times_++ <= capacity_); 254 (void)times_; 255 } 256 return NULL; 257 } 258 void copy_from(hash_map&& m) 259 { 260 if(m.empty()) return; 261 for(size_t idx = 0; idx < m.capacity_; ++idx) 262 { 263 if(is_key_deleted(m.buckets_[idx].first) || 264 is_key_empty(m.buckets_[idx].first)) 265 continue; 266 _insert(std::move(m.buckets_[idx])); 267 } 268 } 269 void copy_from(const hash_map& m) 270 { 271 if(m.empty()) return; 272 for(size_t idx = 0; idx < m.capacity_; ++idx) 273 { 274 if(is_key_deleted(m.buckets_[idx].first) || 275 is_key_empty(m.buckets_[idx].first)) 276 continue; 277 _insert(m.buckets_[idx]); 278 } 279 } 280 void increase_capacity() 281 { 282 if(size_ > (capacity_ >> 1)) 283 { 284 hash_map _m(std::move(*this),capacity_ << 1); 285 swap(_m); 286 } 287 } 288 void decrease_capacity() 289 { 290 if(size_ < (capacity_ >> 2)) 291 { 292 hash_map _m(*this,capacity_ >> 2); 293 swap(_m); 294 } 295 } 296 void set_key_deleted(value_type& pair) 297 { 298 pair.first = deleted_key_; 299 pair.second = mapped_type(); 300 } 301 inline bool is_key_deleted(const key_type& key) const { return equaler_(key,deleted_key_); } 302 inline bool is_key_empty(const key_type& key) const { return equaler_(key,empty_key_); } 303 void init_buckets() 304 { 305 delete[] buckets_; 306 buckets_ = new value_type[capacity_](); 307 if(empty_key_ != key_type()) 308 { 309 for(unsigned idx = 0; idx < capacity_; ++idx) 310 { 311 const_cast<key_type&>(buckets_[idx].first) = empty_key_; 312 } 313 } 314 } 315 void delete_buckets() 316 { 317 delete[] buckets_; 318 } 319 value_type* _insert(const value_type& _v) 320 { 321 const key_type& key = _v.first; 322 if(is_key_deleted(key) || is_key_empty(key)) 323 return NULL; 324 increase_capacity(); 325 value_type *pair_ = find_position(key); 326 if(!pair_ || equaler_(key,pair_->first)) 327 return NULL; 328 329 auto& k1 = const_cast<key_type&>(pair_->first); 330 auto& v1 = const_cast<mapped_type&>(pair_->second); 331 k1 = key; 332 v1 = _v.second; 333 334 ++size_; 335 return pair_; 336 } 337 template<class P> 338 value_type* _insert(P&& p) 339 { 340 std::pair<key_type, mapped_type> _v(p.first, p.second); 341 const key_type& key = _v.first; 342 if(is_key_deleted(key) || is_key_empty(key)) 343 return NULL; 344 increase_capacity(); 345 value_type *pair_ = find_position(key); 346 if(!pair_ || equaler_(key,pair_->first)) 347 return NULL; 348 349 auto& k1 = const_cast<key_type&>(pair_->first); 350 auto& v1 = const_cast<mapped_type&>(pair_->second); 351 k1 = std::move(_v.first); 352 v1 = std::move(_v.second); 353 354 ++size_; 355 return pair_; 356 } 357 private: 358 key_type empty_key_; 359 key_type deleted_key_; 360 size_type size_; 361 size_type capacity_; 362 value_type *buckets_; 363 hash_fn hasher_; 364 equal_fn equaler_; 365 }; 366 367 }//end namespace green_turtle 368 #endif//__MY_HASH_TABLE__參考:
1. 算法導(dǎo)論
2. 計(jì)算機(jī)程序設(shè)計(jì)藝術(shù)
3. google-sparsehash dense_hash_map的實(shí)現(xiàn), http://code.google.com/p/google-sparsehash
PS:
如果有一個(gè)好的內(nèi)存分配器,STL的開鏈法hash table性能并不差太多,所以我砍掉了自己實(shí)現(xiàn)的hash table,代碼貼在上面.加入了C++11的move語(yǔ)義,可能會(huì)有一些bug,move實(shí)在是太繁瑣了.
與50位技術(shù)專家面對(duì)面20年技術(shù)見證,附贈(zèng)技術(shù)全景圖總結(jié)
以上是生活随笔為你收集整理的[C++]怎么样实现一个较快的Hash Table的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 各类滤波算子
- 下一篇: 阿里巴巴AI智能专场:整理分享