當(dāng)前位置：首頁(yè) > 编程资源 > 编程问答 >内容正文

编程问答

寻找最大的K个数，Top K问题的堆实现

發(fā)布時(shí)間：2024/9/21 编程问答 24 豆豆

生活随笔收集整理的這篇文章主要介紹了寻找最大的K个数，Top K问题的堆实现小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

//生成隨機(jī)的不重復(fù)的測(cè)試數(shù)據(jù) #include <iostream> #include <time.h> #include <assert.h> using namespace std; //產(chǎn)生[l,u]區(qū)間的隨機(jī)數(shù) int randint(int l, int u) { return l+(RAND_MAX*rand()+rand())%(u-l+1); } //1000W的int,大約4M的數(shù)據(jù)，如果放在mian內(nèi)，在我的機(jī)子上好像是棧溢出了，放在全局空間就沒(méi)問(wèn)題 const int size = 10000000; int num[size]; int main() { int i, j; FILE *fp = fopen("data.txt", "w"); assert(fp); for (i = 0; i < size; i++) num[i] = i+1; srand((unsigned)time(NULL)); for (i = 0; i < size; i++) { j = randint(i, size-1); int t = num[i]; num[i] = num[j]; num[j] = t; //swap(num[i], num[j]); } for (i = 0; i < size; i++) fprintf(fp, "%d ", num[i]); fclose(fp); return 0; }

1000萬(wàn)個(gè)數(shù)據(jù)太大，打開文件很慢，可能會(huì)看不到，可以測(cè)試1萬(wàn)個(gè)數(shù)據(jù)中找最大的10個(gè)。

搜索引擎熱門查詢統(tǒng)計(jì)

題目描述：
搜索引擎會(huì)通過(guò)日志文件把用戶每次檢索使用的所有檢索串都記錄下來(lái)，每個(gè)查詢串的長(zhǎng)度為1-255字節(jié)。
假設(shè)目前有一千萬(wàn)個(gè)記錄，這些查詢串的重復(fù)度比較高，雖然總數(shù)是1千萬(wàn)，但如果除去重復(fù)后，不超過(guò)3百萬(wàn)個(gè)。一個(gè)查詢串的重復(fù)度越高，說(shuō)明查詢它的用戶越多，也就是越熱門。請(qǐng)你統(tǒng)計(jì)最熱門的10個(gè)查詢串，要求使用的內(nèi)存不能超過(guò)1G。

解決方法：hash表+堆，去重復(fù)后不超過(guò)300萬(wàn)個(gè)，總大小不超過(guò)300萬(wàn)*255B=765MB，內(nèi)存使用不超過(guò)1G。

第一步：先對(duì)這批海量數(shù)據(jù)預(yù)處理，在O(N)的時(shí)間內(nèi)用Hash表完成去重復(fù)操作。
第二步：借助堆這個(gè)數(shù)據(jù)結(jié)構(gòu)，找出Top K，時(shí)間復(fù)雜度為NlogK。即借助堆結(jié)構(gòu)，可以在log量級(jí)的時(shí)間內(nèi)查找和調(diào)整/移動(dòng)。因此，維護(hù)一個(gè)K（該題目中是10）大小的小根堆（Kmin設(shè)為堆頂元素），然后遍歷300萬(wàn)的Query，分別和Kmin進(jìn)行對(duì)比比較（若X>Kmin，則更新并調(diào)整堆，否則，不更新），最終的時(shí)間復(fù)雜度是：O（N）+ N'*O（logK），（N為1000萬(wàn)，N’為300萬(wàn)）。

為了降低實(shí)現(xiàn)上的難度，假設(shè)這些記錄全部是一些英文單詞，即用戶在搜索框里敲入一個(gè)英文單詞，然后查詢搜索結(jié)果，最后，要你統(tǒng)計(jì)輸入單詞中頻率最大的前K個(gè)單詞。復(fù)雜問(wèn)題簡(jiǎn)單化了之后，編寫代碼實(shí)現(xiàn)也相對(duì)輕松多了，如下：

//copyright@yansha &&July //July、updated，2011.05.08 //題目描述： //搜索引擎會(huì)通過(guò)日志文件把用戶每次檢索使用的所有檢索串都記錄下來(lái)，每個(gè)查詢串的 //長(zhǎng)度為1-255字節(jié)。假設(shè)目前有一千萬(wàn)個(gè)記錄（這些查詢串的重復(fù)度比較高，雖然總數(shù)是1千萬(wàn)，但如果 //除去重復(fù)后，不超過(guò)3百萬(wàn)個(gè)。一個(gè)查詢串的重復(fù)度越高，說(shuō)明查詢它的用戶越多，也就是越熱門）， //請(qǐng)你統(tǒng)計(jì)最熱門的10個(gè)查詢串，要求使用的內(nèi)存不能超過(guò)1G。 #include <iostream> #include <string> #include <assert.h> using namespace std; #define HASHLEN 2807303 #define WORDLEN 30 // 結(jié)點(diǎn)指針 typedef struct node_no_space *ptr_no_space; typedef struct node_has_space *ptr_has_space; ptr_no_space head[HASHLEN]; struct node_no_space { char *word; int count; ptr_no_space next; }; struct node_has_space { char word[WORDLEN]; int count; ptr_has_space next; }; // 最簡(jiǎn)單hash函數(shù) int hash_function(const char *p) { int value = 0; while (*p != '\0') { value = value * 31 + *p++; if (value > HASHLEN) value = value % HASHLEN; } return value; } // 添加單詞到hash表 void append_word(const char *str) { int index = hash_function(str); ptr_no_space p = head[index]; while (p != NULL) { if (strcmp(str, p->word) == 0) { (p->count)++; return; } p = p->next; } // 新建一個(gè)結(jié)點(diǎn) ptr_no_space q = new node_no_space; q->count = 1; q->word = new char [strlen(str)+1]; strcpy(q->word, str); q->next = head[index]; head[index] = q; } // 將單詞處理結(jié)果寫入文件 void write_to_file() { FILE *fp = fopen("result.txt", "w"); assert(fp); int i = 0; while (i < HASHLEN) { for (ptr_no_space p = head[i]; p != NULL; p = p->next) fprintf(fp, "%s %d\n", p->word, p->count); i++; } fclose(fp); } // 從上往下篩選，保持小根堆 void sift_down(node_has_space heap[], int i, int len) { int min_index = -1; int left = 2 * i; int right = 2 * i + 1; if (left <= len && heap[left].count < heap[i].count) min_index = left; else min_index = i; if (right <= len && heap[right].count < heap[min_index].count) min_index = right; if (min_index != i) { // 交換結(jié)點(diǎn)元素 swap(heap[i].count, heap[min_index].count); char buffer[WORDLEN]; strcpy(buffer, heap[i].word); strcpy(heap[i].word, heap[min_index].word); strcpy(heap[min_index].word, buffer); sift_down(heap, min_index, len); } } // 建立小根堆 void build_min_heap(node_has_space heap[], int len) { if (heap == NULL) return; int index = len / 2; for (int i = index; i >= 1; i--) sift_down(heap, i, len); } // 去除字符串前后符號(hào) void handle_symbol(char *str, int n) { while (str[n] < '0' || (str[n] > '9' && str[n] < 'A') || (str[n] > 'Z' && str[n] < 'a') || str[n] > 'z') { str[n] = '\0'; n--; } while (str[0] < '0' || (str[0] > '9' && str[0] < 'A') || (str[0] > 'Z' && str[0] < 'a') || str[0] > 'z') { int i = 0; while (i < n) { str[i] = str[i+1]; i++; } str[i] = '\0'; n--; } } int main() { char str[WORDLEN]; for (int i = 0; i < HASHLEN; i++) head[i] = NULL; // 將字符串用hash函數(shù)轉(zhuǎn)換成一個(gè)整數(shù)并統(tǒng)計(jì)出現(xiàn)頻率 FILE *fp_passage = fopen("string.txt", "r"); assert(fp_passage); while (fscanf(fp_passage, "%s", str) != EOF) { int n = strlen(str) - 1; if (n > 0) handle_symbol(str, n); append_word(str); } fclose(fp_passage); // 將統(tǒng)計(jì)結(jié)果輸入文件 write_to_file(); int n = 10; ptr_has_space heap = new node_has_space [n+1]; int c; FILE *fp_word = fopen("result.txt", "r"); assert(fp_word); for (int j = 1; j <= n; j++) { fscanf(fp_word, "%s %d", &str, &c); heap[j].count = c; strcpy(heap[j].word, str); } // 建立小根堆 build_min_heap(heap, n); // 查找出現(xiàn)頻率最大的10個(gè)單詞 while (fscanf(fp_word, "%s %d", &str, &c) != EOF) { if (c > heap[1].count) { heap[1].count = c; strcpy(heap[1].word, str); sift_down(heap, 1, n); } } fclose(fp_word); // 輸出出現(xiàn)頻率最大的單詞 for (int k = 1; k <= n; k++) cout << heap[k].count << " " << heap[k].word << endl; return 0; }

參考：http://blog.csdn.net/v_JULY_v/archive/2011/05/08/6403777.aspx

作者：阿凡盧出處：http://www.cnblogs.com/luxiaoxun/ 本文版權(quán)歸作者和博客園共有，歡迎轉(zhuǎn)載，但未經(jīng)作者同意必須保留此段聲明，且在文章頁(yè)面明顯位置給出原文連接，否則保留追究法律責(zé)任的權(quán)利。

總結(jié)

以上是生活随笔為你收集整理的寻找最大的K个数，Top K问题的堆实现的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問(wèn)題。

如果覺(jué)得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

个数
TOP