當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

limit mongodb 聚合_MongoDB 统计 group 操作用不了，试试 mapReduce 吧

發布時間：2024/9/27 编程问答 28 豆豆

生活随笔收集整理的這篇文章主要介紹了 limit mongodb 聚合_MongoDB 统计 group 操作用不了，试试 mapReduce 吧小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

問題回顧

今天，同事小張 Q 我，說自己辛苦花了一天的時間，基于 mongodb 數據庫開發的待辦統計功能一直報錯！

于是筆者花了近半小時了解小張的開發需求以及代碼實現方式，大致明白問題出在對待辦 collection 做統計時，調用 collection 的分組 group 函數、聚合 aggregate 函數的使用方式不對。

待辦 collection 文檔分組(group )函數代碼

GroupByResults groupByResults = mongoTemplate.group(new Criteria().andOperator(criteriaArray),mongoTemplate.getCollectionName(PendingEntity.class), groupBy, PendingEntity.class);long resultCount = ((List)groupByResults.getRawResults().get("retval")).size();

待辦 collection 文檔聚合(aggregate)函數代碼

AggregationResults results = mongoTemplate.aggregate(aggregation, "studentScore", PendingEntity.class);double totleScore = results.getUniqueMappedResult().getCollect();

問題定位

異常信息

Map-reduce supports operations on sharded collections, both as an input and as an output. This section describes the behaviors of mapReduce specific to sharded collections.However, starting in version 4.2, MongoDB deprecates the map-reduce option to create a new sharded collection as well as the use of the sharded option for map-reduce. To output to a sharded collection, create the sharded collection first. MongoDB 4.2 also deprecates the replacement of an existing sharded collection.Sharded Collection as InputWhen using sharded collection as the input for a map-reduce operation, mongos will automatically dispatch the map-reduce job to each shard in parallel. There is no special option required. mongos will wait for jobs on all shards to finish.Sharded Collection as OutputIf the out field for mapReduce has the sharded value, MongoDB shards the output collection using the _idfield as the shard key.

從異常信息提示來看，我注意到 errmsg 字段值：“can't do command: group on sharded collection”，大意是說分片文檔(sharded collection)不能使用分組 group 函數。

筆者猜測是 sharded collection 的問題，于是筆者從一些技術博客和 mongodb 官網查了下使用 group 函數的一些限制，大致如下：

分片表不能 group 分組

can't do command: group on sharded collection

group 操作不會處理超過 20000 個唯一鍵( group by 的關鍵字具有唯一性約束條件下)

exception:?group() can't handle more than 20000 unique keys

顯然，分片表不能 group 的限制，也驗證了我的當初的猜想。

于是我問了下運維組的同事，也證實了 mongodb 在創建 collection 文檔時，會指定文檔數據分片到不同服務器上，這是出于對 mongodb 穩定性的考慮吧。

解決方案

既然分片表不能 group ，那如何解決分組統計的問題呢？

答案是用 “mapReduce” 。

想到什么呢？

是不是很類似 Hadoop 中的 Map-Reduce 的思想：

MapReduce最重要的一個思想: 分而治之. 就是將負責的大任務分解成若干個小任務, 并行執行. 完成后在合并到一起. 適用于大量復雜的任務處理場景, 大規模數據處理場景.

Map負責“分”，即把復雜的任務分解為若干個“簡單的任務”來并行處理。可以進行拆分的前提是這些小任務可以并行計算，彼此間幾乎沒有依賴關系。

Reduce負責“合”，即對map階段的結果進行全局匯總。

Hadoop 中的 Map-Reduce 執行流程

來源網絡

翻閱 mongodb 官網文檔，對 mapReduce 函數介紹如下：

Map-reduce supports operations on sharded collections, both as an input and as an output. This section describes the behaviors of mapReduce specific to sharded collections.

However, starting in version 4.2, MongoDB deprecates the map-reduce option to create a new sharded collection as well as the use of the sharded option for map-reduce. To output to a sharded collection, create the sharded collection first. MongoDB 4.2 also deprecates the replacement of an existing sharded collection

Sharded Collection as Input

When using sharded collection as the input for a map-reduce operation, mongos will automatically dispatch the map-reduce job to each shard in parallel. There is no special option required. mongos will wait for jobs on all shards to finish.

Sharded Collection as Output

If the out field for mapReduce has the sharded value, MongoDB shards the output collection using the _idfield as the shard key.

大意是 mapReduce 支持對 sharded collections 分片文檔 input / output 操作，其處理邏輯如下：

mongos接收到mapreduce的操作請求后，根據query條件，將map-reduce任務發給持有數據的shards(sharding collection將會被分裂成多個chunks并分布在多個shards中，shard即為mongod節點)。

每個shards都依次執行mapper和reducer，并將結果寫入到本地的臨時collection中，結果數據是根據_id(即reducer的key)正序排列。

當所有的shards都reduce完成之后，將各自結果數據中_id的最大值和最小值(即min、max key)返回給mongos。

mongos負責shuffle和partition，將所有shards反饋的min、max key進行匯總，并將整個key區間分成多個partitions，每個partition包含[min,max]區間，此后mongos將partiton信息封裝在finalReduce指令中并發給每個shard，最終每個shard都會收到一個特定的partition的任務；partition不會重疊。

此后每個shard將與其他所有的shards建立鏈接，根據partition信息，從min到max，遍歷每個key。對于任何一個key，當前shard都將從其他shards獲取此key的所有數據，然后執行reduce和finalize方法，每個key可能會執行多次reduce，這取決于values的條數，但是finalize只會執行一次，最終將此key的finalize的結果通過本地方式寫入sharding collection中。

當所有的shards都處理完畢后，mongos將處理結果返回給客戶端(inline)。

mapReduce 語法格式：

db.collection.mapReduce(,?????????,?????????????????????????{???????????????????????????out: ,???????????????????????????query: ,???????????????????????????sort: ,???????????????????????????limit: ,???????????????????????????finalize: ,???????????????????????????scope: ,???????????????????????????jsMode: ,???????????????????????????verbose: ,???????????????????????????bypassDocumentValidation: ?????????????????????????}???????????????????????)

參數說明：

map：映射函數(生成鍵值對序列，作為reduce函數參數)
reduce：統計函數
query：目標記錄過濾
sort：目標記錄排序
limit：限制目標記錄數量
out：統計結果存放集合(不指定使用臨時集合，在客戶端斷開后自動刪除)
finalize：最終處理函數(對 reduce 返回結果進行最終整理后存入結果集合)
Scope：向map、reduce、finalize導入外部變量
jsMode說明:為 false 時 BSON-->JS-->map-->BSON-->JS-->reduce-->BSON,可處理非常大的mapreduce,為 true 時 BSON-->js-->map-->reduce-->BSON
verbose：顯示詳細的時間統計信息

于是，我讓小張同學把 group 換成 mapReduce 函數，問題解決！

String reducef = "function(key,values){var total = {count:0};for(var i=0;i mrr = readMongoTemplate.mapReduce(query,readMongoTemplate.getCollectionName(ReadingEntity.class), map, reducef, BasicDBObject.class);

問題總結

有時候，問題就出在最顯眼的問題描述上，需要有心人去細細琢磨。

另外，其實大部分問題都可以在官網上找到相關技術解決方案，卻又苦于受英語單詞的折磨。。。

參考

https://docs.mongodb.com/manual/aggregation/

https://docs.mongodb.com/manual/core/map-reduce-sharded-collections/

https://www.cnblogs.com/chenpingzhao/p/7913247.html

https://blog.csdn.net/weixin_42582592/article/details/83080900

https://blog.csdn.net/iteye_19607/article/details/82644559

總結

以上是生活随笔為你收集整理的limit mongodb 聚合_MongoDB 统计 group 操作用不了，试试 mapReduce 吧的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：开封关于启动重污染天气黄色预警Ⅲ级响应的
下一篇：交行普通卡升级白金卡额度会不会涨