记一次 .NET 某WMS仓储打单系统 内存暴涨分析
一:背景
1. 講故事
七月中旬有一位朋友加wx求助,他的程序在生產(chǎn)上跑著跑著內存就飆起來了,貌似沒有回頭的趨勢,詢問如何解決,截圖如下:
和這位朋友聊下來,感覺像是自己在小縣城當了個小老板,規(guī)律的生活,有當?shù)刭Y源,各種小關系,有一股財務自由的味道,這也是我一直向往的生活方式 ????????????。
既然朋友找到我了,我得想辦法給他解決問題,既然是內存暴漲,我就賭一把在托管層面吧,嘿嘿,上windbg說話。
二:windbg 分析
1. 托管還是非托管
一直在追這個系列的朋友應該知道,我無數(shù)次的用 !address -summary 和 !eeheap -gc 這兩個命令來判斷當前內存屬于托管層還是非托管層。
0:000>?!address?-summary---?State?Summary?----------------?RgnCount?-----------?Total?Size?--------?%ofBusy?%ofTotal MEM_FREE????????????????????????????????393?????7dfe`f2105000?(?125.996?TB)???????????98.43% MEM_RESERVE????????????????????????????1691??????200`0f1e4000?(???2.000?TB)??99.81%????1.56% MEM_COMMIT?????????????????????????????6191????????0`fed07000?(???3.981?GB)???0.19%????0.00%0:000>?!eeheap?-gc Number?of?GC?Heaps:?1 generation?0?starts?at?0x000001D2E572BBC8 generation?1?starts?at?0x000001D2E54F70E0 generation?2?starts?at?0x000001D252051000 ephemeral?segment?allocation?context:?nonesegment?????????????begin?????????allocated??????????????size 000001D252050000??000001D252051000??000001D26204FFE0??0xfffefe0(268431328) Large?object?heap?starts?at?0x000001D262051000segment?????????????begin?????????allocated??????????????size 000001D262050000??000001D262051000??000001D2655F3F80??0x35a2f80(56242048) Total?Size:??????????????Size:?0xbf4dbf80?(3209543552)?bytes. ------------------------------ GC?Heap?Size:????Size:?0xbf4dbf80?(3209543552)?bytes.卦象上進程指標為 3.98G ,GC堆指標為 3209543552 = 3G ,很顯然,本次事故屬于 托管層面。
2. 尋找托管層上的大對象
我們都知道C#是托管語言,所以甭管有用沒用的對象都逃不出GC堆,言外之意就是看GC堆準沒錯,挑幾個大對象看看。
0:000>?!dumpheap?-stat Statistics:MT????Count????TotalSize?Class?Name 00007ff98a68f090???391475?????43869284?System.Int32[] 00007ff98b6adfa0??1902760?????45666240?System.Collections.ObjectModel.ReadOnlyCollection`1[[System.Linq.Expressions.Expression,?System.Linq.Expressions]] 00007ff98b6ac3c0??1951470?????46835280?System.Linq.Expressions.ConstantExpression 00007ff98bc452e0??1681178?????53797696?System.Linq.Expressions.TypedConstantExpression 00007ff98eacb6b8??1902708?????60886656?System.Dynamic.Utils.ListArgumentProvider 00007ff98f236518??1774982?????70999280?Microsoft.EntityFrameworkCore.Query.Expressions.ColumnExpression 00007ff98c650c58??1681142?????80694816?System.Linq.Expressions.MethodCallExpression3 00007ff98a82bc38??3414094?????81938256?System.RuntimeMethodHandle 00007ff98fd96fc0????17750?????83936016?System.Collections.Generic.Dictionary`2+Entry[[System.Reflection.MemberInfo,?System.Private.CoreLib],[System.Linq.Expressions.Expression,?System.Linq.Expressions]][] 00007ff98e5ed5d8????35493????101740504?System.Collections.Generic.Dictionary`2+Entry[[Microsoft.Extensions.DependencyInjection.ServiceLookup.ServiceCacheKey,?Microsoft.Extensions.DependencyInjection],[System.Object,?System.Private.CoreLib]][] 00007ff98bcff6a8??3639389????116460448?System.Linq.Expressions.PropertyExpression 00007ff98b85cf00??5028347????160907104?System.Reflection.Emit.GenericFieldInfo 00007ff98a671e18??2178117????168395994?System.String 00007ff98a5b6610???160565????171498416?System.Object[] 00007ff98eaa8ab0??4981589????199263560?System.Linq.Expressions.MemberAssignment 00007ff98a672360???398740????391928469?System.Byte[] 00007ff98a746d68???181886????486150592?System.Char[]?從托管堆上看,System.Linq.Expressions.MemberAssignment 對象高達 498w ,很明顯有問題,從類名看可能和 ExpressionTree 有關,那就抽幾個對象看看它的引用鏈上是否有過大的對象。
0:000>?!gcroot?000001d25399f690 HandleTable:000001D251B715A8?(pinned?handle)->?000001D262068CF0?System.Object[]->?000001D2531C3B78?Microsoft.EntityFrameworkCore.Internal.ServiceProviderCache->?000001D25399E3D0?Remotion.Linq.QueryModel->?000001D25399E3B8?Remotion.Linq.Clauses.SelectClause->?000001D25442C068?System.Linq.Expressions.MemberInitExpression->?000001D25442C050?System.Runtime.CompilerServices.TrueReadOnlyCollection`1[[System.Linq.Expressions.MemberBinding,?System.Linq.Expressions]]->?000001D2539A0290?System.Linq.Expressions.MemberBinding[]->?000001D25399F690?System.Linq.Expressions.MemberAssignment引用鏈特別長,這里我就截取一下,經(jīng)過一頓排查,我發(fā)現(xiàn)大對象居然是 Remotion.Linq.Clauses.SelectClause,objsize 這個對象直接爆掉了,真的很奇葩,如下代碼所示:
0:000>?!objsize?000001D25399E3B8 sizeof(000001D25399E3B8)?=?-1187378032?(0xb93a0c90)?bytes?(Remotion.Linq.Clauses.SelectClause)有點懵,這個對象居然是罪魁禍首,從引用鏈看它是 EF 下的一個構建表達式樹的小部件,可以肯定的是,朋友在用EF的時候出了什么問題,不過還得硬著頭皮繼續(xù)挖 SelectClause,經(jīng)過深挖,我發(fā)現(xiàn)這個類有大量這樣的 char[] 數(shù)組,導出來后大概是下面這樣。
Logistics.Text30),? |???????????Text31?=?string?TryReadValue(t1.Outer.Outer,?42,?WmsOutboundConfirmLogistics.Text31),? |???????????Text32?=?string?TryReadValue(t1.Outer.Outer,?43,?WmsOutboundConfirmLogistics.Text32),? |???????????Text33?=?string?TryReadValue(t1.Outer.Outer,?44,?WmsOutboundConfirmLogistics.Text33),? |???????????Text34?=?string?TryReadValue(t1.Outer.Outer,?45,?WmsOutboundConfirmLogistics.Text34),? |???????????Text35?=?string?TryReadValue(t1.Outer.Outer,?46,?WmsOutboundConfirmLogistics.Text35),? |???????????IsQueue?=?Nullable<bool>?TryReadValue(t1.Outer.Outer,?47,?WmsOutboundConfirmLogistics.IsQueue),? |???????????IsStop?=?Nullable<bool>?TryReadValue(t1.Outer.Outer,?48,?WmsOutboundConfirmLogistics.IsStop),? |???????????CheckCode?=?string?TryReadValue(t1.Outer.Outer,?49,?WmsOutboundConfirmLogistics.CheckCode),? |???????????ClientCode?=?string?TryReadValue(t1.Outer.Inner,?50,?WmsOutboundOrder.ClientCode),? |???????????WarehouseCode?=?string?TryReadValue(t1.Outer.Inner,?51,?WmsOutboundOrder.WarehouseCode),? |???????????ErpNumber?=?string?TryReadValue(t1.Outer.Inner,?52,?WmsOutboundOrder.ErpNumber),? |???????????OrderCategory?=?string?TryReadValue(t1.Outer.Inner,?53,?WmsOutboundOrder.OrderCategory),? |???????????OrderStatus?=?string?TryReadValue(t1.Outer.Inner,?54,?WmsOutboundOrder.OrderStatus),? |???????????OrderType?=?string?TryReadValue(t1.Outer.Inner,?55,?WmsOutboundOrder.OrderType),? |???????????SendCompany?=?string?TryReadValue(t1.Outer.Inner,?56,?WmsOutboundOrder.SendCompany),? |???????????SendName?=?string?TryReadValue(t1.Outer.Inner,?57,?WmsOutboundOrder.SendName),? |???????????SendTel?=?string?TryReadValue(t1.Outer.Inner,?58,?WmsOutboundOrder.SendTel),? |???????????SendMobile?=?string?TryReadValue(t1.Outer.Inner,?59,?WmsOutboundOrder.SendMobile),? |???????????SendProvince?=?string?TryReadValue(t1.Outer.Inner,?60,?WmsOutboundOrder.SendProvince),? |???????????SendCity?=?string?TryReadValue(t1.Outer.Inner,?61,?WmsOutboundOrder.SendCity),? |???????????SendArea?=?string?TryReadValue(t1.Outer.Inner,?62,?WmsOutboundOrder.SendArea),? |???????????... |???????????CategoryName?=?string?TryReadValue(t1.Outer.Inner,?88,?WmsOutboundOrder.CategoryName),? |???????????SourcePlatformCode?=?string?TryReadValue(t1.Outer.Inner,?89,?WmsOutboundOrder.SourcePlatformCode),? |???????????PayMode?=?(string)string?TryReadValue(t1.Outer.Outer,?90,?null),? |???????????List?=?List<WmsOutboundConfirmLogisticsLinesDTO>?WmsOutboundConfirmLogisticsBusiness.GetOrderLines(string?TryReadValue(t1.Outer.Outer,?5,?WmsOutboundConfirmLogistics.OrderNumber)),? |???????????ConfirmTime?=?DateTime?TryReadValue(t1.Inner,?91,?WmsOutboundOrderConfirmation.CreateTime),? |???????????ReturnUrl?=?(string)string?TryReadValue(t1.Outer.Outer,?92,?null)? |???????} |__?),? |__?contextType:?Core.DataRepository.BaseDbContext,? |__?logger:?DiagnosticsLogger<Query>,? |__?queryContext:?Unhandled?parameter:?queryContext)??????????????????????????????????????????????????????從內容看,應該是 select 語句的ExpressionTree表示,問了下朋友,說大概是報表業(yè)務,不過這些信息給他貌似也沒有多大幫助,說實話到這里我其實也不知道怎么繼續(xù)往下排查了,陷入了絕望。
3. 從絕望中尋找希望
我在想,既然EF構建了大量這樣的 ExpressionTree,肯定有問題,但也想不出是什么問題,隔了半天,我突然靈光一現(xiàn),EF既然構建了樹,有可能sql也出來了,對,我何不直接在 heap 上搜索 select 的sql語句。。。。
0:000>?!strings?/m:*select* Address????????????Gen????Length???Value 000001d2e4de64e0????2???????1964???SELECT?a."Id",?a."CreateTime"?AS?"CreateTime0",?a."CreatorId",?a."CreatorRealName",?a."Deleted",?a."OrderNumber",?a."CarrierId",... 000001d2e4e11e78????2???????1964???SELECT?a."Id",?a."CreateTime"?AS?"CreateTime0",?a."CreatorId",?a."CreatorRealName",?a."Deleted",?a."OrderNumber",?a."CarrierId",... 000001d2e4e3d1f0????2???????1964???SELECT?a."Id",?a."CreateTime"?AS?"CreateTime0",?a."CreatorId",?a."CreatorRealName",?a."Deleted",?a."OrderNumber",?a."CarrierId",... 000001d2e4e673c8????2???????1964???SELECT?a."Id",?a."CreateTime"?AS?"CreateTime0",?a."CreatorId",?a."CreatorRealName",?a."Deleted",?a."OrderNumber",?a."CarrierId",... 000001d2e4e91760????2???????1964???SELECT?a."Id",?a."CreateTime"?AS?"CreateTime0",?a."CreatorId",?a."CreatorRealName",?a."Deleted",?a."OrderNumber",?a."CarrierId",... 000001d2e4ebb2e8????2???????1964???SELECT?a."Id",?a."CreateTime"?AS?"CreateTime0",?a."CreatorId",?a."CreatorRealName",?a."Deleted",?a."OrderNumber",?a."CarrierId",... 000001d2e4ee54f8????2???????1964???SELECT?a."Id",?a."CreateTime"?AS?"CreateTime0",?a."CreatorId",?a."CreatorRealName",?a."Deleted",?a."OrderNumber",?a."CarrierId",... 000001d2e4f10758????2???????1964???SELECT?a."Id",?a."CreateTime"?AS?"CreateTime0",?a."CreatorId",?a."CreatorRealName",?a."Deleted",?a."OrderNumber",?a."CarrierId",... 000001d2e4f398d0????2???????1964???SELECT?a."Id",?a."CreateTime"?AS?"CreateTime0",?a."CreatorId",?a."CreatorRealName",?a."Deleted",?a."OrderNumber",?a."CarrierId",...--------------------------------------- 18128?matching?strings果然發(fā)現(xiàn)了大量重復的 select 語句,而且從最左邊的內存地址看都是非常接近的,也就說明他們是在某一個操作中同時生成的,然后我們導出幾個sql語句。
SELECT?a."Id",?.... FROM?"WmsOutboundConfirmLogistics"?AS?a INNER?JOIN?"WmsOutboundOrder"?AS?b?ON?a."OrderNumber"?=?b."OrderNumber" INNER?JOIN?"WmsOutboundOrderConfirmation"?AS?c?ON?a."OrderNumber"?=?c."OrderNumber" WHERE?(a."OrderNumber"?=?@__pagination_OrderNumber_0)?AND?(b."FreezeStatus"?=?FALSE) ORDER?BY?a."Id"";SELECT?a."Id",?.... FROM?"WmsOutboundConfirmLogistics"?AS?a INNER?JOIN?"WmsOutboundOrder"?AS?b?ON?a."OrderNumber"?=?b."OrderNumber" INNER?JOIN?"WmsOutboundOrderConfirmation"?AS?c?ON?a."OrderNumber"?=?c."OrderNumber" WHERE?(a."OrderNumber"?=?@__pagination_OrderNumber_0)?AND?(b."FreezeStatus"?=?FALSE) ORDER?BY?a."Id""拿到這 1.8w 重復的sql 給朋友看,朋友說這是查詢報表的sql。
4. 所有線索整合打通
那這里就存在著很大問題,既然是查詢報表,為什么會有 1.8w 相同的sql,唯一不同的就是 a."OrderNumber" = @__pagination_OrderNumber_0 中的訂單號,難道不應該是 a.OrderNumber in (xxxx) 或者是表關聯(lián)查詢嗎???整理一下就是下面這樣的猜想:
--?理想 select?*?from?a?where?a.id?in?(1,2,3)--?現(xiàn)實 select?*?from?a?where?a.id=1; select?*?from?a?where?a.id=2; select?*?from?a?where?a.id=3;加上每個sql內存地址相近,再結合爆表的 Remotion.Linq.Clauses.SelectClause 對象,整個流程大概就是:本該表關聯(lián)或者in操作,結果變成了無數(shù)個單條sql語句查詢,導致EF底層出現(xiàn)內存爆炸式增長。
三:總結
看了下朋友查詢ef的寫法,猜測大多都是人肉構建 ExpressionTree 去查詢數(shù)據(jù)庫,大寫的????????,比如下面的這張圖:
解決方案就是讓朋友檢查下表示式樹的寫法問題,或者直接灌寫好的sql得了,說實話這個dump還是費了九牛二虎之力,本以為很簡單,實操起來還是碰到了一點小困難,就當歷練成長吧!
總結
以上是生活随笔為你收集整理的记一次 .NET 某WMS仓储打单系统 内存暴涨分析的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: .Net Core with 微服务 -
- 下一篇: C# 10 完整特性介绍