17 内存规整(memory compaction)
? ? 伙伴系統(tǒng)以頁(yè)為單位來(lái)管理內(nèi)存,內(nèi)存碎片也是基于頁(yè)面的,即有大量離散且不連續(xù)的頁(yè)面導(dǎo)致的。從內(nèi)存角度來(lái)看,內(nèi)存碎片不是好事情,有些情況下物理設(shè)備需要大段的連續(xù)的物理內(nèi)存,如果內(nèi)核無(wú)法滿足,則會(huì)發(fā)生內(nèi)核panic。內(nèi)存碎片化好比軍訓(xùn)中帶隊(duì)行走時(shí)間長(zhǎng)了,隊(duì)列亂了,需要重新規(guī)整一下,因此本章稱(chēng)為內(nèi)存規(guī)整,一些文獻(xiàn)稱(chēng)為內(nèi)存緊湊,它是為了解決內(nèi)存碎片化而出現(xiàn)的一個(gè)功能。
? ? 內(nèi)核中去碎片化的基本原理是按照頁(yè)的可移動(dòng)性將頁(yè)面分組。遷移內(nèi)核本身使用的物理內(nèi)存的實(shí)現(xiàn)難度和復(fù)雜度都很大,因此目前的內(nèi)核是不遷移內(nèi)核本身使用的物理頁(yè)面。對(duì)于應(yīng)用程序進(jìn)程使用的頁(yè)面,實(shí)際上通過(guò)用戶頁(yè)表的映射來(lái)訪問(wèn)。用戶頁(yè)表可以移動(dòng)和修改映射關(guān)系,不會(huì)影響用戶進(jìn)程,因此內(nèi)存規(guī)整是基于頁(yè)面遷移實(shí)現(xiàn)的。
內(nèi)存規(guī)整實(shí)現(xiàn):
? ? 內(nèi)存規(guī)整的一個(gè)重要的應(yīng)用場(chǎng)景是在分配大塊內(nèi)存時(shí)(order > 1),在WMARK_LOW低水位情況下分配失敗,喚醒kswapd內(nèi)核線程后依然無(wú)法分配出內(nèi)存,這時(shí)調(diào)用__alloc_pages_direct_compact()來(lái)壓縮內(nèi)存嘗試分配出所需要的內(nèi)存。下面沿著alloc_pages()->...->__alloc_pages_direct_compact()這條內(nèi)核路徑來(lái)看內(nèi)存規(guī)整是如何工作的。
[mm/page_alloc.c]
[alloc_pages()->alloc_pages_node()->__alloc_pages()->__alloc_pages_nodemask()->__alloc_pages_slowpath()->__alloc_pages_direct_compact()]
/* Try memory compaction for high-order allocations before reclaim */ /*參數(shù)mode指migration_mode,通常由__alloc_pages_slowpath()傳遞過(guò)來(lái),其值為MIGRATE_ASYNC*/ static struct page * __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,int alloc_flags, const struct alloc_context *ac,enum migrate_mode mode, int *contended_compaction,bool *deferred_compaction) {unsigned long compact_result;struct page *page;/*內(nèi)存規(guī)整是針對(duì)high-order的內(nèi)存分配,所以order等于0的情況不需要觸發(fā)內(nèi)存規(guī)整。*/if (!order)return NULL;current->flags |= PF_MEMALLOC;/*try_to_compact_pages()函數(shù)執(zhí)行時(shí)需要設(shè)置當(dāng)前進(jìn)程的PF_MEMALLOC標(biāo)志位,該標(biāo)志位會(huì)在頁(yè)面遷移時(shí)用到,避免頁(yè)面鎖(PG_Locked)發(fā)生死鎖,下面查看此函數(shù)實(shí)現(xiàn)*/compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,mode, contended_compaction);current->flags &= ~PF_MEMALLOC;switch (compact_result) {case COMPACT_DEFERRED:*deferred_compaction = true;/* fall-through */case COMPACT_SKIPPED:return NULL;default:break;}/** At least in one zone compaction wasn't deferred or skipped, so let's* count a compaction stall*/count_vm_event(COMPACTSTALL);/*當(dāng)內(nèi)存規(guī)整執(zhí)行完成后,調(diào)用get_page_from_freelist()嘗試分配內(nèi)存,如果分配成功將返回首頁(yè)page數(shù)據(jù)結(jié)構(gòu)*/page = get_page_from_freelist(gfp_mask, order,alloc_flags & ~ALLOC_NO_WATERMARKS, ac);if (page) {struct zone *zone = page_zone(page);zone->compact_blockskip_flush = false;compaction_defer_reset(zone, order, true);count_vm_event(COMPACTSUCCESS);return page;}/** It's bad if compaction run occurs and fails. The most likely reason* is that pages exist, but not enough to satisfy watermarks.*/count_vm_event(COMPACTFAIL);cond_resched();return NULL; }try_to_compact_pages()函數(shù)實(shí)現(xiàn):
[__alloc_pages_direct_compact()->try_to_compact_pages()]
/*** try_to_compact_pages - Direct compact to satisfy a high-order allocation* @gfp_mask: The GFP mask of the current allocation* @order: The order of the current allocation* @alloc_flags: The allocation flags of the current allocation* @ac: The context of current allocation* @mode: The migration mode for async, sync light, or sync migration* @contended: Return value that determines if compaction was aborted due to* need_resched() or lock contention** This is the main entry point for direct page compaction.*/ unsigned long try_to_compact_pages(gfp_t gfp_mask, unsigned int order,int alloc_flags, const struct alloc_context *ac,enum migrate_mode mode, int *contended) {int may_enter_fs = gfp_mask & __GFP_FS;int may_perform_io = gfp_mask & __GFP_IO;struct zoneref *z;struct zone *zone;int rc = COMPACT_DEFERRED;int all_zones_contended = COMPACT_CONTENDED_LOCK; /* init for &= op */*contended = COMPACT_CONTENDED_NONE;/* Check if the GFP flags allow compaction */if (!order || !may_enter_fs || !may_perform_io)return COMPACT_SKIPPED;trace_mm_compaction_try_to_compact_pages(order, gfp_mask, mode);/* Compact each zone in the list *//*for_each_zone_zonelist_nodemask宏,它會(huì)根據(jù)分配掩碼來(lái)確定需要掃描和遍歷哪些zone*/for_each_zone_zonelist_nodemask(zone, z, ac->zonelist, ac->high_zoneidx,ac->nodemask) {int status;int zone_contended;if (compaction_deferred(zone, order))continue;/*compact_zone_order()對(duì)特定zone執(zhí)行內(nèi)存規(guī)整,下面查看此函數(shù)實(shí)現(xiàn)*/status = compact_zone_order(zone, order, gfp_mask, mode,&zone_contended, alloc_flags,ac->classzone_idx);rc = max(status, rc);/** It takes at least one zone that wasn't lock contended* to clear all_zones_contended.*/all_zones_contended &= zone_contended;/* If a normal allocation would succeed, stop compacting *//*zone_watermark_ok()判斷zone當(dāng)前的水位是否高于LOW_WMARK水位,如果是,則退出循環(huán)*/if (zone_watermark_ok(zone, order, low_wmark_pages(zone),ac->classzone_idx, alloc_flags)) {/** We think the allocation will succeed in this zone,* but it is not certain, hence the false. The caller* will repeat this with true if allocation indeed* succeeds in this zone.*/compaction_defer_reset(zone, order, false);/** It is possible that async compaction aborted due to* need_resched() and the watermarks were ok thanks to* somebody else freeing memory. The allocation can* however still fail so we better signal the* need_resched() contention anyway (this will not* prevent the allocation attempt).*/if (zone_contended == COMPACT_CONTENDED_SCHED)*contended = COMPACT_CONTENDED_SCHED;goto break_loop;}if (mode != MIGRATE_ASYNC && status == COMPACT_COMPLETE) {/** We think that allocation won't succeed in this zone* so we defer compaction there. If it ends up* succeeding after all, it will be reset.*/defer_compaction(zone, order);}/** We might have stopped compacting due to need_resched() in* async compaction, or due to a fatal signal detected. In that* case do not try further zones and signal need_resched()* contention.*/if ((zone_contended == COMPACT_CONTENDED_SCHED)|| fatal_signal_pending(current)) {*contended = COMPACT_CONTENDED_SCHED;goto break_loop;}continue; break_loop:/** We might not have tried all the zones, so be conservative* and assume they are not all lock contended.*/all_zones_contended = 0;break;}/** If at least one zone wasn't deferred or skipped, we report if all* zones that were tried were lock contended.*/if (rc > COMPACT_SKIPPED && all_zones_contended)*contended = COMPACT_CONTENDED_LOCK;return rc; } 回到__alloc_pages_direct_compact函數(shù)compact_zone_order()函數(shù)實(shí)現(xiàn):
[__alloc_pages_direct_compact()->try_to_compact_pages()->compact_zone_order()]
和kswapd的代碼一樣,這里定義了控制相關(guān)的數(shù)據(jù)結(jié)構(gòu)struct compact_control?cc來(lái)傳遞參數(shù)。cc.migratepages是將要遷移頁(yè)面的鏈表,cc.freepages表示要遷移的目的鏈表。
static unsigned long compact_zone_order(struct zone *zone, int order,gfp_t gfp_mask, enum migrate_mode mode, int *contended,int alloc_flags, int classzone_idx) {unsigned long ret;struct compact_control cc = {.nr_freepages = 0,.nr_migratepages = 0,.order = order,.gfp_mask = gfp_mask,.zone = zone,.mode = mode,.alloc_flags = alloc_flags,.classzone_idx = classzone_idx,};INIT_LIST_HEAD(&cc.freepages);INIT_LIST_HEAD(&cc.migratepages);/*下面查看此函數(shù)實(shí)現(xiàn)*/ret = compact_zone(zone, &cc);VM_BUG_ON(!list_empty(&cc.freepages));VM_BUG_ON(!list_empty(&cc.migratepages));*contended = cc.contended;return ret; } 回到try_to_compact_pages()函數(shù)compact_zone()函數(shù)實(shí)現(xiàn):
[alloc_pages()->alloc_pages_node()->__alloc_pages()->__alloc_pages_nodemask()->
__alloc_pages_slowpath()->__alloc_pages_direct_compact()->try_to_compact_pages()->compact_zone_order()->compact_zone()]
static int compact_zone(struct zone *zone, struct compact_control *cc) {int ret;unsigned long start_pfn = zone->zone_start_pfn;unsigned long end_pfn = zone_end_pfn(zone);const int migratetype = gfpflags_to_migratetype(cc->gfp_mask);const bool sync = cc->mode != MIGRATE_ASYNC;unsigned long last_migrated_pfn = 0;/*根據(jù)當(dāng)前水位來(lái)判斷是否需要進(jìn)行內(nèi)存規(guī)整,下面查看此函數(shù)實(shí)現(xiàn)*/ret = compaction_suitable(zone, cc->order, cc->alloc_flags,cc->classzone_idx);switch (ret) {case COMPACT_PARTIAL:case COMPACT_SKIPPED:/* Compaction is likely to fail */return ret;case COMPACT_CONTINUE:/* Fall through to compaction */;}/** Clear pageblock skip if there were failures recently and compaction* is about to be retried after being deferred. kswapd does not do* this reset as it'll reset the cached information when going to sleep.*/if (compaction_restarting(zone, cc->order) && !current_is_kswapd())__reset_isolation_suitable(zone);/** Setup to move all movable pages to the end of the zone. Used cached* information on where the scanners should start but check that it* is initialised by ensuring the values are within zone boundaries.*//*設(shè)置cc->migrate_pfn和cc->free_pfn。簡(jiǎn)單來(lái)說(shuō),cc->migrate_pfn設(shè)置為zone的開(kāi)始pfn(zone->zone_start_pfn),表示從zone的第一個(gè)頁(yè)面開(kāi)始掃描和查找哪些頁(yè)面可以遷移。cc->free_pfn設(shè)置為zone的最末的pfn,表示從zone的最末端開(kāi)始掃描和查找有哪些空閑的頁(yè)面可以用作遷移頁(yè)面目的地。*/cc->migrate_pfn = zone->compact_cached_migrate_pfn[sync];cc->free_pfn = zone->compact_cached_free_pfn;if (cc->free_pfn < start_pfn || cc->free_pfn > end_pfn) {cc->free_pfn = end_pfn & ~(pageblock_nr_pages-1);zone->compact_cached_free_pfn = cc->free_pfn;}if (cc->migrate_pfn < start_pfn || cc->migrate_pfn > end_pfn) {cc->migrate_pfn = start_pfn;zone->compact_cached_migrate_pfn[0] = cc->migrate_pfn;zone->compact_cached_migrate_pfn[1] = cc->migrate_pfn;}trace_mm_compaction_begin(start_pfn, cc->migrate_pfn,cc->free_pfn, end_pfn, sync);migrate_prep_local();/*while循環(huán)從zone的開(kāi)頭處去掃描和查找合適的遷移頁(yè)面,然后嘗試遷移到zone末端的空閑頁(yè)面中,直到zone處于低水位WMARK_LOW之上。compact_finished()判斷compact過(guò)程是否可以結(jié)束,下面查看此函數(shù)的實(shí)現(xiàn)*/while ((ret = compact_finished(zone, cc, migratetype)) ==COMPACT_CONTINUE) {int err;unsigned long isolate_start_pfn = cc->migrate_pfn;/*isolate_migratepages()函數(shù)掃描并且尋找zone中可遷移的頁(yè)面,可遷移的頁(yè)面會(huì)添加到cc->migratepages鏈表中,下面查看此函數(shù)的實(shí)現(xiàn)*/switch (isolate_migratepages(zone, cc)) {case ISOLATE_ABORT:ret = COMPACT_PARTIAL;putback_movable_pages(&cc->migratepages);cc->nr_migratepages = 0;goto out;case ISOLATE_NONE:/** We haven't isolated and migrated anything, but* there might still be unflushed migrations from* previous cc->order aligned block.*/goto check_drain;case ISOLATE_SUCCESS:;}/*遷移頁(yè)的核心函數(shù),從cc->migratepages鏈表中摘取頁(yè),然后嘗試去遷移頁(yè)。下面查看此函數(shù)的實(shí)現(xiàn)*/err = migrate_pages(&cc->migratepages, compaction_alloc,compaction_free, (unsigned long)cc, cc->mode,MR_COMPACTION);trace_mm_compaction_migratepages(cc->nr_migratepages, err,&cc->migratepages);/* All pages were either migrated or will be released */cc->nr_migratepages = 0;/*處理遷移頁(yè)面失敗的情況,沒(méi)遷移的頁(yè)面會(huì)放回適合的LRU鏈表中*/if (err) {putback_movable_pages(&cc->migratepages);/** migrate_pages() may return -ENOMEM when scanners meet* and we want compact_finished() to detect it*/if (err == -ENOMEM && cc->free_pfn > cc->migrate_pfn) {ret = COMPACT_PARTIAL;goto out;}}/** Record where we could have freed pages by migration and not* yet flushed them to buddy allocator. We use the pfn that* isolate_migratepages() started from in this loop iteration* - this is the lowest page that could have been isolated and* then freed by migration.*/if (!last_migrated_pfn)last_migrated_pfn = isolate_start_pfn;check_drain:/** Has the migration scanner moved away from the previous* cc->order aligned block where we migrated from? If yes,* flush the pages that were freed, so that they can merge and* compact_finished() can detect immediately if allocation* would succeed.*/if (cc->order > 0 && last_migrated_pfn) {int cpu;unsigned long current_block_start =cc->migrate_pfn & ~((1UL << cc->order) - 1);if (last_migrated_pfn < current_block_start) {cpu = get_cpu();lru_add_drain_cpu(cpu);drain_local_pages(zone);put_cpu();/* No more flushing until we migrate again */last_migrated_pfn = 0;}}}out:/** Release free pages and update where the free scanner should restart,* so we don't leave any returned pages behind in the next attempt.*/if (cc->nr_freepages > 0) {unsigned long free_pfn = release_freepages(&cc->freepages);cc->nr_freepages = 0;VM_BUG_ON(free_pfn == 0);/* The cached pfn is always the first in a pageblock */free_pfn &= ~(pageblock_nr_pages-1);/** Only go back, not forward. The cached pfn might have been* already reset to zone end in compact_finished()*/if (free_pfn > zone->compact_cached_free_pfn)zone->compact_cached_free_pfn = free_pfn;}trace_mm_compaction_end(start_pfn, cc->migrate_pfn,cc->free_pfn, end_pfn, sync, ret);return ret; } 回到compact_zone_order()函數(shù)compaction_suitable()函數(shù)實(shí)現(xiàn):判斷當(dāng)前水位是否需要內(nèi)存規(guī)整
unsigned long compaction_suitable(struct zone *zone, int order,int alloc_flags, int classzone_idx) {unsigned long ret;ret = __compaction_suitable(zone, order, alloc_flags, classzone_idx);trace_mm_compaction_suitable(zone, order, ret);if (ret == COMPACT_NOT_SUITABLE_ZONE)ret = COMPACT_SKIPPED;return ret; } /** compaction_suitable: Is this suitable to run compaction on this zone now?* Returns* COMPACT_SKIPPED - If there are too few free pages for compaction* COMPACT_PARTIAL - If the allocation would succeed without compaction* COMPACT_CONTINUE - If compaction should run now*/ static unsigned long __compaction_suitable(struct zone *zone, int order,int alloc_flags, int classzone_idx) {int fragindex;unsigned long watermark;/** order == -1 is expected when compacting via* /proc/sys/vm/compact_memory*/if (order == -1)return COMPACT_CONTINUE;/*以低水位WMARK_LOW為判斷標(biāo)準(zhǔn)然后做如下三個(gè)判斷*/watermark = low_wmark_pages(zone);/** If watermarks for high-order allocation are already met, there* should be no need for compaction at all.*//*(1) 以分配內(nèi)存請(qǐng)求的order來(lái)判斷zone是否在低水位WMARK_LOW之上,如果是,則返回COMPACT_PARTIAL表示不需要做內(nèi)存規(guī)整*/if (zone_watermark_ok(zone, order, watermark, classzone_idx,alloc_flags))return COMPACT_PARTIAL;/** Watermarks for order-0 must be met for compaction. Note the 2UL.* This is because during migration, copies of pages need to be* allocated and for a short time, the footprint is higher*//*(2) 接下來(lái)以order為0來(lái)判斷zone是否在低水位WMARK_LOW + (2 << order)之上,如果達(dá)不到這個(gè)條件,說(shuō)明zone中只有很少的空閑頁(yè)面,不適合做內(nèi)存規(guī)整,返回COMPACT_SKIPPED表示跳過(guò)這個(gè)zone*/watermark += (2UL << order);if (!zone_watermark_ok(zone, 0, watermark, classzone_idx, alloc_flags))return COMPACT_SKIPPED;/*(3) 其余情況返回COMPACT_CONTINUE表示zone可以做內(nèi)存規(guī)整。*//** fragmentation index determines if allocation failures are due to* low memory or external fragmentation** index of -1000 would imply allocations might succeed depending on* watermarks, but we already failed the high-order watermark check* index towards 0 implies failure is due to lack of memory* index towards 1000 implies failure is due to fragmentation** Only compact if a failure would be due to fragmentation.*/fragindex = fragmentation_index(zone, order);if (fragindex >= 0 && fragindex <= sysctl_extfrag_threshold)return COMPACT_NOT_SUITABLE_ZONE;return COMPACT_CONTINUE; } 回到compact_zone()函數(shù)compact_finished()函數(shù)實(shí)現(xiàn):
static int compact_finished(struct zone *zone, struct compact_control *cc,const int migratetype) {int ret;ret = __compact_finished(zone, cc, migratetype);trace_mm_compaction_finished(zone, cc->order, ret);if (ret == COMPACT_NO_SUITABLE_PAGE)ret = COMPACT_CONTINUE;return ret; }static int __compact_finished(struct zone *zone, struct compact_control *cc,const int migratetype) {unsigned int order;unsigned long watermark;if (cc->contended || fatal_signal_pending(current))return COMPACT_PARTIAL;/* Compaction run completes if the migrate and free scanner meet *//*結(jié)束條件有兩個(gè):(1) cc->migrate_pfn和cc->free_pfn兩個(gè)指針相遇,他們從zone的一頭一尾向中間方向運(yùn)行(2) 以order為條件判斷當(dāng)前zone的水位在低水位WMARK_LOW之上。如果當(dāng)zone在低水位WMARK_LOW之上,那么需要判斷伙伴系統(tǒng)中的order對(duì)應(yīng)的zone中的可移動(dòng)類(lèi)型的空閑鏈表是否為空(zone->free_area[order].free_list[MIGRATE_MOVABLE]),最好的結(jié)果是order對(duì)應(yīng)的free_area鏈表正好有空閑頁(yè)面,或者大于order的空閑鏈表里有空閑頁(yè)面,再或者大于pageblock_order的空閑鏈表有空閑頁(yè)面。*/if (cc->free_pfn <= cc->migrate_pfn) {/* Let the next compaction start anew. */zone->compact_cached_migrate_pfn[0] = zone->zone_start_pfn;zone->compact_cached_migrate_pfn[1] = zone->zone_start_pfn;zone->compact_cached_free_pfn = zone_end_pfn(zone);/** Mark that the PG_migrate_skip information should be cleared* by kswapd when it goes to sleep. kswapd does not set the* flag itself as the decision to be clear should be directly* based on an allocation request.*/if (!current_is_kswapd())zone->compact_blockskip_flush = true;return COMPACT_COMPLETE;}/** order == -1 is expected when compacting via* /proc/sys/vm/compact_memory*/if (cc->order == -1)return COMPACT_CONTINUE;/* Compaction run is not finished if the watermark is not met */watermark = low_wmark_pages(zone);if (!zone_watermark_ok(zone, cc->order, watermark, cc->classzone_idx,cc->alloc_flags))return COMPACT_CONTINUE;/* Direct compactor: Is a suitable page free? */for (order = cc->order; order < MAX_ORDER; order++) {struct free_area *area = &zone->free_area[order];/* Job done if page is free of the right migratetype */if (!list_empty(&area->free_list[migratetype]))return COMPACT_PARTIAL;/* Job done if allocation would set block type */if (order >= pageblock_order && area->nr_free)return COMPACT_PARTIAL;}return COMPACT_NO_SUITABLE_PAGE; } 回到compact_zone()函數(shù)isolate_migratepages()函數(shù)實(shí)現(xiàn):
用于掃描和查找合適遷移的頁(yè)面,從zone的頭部開(kāi)始找起,查找的步長(zhǎng)以pageblock_nr_pages為單位。linux內(nèi)核以pageblock為單位來(lái)管理頁(yè)的遷移屬性。頁(yè)的遷移屬性包括MIGRATE_UNMOVABLE、MIGRATE_RECLAIMABLE、MIGRATE_MOVABLE、MIGRATE_PCPTYPES和MIGRATE_CMA等,內(nèi)核有兩個(gè)函數(shù)來(lái)管理遷移類(lèi)型,分別是get_pageblock_migratetype()和set_pageblock_migratetype()。內(nèi)核在初始化時(shí),所有的頁(yè)面最初都標(biāo)記位MIGRATE_MOVABLE,見(jiàn)memmap_init_zone()函數(shù)(mm/page_alloc.c)。pageblock_nr_pages通常是1024個(gè)頁(yè)面(1UL << MAX_ORDER-1)。
[alloc_pages()->alloc_pages_node()->__alloc_pages()->__alloc_pages_nodemask()->__alloc_pages_slowpath()->
__alloc_pages_direct_compact()->try_to_compact_pages()->compact_zone_order()->compact_zone()->isolate_migratepages()]
** Isolate all pages that can be migrated from the first suitable block,* starting at the block pointed to by the migrate scanner pfn within* compact_control.*/ static isolate_migrate_t isolate_migratepages(struct zone *zone,struct compact_control *cc) {unsigned long low_pfn, end_pfn;struct page *page;/*確定分離類(lèi)型,通常isolate_mode為ISOLATE_ASYNC_MIGRATE*/const isolate_mode_t isolate_mode =(cc->mode == MIGRATE_ASYNC ? ISOLATE_ASYNC_MIGRATE : 0);/** Start at where we last stopped, or beginning of the zone as* initialized by compact_zone()*/low_pfn = cc->migrate_pfn;/* Only scan within a pageblock boundary */end_pfn = ALIGN(low_pfn + 1, pageblock_nr_pages);/** Iterate over whole pageblocks until we find the first suitable.* Do not cross the free scanner.*//*從zone頭部cc->migrate_pfn開(kāi)始以pageblock_br_pages為單位向zone尾部方向掃描。*/for (; end_pfn <= cc->free_pfn;low_pfn = end_pfn, end_pfn += pageblock_nr_pages) {/** This can potentially iterate a massively long zone with* many pageblocks unsuitable, so periodically check if we* need to schedule, or even abort async compaction.*/if (!(low_pfn % (SWAP_CLUSTER_MAX * pageblock_nr_pages))&& compact_should_abort(cc))break;page = pageblock_pfn_to_page(low_pfn, end_pfn, zone);if (!page)continue;/* If isolation recently failed, do not retry */if (!isolation_suitable(cc, page))continue;/** For async compaction, also only scan in MOVABLE blocks.* Async compaction is optimistic to see if the minimum amount* of work satisfies the allocation.*//*判斷pageblock是否為MIGRATE_MOVABLE或MIGRATE_CMA類(lèi)型,因?yàn)檫@兩種類(lèi)型的頁(yè)是可以遷移的。cc->mode遷移的類(lèi)型在__alloc_pages_slowpath()函數(shù)傳遞下來(lái)的參數(shù),通常migration_mode參數(shù)是異步的,即MIGRATE_ASYNC*/if (cc->mode == MIGRATE_ASYNC &&!migrate_async_suitable(get_pageblock_migratetype(page)))continue;/* Perform the isolation *//*掃描和分離pageblock中的頁(yè)面是否適合遷移,下面查看此函數(shù)的實(shí)現(xiàn)*/low_pfn = isolate_migratepages_block(cc, low_pfn, end_pfn,isolate_mode);if (!low_pfn || cc->contended) {acct_isolated(zone, cc);return ISOLATE_ABORT;}/** Either we isolated something and proceed with migration. Or* we failed and compact_zone should decide if we should* continue or not.*/break;}acct_isolated(zone, cc);/** Record where migration scanner will be restarted. If we end up in* the same pageblock as the free scanner, make the scanners fully* meet so that compact_finished() terminates compaction.*/cc->migrate_pfn = (end_pfn <= cc->free_pfn) ? low_pfn : cc->free_pfn;return cc->nr_migratepages ? ISOLATE_SUCCESS : ISOLATE_NONE; } 回到compact_zone()函數(shù)isolate_migratepages_block()函數(shù)實(shí)現(xiàn):
[alloc_pages()->alloc_pages_node()->__alloc_pages()->__alloc_pages_nodemask()->__alloc_pages_slowpath()->
__alloc_pages_direct_compact()->try_to_compact_pages()->compact_zone_order()->compact_zone()->
isolate_migratepages()->isolate_migratepages_block()]
*** isolate_migratepages_block() - isolate all migrate-able pages within* a single pageblock* @cc: Compaction control structure.* @low_pfn: The first PFN to isolate* @end_pfn: The one-past-the-last PFN to isolate, within same pageblock* @isolate_mode: Isolation mode to be used.** Isolate all pages that can be migrated from the range specified by* [low_pfn, end_pfn). The range is expected to be within same pageblock.* Returns zero if there is a fatal signal pending, otherwise PFN of the* first page that was not scanned (which may be both less, equal to or more* than end_pfn).** The pages are isolated on cc->migratepages list (not required to be empty),* and cc->nr_migratepages is updated accordingly. The cc->migrate_pfn field* is neither read nor updated.*/ static unsigned long isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,unsigned long end_pfn, isolate_mode_t isolate_mode) {struct zone *zone = cc->zone;unsigned long nr_scanned = 0, nr_isolated = 0;struct list_head *migratelist = &cc->migratepages;struct lruvec *lruvec;unsigned long flags = 0;bool locked = false;struct page *page = NULL, *valid_page = NULL;unsigned long start_pfn = low_pfn;/** Ensure that there are not too many pages isolated from the LRU* list by either parallel reclaimers or compaction. If there are,* delay for some time until fewer pages are isolated*//*too_many_isolated()函數(shù)判斷當(dāng)前臨時(shí)從LRU鏈表分離出來(lái)的頁(yè)面比較多,則最好睡眠等待100毫秒(congestion_wait),如果遷移模式是異步(MIGRATE_ASYNC)的,則直接退出。*/while (unlikely(too_many_isolated(zone))) {/* async migration should just abort */if (cc->mode == MIGRATE_ASYNC)return 0;congestion_wait(BLK_RW_ASYNC, HZ/10);if (fatal_signal_pending(current))return 0;}if (compact_should_abort(cc))return 0;/* Time to isolate some pages for migration *//*此循環(huán)中掃描pageblock尋找可以遷移的頁(yè)*/for (; low_pfn < end_pfn; low_pfn++) {/** Periodically drop the lock (if held) regardless of its* contention, to give chance to IRQs. Abort async compaction* if contended.*/if (!(low_pfn % SWAP_CLUSTER_MAX)&& compact_unlock_should_abort(&zone->lru_lock, flags,&locked, cc))break;if (!pfn_valid_within(low_pfn))continue;nr_scanned++;page = pfn_to_page(low_pfn);if (!valid_page)valid_page = page;/** Skip if free. We read page order here without zone lock* which is generally unsafe, but the race window is small and* the worst thing that can happen is that we skip some* potential isolation targets.*//*如果該頁(yè)還在伙伴系統(tǒng)中,那么該頁(yè)不適合遷移,略過(guò)該頁(yè)。通過(guò)page_order_unsafe()讀取該頁(yè)的order值,for循環(huán)可以直接略過(guò)這些頁(yè)*/if (PageBuddy(page)) {unsigned long freepage_order = page_order_unsafe(page);/** Without lock, we cannot be sure that what we got is* a valid page order. Consider only values in the* valid order range to prevent low_pfn overflow.*/if (freepage_order > 0 && freepage_order < MAX_ORDER)low_pfn += (1UL << freepage_order) - 1;continue;}/** Check may be lockless but that's ok as we recheck later.* It's possible to migrate LRU pages and balloon pages* Skip any other type of page*//*在LRU鏈表中的頁(yè)面或balloon頁(yè)面適合遷移,其他類(lèi)型的頁(yè)面將被略過(guò)。*/if (!PageLRU(page)) {if (unlikely(balloon_page_movable(page))) {if (balloon_page_isolate(page)) {/* Successfully isolated */goto isolate_success;}}continue;}/** PageLRU is set. lru_lock normally excludes isolation* splitting and collapsing (collapsing has already happened* if PageLRU is set) but the lock is not necessarily taken* here and it is wasteful to take it just to check transhuge.* Check TransHuge without lock and skip the whole pageblock if* it's either a transhuge or hugetlbfs page, as calling* compound_order() without preventing THP from splitting the* page underneath us may return surprising results.*/if (PageTransHuge(page)) {if (!locked)low_pfn = ALIGN(low_pfn + 1,pageblock_nr_pages) - 1;elselow_pfn += (1 << compound_order(page)) - 1;continue;}/** Migration will fail if an anonymous page is pinned in memory,* so avoid taking lru_lock and isolating it unnecessarily in an* admittedly racy check.*//*之前已經(jīng)排除了PageBuddy和頁(yè)不在LRU鏈表的情況,接下來(lái)剩下的頁(yè)面是比較合適的候選者,但是還有一些特殊情況需要過(guò)濾掉。page_mapping()返回0,說(shuō)明有可能是匿名頁(yè)面。對(duì)于匿名頁(yè)面來(lái)說(shuō),通常情況下page_count(page) = page_mapcount(page),即page->_count = page->_mapcount + 1.如果不相等,說(shuō)明內(nèi)核有人偷偷使用了這個(gè)匿名頁(yè)面,所以匿名頁(yè)面也不適合遷移。*/if (!page_mapping(page) &&page_count(page) > page_mapcount(page))continue;/* If we already hold the lock, we can skip some rechecking *//*加鎖zone->lru_lock,并且重新判斷該頁(yè)是否是LRU鏈表中的頁(yè)*/if (!locked) {locked = compact_trylock_irqsave(&zone->lru_lock,&flags, cc);if (!locked)break;/* Recheck PageLRU and PageTransHuge under lock */if (!PageLRU(page))continue;if (PageTransHuge(page)) {low_pfn += (1 << compound_order(page)) - 1;continue;}}lruvec = mem_cgroup_page_lruvec(page, zone);/* Try isolate the page *//*__isolate_lru_page()分離ISOLATE_ASYNC_MIGRATE類(lèi)型的頁(yè)面。__isolate_lru_page()函數(shù)之前分析過(guò),對(duì)于正在回寫(xiě)的頁(yè)面是不合格的候選者,對(duì)于臟的頁(yè)面,如果該頁(yè)沒(méi)有定義mapping->a_ops->migratepage()函數(shù)指針,那么也是不合格的候選者,另外還會(huì)對(duì)該頁(yè)的page->_count引用計(jì)數(shù)加1,并清PG_lru標(biāo)志位*/if (__isolate_lru_page(page, isolate_mode) != 0)continue;VM_BUG_ON_PAGE(PageTransCompound(page), page);/* Successfully isolated *//*把該頁(yè)從LRU鏈表中刪除*/del_page_from_lru_list(page, lruvec, page_lru(page));/*表示該頁(yè)是一個(gè)合格的、可以遷移的頁(yè)面,添加到cc->migratelist鏈表中*/ isolate_success:list_add(&page->lru, migratelist);cc->nr_migratepages++;nr_isolated++;/* Avoid isolating too much */if (cc->nr_migratepages == COMPACT_CLUSTER_MAX) {++low_pfn;break;}}/*適合被內(nèi)存規(guī)整遷移的頁(yè)面總結(jié)如下:(1) 必須在LRU鏈表中的頁(yè)面,還在伙伴系統(tǒng)中的頁(yè)面不適合。(2) 正在回寫(xiě)中的頁(yè)面不適合,即標(biāo)記為PG_writeback的頁(yè)面。(3) 標(biāo)記為PG_unevictable的頁(yè)面不適合。(4) 沒(méi)有定義mapping->a_ops->migratepage()方法的臟頁(yè)面不適合。*//** The PageBuddy() check could have potentially brought us outside* the range to be scanned.*/if (unlikely(low_pfn > end_pfn))low_pfn = end_pfn;if (locked)spin_unlock_irqrestore(&zone->lru_lock, flags);/** Update the pageblock-skip information and cached scanner pfn,* if the whole pageblock was scanned without isolating any page.*/if (low_pfn == end_pfn)update_pageblock_skip(cc, valid_page, nr_isolated, true);trace_mm_compaction_isolate_migratepages(start_pfn, low_pfn,nr_scanned, nr_isolated);count_compact_events(COMPACTMIGRATE_SCANNED, nr_scanned);if (nr_isolated)count_compact_events(COMPACTISOLATED, nr_isolated);return low_pfn; } 回到compact_zone()函數(shù)migrate_pages()函數(shù)實(shí)現(xiàn):遷移頁(yè)的核心函數(shù),從cc->migratepages鏈表中摘取頁(yè),然后嘗試去遷移頁(yè)。compaction_alloc()從zone的末尾開(kāi)始查找空閑頁(yè)面,然后并把空閑頁(yè)面添加到cc->freepages鏈表中。
? ? migrate_pages()函數(shù)在頁(yè)遷移一節(jié)中已經(jīng)介紹,其中g(shù)et_new_page()函數(shù)指針指向compaction_alloc()函數(shù),put_new_page()函數(shù)指針指向compaction_free()函數(shù),遷移模式為MIGRATE_ASYNC,reasion為MR_COMPACTION.
/** This is a migrate-callback that "allocates" freepages by taking pages* from the isolated freelists in the block we are migrating to.*/ /*查找哪些頁(yè)面適合遷移,compaction_alloc()函數(shù)是從zone尾部開(kāi)始查找哪些頁(yè)面是空閑頁(yè)面, 核心函數(shù)是isolate_freepages()函數(shù),它與之前的isolate_migratepages()函數(shù)很相似。 compaction_alloc()函數(shù)最后返回一個(gè)空閑的頁(yè)面。*/ static struct page *compaction_alloc(struct page *migratepage,unsigned long data,int **result) {struct compact_control *cc = (struct compact_control *)data;struct page *freepage;/** Isolate free pages if necessary, and if we are not aborting due to* contention.*/if (list_empty(&cc->freepages)) {if (!cc->contended)isolate_freepages(cc);if (list_empty(&cc->freepages))return NULL;}freepage = list_entry(cc->freepages.next, struct page, lru);list_del(&freepage->lru);cc->nr_freepages--;return freepage; }總結(jié)
以上是生活随笔為你收集整理的17 内存规整(memory compaction)的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: matlab中删除矩阵中的某些行
- 下一篇: SEO快排的行业秘密,原来SEO快排套路