F2FS源码分析-5.2 [数据恢复流程] 后滚恢复和Checkpoint的作用与实现
F2FS源碼分析系列文章
主目錄
一、文件系統(tǒng)布局以及元數(shù)據(jù)結(jié)構(gòu)
二、文件數(shù)據(jù)的存儲(chǔ)以及讀寫(xiě)
三、文件與目錄的創(chuàng)建以及刪除(未完成)
四、垃圾回收機(jī)制
五、數(shù)據(jù)恢復(fù)機(jī)制
六、重要數(shù)據(jù)結(jié)構(gòu)或者函數(shù)的分析
Checkpoint的作用與實(shí)現(xiàn)
后滾恢復(fù)即恢復(fù)到上一個(gè)Checkpoint點(diǎn)的元數(shù)據(jù)狀態(tài),因此F2FS需要在特定的時(shí)刻將Checkpoint的數(shù)據(jù)寫(xiě)入到磁盤(pán)中。
Checkpoint的時(shí)機(jī)
CP是一個(gè)開(kāi)銷很大的操作,因此合理選取CP時(shí)機(jī),能夠很好地提高性能。CP的觸發(fā)時(shí)機(jī)有:
前臺(tái)GC(FG_GC)
FASTBOOT
UMOUNT
DISCARD
RECOVERY
TRIM
周期進(jìn)行
因此F2FS有幾個(gè)宏表示CP的觸發(fā)原因:
#define CP_UMOUNT 0x00000001 #define CP_FASTBOOT 0x00000002 #define CP_SYNC 0x00000004 #define CP_RECOVERY 0x00000008 #define CP_DISCARD 0x00000010 #define CP_TRIMMED 0x00000020大部分情況下,都是觸發(fā) CP_SYNC 這個(gè)宏的CP。
Checkpoint的核心流程
Checkpoint的入口函數(shù)以及核心數(shù)據(jù)結(jié)構(gòu)
struct cp_control {int reason; /* Checkpoint的原因,大部分情況下是CP_SYNC */__u64 trim_start; /* CP處理后的數(shù)據(jù)的block起始地址 */__u64 trim_end; /* CP處理后的數(shù)據(jù)的block結(jié)束地址 */__u64 trim_minlen; /* CP處理后的最小長(zhǎng)度 */ };int f2fs_write_checkpoint(struct f2fs_sb_info *sbi, struct cp_control *cpc)下面分段分析 CP_SYNC 原因的 f2fs_write_checkpoint 函數(shù)的核心流程。
f2fs_write_checkpoint(struct f2fs_sb_info *sbi, struct cp_control *cpc) {struct f2fs_checkpoint *ckpt = F2FS_CKPT(sbi); //從sbi讀取當(dāng)前CP的數(shù)據(jù)結(jié)構(gòu)...err = block_operations(sbi); //將文件系統(tǒng)的所有操作都停止...f2fs_flush_merged_writes(sbi); // 將暫存的所有BIO刷寫(xiě)到磁盤(pán)...ckpt_ver = cur_cp_version(ckpt); // 獲取當(dāng)前CP的versionckpt->checkpoint_ver = cpu_to_le64(++ckpt_ver); // 給當(dāng)前CP version加1// 更新元數(shù)據(jù)的NAT區(qū)域f2fs_flush_nat_entries(sbi, cpc); // 刷寫(xiě)所有nat entries到磁盤(pán)// 更新元數(shù)據(jù)的SIT區(qū)域f2fs_flush_sit_entries(sbi, cpc); // 刷寫(xiě)所有sit entries到磁盤(pán),處理dirty prefree segments// 更新元數(shù)據(jù)的Checkpoint區(qū)域以及Summary區(qū)域err = do_checkpoint(sbi, cpc); // checkpoint核心流程f2fs_clear_prefree_segments(sbi, cpc); // 清除dirty prefree segments的dirty標(biāo)記unblock_operations(sbi); //恢復(fù)文件系統(tǒng)的操作...f2fs_update_time(sbi, CP_TIME); // 更新CP的時(shí)間... }Checkpoint涉及的子函數(shù)的分析
暫存BIO的回寫(xiě)
一般情況下,文件系統(tǒng)與設(shè)備的交互的開(kāi)銷是比較大的,因此一些文件系統(tǒng)為了減少交互的開(kāi)銷,都會(huì)盡可能將更多的page合并在一個(gè)bio中,再提交到設(shè)備,進(jìn)而減少交互的次數(shù)。F2FS中,在sbi中使用了struct f2fs_bio_info結(jié)構(gòu)用于減少交互次數(shù),它的核心是緩存一個(gè)bio,將即將回寫(xiě)的page都保存到這個(gè)bio中,等到bio盡可能滿再回寫(xiě)進(jìn)入磁盤(pán)。它在sbi的聲明如下:
struct f2fs_sb_info {...struct f2fs_bio_info *write_io[NR_PAGE_TYPE]; // NR_PAGE_TYPE表示HOW/WARM/COLD不同類型的數(shù)據(jù)... }在Checkpoint流程中,必須要回寫(xiě)暫存的page,以獲得系統(tǒng)最新的穩(wěn)定狀態(tài)信息,它調(diào)用了函數(shù)是 f2fs_flush_merged_writes。f2fs_flush_merged_writes 函數(shù)調(diào)用了f2fs_submit_merged_write分別回寫(xiě)了DATA、NODE、META的信息。然后會(huì)調(diào)用__submit_merged_write_cond函數(shù),這個(gè)函數(shù)會(huì)遍歷HOW/WARM/COLD對(duì)應(yīng)的sbi->write_io進(jìn)行回寫(xiě),最后調(diào)用__submit_merged_bio函數(shù),從sbi->write_io得到bio,submit到設(shè)備中。
void f2fs_flush_merged_writes(struct f2fs_sb_info *sbi) {f2fs_submit_merged_write(sbi, DATA);f2fs_submit_merged_write(sbi, NODE);f2fs_submit_merged_write(sbi, META); }void f2fs_submit_merged_write(struct f2fs_sb_info *sbi, enum page_type type) {__submit_merged_write_cond(sbi, NULL, 0, 0, type, true); }static void __submit_merged_write_cond(struct f2fs_sb_info *sbi,struct inode *inode, nid_t ino, pgoff_t idx,enum page_type type, bool force) {enum temp_type temp;if (!force && !has_merged_page(sbi, inode, ino, idx, type))return;for (temp = HOT; temp < NR_TEMP_TYPE; temp++) { // 遍歷不同的HOT/WARM/COLD類型就行回寫(xiě)__f2fs_submit_merged_write(sbi, type, temp);/* TODO: use HOT temp only for meta pages now. */if (type >= META)break;} }static void __f2fs_submit_merged_write(struct f2fs_sb_info *sbi,enum page_type type, enum temp_type temp) {enum page_type btype = PAGE_TYPE_OF_BIO(type);struct f2fs_bio_info *io = sbi->write_io[btype] + temp; // temp可以計(jì)算屬于HOT/WARM/COLD對(duì)應(yīng)的sbi->write_iodown_write(&io->io_rwsem);/* change META to META_FLUSH in the checkpoint procedure */if (type >= META_FLUSH) {io->fio.type = META_FLUSH;io->fio.op = REQ_OP_WRITE;io->fio.op_flags = REQ_META | REQ_PRIO | REQ_SYNC;if (!test_opt(sbi, NOBARRIER))io->fio.op_flags |= REQ_PREFLUSH | REQ_FUA;}__submit_merged_bio(io);up_write(&io->io_rwsem); }static void __submit_merged_bio(struct f2fs_bio_info *io) {struct f2fs_io_info *fio = &io->fio;if (!io->bio)return;bio_set_op_attrs(io->bio, fio->op, fio->op_flags);if (is_read_io(fio->op))trace_f2fs_prepare_read_bio(io->sbi->sb, fio->type, io->bio);elsetrace_f2fs_prepare_write_bio(io->sbi->sb, fio->type, io->bio);__submit_bio(io->sbi, io->bio, fio->type); // 從f2fs_io_info得到bio,提交到設(shè)備io->bio = NULL; }NAT區(qū)域的臟數(shù)據(jù)回寫(xiě)
f2fs_flush_nat_entries 和 f2fs_flush_sit_entries 的作用是將暫存在ram的nat entry合sit entry都回寫(xiě)到Journal或磁盤(pán)當(dāng)中:
f2fs_flush_nat_entries函數(shù)
修改node的信息會(huì)對(duì)對(duì)應(yīng)的nat_entry進(jìn)行修改,同時(shí)nat_entry會(huì)被設(shè)置為臟,加入到nm_i->nat_set_root的radix tree中。Checkpoint會(huì)對(duì)臟的nat_entry進(jìn)行回寫(xiě),完成元數(shù)據(jù)的更新。
首先聲明了一個(gè)list變量LIST_HEAD(sets),然后通過(guò)一個(gè)while循環(huán),將nat_entry_set按一個(gè)set為單位,對(duì)臟的nat_entry進(jìn)行提取,每次提取SETVEC_SIZE個(gè),然后保存到setvec[SETVEC_SIZE]中,然后對(duì)setvec中的每一個(gè)nat_entry_set,按照一定條件加入到LIST_HEAD(sets)的鏈表中。最后針對(duì)LIST_HEAD(sets)的nat_entry_set,執(zhí)行__flush_nat_entry_set函數(shù),對(duì)臟數(shù)據(jù)進(jìn)行回寫(xiě)。__flush_nat_entry_set有兩種回寫(xiě)方法,第一種是寫(xiě)入到curseg的journal中,第二種是直接找到對(duì)應(yīng)的nat block,回寫(xiě)到磁盤(pán)中。
void f2fs_flush_nat_entries(struct f2fs_sb_info *sbi, struct cp_control *cpc) {struct f2fs_nm_info *nm_i = NM_I(sbi);struct curseg_info *curseg = CURSEG_I(sbi, CURSEG_HOT_DATA);struct f2fs_journal *journal = curseg->journal;struct nat_entry_set *setvec[SETVEC_SIZE];struct nat_entry_set *set, *tmp;unsigned int found;nid_t set_idx = 0;LIST_HEAD(sets);if (!nm_i->dirty_nat_cnt)return;down_write(&nm_i->nat_tree_lock);/** __gang_lookup_nat_set 這個(gè)函數(shù)就是從radix tree讀取set_idx開(kāi)始,* 連續(xù)讀取SETVEC_SIZE這么多個(gè)nat_entry_set,保存在setvec中* 然后按照一定條件,通過(guò)__adjust_nat_entry_set函數(shù)加入到LIST_HEAD(sets)鏈表中* */while ((found = __gang_lookup_nat_set(nm_i,set_idx, SETVEC_SIZE, setvec))) {unsigned idx;set_idx = setvec[found - 1]->set + 1;for (idx = 0; idx < found; idx++)__adjust_nat_entry_set(setvec[idx], &sets,MAX_NAT_JENTRIES(journal));}/** flush dirty nats in nat entry set* 遍歷這個(gè)list所有的nat_entry_set,然后寫(xiě)入到curseg->journal中* */list_for_each_entry_safe(set, tmp, &sets, set_list)__flush_nat_entry_set(sbi, set, cpc);up_write(&nm_i->nat_tree_lock);/* Allow dirty nats by node block allocation in write_begin */ }__flush_nat_entry_set有兩種回寫(xiě)的方式,第一種是寫(xiě)入到curseg的journal中,第二種是回寫(xiě)到nat block中。
第一種寫(xiě)入方式通常是由于curseg有足夠的journal的情況下的寫(xiě)入,首先遍歷nat_entry_set中的所有nat_entry,然后根據(jù)nid找到curseg->journal中對(duì)應(yīng)的nat_entry的位置,跟著將被遍歷的nat_entry的值賦予給curseg->journal的nat_entry,通過(guò)raw_nat_from_node_info完成curseg的nat_entry的更新。
第二種寫(xiě)入方式在curseg沒(méi)有足夠的journal的時(shí)候觸發(fā),首先根據(jù)nid找到NAT區(qū)域的對(duì)應(yīng)的f2fs_nat_block,然后通過(guò)get_next_nat_page讀取出來(lái),然后通過(guò)raw_nat_from_node_info進(jìn)行更新。
static void __flush_nat_entry_set(struct f2fs_sb_info *sbi,struct nat_entry_set *set, struct cp_control *cpc) {struct curseg_info *curseg = CURSEG_I(sbi, CURSEG_HOT_DATA);struct f2fs_journal *journal = curseg->journal;nid_t start_nid = set->set * NAT_ENTRY_PER_BLOCK; // 根據(jù)set number找到對(duì)應(yīng)f2fs_nat_blockbool to_journal = true;struct f2fs_nat_block *nat_blk;struct nat_entry *ne, *cur;struct page *page = NULL;/** there are two steps to flush nat entries:* #1, flush nat entries to journal in current hot data summary block.* #2, flush nat entries to nat page.*/if (enabled_nat_bits(sbi, cpc) ||!__has_cursum_space(journal, set->entry_cnt, NAT_JOURNAL)) //當(dāng)curseg的journal空間不夠了,就刷寫(xiě)到磁盤(pán)中to_journal = false;if (to_journal) {down_write(&curseg->journal_rwsem);} else {page = get_next_nat_page(sbi, start_nid); /* 根據(jù)nid找到管理這個(gè)nid的f2fs_nat_block */nat_blk = page_address(page);f2fs_bug_on(sbi, !nat_blk);}/** flush dirty nats in nat entry set* 遍歷所有的nat_entry** nat_entry只存在于內(nèi)存當(dāng)中,具體在磁盤(pán)保存的是f2fs_entry_block* */list_for_each_entry_safe(ne, cur, &set->entry_list, list) {struct f2fs_nat_entry *raw_ne;nid_t nid = nat_get_nid(ne);int offset;f2fs_bug_on(sbi, nat_get_blkaddr(ne) == NEW_ADDR);if (to_journal) {// 搜索當(dāng)前的journal中nid所在的位置offset = f2fs_lookup_journal_in_cursum(journal,NAT_JOURNAL, nid, 1);f2fs_bug_on(sbi, offset < 0);raw_ne = &nat_in_journal(journal, offset); // 從journal中取出f2fs_nat_entry的信息nid_in_journal(journal, offset) = cpu_to_le32(nid); // 更新journal的nid} else {raw_ne = &nat_blk->entries[nid - start_nid]; /* 拿到nid對(duì)應(yīng)的nat_entry地址,下面開(kāi)始填數(shù)據(jù) */}raw_nat_from_node_info(raw_ne, &ne->ni); // 將node info的信息更新到j(luò)ournal中后者磁盤(pán)中nat_reset_flag(ne); // 清除需要CP的標(biāo)志__clear_nat_cache_dirty(NM_I(sbi), set, ne); // 從dirty list清除處理后的entryif (nat_get_blkaddr(ne) == NULL_ADDR) { // 如果對(duì)應(yīng)nid已經(jīng)是被無(wú)效化了,則釋放add_free_nid(sbi, nid, false, true);} else {spin_lock(&NM_I(sbi)->nid_list_lock);update_free_nid_bitmap(sbi, nid, false, false); // 更新可用的nat的bitmapspin_unlock(&NM_I(sbi)->nid_list_lock);}}if (to_journal) {up_write(&curseg->journal_rwsem);} else {__update_nat_bits(sbi, start_nid, page);f2fs_put_page(page, 1);}/* Allow dirty nats by node block allocation in write_begin */if (!set->entry_cnt) {radix_tree_delete(&NM_I(sbi)->nat_set_root, set->set);kmem_cache_free(nat_entry_set_slab, set);} }SIT區(qū)域的臟數(shù)據(jù)回寫(xiě)
f2fs_flush_sit_entries函數(shù)
主要過(guò)程跟 f2fs_flush_nat_entries 類似,將dirty的seg_entry刷寫(xiě)到j(luò)ournal或sit block中
void f2fs_flush_sit_entries(struct f2fs_sb_info *sbi, struct cp_control *cpc) {struct sit_info *sit_i = SIT_I(sbi);unsigned long *bitmap = sit_i->dirty_sentries_bitmap;struct curseg_info *curseg = CURSEG_I(sbi, CURSEG_COLD_DATA);struct f2fs_journal *journal = curseg->journal;struct sit_entry_set *ses, *tmp;struct list_head *head = &SM_I(sbi)->sit_entry_set;bool to_journal = true;struct seg_entry *se;down_write(&sit_i->sentry_lock);if (!sit_i->dirty_sentries)goto out;/** add and account sit entries of dirty bitmap in sit entry* set temporarily** 遍歷所有dirty的segment的segno,* 找到對(duì)應(yīng)的sit_entry_set,然后保存到sbi->sm_info->sit_entry_set*/add_sits_in_set(sbi);/** if there are no enough space in journal to store dirty sit* entries, remove all entries from journal and add and account* them in sit entry set.*/if (!__has_cursum_space(journal, sit_i->dirty_sentries, SIT_JOURNAL))remove_sits_in_journal(sbi);/** there are two steps to flush sit entries:* #1, flush sit entries to journal in current cold data summary block.* #2, flush sit entries to sit page.* 遍歷list中的所有segno對(duì)應(yīng)的sit_entry_set*/list_for_each_entry_safe(ses, tmp, head, set_list) {struct page *page = NULL;struct f2fs_sit_block *raw_sit = NULL;unsigned int start_segno = ses->start_segno;unsigned int end = min(start_segno + SIT_ENTRY_PER_BLOCK,(unsigned long)MAIN_SEGS(sbi));unsigned int segno = start_segno; /* 找到 */if (to_journal &&!__has_cursum_space(journal, ses->entry_cnt, SIT_JOURNAL))to_journal = false;if (to_journal) {down_write(&curseg->journal_rwsem);} else {page = get_next_sit_page(sbi, start_segno); /* 訪問(wèn)磁盤(pán),從磁盤(pán)獲取到f2fs_sit_block */raw_sit = page_address(page); /* 根據(jù)segno獲得f2fs_sit_block,然后下一步將數(shù)據(jù)寫(xiě)入這個(gè)block當(dāng)中 */}/** flush dirty sit entries in region of current sit set* 遍歷segno~end所有dirty的seg_entry* */for_each_set_bit_from(segno, bitmap, end) {int offset, sit_offset;se = get_seg_entry(sbi, segno); /* 根據(jù)segno從SIT緩存中獲取到seg_entry,這個(gè)緩存是F2FS初始化的時(shí)候,將全部seg_entry讀入創(chuàng)建的 */if (to_journal) {offset = f2fs_lookup_journal_in_cursum(journal,SIT_JOURNAL, segno, 1);f2fs_bug_on(sbi, offset < 0);segno_in_journal(journal, offset) =cpu_to_le32(segno);seg_info_to_raw_sit(se,&sit_in_journal(journal, offset)); // 更新journal的數(shù)據(jù)check_block_count(sbi, segno,&sit_in_journal(journal, offset));} else {sit_offset = SIT_ENTRY_OFFSET(sit_i, segno);seg_info_to_raw_sit(se,&raw_sit->entries[sit_offset]); // 更新f2fs_sit_block的數(shù)據(jù)check_block_count(sbi, segno,&raw_sit->entries[sit_offset]);}__clear_bit(segno, bitmap); // 從dirty map中除名sit_i->dirty_sentries--;ses->entry_cnt--;}if (to_journal)up_write(&curseg->journal_rwsem);elsef2fs_put_page(page, 1);f2fs_bug_on(sbi, ses->entry_cnt);release_sit_entry_set(ses);}f2fs_bug_on(sbi, !list_empty(head));f2fs_bug_on(sbi, sit_i->dirty_sentries); out:up_write(&sit_i->sentry_lock);/** 通過(guò)CP的時(shí)機(jī),將暫存在dirty_segmap的dirty的segment信息,更新到free_segmap中* 而且與接下來(lái)的do_checkpoint完成的f2fs_clear_prefree_segments有關(guān)系,因?yàn)?這里處理完了* dirty prefree segments,所以在f2fs_clear_prefree_segments這個(gè)函數(shù)將它的dirty標(biāo)記清除* */set_prefree_as_free_segments(sbi); }static inline struct seg_entry *get_seg_entry(struct f2fs_sb_info *sbi,unsigned int segno) {struct sit_info *sit_i = SIT_I(sbi);return &sit_i->sentries[segno]; }static void set_prefree_as_free_segments(struct f2fs_sb_info *sbi) {struct dirty_seglist_info *dirty_i = DIRTY_I(sbi);unsigned int segno;mutex_lock(&dirty_i->seglist_lock);/** 遍歷dirty_seglist_info->dirty_segmap[PRE],然后執(zhí)行__set_test_and_free* */for_each_set_bit(segno, dirty_i->dirty_segmap[PRE], MAIN_SEGS(sbi))__set_test_and_free(sbi, segno); /* 根據(jù)segno更新free_segmap的可用信息 */mutex_unlock(&dirty_i->seglist_lock); }static inline void __set_test_and_free(struct f2fs_sb_info *sbi,unsigned int segno) {struct free_segmap_info *free_i = FREE_I(sbi);unsigned int secno = GET_SEC_FROM_SEG(sbi, segno);unsigned int start_segno = GET_SEG_FROM_SEC(sbi, secno);unsigned int next;spin_lock(&free_i->segmap_lock);/** free_i->free_segmap用這個(gè)bitmap表示這個(gè)segment是否是dirty* 如果這個(gè)segno對(duì)應(yīng)的segment位置等于0,代表不是dirty,不作處理* 如果這個(gè)segno對(duì)應(yīng)的位置等于1,表示這個(gè)segment是dirty的,那么在當(dāng)前的free_segment+1,更新最新的free_segment信息* */if (test_and_clear_bit(segno, free_i->free_segmap)) {free_i->free_segments++;next = find_next_bit(free_i->free_segmap,start_segno + sbi->segs_per_sec, start_segno);if (next >= start_segno + sbi->segs_per_sec) {if (test_and_clear_bit(secno, free_i->free_secmap))free_i->free_sections++;}}spin_unlock(&free_i->segmap_lock); }Checkpoint區(qū)域的回寫(xiě)
上述分別描述了對(duì)NAT和SIT的回寫(xiě)與更新,而do_checkpoint是針對(duì)Checkpoint區(qū)域的更新。Checkpoint主要涉及兩部分,第一部分f2fs_checkpoint結(jié)構(gòu)的更新,第二部分是curseg的summary數(shù)據(jù)的回寫(xiě)。在分析這個(gè)函數(shù)之前,需要知道元數(shù)據(jù)的Checkpoint區(qū)域在磁盤(pán)中是如何保存的,磁盤(pán)的保存結(jié)構(gòu)如下:
+---------------------------------------------------------------------------------------------------+| f2fs_checkpoint | data summaries | hot node summaries | warm node summaries | cold node summaries |+---------------------------------------------------------------------------------------------------+. . . . . compacted summaries . +----------------+-------------------+----------------+| nat journal | sit journal | data summaries |+----------------+-------------------+----------------+. normal summaries . +----------------+-------------------+----------------+| data summaries |+----------------+-------------------+----------------+其中f2fs_checkpoint、hot/warm/cold node summaries都分別占用一個(gè)block的空間。f2fs為了減少Checkpoint的寫(xiě)入開(kāi)銷,將data summaries被設(shè)計(jì)為可變的。它包含兩種寫(xiě)入方式,一種是compacted summaries寫(xiě)入,另一種是normal summaries寫(xiě)入。compacted summaries可以在一次Checkpoint中,減少1~2個(gè)page的寫(xiě)入。
do_checkpoint函數(shù)
下面是簡(jiǎn)化的do_checkpoint函數(shù)核心流程,如下所示:
static int do_checkpoint(struct f2fs_sb_info *sbi, struct cp_control *cpc) {// 第一部分,根據(jù)curseg,修改f2fs_checkpoint結(jié)構(gòu)的信息...for (i = 0; i < NR_CURSEG_NODE_TYPE; i++) {ckpt->cur_node_segno[i] =cpu_to_le32(curseg_segno(sbi, i + CURSEG_HOT_NODE));ckpt->cur_node_blkoff[i] =cpu_to_le16(curseg_blkoff(sbi, i + CURSEG_HOT_NODE));ckpt->alloc_type[i + CURSEG_HOT_NODE] =curseg_alloc_type(sbi, i + CURSEG_HOT_NODE);}//printk("[do-checkpoint] point 3\n");for (i = 0; i < NR_CURSEG_DATA_TYPE; i++) {ckpt->cur_data_segno[i] =cpu_to_le32(curseg_segno(sbi, i + CURSEG_HOT_DATA));ckpt->cur_data_blkoff[i] =cpu_to_le16(curseg_blkoff(sbi, i + CURSEG_HOT_DATA));ckpt->alloc_type[i + CURSEG_HOT_DATA] =curseg_alloc_type(sbi, i + CURSEG_HOT_DATA);}// 第二部分,根據(jù)curseg,修改summary的信息...data_sum_blocks = f2fs_npages_for_summary_flush(sbi, false);if (data_sum_blocks < NR_CURSEG_DATA_TYPE)__set_ckpt_flags(ckpt, CP_COMPACT_SUM_FLAG);else__clear_ckpt_flags(ckpt, CP_COMPACT_SUM_FLAG);f2fs_write_data_summaries(sbi, start_blk); // 將data summary以及里面的journal寫(xiě)入磁盤(pán)/* * node summaries的寫(xiě)回只有在啟動(dòng)和關(guān)閉F2FS的時(shí)候才會(huì)執(zhí)行,* 如果出現(xiàn)的宕機(jī)的情況下,就會(huì)失去了UMOUNT的標(biāo)志,也會(huì)失去了所有的NODE SUMMARY* F2FS會(huì)進(jìn)行根據(jù)上次checkpoint的情況進(jìn)行恢復(fù)*/if (__remain_node_summaries(cpc->reason)) {f2fs_write_node_summaries(sbi, start_blk); // 將node summary以及里面的journal寫(xiě)入磁盤(pán)start_blk += NR_CURSEG_NODE_TYPE;}commit_checkpoint(sbi, ckpt, start_blk); // 將修改后的checkpoint區(qū)域的數(shù)據(jù)提交到設(shè)備,對(duì)磁盤(pán)的元數(shù)據(jù)進(jìn)行更新...return 0; }首先,第一部分主要是針對(duì)元數(shù)據(jù)區(qū)域的f2fs_checkpoint結(jié)構(gòu)的修改,其實(shí)包括將curseg的當(dāng)前segno,blkoff等寫(xiě)入到f2fs_checkpoint中,以便下次重啟時(shí)可以根據(jù)這些信息,重建curseg。
接下來(lái)重點(diǎn)討論,Checkpoint區(qū)域的summary的回寫(xiě),在分析流程之前,需要分析compacted summaries和normal summaries的差別。
compacted summaries和normal summaries
通過(guò)查看curseg的結(jié)構(gòu)可以知道,curseg管理了(NODE,DATA) X (HOT,WARM,COLD)總共6個(gè)的segment,因此也需要管理這6個(gè)segment對(duì)應(yīng)的f2fs_summary_block。
因此一般情況下,每一次checkpoint時(shí)候,應(yīng)該需要回寫(xiě)6種類型的f2fs_summary_block,即6個(gè)block到磁盤(pán)。
為了減少這部分回寫(xiě)的開(kāi)銷,f2fs針對(duì)DATA類型f2fs_summary_block設(shè)計(jì)了一種compacted summary block。一般情況下,DATA需要回寫(xiě)3個(gè)f2fs_summary_block到磁盤(pán)(HOT,WARM,COLD),但是如果使用了compacted summary block,大部分情況下只需要回寫(xiě)1~2個(gè)block。
compacted summary block被設(shè)計(jì)為通過(guò)1~2個(gè)block保存當(dāng)前curseg所有的元信息,它的核心設(shè)計(jì)是將HOW WARM COLD DATA的元信息混合保存:
混合類型Journal保存
compacted summary block分別維護(hù)了一個(gè)公用的nat journal,以及sit journal,HOT WARM COLD類型的Journal都會(huì)混合保存進(jìn)入兩個(gè)journal結(jié)構(gòu)中。
在滿足COMPACTED的條件下,系統(tǒng)啟動(dòng)時(shí),F2FS會(huì)從磁盤(pán)中讀取這兩個(gè)Journal到內(nèi)存中,分別保存在HOT以及COLD所在的curseg->journal中。
不同類型的journal會(huì)在CP時(shí)刻,通過(guò)f2fs_flush_sit_entries函數(shù)寫(xiě)入到HOT或者COLD對(duì)應(yīng)的curseg->journal區(qū)域中。如果HOT或者COLD對(duì)應(yīng)的curseg->journal區(qū)域的空間不夠了,就將不同類型的journal保存的segment的信息,直接寫(xiě)入到對(duì)應(yīng)的sit entry block中。
接下來(lái)將HOT或者COLD對(duì)應(yīng)的curseg->journal包裝為compacted block回寫(xiě)到cp區(qū)域中。
混合類型Summary保存
b) 以及將HOT,WARM,COLD三種類型的summary混合保存同一個(gè)data summaries數(shù)組中,它們的差別如下:
根據(jù)上面的描述,不同類型的summary block的可以保存的summary的大小,可以得到
HOT,WARM,COLD DATA這三種類型,如果目前加起來(lái)僅使用了
接下來(lái)進(jìn)行代碼分析:
// 根據(jù)需要回寫(xiě)的summary的數(shù)目,返回需要寫(xiě)回的block的數(shù)目,返回值有1、2、3 data_sum_blocks = f2fs_npages_for_summary_flush(sbi, false); // 如果data_sum_blocks = 1 或者 2,則表示回寫(xiě)1個(gè)或者2個(gè)block,則設(shè)置CP_COMPACT_SUM_FLAG標(biāo)志 if (data_sum_blocks < NR_CURSEG_DATA_TYPE) // NR_CURSEG_DATA_TYPE = 3__set_ckpt_flags(ckpt, CP_COMPACT_SUM_FLAG); else__clear_ckpt_flags(ckpt, CP_COMPACT_SUM_FLAG);// 然后將summary寫(xiě)入磁盤(pán) f2fs_write_data_summaries(sbi, start_blk); // 將data summary以及里面的journal寫(xiě)入磁盤(pán)f2fs_write_data_summaries函數(shù)會(huì)判斷一下是否設(shè)置了CP_COMPACT_SUM_FLAG標(biāo)志,采取不同的方法寫(xiě)入磁盤(pán)
void f2fs_write_data_summaries(struct f2fs_sb_info *sbi, block_t start_blk) {if (is_set_ckpt_flags(sbi, CP_COMPACT_SUM_FLAG))write_compacted_summaries(sbi, start_blk);elsewrite_normal_summaries(sbi, start_blk, CURSEG_HOT_DATA); }write_compacted_summaries函數(shù)會(huì)根據(jù)上述的compacted block的數(shù)據(jù)分布,將數(shù)據(jù)寫(xiě)入到磁盤(pán)中
static void write_compacted_summaries(struct f2fs_sb_info *sbi, block_t blkaddr) {struct page *page;unsigned char *kaddr;struct f2fs_summary *summary;struct curseg_info *seg_i;int written_size = 0;int i, j;int datatypes = CURSEG_COLD_DATA; #ifdef CONFIG_F2FS_COMPRESSIONdatatypes = CURSEG_BG_COMPR_DATA; #endifpage = f2fs_grab_meta_page(sbi, blkaddr++);kaddr = (unsigned char *)page_address(page);memset(kaddr, 0, PAGE_SIZE);/* Step 1: write nat cache */seg_i = CURSEG_I(sbi, CURSEG_HOT_DATA); // 第一步寫(xiě)nat的journalmemcpy(kaddr, seg_i->journal, SUM_JOURNAL_SIZE);written_size += SUM_JOURNAL_SIZE;/* Step 2: write sit cache */seg_i = CURSEG_I(sbi, CURSEG_COLD_DATA);memcpy(kaddr + written_size, seg_i->journal, SUM_JOURNAL_SIZE); // 第二步寫(xiě)sit的journalwritten_size += SUM_JOURNAL_SIZE;/* Step 3: write summary entries */for (i = CURSEG_HOT_DATA; i <= datatypes; i++) { // 開(kāi)始寫(xiě)summaryunsigned short blkoff;seg_i = CURSEG_I(sbi, i);if (sbi->ckpt->alloc_type[i] == SSR)blkoff = sbi->blocks_per_seg;elseblkoff = curseg_blkoff(sbi, i);for (j = 0; j < blkoff; j++) {if (!page) { // 如果f2fs compacted block寫(xiě)不下,則創(chuàng)建一個(gè)純summary的blockpage = f2fs_grab_meta_page(sbi, blkaddr++);kaddr = (unsigned char *)page_address(page);memset(kaddr, 0, PAGE_SIZE);written_size = 0;}summary = (struct f2fs_summary *)(kaddr + written_size);*summary = seg_i->sum_blk->entries[j];written_size += SUMMARY_SIZE;if (written_size + SUMMARY_SIZE <= PAGE_SIZE -SUM_FOOTER_SIZE)continue;set_page_dirty(page); // 如果超過(guò)了compaced sum block可以承載的極限,就設(shè)置這個(gè)block是臟,等待回寫(xiě)f2fs_put_page(page, 1);page = NULL;}}if (page) {set_page_dirty(page);f2fs_put_page(page, 1);} }write_normal_summaries函數(shù)則是簡(jiǎn)單地將按照HOT/WARM/COLD的順序?qū)懭氲絚heckpoint區(qū)域中
static void write_normal_summaries(struct f2fs_sb_info *sbi,block_t blkaddr, int type) {int i, end;if (IS_DATASEG(type))end = type + NR_CURSEG_DATA_TYPE;elseend = type + NR_CURSEG_NODE_TYPE;for (i = type; i < end; i++) // 根據(jù) HOW WARM COLD 都寫(xiě)入磁盤(pán)write_current_sum_page(sbi, i, blkaddr + (i - type)); }static void write_current_sum_page(struct f2fs_sb_info *sbi,int type, block_t blk_addr) {struct curseg_info *curseg = CURSEG_I(sbi, type);struct page *page = f2fs_grab_meta_page(sbi, blk_addr);struct f2fs_summary_block *src = curseg->sum_blk;struct f2fs_summary_block *dst;dst = (struct f2fs_summary_block *)page_address(page);memset(dst, 0, PAGE_SIZE);mutex_lock(&curseg->curseg_mutex);down_read(&curseg->journal_rwsem);memcpy(&dst->journal, curseg->journal, SUM_JOURNAL_SIZE);up_read(&curseg->journal_rwsem);memcpy(dst->entries, src->entries, SUM_ENTRY_SIZE);memcpy(&dst->footer, &src->footer, SUM_FOOTER_SIZE);mutex_unlock(&curseg->curseg_mutex);set_page_dirty(page);f2fs_put_page(page, 1); }總結(jié)
以上是生活随笔為你收集整理的F2FS源码分析-5.2 [数据恢复流程] 后滚恢复和Checkpoint的作用与实现的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: 1M带宽、1Mbps、1Mb/s 区分
- 下一篇: F2FS源码分析-2.3 [F2FS 读