k8s gc原理详解
-
1. K8s 的垃圾回收策略
?
-
2 gc 源碼分析
-
2.1 初始化 garbageCollector 對象
-
2.1.1 garbageCollector包含的結(jié)構(gòu)體對象
-
2.1.2 NewGarbageCollector
-
-
2.2 啟動garbageCollector
-
2.2.1 啟動dependencyGraphBuilder
-
2.2.2 runAttemptToDeleteWorker
-
2.2.3 runAttemptToOrphanWorker
-
2.2.4 總結(jié)
-
-
2.3 runProcessGraphChanges
-
2.4 processTransitions函數(shù)的處理邏輯
-
2.5 runAttemptToOrphanWorker
-
2.6 attemptToDeleteWorker
-
2.7 uidToNode到底是什么
-
-
3.總結(jié)
當(dāng)我們給一個對象設(shè)置OwnerReference的時候,刪除該對象的owner, 該對象也會被連帶刪除。這個時候用的就是k8s的垃圾回收機制。
1. K8s 的垃圾回收策略
k8s目前支持三種回收策略:
(1)前臺級聯(lián)刪除(Foreground Cascading Deletion):在這種刪除策略中,所有者對象的刪除將會持續(xù)到其所有從屬對象都被刪除為止。當(dāng)所有者被刪除時,會進入“正在刪除”(deletion in progress)狀態(tài),此時:
-
對象仍然可以通過 REST API 查詢到(可通過 kubectl 或 kuboard 查詢到)
-
對象的 deletionTimestamp 字段被設(shè)置
-
對象的 metadata.finalizers 包含值 foregroundDeletion
(2)后臺級聯(lián)刪除(Background Cascading Deletion):這種刪除策略會簡單很多,它會立即刪除所有者的對象,并由垃圾回收器在后臺刪除其從屬對象。這種方式比前臺級聯(lián)刪除快的多,因為不用等待時間來刪除從屬對象。
(3)孤兒(Orphan):這種情況下,對所有者的進行刪除只會將其從集群中刪除,并使所有對象處于“孤兒”狀態(tài)。
舉例:已有一個deployA, 對應(yīng)的rs假設(shè)為 rsA, pod為PodA。
(1)前臺刪除:先刪除podA, 再刪除rsA, 再刪除deployA。 podA的刪除如果卡在,rsA也會被卡住。
(2)后臺刪除:先刪除deployA, 再刪除rsA, 再刪除podA。 podA和rsA是否會刪除成功,deploy不會受影響。
(3)孤兒刪除:只刪除deployA。rsA, podA不受影響。 rsA的owner不再是deployA。
2 gc 源碼分析
和deployController, rsController一樣,GarbageCollectorController也是kube-controller-manager(kcm)中的一個控制器。
GarbageCollectorController 的啟動方法為 startGarbageCollectorController,主要邏輯如下:
從第三步開始每一步都深入展開。第三步對應(yīng)2.1。
(1)初始化客戶端,用于發(fā)現(xiàn)集群中的資源。這個先不關(guān)注
(2)獲得deletableResources,以及ignoredResources。
deletableResources: 所有支持”delete”, “l(fā)ist”, “watch” 操作的資源
ignoredResources:kcm啟動時GarbageCollectorController的config指定
(3)初始化 garbageCollector 對象。
(4)啟動garbageCollector
(5)garbageCollector同步
(6)開啟debug模式
func startGarbageCollectorController(ctx ControllerContext) (http.Handler, bool, error) {// 1.初始化客戶端if !ctx.ComponentConfig.GarbageCollectorController.EnableGarbageCollector {return nil, false, nil} ?gcClientset := ctx.ClientBuilder.ClientOrDie("generic-garbage-collector")discoveryClient := cacheddiscovery.NewMemCacheClient(gcClientset.Discovery()) ?config := ctx.ClientBuilder.ConfigOrDie("generic-garbage-collector")metadataClient, err := metadata.NewForConfig(config)if err != nil {return nil, true, err} ?// 2. 獲得deletableResources,以及ignoredResources// Get an initial set of deletable resources to prime the garbage collector.deletableResources := garbagecollector.GetDeletableResources(discoveryClient)ignoredResources := make(map[schema.GroupResource]struct{})for _, r := range ctx.ComponentConfig.GarbageCollectorController.GCIgnoredResources {ignoredResources[schema.GroupResource{Group: r.Group, Resource: r.Resource}] = struct{}{}}// 3. NewGarbageCollectorgarbageCollector, err := garbagecollector.NewGarbageCollector(metadataClient,ctx.RESTMapper,deletableResources,ignoredResources,ctx.ObjectOrMetadataInformerFactory,ctx.InformersStarted,)if err != nil {return nil, true, fmt.Errorf("failed to start the generic garbage collector: %v", err)} ?// 4. 啟動garbageCollector// Start the garbage collector.workers := int(ctx.ComponentConfig.GarbageCollectorController.ConcurrentGCSyncs)go garbageCollector.Run(workers, ctx.Stop) ?// Periodically refresh the RESTMapper with new discovery information and sync// the garbage collector.// 5. garbageCollector同步go garbageCollector.Sync(gcClientset.Discovery(), 30*time.Second, ctx.Stop)// 6. 開啟debug模式return garbagecollector.NewDebugHandler(garbageCollector), true, nil }2.1 初始化 garbageCollector 對象
2.1.1 garbageCollector包含的結(jié)構(gòu)體對象
garbageCollector需要額外的結(jié)構(gòu):
attemptToDelete,attemptToOrphan:限速隊列
uidToNode:一個緩存依賴關(guān)系的圖。一個map結(jié)構(gòu),key=uid, value是一個node結(jié)構(gòu)。
type GarbageCollector struct {restMapper ? ? resettableRESTMappermetadataClient metadata.InterfaceattemptToDelete workqueue.RateLimitingInterfaceattemptToOrphan ? ? ? workqueue.RateLimitingInterfacedependencyGraphBuilder *GraphBuilderabsentOwnerCache *UIDCacheworkerLock sync.RWMutex } ? ? // GraphBuilder: based on the events supplied by the informers, GraphBuilder updates // uidToNode, a graph that caches the dependencies as we know, and enqueues // items to the attemptToDelete and attemptToOrphan. type GraphBuilder struct {restMapper meta.RESTMapper ?// 每一個monitor對應(yīng)一種資源monitors ? monitorsmonitorLock sync.RWMutexinformersStarted <-chan struct{} ?stopCh <-chan struct{} ?running bool ?metadataClient metadata.InterfacegraphChanges workqueue.RateLimitingInterface ?uidToNode *concurrentUIDToNodeattemptToDelete workqueue.RateLimitingInterfaceattemptToOrphan workqueue.RateLimitingInterface ?absentOwnerCache *UIDCachesharedInformers controller.InformerFactoryignoredResources map[schema.GroupResource]struct{} } ? type concurrentUIDToNode struct {uidToNodeLock sync.RWMutexuidToNode ? ? map[types.UID]*node } ? type node struct {identity objectReferencedependentsLock sync.RWMutexdependents map[*node]struct{} ? ? ? ? ? //該節(jié)點的所有依賴 ?deletingDependents ? ? booldeletingDependentsLock sync.RWMutexbeingDeleted ? ? boolbeingDeletedLock sync.RWMutex ?virtual ? ? boolvirtualLock sync.RWMutexowners []metav1.OwnerReference ? ? ? ? //該節(jié)點的所有owner }舉例來說:
假設(shè)集群中有:deployA, rsA, podA三個對象。
monitors 負(fù)責(zé)監(jiān)聽這三種資源的變化。然后根據(jù)情況扔進 attemptToDelete,attemptToOrphan隊列。
GraphBuilder負(fù)責(zé)構(gòu)建一個圖。在這種情況下,圖的內(nèi)容為:
Node1( key=deployA.uid ): 它的owner為空,dependents=rsA。
Node2( key=rsA.uid ): 它的owner=deployA,dependents=podA。
Node3( key=pod.uid ): 它的owner=rsA,dependents為空。
同時,每個節(jié)點還有beingDeleted,deletingDependents等關(guān)鍵字段。這樣gc根據(jù)這個圖就可以很方便地進行各種策略的刪除。
2.1.2 NewGarbageCollector
NewGarbageCollector就做了倆件事
(1)初始化GarbageCollector結(jié)構(gòu)體
(2)調(diào)用controllerFor定義對象變化的處理事件。無論是監(jiān)聽到add, update, del都是將其打包成一個event事件,然后加入graphChanges隊列。
func NewGarbageCollector(metadataClient metadata.Interface,mapper resettableRESTMapper,deletableResources map[schema.GroupVersionResource]struct{},ignoredResources map[schema.GroupResource]struct{},sharedInformers controller.InformerFactory,informersStarted <-chan struct{}, ) (*GarbageCollector, error) {attemptToDelete := workqueue.NewNamedRateLimitingQueue(workqueue.DefaultControllerRateLimiter(), "garbage_collector_attempt_to_delete")attemptToOrphan := workqueue.NewNamedRateLimitingQueue(workqueue.DefaultControllerRateLimiter(), "garbage_collector_attempt_to_orphan")absentOwnerCache := NewUIDCache(500)gc := &GarbageCollector{metadataClient: ? metadataClient,restMapper: ? ? ? mapper,attemptToDelete: attemptToDelete,attemptToOrphan: attemptToOrphan,absentOwnerCache: absentOwnerCache,}gb := &GraphBuilder{metadataClient: ? metadataClient,informersStarted: informersStarted,restMapper: ? ? ? mapper,graphChanges: ? ? workqueue.NewNamedRateLimitingQueue(workqueue.DefaultControllerRateLimiter(), "garbage_collector_graph_changes"),uidToNode: &concurrentUIDToNode{uidToNode: make(map[types.UID]*node),},attemptToDelete: attemptToDelete,attemptToOrphan: attemptToOrphan,absentOwnerCache: absentOwnerCache,sharedInformers: sharedInformers,ignoredResources: ignoredResources,}// if err := gb.syncMonitors(deletableResources); err != nil {utilruntime.HandleError(fmt.Errorf("failed to sync all monitors: %v", err))}gc.dependencyGraphBuilder = gb ?return gc, nil }syncMonitors就是同步更新哪些資源需要監(jiān)聽,然后調(diào)用controllerFor注冊事件處理。
func (gb *GraphBuilder) syncMonitors(resources map[schema.GroupVersionResource]struct{}) error {gb.monitorLock.Lock()defer gb.monitorLock.Unlock() ?toRemove := gb.monitorsif toRemove == nil {toRemove = monitors{}}current := monitors{}errs := []error{}kept := 0added := 0for resource := range resources {if _, ok := gb.ignoredResources[resource.GroupResource()]; ok {continue}if m, ok := toRemove[resource]; ok {current[resource] = mdelete(toRemove, resource)kept++continue}kind, err := gb.restMapper.KindFor(resource)if err != nil {errs = append(errs, fmt.Errorf("couldn't look up resource %q: %v", resource, err))continue}c, s, err := gb.controllerFor(resource, kind)if err != nil {errs = append(errs, fmt.Errorf("couldn't start monitor for resource %q: %v", resource, err))continue}current[resource] = &monitor{store: s, controller: c}added++}gb.monitors = current ?for _, monitor := range toRemove {if monitor.stopCh != nil {close(monitor.stopCh)}} ?klog.V(4).Infof("synced monitors; added %d, kept %d, removed %d", added, kept, len(toRemove))// NewAggregate returns nil if errs is 0-lengthreturn utilerrors.NewAggregate(errs) }controllerFor無論是監(jiān)聽到add, update, del都是將其打包成一個event事件,然后加入graphChanges隊列。
func (gb *GraphBuilder) controllerFor(resource schema.GroupVersionResource, kind schema.GroupVersionKind) (cache.Controller, cache.Store, error) {handlers := cache.ResourceEventHandlerFuncs{// add the event to the dependencyGraphBuilder's graphChanges.AddFunc: func(obj interface{}) {event := &event{eventType: addEvent,obj: ? ? ? obj,gvk: ? ? ? kind,}gb.graphChanges.Add(event)},UpdateFunc: func(oldObj, newObj interface{}) {// TODO: check if there are differences in the ownerRefs,// finalizers, and DeletionTimestamp; if not, ignore the update.event := &event{eventType: updateEvent,obj: ? ? ? newObj,oldObj: ? oldObj,gvk: ? ? ? kind,}gb.graphChanges.Add(event)},DeleteFunc: func(obj interface{}) {// delta fifo may wrap the object in a cache.DeletedFinalStateUnknown, unwrap itif deletedFinalStateUnknown, ok := obj.(cache.DeletedFinalStateUnknown); ok {obj = deletedFinalStateUnknown.Obj}event := &event{eventType: deleteEvent,obj: ? ? ? obj,gvk: ? ? ? kind,}gb.graphChanges.Add(event)},}shared, err := gb.sharedInformers.ForResource(resource)if err != nil {klog.V(4).Infof("unable to use a shared informer for resource %q, kind %q: %v", resource.String(), kind.String(), err)return nil, nil, err}klog.V(4).Infof("using a shared informer for resource %q, kind %q", resource.String(), kind.String())// need to clone because it's from a shared cacheshared.Informer().AddEventHandlerWithResyncPeriod(handlers, ResourceResyncTime)return shared.Informer().GetController(), shared.Informer().GetStore(), nil }2.2 啟動garbageCollector
func (gc *GarbageCollector) Run(workers int, stopCh <-chan struct{}) {defer utilruntime.HandleCrash()defer gc.attemptToDelete.ShutDown()defer gc.attemptToOrphan.ShutDown()defer gc.dependencyGraphBuilder.graphChanges.ShutDown() ?klog.Infof("Starting garbage collector controller")defer klog.Infof("Shutting down garbage collector controller")// 1.啟動dependencyGraphBuildergo gc.dependencyGraphBuilder.Run(stopCh) ?if !cache.WaitForNamedCacheSync("garbage collector", stopCh, gc.dependencyGraphBuilder.IsSynced) {return} ?klog.Infof("Garbage collector: all resource monitors have synced. Proceeding to collect garbage")// 啟動runAttemptToDeleteWorker,runAttemptToOrphanWorker// gc workersfor i := 0; i < workers; i++ {go wait.Until(gc.runAttemptToDeleteWorker, 1*time.Second, stopCh)go wait.Until(gc.runAttemptToOrphanWorker, 1*time.Second, stopCh)} ?<-stopCh }2.2.1 啟動dependencyGraphBuilder
// Run sets the stop channel and starts monitor execution until stopCh is // closed. Any running monitors will be stopped before Run returns. func (gb *GraphBuilder) Run(stopCh <-chan struct{}) {klog.Infof("GraphBuilder running")defer klog.Infof("GraphBuilder stopping") ?// Set up the stop channel.gb.monitorLock.Lock()gb.stopCh = stopChgb.running = truegb.monitorLock.Unlock() ?// Start monitors and begin change processing until the stop channel is// closed.// 1. 啟動各個資源的監(jiān)聽gb.startMonitors()// 2. runProcessGraphChanges開始處理各種事件wait.Until(gb.runProcessGraphChanges, 1*time.Second, stopCh) ?// 這里就是有monitor關(guān)閉后的處理// Stop any running monitors.gb.monitorLock.Lock()defer gb.monitorLock.Unlock()monitors := gb.monitorsstopped := 0for _, monitor := range monitors {if monitor.stopCh != nil {stopped++close(monitor.stopCh)}} ?// reset monitors so that the graph builder can be safely re-run/synced.gb.monitors = nilklog.Infof("stopped %d of %d monitors", stopped, len(monitors)) } ? ? // 啟動各個資源的監(jiān)聽 func (gb *GraphBuilder) startMonitors() {gb.monitorLock.Lock()defer gb.monitorLock.Unlock() ?if !gb.running {return} ?// we're waiting until after the informer start that happens once all the controllers are initialized. This ensures// that they don't get unexpected events on their work queues.<-gb.informersStarted ?monitors := gb.monitorsstarted := 0for _, monitor := range monitors {if monitor.stopCh == nil {monitor.stopCh = make(chan struct{})gb.sharedInformers.Start(gb.stopCh)go monitor.Run()started++}}klog.V(4).Infof("started %d new monitors, %d currently running", started, len(monitors)) }2.2.2 runAttemptToDeleteWorker
runAttemptToDeleteWorker就是從attemptToDelete隊列中取出來一個對象處理。
func (gc *GarbageCollector) runAttemptToDeleteWorker() {for gc.attemptToDeleteWorker() {} } ? func (gc *GarbageCollector) attemptToDeleteWorker() bool {item, quit := gc.attemptToDelete.Get()...err := gc.attemptToDeleteItem(n)...return true }2.2.3 runAttemptToOrphanWorker
runAttemptToOrphanWorker就是從attemptToOrphan隊列中取出來一個對象處理。
func (gc *GarbageCollector) runAttemptToOrphanWorker() {for gc.attemptToOrphanWorker() {} } ? ? func (gc *GarbageCollector) attemptToOrphanWorker() bool {item, quit := gc.attemptToOrphan.Get()defer gc.attemptToOrphan.Done(item)owner, ok := item.(*node)if !ok {utilruntime.HandleError(fmt.Errorf("expect *node, got %#v", item))return true}// we don't need to lock each element, because they never get updatedowner.dependentsLock.RLock()dependents := make([]*node, 0, len(owner.dependents))for dependent := range owner.dependents {dependents = append(dependents, dependent)}owner.dependentsLock.RUnlock() ?err := gc.orphanDependents(owner.identity, dependents)if err != nil {utilruntime.HandleError(fmt.Errorf("orphanDependents for %s failed with %v", owner.identity, err))gc.attemptToOrphan.AddRateLimited(item)return true}// update the owner, remove "orphaningFinalizer" from its finalizers listerr = gc.removeFinalizer(owner, metav1.FinalizerOrphanDependents)if err != nil {utilruntime.HandleError(fmt.Errorf("removeOrphanFinalizer for %s failed with %v", owner.identity, err))gc.attemptToOrphan.AddRateLimited(item)}return true }2.2.4 總結(jié)
(1)NewGarbageCollector初始化了graphbuild, attempToDelete, attempToOrphan隊列,然后定義了資源變化時的處理對象
(2)GarbageCollector.run 做了三個工作。第一是, 讓監(jiān)控的所有資源,都用一個處理邏輯。就是:add, update, del都是將其打包成一個event事件,然后加入graphChanges隊列。第二是 ,啟動runProcessGraphChanges處理graphChanges隊列的對象。第三是, 啟動AttemptToOrphanWorker,AttemptToDeleteWorker進行g(shù)c處理。
(3)到這里,總的來說邏輯就是:
-
NewGarbageCollector監(jiān)聽了所有支持 list, watch, delete操作的事件
-
然后定義這些對象所有的add, update, del變化都扔進 graphChanges隊列
-
然后啟動runProcessGraphChanges,處理graphChanges的對象。runProcessGraphChanges主要做倆件事,一是維護圖,二是將可能需要刪除的對象,扔進 AttemptToOrphan,或者AttemptToDelete進行處理
-
AttemptToOrphanWorker,AttemptToDeleteWorker進行具體的gc處理。
到這里為止,gc的初始化,以及大概的流程都清楚了。接下來具體分析runProcessGraphChanges函數(shù),以及AttemptToOrphanWorker,AttemptToDeleteWorker的處理邏輯。
2.3 runProcessGraphChanges
runProcessGraphChanges作用就是倆件事:
(1)時刻uidToNode維護圖的正確和完整
(2)將可能需要刪除的對象扔進AttemptToOrphan,AttemptToDelete隊列
具體邏輯如下:
(1)從 graphChanges 取出一個 對象(event),然后判斷圖里面有沒有這個對象。如果存在,將該節(jié)點標(biāo)記為 observed。這個是表示,這個節(jié)點不是virtual節(jié)點。
(2)分三種情況進行處理。具體是:
func (gb *GraphBuilder) runProcessGraphChanges() {for gb.processGraphChanges() {} } ? // Dequeueing an event from graphChanges, updating graph, populating dirty_queue. func (gb *GraphBuilder) processGraphChanges() bool {item, quit := gb.graphChanges.Get()if quit {return false}defer gb.graphChanges.Done(item)event, ok := item.(*event)if !ok {utilruntime.HandleError(fmt.Errorf("expect a *event, got %v", item))return true}obj := event.objaccessor, err := meta.Accessor(obj)if err != nil {utilruntime.HandleError(fmt.Errorf("cannot access obj: %v", err))return true}klog.V(5).Infof("GraphBuilder process object: %s/%s, namespace %s, name %s, uid %s, event type %v", event.gvk.GroupVersion().String(), event.gvk.Kind, accessor.GetNamespace(), accessor.GetName(), string(accessor.GetUID()), event.eventType)// Check if the node already exists// 1.判斷圖里面有沒有這個對象existingNode, found := gb.uidToNode.Read(accessor.GetUID())// 1.1 如果存在,將其標(biāo)記為 observed。這個是表示,這個節(jié)點不是virtual節(jié)點。if found {// this marks the node as having been observed via an informer event// 1. this depends on graphChanges only containing add/update events from the actual informer// 2. this allows things tracking virtual nodes' existence to stop polling and rely on informer eventsexistingNode.markObserved()}// 2. 分三種情況進行處理。switch {case (event.eventType == addEvent || event.eventType == updateEvent) && !found:newNode := &node{identity: objectReference{OwnerReference: metav1.OwnerReference{APIVersion: event.gvk.GroupVersion().String(),Kind: ? ? ? event.gvk.Kind,UID: ? ? ? accessor.GetUID(),Name: ? ? ? accessor.GetName(),},Namespace: accessor.GetNamespace(),},dependents: ? ? ? ? make(map[*node]struct{}),owners: ? ? ? ? ? ? accessor.GetOwnerReferences(),deletingDependents: beingDeleted(accessor) && hasDeleteDependentsFinalizer(accessor),beingDeleted: ? ? ? beingDeleted(accessor),}gb.insertNode(newNode)// the underlying delta_fifo may combine a creation and a deletion into// one event, so we need to further process the event.gb.processTransitions(event.oldObj, accessor, newNode)case (event.eventType == addEvent || event.eventType == updateEvent) && found:// handle changes in ownerReferencesadded, removed, changed := referencesDiffs(existingNode.owners, accessor.GetOwnerReferences())if len(added) != 0 || len(removed) != 0 || len(changed) != 0 {// check if the changed dependency graph unblock owners that are// waiting for the deletion of their dependents.gb.addUnblockedOwnersToDeleteQueue(removed, changed)// update the node itselfexistingNode.owners = accessor.GetOwnerReferences()// Add the node to its new owners' dependent lists.gb.addDependentToOwners(existingNode, added)// remove the node from the dependent list of node that are no longer in// the node's owners list.gb.removeDependentFromOwners(existingNode, removed)} ?if beingDeleted(accessor) {existingNode.markBeingDeleted()}gb.processTransitions(event.oldObj, accessor, existingNode)case event.eventType == deleteEvent:if !found {klog.V(5).Infof("%v doesn't exist in the graph, this shouldn't happen", accessor.GetUID())return true}// removeNode updates the graphgb.removeNode(existingNode)existingNode.dependentsLock.RLock()defer existingNode.dependentsLock.RUnlock()if len(existingNode.dependents) > 0 {gb.absentOwnerCache.Add(accessor.GetUID())}for dep := range existingNode.dependents {gb.attemptToDelete.Add(dep)}for _, owner := range existingNode.owners {ownerNode, found := gb.uidToNode.Read(owner.UID)if !found || !ownerNode.isDeletingDependents() {continue}// this is to let attempToDeleteItem check if all the owner's// dependents are deleted, if so, the owner will be deleted.gb.attemptToDelete.Add(ownerNode)}}return true }第一種: 如果圖中不存在這個節(jié)點,并且事件為 add或者update,處理方法為:
(1) 初始化一個node節(jié)點。然后插入到map中。
case (event.eventType == addEvent || event.eventType == updateEvent) && !found:newNode := &node{// 該對象的標(biāo)記,由APIVersion,Kind,UID,Nameidentity: objectReference{OwnerReference: metav1.OwnerReference{APIVersion: event.gvk.GroupVersion().String(),Kind: ? ? ? event.gvk.Kind,UID: ? ? ? accessor.GetUID(),Name: ? ? ? accessor.GetName(),},Namespace: accessor.GetNamespace(),},dependents: ? ? ? ? make(map[*node]struct{}), ? ? ? ? // 這里現(xiàn)在是空的owners: ? ? ? ? ? ? accessor.GetOwnerReferences(),// 判斷是否是刪dependentdeletingDependents: beingDeleted(accessor) && hasDeleteDependentsFinalizer(accessor), ? // 判斷是否在正在刪除beingDeleted: ? ? ? beingDeleted(accessor),}gb.insertNode(newNode)// the underlying delta_fifo may combine a creation and a deletion into// one event, so we need to further process the event.gb.processTransitions(event.oldObj, accessor, newNode)(2)insertNode,將這個節(jié)點加入map中,并且將這個node加入所有的owner node的dependent中。
假設(shè)當(dāng)前是當(dāng)前節(jié)點是rsA, 這一步會將rsA加入map中,并且增加deployA的一個dependent為rsA.
(3)調(diào)用processTransitions進行進一步的處理。processTransitions是一個通用函數(shù),它的作用就是將這個對象放入放到AttemptToOrphan或者AttemptToDelete隊列,這個等下具體介紹
第二種, 如果圖中存在這個節(jié)點,并且事件為 add或者update,處理方法為:
(1)處理references Diff
-
首先根據(jù)節(jié)點的信息 和 對象最新的信息,判斷OwnerReference的變化。這里分為三種變化:
added 表示該對象的OwnerReference中新增了哪些 owner; removed表示該對象刪除了哪些owner;changed表示哪些改變了
-
針對這三種變化做出的處理如下:
a. 調(diào)用addUnblockedOwnersToDeleteQueue將可能阻塞的owner重新加入隊列。具體可以看代碼注釋中的分析
b. existingNode.owners = accessor.GetOwnerReferences(), 讓節(jié)點使用最新的owner
c. 新增了owner,需要在新增owner中的Dependents增加一個Dependent, 就是該節(jié)點
d. 刪除了owner,需要在原來的owner中的Dependents刪除這個Dependent, 就是該節(jié)點
(2) 如果當(dāng)前對象有deletionStamp,標(biāo)記這個節(jié)點正在刪除
(3)調(diào)用processTransitions進行進一步的處理。processTransitions是一個通用函數(shù),它的作用就是將這個對象放入放到AttemptToOrphan或者AttemptToDelete隊列,這個等下具體介紹
case (event.eventType == addEvent || event.eventType == updateEvent) && found:// handle changes in ownerReferencesadded, removed, changed := referencesDiffs(existingNode.owners, accessor.GetOwnerReferences())if len(added) != 0 || len(removed) != 0 || len(changed) != 0 {// check if the changed dependency graph unblock owners that are// waiting for the deletion of their dependents.// a.調(diào)用addUnblockedOwnersToDeleteQueue將可能阻塞的owner重新加入隊列。具體可以看代碼注釋中的分析gb.addUnblockedOwnersToDeleteQueue(removed, changed)// update the node itself// b.讓節(jié)點使用最新的ownerexistingNode.owners = accessor.GetOwnerReferences()// Add the node to its new owners' dependent lists.// c. 新增了owner,需要在新增owner中的Dependents增加一個Dependent, 就是該節(jié)點gb.addDependentToOwners(existingNode, added)// remove the node from the dependent list of node that are no longer in// the node's owners list.// d. 刪除了owner,需要在原來的owner中的Dependents刪除這個Dependent, 就是該節(jié)點gb.removeDependentFromOwners(existingNode, removed)}if beingDeleted(accessor) {existingNode.markBeingDeleted()}gb.processTransitions(event.oldObj, accessor, existingNode)? // TODO: profile this function to see if a naive N^2 algorithm performs better // when the number of references is small. func referencesDiffs(old []metav1.OwnerReference, new []metav1.OwnerReference) (added []metav1.OwnerReference, removed []metav1.OwnerReference, changed []ownerRefPair) {oldUIDToRef := make(map[string]metav1.OwnerReference)for _, value := range old {oldUIDToRef[string(value.UID)] = value}oldUIDSet := sets.StringKeySet(oldUIDToRef)for _, value := range new {newUID := string(value.UID)if oldUIDSet.Has(newUID) {if !reflect.DeepEqual(oldUIDToRef[newUID], value) {changed = append(changed, ownerRefPair{oldRef: oldUIDToRef[newUID], newRef: value})}oldUIDSet.Delete(newUID)} else {added = append(added, value)}}for oldUID := range oldUIDSet {removed = append(removed, oldUIDToRef[oldUID])} ?return added, removed, changed } ? ? // 以foreground方式刪除deployA的時候,deployA會被Block,原因在于它在等 rsA的刪除。 // 這個時候如果改變rsA的OwnerReference,比如刪除owner, deployA。這個時候需要通知deployA,你不用等了,可以直接刪除了。 // addUnblockedOwnersToDeleteQueue就是做這樣的事情,檢測到rsA的OwnerReference變化,將等待的deployA加入刪除隊列。 // if an blocking ownerReference points to an object gets removed, or gets set to // "BlockOwnerDeletion=false", add the object to the attemptToDelete queue. func (gb *GraphBuilder) addUnblockedOwnersToDeleteQueue(removed []metav1.OwnerReference, changed []ownerRefPair) {for _, ref := range removed {if ref.BlockOwnerDeletion != nil && *ref.BlockOwnerDeletion {node, found := gb.uidToNode.Read(ref.UID)if !found {klog.V(5).Infof("cannot find %s in uidToNode", ref.UID)continue}gb.attemptToDelete.Add(node)}}for _, c := range changed {wasBlocked := c.oldRef.BlockOwnerDeletion != nil && *c.oldRef.BlockOwnerDeletionisUnblocked := c.newRef.BlockOwnerDeletion == nil || (c.newRef.BlockOwnerDeletion != nil && !*c.newRef.BlockOwnerDeletion)if wasBlocked && isUnblocked {node, found := gb.uidToNode.Read(c.newRef.UID)if !found {klog.V(5).Infof("cannot find %s in uidToNode", c.newRef.UID)continue}gb.attemptToDelete.Add(node)}} }第三種,這個對象已經(jīng)刪除, 處理方法為:
(1)從圖中刪除這個節(jié)點,如果這個節(jié)點有dependents,將這個節(jié)點加入absentOwnerCache。這個是非常有用的。假如deployA刪除了,rsA通過absentOwnerCache能判斷,deployA確實存在,并且被刪除了。
(2)將所有的依賴加入attemptToDelete隊列
(3)如果這個節(jié)點有owners,并且處于刪除Dependents中,那么很有可能它的owners正在等自己?,F(xiàn)在自己刪除了,所以將owners再加入刪除隊列
case event.eventType == deleteEvent:if !found {klog.V(5).Infof("%v doesn't exist in the graph, this shouldn't happen", accessor.GetUID())return true}// removeNode updates the graphgb.removeNode(existingNode)existingNode.dependentsLock.RLock()defer existingNode.dependentsLock.RUnlock()if len(existingNode.dependents) > 0 {gb.absentOwnerCache.Add(accessor.GetUID())}for dep := range existingNode.dependents {gb.attemptToDelete.Add(dep)}for _, owner := range existingNode.owners {ownerNode, found := gb.uidToNode.Read(owner.UID)if !found || !ownerNode.isDeletingDependents() {continue}// this is to let attempToDeleteItem check if all the owner's// dependents are deleted, if so, the owner will be deleted.gb.attemptToDelete.Add(ownerNode)}}2.4 processTransitions函數(shù)的處理邏輯
從上面的分析,可以看出來,runProcessGraphChanges就做了兩件事情:
(1)時刻維護圖的正確和完整
(2)將可能需要刪除的對象扔進AttemptToOrphan,AttemptToDelete隊列
processTransitions就是做第二件事情,將可能需要刪除的對象扔進AttemptToOrphan,AttemptToDelete隊列。
判斷的邏輯很簡單:
(1)如果這個對象正在刪除,并且有orphan這個Finalizer,就將它扔進attemptToOrphan隊列
(1)如果這個對象正在刪除,并且有foregroundDeletion這個Finalizer,就將它和它的dependents扔進attemptToDelete
func (gb *GraphBuilder) processTransitions(oldObj interface{}, newAccessor metav1.Object, n *node) { ?if startsWaitingForDependentsOrphaned(oldObj, newAccessor) {klog.V(5).Infof("add %s to the attemptToOrphan", n.identity)gb.attemptToOrphan.Add(n)return}if startsWaitingForDependentsDeleted(oldObj, newAccessor) {klog.V(2).Infof("add %s to the attemptToDelete, because it's waiting for its dependents to be deleted", n.identity)// if the n is added as a "virtual" node, its deletingDependents field is not properly set, so always set it here.n.markDeletingDependents()for dep := range n.dependents {gb.attemptToDelete.Add(dep)}gb.attemptToDelete.Add(n)} }2.5 runAttemptToOrphanWorker
runAttemptToOrphanWorker邏輯如下:
(1)獲得這個節(jié)點的所有orphanDependents
(2)調(diào)用orphanDependents,刪除它的orphanDependents的OwnerReferences。
(3)刪除orphan這個Finalizer,讓該對象可以被刪除
func (gc *GarbageCollector) runAttemptToOrphanWorker() {for gc.attemptToOrphanWorker() {} } ? // attemptToOrphanWorker dequeues a node from the attemptToOrphan, then finds its // dependents based on the graph maintained by the GC, then removes it from the // OwnerReferences of its dependents, and finally updates the owner to remove // the "Orphan" finalizer. The node is added back into the attemptToOrphan if any of // these steps fail. func (gc *GarbageCollector) attemptToOrphanWorker() bool {item, quit := gc.attemptToOrphan.Get()gc.workerLock.RLock()defer gc.workerLock.RUnlock()if quit {return false}defer gc.attemptToOrphan.Done(item)owner, ok := item.(*node)if !ok {utilruntime.HandleError(fmt.Errorf("expect *node, got %#v", item))return true}// we don't need to lock each element, because they never get updatedowner.dependentsLock.RLock()dependents := make([]*node, 0, len(owner.dependents))// 1.獲得這個節(jié)點的所有orphanDependentsfor dependent := range owner.dependents {dependents = append(dependents, dependent)}owner.dependentsLock.RUnlock()// 2.調(diào)用orphanDependents,刪除它的orphanDependents的OwnerReferences。// 舉例來說,刪除deployA時,刪除rsA的OwnerReference,這樣rsA就不受deployA控制了。err := gc.orphanDependents(owner.identity, dependents)if err != nil {utilruntime.HandleError(fmt.Errorf("orphanDependents for %s failed with %v", owner.identity, err))gc.attemptToOrphan.AddRateLimited(item)return true}// update the owner, remove "orphaningFinalizer" from its finalizers list// 3. 刪除orphan這個Finalizer,讓deployA可以被刪除err = gc.removeFinalizer(owner, metav1.FinalizerOrphanDependents)if err != nil {utilruntime.HandleError(fmt.Errorf("removeOrphanFinalizer for %s failed with %v", owner.identity, err))gc.attemptToOrphan.AddRateLimited(item)}return true }2.6 attemptToDeleteWorker
主要調(diào)用attemptToDeleteItem函數(shù)。attemptToDeleteItem的邏輯如下:
(1)如果該對象isBeingDeleted,并且沒有在刪除Dependents,直接返回
(2)如果該對象正在刪除dependents, 將dependents加入attemptToDelete隊列
(3)調(diào)用classifyReferences,計算solid,dangling,waitingForDependentsDeletion的情況,solid,dangling,waitingForDependentsDeletion是OwnerReferences數(shù)組
solid:當(dāng)前節(jié)點的owner存在,并且owner的狀態(tài)不是刪除Dependents中
dangling:owner不存在
waitingForDependentsDeletion:owner存在,并且owner的狀態(tài)是刪除Dependents中
(4)根據(jù)solid,dangling,waitingForDependentsDeletion的情況進行不同的處理,具體如下:
-
情況1: 如果有至少有一個owner存在,并且不處于刪除依賴中。這個時候判斷dangling,waitingForDependentsDeletion的數(shù)量是否為0。如果為0,說明當(dāng)前不需要處理;否則,將該節(jié)點對應(yīng)dangling,waitingForDependentsDeletion的節(jié)點刪除dependents。
-
情況2: 到這里說明 len(solid)=0,這個時候如果有節(jié)點在等待這個節(jié)點刪除,并且這個節(jié)點還有依賴,那么將這個節(jié)點的blockOwnerDeletion設(shè)置為true。然后后臺刪除這個節(jié)點。 這里舉一個例子說明:當(dāng)前臺模式刪除deployA時,rsA是當(dāng)前要處理的節(jié)點。這個時候rsA發(fā)現(xiàn)deployA再等自己刪除,但是自己又有依賴podA,所以這里馬上將自己設(shè)置為前臺刪除。這樣在deployA看來就實現(xiàn)了先刪除podA, 再刪除rsA,再刪除deployA。
-
情況3: 除了上面的兩種情況,根據(jù)設(shè)置的刪除策略刪除這個節(jié)點。
這里舉一個例子說明:當(dāng)后臺模式刪除deployA時,rsA是當(dāng)前要處理的節(jié)點。這個時候deployA已經(jīng)刪除了,同時沒有finalizer,因為只有Orphan, foreGround有finalizer,所以這個時候直接默認(rèn)以background刪除這個節(jié)點。
func (gc *GarbageCollector) attemptToDeleteWorker() bool {item, quit := gc.attemptToDelete.Get() ?err := gc.attemptToDeleteItem(n) ?return true } ? ? func (gc *GarbageCollector) attemptToDeleteItem(item *node) error {klog.V(2).Infof("processing item %s", item.identity)// "being deleted" is an one-way trip to the final deletion. We'll just wait for the final deletion, and then process the object's dependents.// 1.如果該對象isBeingDeleted,并且沒有在刪除Dependents,直接返回if item.isBeingDeleted() && !item.isDeletingDependents() {klog.V(5).Infof("processing item %s returned at once, because its DeletionTimestamp is non-nil", item.identity)return nil}// TODO: It's only necessary to talk to the API server if this is a// "virtual" node. The local graph could lag behind the real status, but in// practice, the difference is small.latest, err := gc.getObject(item.identity)switch {case errors.IsNotFound(err):// the GraphBuilder can add "virtual" node for an owner that doesn't// exist yet, so we need to enqueue a virtual Delete event to remove// the virtual node from GraphBuilder.uidToNode.klog.V(5).Infof("item %v not found, generating a virtual delete event", item.identity)gc.dependencyGraphBuilder.enqueueVirtualDeleteEvent(item.identity)// since we're manually inserting a delete event to remove this node,// we don't need to keep tracking it as a virtual node and requeueing in attemptToDeleteitem.markObserved()return nilcase err != nil:return err} ?if latest.GetUID() != item.identity.UID {klog.V(5).Infof("UID doesn't match, item %v not found, generating a virtual delete event", item.identity)gc.dependencyGraphBuilder.enqueueVirtualDeleteEvent(item.identity)// since we're manually inserting a delete event to remove this node,// we don't need to keep tracking it as a virtual node and requeueing in attemptToDeleteitem.markObserved()return nil} ?// TODO: attemptToOrphanWorker() routine is similar. Consider merging// attemptToOrphanWorker() into attemptToDeleteItem() as well.// 2. 如果該對象正在刪除dependents, 將dependents加入attemptToDelete隊列if item.isDeletingDependents() {return gc.processDeletingDependentsItem(item)}// compute if we should delete the itemownerReferences := latest.GetOwnerReferences()if len(ownerReferences) == 0 {klog.V(2).Infof("object %s's doesn't have an owner, continue on next item", item.identity)return nil}// 3.計算solid,dangling,waitingForDependentsDeletion的情況。solid, dangling, waitingForDependentsDeletion, err := gc.classifyReferences(item, ownerReferences)if err != nil {return err}klog.V(5).Infof("classify references of %s.\nsolid: %#v\ndangling: %#v\nwaitingForDependentsDeletion: %#v\n", item.identity, solid, dangling, waitingForDependentsDeletion) ? ?// 4.根據(jù)solid,dangling,waitingForDependentsDeletion的情況進行不同的處理switch {// 情況1: 如果有至少有一個owner存在,并且不處于刪除依賴中。這個時候判斷dangling,waitingForDependentsDeletion的數(shù)量是否為0。如果為0,說明當(dāng)前不需要處理;否則,將該節(jié)點對應(yīng)dangling,waitingForDependentsDeletion的節(jié)點刪除dependents。case len(solid) != 0:klog.V(2).Infof("object %#v has at least one existing owner: %#v, will not garbage collect", item.identity, solid)if len(dangling) == 0 && len(waitingForDependentsDeletion) == 0 {return nil}klog.V(2).Infof("remove dangling references %#v and waiting references %#v for object %s", dangling, waitingForDependentsDeletion, item.identity)// waitingForDependentsDeletion needs to be deleted from the// ownerReferences, otherwise the referenced objects will be stuck with// the FinalizerDeletingDependents and never get deleted.ownerUIDs := append(ownerRefsToUIDs(dangling), ownerRefsToUIDs(waitingForDependentsDeletion)...)patch := deleteOwnerRefStrategicMergePatch(item.identity.UID, ownerUIDs...)_, err = gc.patch(item, patch, func(n *node) ([]byte, error) {return gc.deleteOwnerRefJSONMergePatch(n, ownerUIDs...)})return err// 情況2: 到這里說明 len(solid)=0,這個時候如果有節(jié)點在等待這個節(jié)點刪除,并且這個節(jié)點還有依賴,那么將這個節(jié)點的blockOwnerDeletion設(shè)置為true。然后后臺刪除這個節(jié)點。case len(waitingForDependentsDeletion) != 0 && item.dependentsLength() != 0:deps := item.getDependents()for _, dep := range deps {if dep.isDeletingDependents() {// this circle detection has false positives, we need to// apply a more rigorous detection if this turns out to be a// problem.// there are multiple workers run attemptToDeleteItem in// parallel, the circle detection can fail in a race condition.klog.V(2).Infof("processing object %s, some of its owners and its dependent [%s] have FinalizerDeletingDependents, to prevent potential cycle, its ownerReferences are going to be modified to be non-blocking, then the object is going to be deleted with Foreground", item.identity, dep.identity)patch, err := item.unblockOwnerReferencesStrategicMergePatch()if err != nil {return err}if _, err := gc.patch(item, patch, gc.unblockOwnerReferencesJSONMergePatch); err != nil {return err}break}}klog.V(2).Infof("at least one owner of object %s has FinalizerDeletingDependents, and the object itself has dependents, so it is going to be deleted in Foreground", item.identity)// the deletion event will be observed by the graphBuilder, so the item// will be processed again in processDeletingDependentsItem. If it// doesn't have dependents, the function will remove the// FinalizerDeletingDependents from the item, resulting in the final// deletion of the item.policy := metav1.DeletePropagationForegroundreturn gc.deleteObject(item.identity, &policy)// 情況3: 除了上面的兩種情況,根據(jù)設(shè)置的刪除策略刪除這個節(jié)點default:// item doesn't have any solid owner, so it needs to be garbage// collected. Also, none of item's owners is waiting for the deletion of// the dependents, so set propagationPolicy based on existing finalizers.var policy metav1.DeletionPropagationswitch {case hasOrphanFinalizer(latest):// if an existing orphan finalizer is already on the object, honor it.policy = metav1.DeletePropagationOrphancase hasDeleteDependentsFinalizer(latest):// if an existing foreground finalizer is already on the object, honor it.policy = metav1.DeletePropagationForegrounddefault:// otherwise, default to background.policy = metav1.DeletePropagationBackground}klog.V(2).Infof("delete object %s with propagation policy %s", item.identity, policy)return gc.deleteObject(item.identity, &policy)} }2.7 uidToNode到底是什么
在startGarbageCollectorController的時候 開啟debug模式
return garbagecollector.NewDebugHandler(garbageCollector), true, nil利用這個,我們可以看到uidToNode里的數(shù)據(jù)。數(shù)據(jù)太多,我這里就看 kube-system命名空間,kube-hpa這個deploy 在uidToNode的數(shù)據(jù)。
kcm對應(yīng)的10252端口
??催@個 // 639d5269-d73d-4964-a7de-d6f386c9c7e4是kube-hpa這個deploy的uid。 # curl http://127.0.0.1:10252/debug/controllers/garbagecollector/graph?uid=639d5269-d73d-4964-a7de-d6f386c9c7e4 strict digraph full {// Node definitions.0 [label="\"uid=e66e45c0-5695-4c93-82f1-067b20aa035f\nnamespace=kube-system\nReplicaSet.v1.apps/kube-hpa-84c884f994\n\""group="apps"version="v1"kind="ReplicaSet"namespace="kube-system"name="kube-hpa-84c884f994"uid="e66e45c0-5695-4c93-82f1-067b20aa035f"missing="false"beingDeleted="false"deletingDependents="false"virtual="false"];1 [label="\"uid=9833c399-b139-4432-98f7-cec13158f804\nnamespace=kube-system\nPod.v1/kube-hpa-84c884f994-7gwpz\n\""group=""version="v1"kind="Pod"namespace="kube-system"name="kube-hpa-84c884f994-7gwpz"uid="9833c399-b139-4432-98f7-cec13158f804"missing="false"beingDeleted="false"deletingDependents="false"virtual="false"];2 [label="\"uid=639d5269-d73d-4964-a7de-d6f386c9c7e4\nnamespace=kube-system\nDeployment.v1.apps/kube-hpa\n\""group="apps"version="v1"kind="Deployment"namespace="kube-system"name="kube-hpa"uid="639d5269-d73d-4964-a7de-d6f386c9c7e4"missing="false"beingDeleted="false"deletingDependents="false"virtual="false"]; ?// Edge definitions.0 -> 2;1 -> 0; }可以看出來,這個圖就是表示了節(jié)點的依賴,同時beingDeleted, deletingDependents表示了當(dāng)前節(jié)點的狀態(tài)。
這個還可以將圖畫出來。
curl http://127.0.0.1:10252/debug/controllers/garbagecollector/graph?uid=639d5269-d73d-4964-a7de-d6f386c9c7e4 > tmp.dot ? dot -Tsvg -o graph.svg tmp.dotgraph.svg如下:
3.總結(jié)
gc這塊的邏輯非常繞,也非常難懂。但是多看幾遍就會發(fā)現(xiàn)這個其他的妙處。這里再次總結(jié)一下整個流程。
(1) kcm啟動時,gc controller隨之啟動。gc 啟動時,做了以下的初始化工作見下圖:
-
定期獲取所有能刪除的資源,保存到RestMapper。然后啟動這些資源的監(jiān)聽事件
-
對這些些資源設(shè)置add, update, delete事件的處理邏輯:只要有變化就將其封裝成一個event,然后扔進graphChanges隊列
(2)runProcessGraphChanges負(fù)責(zé)處理graphChanges隊列中的對象。主要做了倆件事情:
-
第一,根據(jù)不同的變化,維護uidToNode這個圖。一個對象對應(yīng)了uidToNode中的一個節(jié)點,同時該節(jié)點有o wner, depends字段。
-
第二,根據(jù)節(jié)點的beingDeleted, deletingDependents等字段,判斷該節(jié)點是否可能要刪除。如果要刪除,將其扔進attemtToDelete, attemtToOrghan隊列
(3)attemtToDeleteWorker, attemtToOrghanWorker負(fù)責(zé)出來attemtToDelete, attemtToOrghan隊列,根據(jù)不同的情況進行刪除
總結(jié)
以上是生活随笔為你收集整理的k8s gc原理详解的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: python中的魔法方法__new___
- 下一篇: 中国人误传千年的七句话