服务器之家:专注于服务器技术及软件下载分享
分类导航

服务器资讯|IT/互联网|云计算|区块链|软件资讯|操作系统|手机数码|百科知识|免费资源|头条新闻|

服务器之家 - 新闻资讯 - 云计算 - 救火分享:记一次 K8s 管理节点故障排查经历,太深入了

救火分享:记一次 K8s 管理节点故障排查经历,太深入了

2023-07-13 12:00未知服务器之家 云计算

集群以及环境信息的信息: k8s v1.18.4 3 节点 Master 均为 8核 16G,50Gi-SSD 差异化配置的 19 节点 Minion control-plane 组件 (kube-apiserver,etcd,kube-controller-manager,kube-scheduler) 以 static-pod 的模式进行部署 3 个 kube-apiserver 前端有一个 VIP 进行流

集群以及环境信息的信息:

  • k8s v1.18.4

  • 3 节点 Master 均为 8核 16G,50Gi-SSD

  • 差异化配置的 19 节点 Minion

  • control-plane 组件 (kube-apiserver,etcd,kube-controller-manager,kube-scheduler) 以 static-pod 的模式进行部署

  • 3 个 kube-apiserver 前端有一个 VIP 进行流量的 LoadBalance

  • 腾讯云的 SSD 性能大概是 130MB/s

故障描述

在2022-9-10下午“诡异”的事情开始出现:kubectl 偶尔卡住无法正常 CRUDW 标准资源 (Pod, Node 等),此时已经意识到是部分的 kube-apiserver 无法正常工作,然后尝试分别接入 3 台 kube-apiserver 所在宿主机,发现以下不正常的情况:

现场的信息

k8s control-plane kube-apiserver Pod 信息:

$kubectlgetpods-nkube-systemkube-apiserver-x.x.x.x-oyaml...containerStatuses:-containerID:docker://xxxxx....lastState:terminated:containerID:docker://yyyyexitCode:137finishedAt:"2021-09-10T09:29:02Z"reason:OOMKilledstartedAt:"2020-12-09T07:02:23Z"name:kube-apiserverready:truerestartCount:1started:truestate:running:startedAt:"2021-09-10T09:29:08Z"...

9月10日,kube-apiserver 因为 OOM 被 Kill 过。

周边监控

IaaS 层提供监控信息 (control-plane 所在宿主的黑盒监控信息):

救火分享:记一次 K8s 管理节点故障排查经历,太深入了

救火分享:记一次 K8s 管理节点故障排查经历,太深入了

有效信息:

  • 内存、CPU、读取磁盘呈现正相关,并且在9月10日开始明显下降各方面指标回归正常

Kube-apiserver Prometheus 的监控信息:

救火分享:记一次 K8s 管理节点故障排查经历,太深入了

有效信息:

  • kube-apiserver 的 IO 出现了问题,Prometheus 在中途的某段事件无法正常的刮取 kube-apiserver 的 metrics 指标
  • kube-apiserver 占用的内存单调递增,内部 workqueue 的 ADD iops 很高

实时的 Debug 信息

救火分享:记一次 K8s 管理节点故障排查经历,太深入了

救火分享:记一次 K8s 管理节点故障排查经历,太深入了

救火分享:记一次 K8s 管理节点故障排查经历,太深入了

救火分享:记一次 K8s 管理节点故障排查经历,太深入了

有效信息:

  • 两个 Master 节点的内存都几乎被占用了 80%~90% 的样子

  • 大量的内存都被 kube-apiserver 进程吃掉了

  • 有一台 Master 组件格外特殊,除了内存之外,CPU 也已经被打满,并且大量的 CPU 陷入内核态,wa 很高

  • 机器上几乎每个进程都 “ 疯 “ 了,都在进行大量的读盘操作,接入的 shell 已经基本不可用

  • 唯内存消耗相对较低(当时占用 8Gi)的 Master 节点上,kube-apiserver 曾经被 OOM-Kill 过

一些疑问以及相关的猜想

为什么 kube-apiserver 消耗大量内存?

  1. 存在 Client 在进行全量 List 核心资源

  2. etcd 无法提供正常服务,引发 kube-apiserver 无法正常为其他的 control-plane 组件无法提供选主服务,kube-controller-manager, kube-scheduler 不断重启,不断 ListAndWatch 进而打垮整个机器

  3. kube-apiserver 代码有 bug,存在内存泄漏

etcd 集群为何无法正常工作?

  1. etcd 集群内网络抖动
  2. 磁盘性能下降,无法满足正常的 etcd 工作
  3. etcd 所在宿主机计算资源匮乏 (CPU,RAM),etcd 只能分配到很少的时间轮片。 而代码中设置的网络描述符的 Deadline 到期

kube-controller-manager,kube-scheduler 这种无状态服务为什么会大量的读盘?

  1. kube-controller-manager,kube-scheduler 读取本地的配置文件
  2. 操作系统内存极度紧缩的情况下,会将部分代码段很大的进程的内存页驱逐。 保证被调度的进程能够运行。当被驱逐的进程再度被调度的时候会重新读入内存。这样会导致 I/O 增加

一些日志

kube-apiserver相关:

    
    I0907 07:04:17.611412    1 trace.go:116] Trace[1140445702]: "Get" url:/apis/storage.k8s.io/v1/volumeattachments/csi-b8d912ae5f2cd6d9cfaecc515568c455afdc729fd4c721e4a6ed24ae09d9bcb6,user-agent:kube-controller-manager/v1.18.4 (linux/amd64) kubernetes/f8797eb/system:serviceaccount:kube-system:attachdetach-controller,client:10.0.0.42 (started: 2021-09-07 07:04:16.635225967 +0800 CST m=+23472113.230522917) (total time: 976.1773ms):
    
    Trace[1140445702]: [976.164659ms] [976.159045ms] About to write a response
    
    I0907 07:04:17.611478    1 trace.go:116] Trace[630463685]: "Get" url:/apis/storage.k8s.io/v1/volumeattachments/csi-30c942388d7a835b685e5d5a44a15943bfc87b228e3b9ed2fa01ed071fbc1ada,user-agent:kube-controller-manager/v1.18.4 (linux/amd64) kubernetes/f8797eb/system:serviceaccount:kube-system:attachdetach-controller,client:10.0.0.42 (started: 2021-09-07 07:04:16.627643544 +0800 CST m=+23472113.222940496) (total time: 983.823847ms):
    
    Trace[630463685]: [983.812225ms] [983.803546ms] About to write a response
    
    I0907 07:04:17.611617    1 trace.go:116] Trace[1538026391]: "Get" url:/apis/storage.k8s.io/v1/volumeattachments/csi-3360a8de64a82e17aa96f373f6dc185489bbc653824b3099c81a6fce754a3f6f,user-agent:kube-controller-manager/v1.18.4 (linux/amd64) kubernetes/f8797eb/system:serviceaccount:kube-system:attachdetach-controller,client:10.0.0.42 (started: 2021-09-07 07:04:16.565151702 +0800 CST m=+23472113.160448667) (total time: 1.046416117s):
    
    Trace[1538026391]: [1.046397533s] [1.046390931s] About to write a response
    
    
    
    # ....N条类似的日志.... I0907 07:04:24.171900 1 trace.go:116] Trace[1679666311]: "Get" url:/apis/storage.k8s.io/v1/volumeattachments/csi-b5ee4d68917fbf1262cc0a8047d9c37133a4a766f53f3566c40f868edd8a0bf0,user-agent:kube-controller-manager/v1.18.4 (linux/amd64) kubernetes/f8797eb/system:serviceaccount:kube-system:attachdetach-controller,client:10.0.0.42 (started: 2021-09-07 07:04:17.116692376 +0800 CST m=+23472113.711989328) (total time: 5.467094616s): Trace[1679666311]: [4.69843064s] [4.698422403s] About to write a response Trace[1679666311]: [5.467093266s] [768.662626ms] Transformed response object I0907 07:04:24.142154 1 trace.go:116] Trace[2029984150]: "Get" url:/apis/storage.k8s.io/v1/volumeattachments/csi-29407a58f021e9911d3ed9b5745dddfd40e61d64e1ecc62cf71ab9e037f83107,user-agent:kube-controller-manager/v1.18.4 (linux/amd64) kubernetes/f8797eb/system:serviceaccount:kube-system:attachdetach-controller,client:10.0.0.42 (started: 2021-09-07 07:04:17.102181932 +0800 CST m=+23472113.697478883) (total time: 5.50528966s): Trace[2029984150]: [4.704869664s] [4.704865103s] About to write a response Trace[2029984150]: [5.505288975s] [800.419311ms] Transformed response object E0907 07:04:37.327057 1 authentication.go:53] Unable to authenticate the request due to an error: [invalid bearer token, context canceled] I0907 07:04:40.391602 1 trace.go:116] Trace[2032502471]: "Update" url:/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/10.0.0.8,user-agent:kubelet/v1.18.4 (linux/amd64) kubernetes/f8797eb,client:10.0.0.8 (started: 2021-09-07 07:04:17.392952999 +0800 CST m=+23472113.988249966) (total time: 20.164329656s): Trace[2032502471]: [286.281637ms] [286.281637ms] About to convert to expected version Trace[2032502471]: [481.650367ms] [195.36873ms] Conversion done Trace[2032502471]: [674.825685ms] [193.175318ms] About to store object in database Trace[2032502471]: [20.164329656s] [19.489503971s] END I0907 07:04:41.502454 1 trace.go:116] Trace[951110924]: "Get" url:/apis/storage.k8s.io/v1/volumeattachments/csi-0938f8852ff15a3937d619216756f9bfa5b0d91c0eb95a881b3caccbf8e87fb2,user-agent:kube-controller-manager/v1.18.4 (linux/amd64) kubernetes/f8797eb/system:serviceaccount:kube-system:attachdetach-controller,client:10.0.0.42 (started: 2021-09-07 07:04:17.29109778 +0800 CST m=+23472113.886395069) (total time: 20.41128212s): Trace[951110924]: [12.839487235s] [12.839478549s] About to write a response Trace[951110924]: [20.411280569s] [7.571793334s] Transformed response object
    W0907 07:10:39.496915 1 clientconn.go:1208] grpc: addrConn.createTransport failed to connect to { <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing context deadline exceeded". Reconnecting... I0907 07:10:39.514215 1 clientconn.go:882] blockingPicker: the picked transport is not ready, loop back to repick E0907 07:10:39.514473 1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"context canceled"} E0907 07:10:39.514538 1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"context canceled"} I0907 07:10:39.514662 1 clientconn.go:882] blockingPicker: the picked transport is not ready, loop back to repick E0907 07:10:39.514679 1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"context canceled"} E0907 07:10:39.514858 1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"context canceled"} W0907 07:10:39.514925 1 clientconn.go:1208] grpc: addrConn.createTransport failed to connect to { <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing context deadline exceeded". Reconnecting...

    可以看到对 etcd 的操作耗时越来越慢。并且最后甚至丢失了与 etcd 的连接

    etcd日志【已经丢失9月7日的日志】:

    {"level":"warn","ts":"2021-09-10T17:14:50.559+0800","caller":"embed/config_logging.go:279","msg":"rejectedconnection","remote-addr":"10.0.0.42:49824","server-name":"","error":"readtcp10.0.0.8:2380->10.0.0.42:49824:i/otimeout"}{"level":"warn","ts":"2021-09-10T17:14:58.993+0800","caller":"embed/config_logging.go:279","msg":"rejectedconnection","remote-addr":"10.0.0.6:54656","server-name":"","error":"EOF"}{"level":"warn","ts":"2021-09-10T17:15:03.961+0800","caller":"embed/config_logging.go:279","msg":"rejectedconnection","remote-addr":"10.0.0.6:54642","server-name":"","error":"EOF"}{"level":"warn","ts":"2021-09-10T17:15:32.253+0800","caller":"embed/config_logging.go:279","msg":"rejectedconnection","remote-addr":"10.0.0.6:54658","server-name":"","error":"EOF"}{"level":"warn","ts":"2021-09-10T17:15:49.299+0800","caller":"embed/config_logging.go:279","msg":"rejectedconnection","remote-addr":"10.0.0.42:49690","server-name":"","error":"EOF"}{"level":"warn","ts":"2021-09-10T17:15:55.930+0800","caller":"embed/config_logging.go:279","msg":"rejectedconnection","remote-addr":"10.0.0.42:49736","server-name":"","error":"EOF"}{"level":"warn","ts":"2021-09-10T17:16:29.446+0800","caller":"embed/config_logging.go:279","msg":"rejectedconnection","remote-addr":"10.0.0.6:54664","server-name":"","error":"EOF"}{"level":"warn","ts":"2021-09-10T17:16:24.026+0800","caller":"embed/config_logging.go:279","msg":"rejectedconnection","remote-addr":"10.0.0.42:49696","server-name":"","error":"EOF"}{"level":"warn","ts":"2021-09-10T17:16:36.625+0800","caller":"embed/config_logging.go:279","msg":"rejectedconnection","remote-addr":"10.0.0.42:49714","server-name":"","error":"EOF"}{"level":"warn","ts":"2021-09-10T17:17:54.241+0800","caller":"embed/config_logging.go:279","msg":"rejectedconnection","remote-addr":"10.0.0.42:49878","server-name":"","error":"readtcp10.0.0.8:2380->10.0.0.42:49878:i/otimeout"}{"level":"warn","ts":"2021-09-10T17:18:31.712+0800","caller":"embed/config_logging.go:279","msg":"rejectedconnection","remote-addr":"10.0.0.42:49728","server-name":"","error":"EOF"}{"level":"warn","ts":"2021-09-10T17:18:31.785+0800","caller":"embed/config_logging.go:279","msg":"rejectedconnection","remote-addr":"10.0.0.42:49800","server-name":"","error":"EOF"}
    可以看出,该节点的 etcd 与集群中的其他节点的通信也异常了。无法正常对外提供服务

    深入调查

    查看 kube-apiserver Heap-Profile

    救火分享:记一次 K8s 管理节点故障排查经历,太深入了

    获取到 kube-apiserver 最近的 Profile 可以发现大量的内存都被 registry(*Store).DeleteCollection 吃掉了。DeleteCollection 会先进行 List 操作,然后并发删除每个 Item。瞬间吃掉大量内存也在情理之中。也许我这么倒霉,这次抓的时候就正好处于 Delete 的疯狂调用期间,休息 10 分钟,等下再抓一次看看。

    怎么回事?怎么 DeleteCollection 还是持有这么多内存。此处其实已经开始怀疑 kube-apiserver 有 goroutine 泄漏的可能了。

    看来 heap 的 Profile 无法提供足够的信息进行进一步的问题确定,继续抓取 kube-apiserver goroutine 的 profile

    查看 kube-apiserver goroutine-profile

    1.8W goroutine

    goroutine18970952966[chansend,429minutes]:k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/registry/generic/registry.(*Store).DeleteCollection.func1(0xc00f082ba0,0xc04e059b90,0x9,0x9,0xc235e37570)--goroutine18971918521[chansend,394minutes]:k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/registry/generic/registry.(*Store).DeleteCollection.func1(0xc01f5a7c20,0xc025cf8750,0x9,0x9,0xc128042620)--goroutine18975541850[chansend,261minutes]:k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/registry/generic/registry.(*Store).DeleteCollection.func1(0xc08bb1c780,0xc095222240,0x9,0x9,0xc128667340)--goroutine18972733441[chansend,363minutes]:k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/registry/generic/registry.(*Store).DeleteCollection.func1(0xc015c2c9c0,0xc0db776240,0x9,0x9,0xc124b0f570)--goroutine18973170460[chansend,347minutes]:k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/registry/generic/registry.(*Store).DeleteCollection.func1(0xc07a804c60,0xc04ee83200,0x9,0x9,0xc23e555c00)--goroutine18974040729[chansend,316minutes]:k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/registry/generic/registry.(*Store).DeleteCollection.func1(0xc028d40780,0xc243ea0cf0,0x9,0x9,0xc14a6be770)--goroutine18974695712[chansend,292minutes]:k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/registry/generic/registry.(*Store).DeleteCollection.func1(0xc04573b5c0,0xc04f0af290,0x9,0x9,0xc1982732d0)--goroutine18974885774[chansend,285minutes]:k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/registry/generic/registry.(*Store).DeleteCollection.func1(0xc26065b9e0,0xc06f025710,0x9,0x9,0xc229945f80)--goroutine18971511296[chansend,408minutes]:k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/registry/generic/registry.(*Store).DeleteCollection.func1(0xc021e74480,0xc0313a54d0,0x9,0x9,0xc1825177a0)--goroutine18975459144[chansend,263minutes]:k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/registry/generic/registry.(*Store).DeleteCollection.func1(0xc26cb9d1a0,0xc17672b440,0x9,0x9,0xc156fcd810)--goroutine18973661056[chansend,331minutes]:k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/registry/generic/registry.(*Store).DeleteCollection.func1(0xc2ba3f1680,0xc19df40870,0x9,0x9,0xc24b888a10)--goroutine18973633598[chansend,331minutes]:k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/registry/generic/registry.(*Store).DeleteCollection.func1(0xc24a0ef5c0,0xc05472a900,0x9,0x9,0xc01ec14af0)--goroutine18974804879[chansend,288minutes]:k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/registry/generic/registry.(*Store).DeleteCollection.func1(0xc0863f9ce0,0xc04fcee090,0x9,0x9,0xc2257a1d50)--goroutine18974203255[chansend,310minutes]:k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/registry/generic/registry.(*Store).DeleteCollection.func1(0xc2727ed8c0,0xc021d4ae10,0x9,0x9,0xc173af3570)#...还有很多,就不列举了...

    果然大量密集的 goroutine 全部都 Block 在 chan send,并且时间上很密集。DeleteCollection 该接口的调用一般都是 kube-controller-manager 的 namespace_deleter。如果 kube-apiserver 的接口调用异常会进行 Backoff,然后继续请求,此处的异常是 和 kube-apiserver 的通讯没问题,但是 kube-apiserver 无法正常和 etcd 交互

    查看 kube-controller-manager 的日志

    E102715:15:01.0167121leaderelection.go:320]errorretrievingresourcelockkube-system/kube-controller-manager:etcdserver:requesttimedoutI102715:15:01.9506821leaderelection.go:277]failedtorenewleasekube-system/kube-controller-manager:timedoutwaitingfortheconditionF102715:15:01.9507601controllermanager.go:279]leaderelectionlostI102715:15:01.9622101certificate_controller.go:131]Shuttingdowncertificatecontroller"csrsigning"

    虽然没有输出调用 DeleteCollection 相关的日志,但是 kube-controller-manager 进行 LeaderElection 的 ConfigMap 也无法正常的刷新了。并且报错也非常明确的指出是 kube-apiserver 请求 etcd 出现 timeout。

    kube-apiserver DeleteCollection 实现

      
      func (e *Store) DeleteCollection(ctx context.Context, deleteValidation rest.ValidateObjectFunc, options *metav1.DeleteOptions, listOptions *metainternalversion.ListOptions) (runtime.Object, error) {
      
      
      
      ... listObj, err := e.List(ctx, listOptions) if err != nil { return nil, err } items, err := meta.ExtractList(listObj) ... wg := sync.WaitGroup{} toProcess := make(chan int, 2*workersNumber) errs := make(chan error, workersNumber+1)
      go func() { defer utilruntime.HandleCrash(func(panicReason interface{}) { errs <- fmt.Errorf("DeleteCollection distributor panicked: %v", panicReason) }) for i := 0; i < len(items); i++ { toProcess <- i } close(toProcess) }()
      wg.Add(workersNumber) for i := 0; i < workersNumber; i++ { go func() { // panics don't cross goroutine boundaries defer utilruntime.HandleCrash(func(panicReason interface{}) { errs <- fmt.Errorf("DeleteCollection goroutine panicked: %v", panicReason) }) defer wg.Done()
      for index := range toProcess { ... if _, _, err := e.Delete(ctx, accessor.GetName(), deleteValidation, options.DeepCopy()); err != nil && !apierrors.IsNotFound(err) { klog.V(4).InfoS("Delete object in DeleteCollection failed", "object", klog.KObj(accessor), "err", err) errs <- err return } } }() } wg.Wait() select { case err := <-errs: return nil, err default: return listObj, nil } }
      如果 e.Delete 发生异常(也就是我们场景下的 etcd 异常)。worker goroutine 将会正常退出,但是任务分配 goroutine 无法正常退出,Block 在发送 toProcess chan 的代码块中。如果该 Goroutine 无法结束,那么通过 etcd 的 List 接口获取到的 items 也无法被 GC,在内存中大量堆积,导致 OOM。

      总结

      1、排障之前需要对正常有一个比较明确的定位,何谓“正常”?

      根据之前的实践经验:100Node,1400Pod, 50 Configmap, 300 event。 kube-apiserver 消耗的资源大概在 2Gi 的内存以及单核心 10% 的计算资源。

      2、本次排障过程中大概如下:

      • 通过一些不正常的表现,感觉到问题的存在

      • 大概确定引发故障的组件,通过管理工具获取该组件相关的信息

      • 在相关的监控系统中寻找异常发生的时间,提取相关的有效信息比如 CPU,RAM,Disk 的消耗

      • 对触发故障的原因做假设

      • 查找相关组件的日志,Profile 来印证假设

      3、如何防止 control-plane 链式崩溃

      • 明确设置 kube-apiserver 的 CPU 资源内存资源的消耗。防止在混部模式下因为 kube-apiserver 消耗大量内存资源的前提下影响到 etcd 的正常工作

      • etcd 集群独立于 control-plane 其他的组件部署

      原文链接:%E6%8A%93%E8%99%AB%E6%97%A5%E8%AE%B0%20-%20kube-apiserver.md


      延伸 · 阅读

      精彩推荐