Version:
Kubernetes v1.21+ with Calico less than v3.23.3
Issue:
Calico CNI pods can become unhealthy after 90 days due to an expired token. You may find a variety of symptoms within Domino, such as executions may not start or already running workspaces may not sync. Because this issue breaks the kubernetes cluster's networking, the symptoms can be quite varied. If you are seeing a varied array of failures in your Domino deployment check your Calico pods.
❯ kubectl get all -A | grep calico
kube-system pod/calico-node-7tzts 0/1 Running 0 5h8m
kube-system pod/calico-node-94dgj 0/1 Running 0 89d
kube-system pod/calico-node-9n7kx 0/1 Running 0 24h
kube-system pod/calico-node-9rtgg 0/1 Running 0 4d7h
kube-system pod/calico-node-c7d8t 0/1 Running 0 6d13h
kube-system pod/calico-node-cp8sw 0/1 Running 0 89d
kube-system pod/calico-node-dg4mv 0/1 Running 0 4h27m
kube-system pod/calico-node-hq7kp 0/1 Running 0 5h40m
kube-system pod/calico-node-kkr2t 0/1 Running 0 7h11m
kube-system pod/calico-node-knd5m 0/1 Running 0 89d
kube-system pod/calico-node-lhljj 0/1 Running 0 89d
kube-system pod/calico-node-mwtlr 0/1 Running 0 21h
kube-system pod/calico-node-pvdv5 0/1 Running 0 80m
kube-system pod/calico-node-v6h2k 0/1 Running 0 89d
kube-system pod/calico-node-vhqt9 0/1 Running 0 6h56m
kube-system pod/calico-node-wnftp 0/1 Running 0 7h11m
kube-system pod/calico-node-zgcgn 0/1 Running 0 3h26m
kube-system pod/calico-typha-64db44bbb5-btgml 0/1 Running 0 89d
kube-system pod/calico-typha-horizontal-autoscaler-7fb8b5894b-kvcdq 1/1 Running 0 89d
If you find your pods failed as above you may have hit this issue. Note the 89 day age of some of these pods. The exact failure state may be dependent on your calico version.
Root Cause:
As of Kubernetes 1.21 the default token has changed to a service account bound token, see AWS docs for additional detail on this change:
https://docs.aws.amazon.com/eks/latest/userguide/service-accounts.html
Older versions of Calico, before v3.23.3 do not support this change. The token will expire every 90 days, causing a cluster wide breakdown in container networking when it does.
Resolution:
If you are running an older version of Calico on a Kubernetes 1.21 cluster you will need to restart Calico and the related pods such as Typha every 90 days until you upgrade your Calico version to 3.23.3+.
kubectl -n kube-system delete pod/calico-typha-nnnnxxxxx
kubectl -n kube-system rollout restart ds calico-node
Notes/Information:
https://github.com/projectcalico/calico/tree/master/typha#readme
https://docs.aws.amazon.com/eks/latest/userguide/service-accounts.html
Comments
0 comments
Please sign in to leave a comment.