Submitted originally by: katie.shakman
From the Support Bundle of the execution, look for these autoscaling failures in the execution.log or the events.json.
execution.log:
2023-06-26T15:06:39.204Z: pod didn't trigger scale-up: 2 node(s) had volume node affinity conflict, 26 node(s) didn't match Pod's node affinity/selector
Please check if the hardware tier 'p3-Large-GPU' has a valid nodePool
events.json:
"resourceKind" : "Pod",
"reason" : "NotTriggerScaleUp",
"description" : "pod didn't trigger scale-up: 26 node(s) didn't match Pod's node affinity/selector, 2 node(s) had volume node affinity conflict",
Ensure that your cluster's yaml file has autoscaling enabled.
autoscaler:
enabled: true
Describe the pod not being scheduled and inspect its "Events" section at the bottom. Example pod events that involve the Autoscaler:
Autoscaler is working, and intentionally not triggering scaleup. This can happen when the pods request more memory/cpu than the nodes can provide.
Normal NotTriggerScaleUp 2m24s (x18 over 27m) cluster-autoscaler pod didn't trigger scale-up (it wouldn't fit if a new node is added): 45 Insufficient memory, 3 max node group size reached, 45 Insufficient cpu
Autoscaler is not getting triggered when it should, assuming your deployment of Domino is not On Prem. This can result from a mismatch between the ASG's that are configured in AWS, and what is set in K8s.
Warning FailedScheduling <unknown> default-scheduler 0/12 nodes are available: 1 Insufficient memory, 1 node(s) didn't match node selector, 10 Insufficient cpu.
Autoscaling groups (ASGs) need to be set to one AZ per group. Additionally, your ASG AZs must match storage AZs. For example, if storage volumes may be in us-east-1a, us-east-1b, & us-east-1c, then the ASG must be configured for instances in those three zones as well. If they are not, this volume node affinity conflict may appear because the node (ec2 instance for example) and volume are in different AZs.
pod didn't trigger scale-up (it wouldn't fit if a new node is added): 1 node(s) had volume node affinity conflict
Make sure the autoscaler deployment's ASG settings match the ASG settings in AWS. Edit deployment to resolve any differences.
kubectl get configmap cluster-autoscaler-status -n <install namespace> -o yaml
kubectl edit deployment -n <platform namespace> <cluster autoscaler deployment name>
Check that each ASG has only 1 Availability Zone. This should be checked both in AWS console, and in the configmap command shown above.
If the Autoscaler is intentionally not triggering scaleup, try reducing the memory and cpu that the pods are requesting from nodes. Use this guide to help reduce the resources that new pods request. https://admin.dominodatalab.com/en/4.4.1/compute/hardware-tiers.html?highlight=hardware%20tier
Normal NotTriggerScaleUp 2m24s (x18 over 27m) cluster-autoscaler pod didn't trigger scale-up (it wouldn't fit if a new node is added): 45 Insufficient memory, 3 max node group size reached, 45 Insufficient cpu
*If the autoscaler deployment in K8s is edited, add the new changes to the autoscaler section of the cluster yaml file(deploy manifest) so that changes are persisted after an upgrade or maintenance.
Other less common messages and their meanings:
2 node(s) didn't match pod affinity rules, 2 node(s) didn't match pod affinity/anti-affinity rules.
This often means labels or rules for the pod aren't aligned with something specific to the two nodes. One example is the data-importer needing to run on the same volume as the 'git' pod, but in this case neither of the two available nodes meet this condition. Describe the pod to check:
affinity: nodeAffinity:
11 node(s) didn't match Pod's node affinity
or 20 node(s) didn't match Pod's node affinity/selector
or 4 node(s) didn't match node selector
These can occur when nodes exist but don't have a label matching the Node-Selector label set on your pod via the Hardware Tier. For example if your Hardware Tier sets the label
"Node-Selectors: dominodatalab.com/node-pool=xlarge-memory" but 11 pods are lacking this label because they're labeled "default" or some other node-pool instead, then you'd see this message.
Again, check the pod's affinity as above, but also mind the Node-Selector and other labels.
3 max node group size reached
This would occur if the maximum count designated for the AutoScalerGroup in AWS (maxSize) has been reached, thus it cannot spin up any new nodes.
You can quickly review your min, max, and currently running nodes by describing the cluster-autoscaler configmapkubectl describe cm -n <could be platform namespace, could be something like kube-system> cluster-autoscaler-status
Comments
0 comments
Please sign in to leave a comment.