Version:
Domino 4.x & 5.x
Issue:
With default settings the prometheus-adapter pod may experience an Out Of Memory (OOM) condition and fail. The prometheus-adapter takes metrics on running DMM jobs and horizontally scales the DMM spark worker instances to accommodate the load. If you do not use DMM, failure of this pod will not have an impact to your Domino deployment.
If you have found the pod to be failed, check the describe output of the pod to check the reason for failure.
# kubectl get po -n domino-platform | grep prometheus-adapter
prometheus-adapter-7d4db6cf6d-x4xrq 0/1 CrashLoopBackOff 1 6h49m
then describe the pod
# kubectl describe po -n domino-platform prometheus-adapterprometheus-adapter-7d4db6cf6d-x4xrq
and check the following...
Root Cause:
Initial deploys of this pod used a 'memory limit' of 256Mi. In reality we have found the pod can use memory bursts up to 3Gi or higher.
Resolution:
If you find the pod has been OOMKilled, or in reality even if you find the memory limit set to 256mb as in the screenshot above then you should increase the memory limit to 3Gi.
To increase memory limit on the pod edit the prometheus-adaptor deployment:# kubectl edit deploy -n domino-platform prometheus-adapter
find the resources section:
resources:
limits:
cpu: 100m
memory: 256Mi
and change the memory limit to 3Gi
resources:
limits:
cpu: 100m
memory: 3Gi
Once you save this change the prometheus adaptor pod will restart with the larger limit. Note that 3Gi is a number we have found to work for many deployments, but if you continue to see OOMs on the pod you may need to go higher.
Notes/Information:
Engineering is working to better understand the memory usage in this pod and optimize it. See internal ticket DOM-38099.
Comments
0 comments
Please sign in to leave a comment.