Version/Environment (if relevant):
Versions 5.2.1 and above using Hephaestus (Image Builder v3)
Issue:
After a period of time Compute Environment build logs and Model build logs may become missing. The logs appear initially, but eventually (an hour to a few hours) they no longer appear in the UI. This symptom can be infrequent or seemingly 100%. Workspace and run logs aren't impacted.
Root Cause:
To confirm this particular KB first verify that your node instance-metadata's HttpPutResponseHopLimit has a value of 1 instead of 2.
aws ec2 describe-instances --region <your region> | grep Hop
"HttpPutResponseHopLimit": 1,
"HttpPutResponseHopLimit": 1,
.....
A node's instance-metadata HttpPutResponseHopLimit with value of 1 instead of 2 is causing AWS sdk client running in Hephaestus-manager's Vector to timeout while trying to get instance metadata information. This timeout results in Vector not pushing the logs to S3 for permanent storage. The timeout is partly described at https://github.com/aws/aws-sdk-go/issues/2972
Resolution:
To resolve this matter you need to modify the instance-meta for all lives nodes to have an HttpPutResponseHopLimit value of 2 instead of 1. You will need to do this proactively for nodes at the AutoScaleGroup too.
"MetadataOptions": {
"State": "applied",
"HttpTokens": "required",
"HttpPutResponseHopLimit": 2,
"HttpEndpoint": "enabled",
"HttpProtocolIpv6": "disabled",
"InstanceMetadataTags": "disabled"
},
Notes/Information:
If your HttpPutResponseHopLimit is already 2 you should ask Domino Technical Support for help.
Comments
0 comments
Please sign in to leave a comment.