Version/Environment (if relevant):
Versions 5.2.1 and above using Hephaestus (Image Builder v3)
Issue:
After a period of time Compute Environment build logs and Model build logs may become missing. The logs appear initially, but eventually (an hour to a few hours) they no longer appear in the UI. This symptom can be infrequent or seemingly 100%. Workspace and run logs aren't impacted.
This can be confirmed via logs: kubectl logs <hephaestus-manager-nnnnnn> -c vector
2023-01-19T15:38:54.983401304Z 2023-01-19T15:38:54.983271Z ERROR sink{component_kind="sink" component_id=s3 component_type=aws_s3 component_name=s3}:request{request_id=29}: vector::sinks::util::retries: Non-retriable error; dropping the request. error=failed to construct request: No credentials in the property bag
2023-01-19T15:38:54.983450118Z 2023-01-19T15:38:54.983321Z ERROR sink{component_kind="sink" component_id=s3 component_type=aws_s3 component_name=s3}:request{request_id=29}: vector_core::stream::driver: Service call failed. error=ConstructionFailure(MissingCredentials) request_id=29
Root Cause:
To confirm this particular KB first verify that your node instance-metadata's HttpPutResponseHopLimit has a value of 1 instead of 2.
aws ec2 describe-instances --region <your region> | grep Hop
"HttpPutResponseHopLimit": 1,
"HttpPutResponseHopLimit": 1,
.....
A node's instance-metadata HttpPutResponseHopLimit with value of 1 instead of 2 is causing AWS sdk client running in Hephaestus-manager's Vector to timeout while trying to get instance metadata information. This timeout results in Vector not pushing the logs to S3 for permanent storage. The timeout is partly described at https://github.com/aws/aws-sdk-go/issues/2972
Resolution:
To resolve this matter you need to modify the instance-meta for all running nodes to have an HttpPutResponseHopLimit value of 2 instead of 1. You will need to do this proactively for nodes at the AutoScaleGroup launch-template level too.
"MetadataOptions": {
"State": "applied",
"HttpTokens": "required",
"HttpPutResponseHopLimit": 2,
"HttpEndpoint": "enabled",
"HttpProtocolIpv6": "disabled",
"InstanceMetadataTags": "disabled"
},
Notes/Information:
If your HttpPutResponseHopLimit is already 2 you should ask Domino Technical Support for help.
Reference for changing running node - https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/configuring-IMDS-existing-instances.html#modify-PUT-response-hop-limit
Comments
0 comments
Please sign in to leave a comment.