Version/Environment:
This applies to versions 5.0.x-5.1.x of Domino.
Issue:
You see this error when new Model API builds fail:
Error during image build and push: timed out waiting for replicator to prepare resources
This error appears in the model’s build logs. To view and download the build logs from the UI, go to Model APIs > Versions > select the version > Logs
To locate the build logs via CLI:
kubectl logs -n domino-compute <your domino-build-namespace>
Example of full error:
Error during image build and push: timed out waiting for replicator to prepare resources "ABcdNVkHRP6v2o_TFtsgYz"abc-orgname-cluster-compute-123456a7-8bcddomino-build-12301267267b846769d2366f-8abd9forge-build
Closing message producer
Killing preparer plugins
[DEBUG] plugins.forge-replicator-plugin:
[ERR] plugin: plugin server: accept unix /tmp/plugin385635260: use of closed network connection
[DEBUG] plugins: plugin process exited: path=/usr/local/share/forge/plugins/forge-replicator-plugin pid=17
[DEBUG] plugins: plugin exited
panic: failed to prepare /mnt/build/extracted: timed out waiting for replicator to prepare resources
Root Cause:
The replicator pod supports the building of Model API images. It clones Domino projects (Git repo + blobs) and linked Git projects (Git repo only) into a directory for a given builder job to copy into a Docker image.
This error occurs when the replicator pod gets stuck and breaks. As a result, users can’t create new models. Additional symptoms include very high CPU usage and a broken event-log directory where nothing has been written since the errors started.
This issue does not affect the publishing of existing models.
Resolution:
This issue is resolved by clearing out and replacing three directories in the replicator pod followed by restarting the replicator pod.
Part 1: In the replicator pod, rename the cached, prepared, and event-log directories in /domino/shared/replicatorStorage/. This will back them up and create new directories.
1.1. Grep for the replicator pod.
kubectl get po -n domino-compute | grep replic
1.2. Exec into the replicator pod to get into root.
kubectl exec -it -n domino-compute <your replicator pod> -- bash
1.3. View the CPU resource usage and load. The output usually shows that the %CPU is very high at over 100%. This is likely due to the replicator trying to rebuild the files in the broken event-log directory.
top
1.4. Change directories over to the replicatorStorage and list the contents. They should include these 3 directories: prepared, cached, event-log
cd /domino/shared/replicatorStorage/
ls -lrt
1.5. Move each to rename and back up each directory.
mv prepared prepared.old
mv cached cached.old
mv event-log event-log.old
1.6. Create the new directories.
mkdir {cached,prepared,event-log}
1.7. Exit out of root.
Part 2: Restart the replicator pod
2.1. Delete the replicator pod. This will restart it.
kubectl delete po -n domino-compute <your replicator pod>
2.2. Check that the new pod has been created and that it’s running.
kubectl get po -n domino-compute | grep replic
2.3. Describe the new replicator pod and confirm that the new replicator has successfully started.
kubectl get po -n domino-compute <your replicator pod>
2.4. Exec into the new replicator pod to get into root.
kubectl exec -it -n domino-compute <your replicator pod> -- bash
2.5. Change directories over to the replicatorStorage and list the contents. Confirm that the new directories were created.
cd /domino/shared/replicatorStorage/
ls -lrt
2.6. View the CPU resource usage and load. Confirm that the %CPU is now much lower.
top
2.7. Exit out of root.
2.8. In the UI, test publishing a new Model API: Publish > Models > New Model.
If you have questions or run into any issues, please contact Domino Technical Support.
Notes/Information:
Since the replicator pod is being phased out in newer versions of Domino, no further work is planned for this issue on versions 5.0.x-5.1.x of Domino.
Since versions higher than 5.2 do not use the replicator pod, this issue will not occur in those versions.
Comments
0 comments
Please sign in to leave a comment.