Version/Environment (if relevant):
This applies to Domino version 5.7 and prior
Issue:
The backup system involves a cronjob object which spawns pods named something like domino-workbench-backup-nnnnn
. These pods can fail with an Error status while the pod logs have output stating "file changed as we read it".
Log example snippet:
RuntimeError: Got return code '1' for command: tar -zcvvf /opt/scratch/migration-sessions/20230710-000245.tar.gz 20230710-000245
Output:
drwxrwxrwx root/root 0 2023-07-10 00:02 20230710-000245/
-rwxrwxrwx root/root 1068 2023-07-10 00:05 20230710-000245/config.yaml
-rwxrwxrwx root/root 33609587 2023-07-10 00:05 20230710-000245/git.tar.gz
tar: 20230710-000245/git.tar.gz: file changed as we read it
-rwxrwxrwx root/root 218220 2023-07-10 00:05 20230710-000245/keycloak-postgres_archive_local-backup.gz
-rwxrwxrwx root/root 375276084 2023-07-10 00:02 20230710-000245/mongo_archive_local-backup.gz
-rwxrwxrwx root/root 731 2023-07-10 00:05 20230710-000245/vault-k8s-secrets_local-backup.yaml
-rwxrwxrwx root/root 1403893 2023-07-10 00:05 20230710-000245/vault-postgres_archive_local-backup.gz
Root Cause:
The pod runs a script which gathers several files, involving a series of tar commands. Due to unknown reasons the tar process will complete but successive tar commands trying to access one of the .gz files will fail because the file system perceives that the .gz is still being used by the prior tar command process.
Resolution:
Adding a pause in between the tar commands resolves this predicament.
Edit a configmap to add a time.sleep(). You are going to make a change to the "data:" section, adding the following:
kubectl -n domino-platform edit cm domino-data-importer-backup-job-script
BEFORE:
data:
backup-job.sh: |
timeout 1m bash -c "until curl -sSf http://nucleus-frontend/health; do echo Waiting for network... && sleep 2; done"
importer -c /app/config-4x-example.yaml -b --backup-archive --backup-upload --backup-delete
kind: ConfigMap
metadata
AFTER:
apiVersion: v1
data:
backup-job.sh: |
timeout 1m bash -c "until curl -sSf http://nucleus-frontend/health; do echo Waiting for network... && sleep 2; done"
sed -i '/def bundle_backup/a \ import time\n time.sleep(30)' /app/importer/importer.py
importer -c /app/config-4x-example.yaml -b --backup-archive --backup-upload --backup-delete
kind: ConfigMap
metadat.....
To test validity of your edit scale-up (or scale-down then scale back up) the domino-data-importer:kubectl -n domino-platform scale sts domino-data-importer-0 --replicas=1
then kubectl -n domino-platform exec -it domino-data-importer-0 /bin/bash
then via some editor make sure /app/importer/importer.py looks like:
def bundle_backup(self):
import time
time.sleep(30)
new_cfg_path = path_join(self.session_path, "config.yaml")
[...]
To make this change apply to a re-install edit your agent.yaml to include:
release_overrides:
domino-data-importer:
chart_values:
backupJobScript: |
timeout 1m bash -c "until curl -sSf http://nucleus-frontend/health; do echo Waiting for network... && sleep 2; done"
sed -i '/def bundle_backup/a \ import time\n time.sleep(30)' /app/importer/importer.py
importer -c /app/config-4x-example.yaml -b --backup-archive --backup-upload --backup-delete
Notes/Information:
Reference: https://docs.dominodatalab.com/en/latest/admin_guide/72ff13/backup-and-restore/
References/Internal Records:
Comments
0 comments
Please sign in to leave a comment.