Version:
Testing for this article was done against a Domino 5.2.1 deployment with the same version CLI client. That said, the notes here should probably still hold in Domino 4.6 and possibly older & newer, as the code here has not changed in some time.
Issue:
The CLI command to upload data to a dataset from your local system is fairly robust and can be used to upload 100GB+ filesets.
domino upload-dataset [options] <dataset> <directory> [token]
There are however practical limits to the size of dataset that you will be able to upload. This article is an effort to do some performance testing with these large size datasets and give some recommendations on the two tune-able parameters for the command, DOMINO_UPLOAD_THREADS & DOMINO_UPLOAD_CHUNK_BYTES.
Command Usage:
Let's start with the usage output for the command:
% domino help upload-dataset
Usage: domino upload-dataset [options] <dataset> <directory> [token]
Upload a directory to a Dataset. The contents of the directory will be a new
Dataset Read Write Snapshot if --fileUploadSetting is not defined. When
uploading begins, a token will be provided for the upload session. In the event
that the upload session is interrupted, the token can be used to resume the
session.
<dataset>: Dataset in the format: <project-owner>/<project-name>/<dataset-name>
<directory>: Local directory file path
[token]: Optional token for resuming previous upload session
Environment Variables: - Set DOMINO_UPLOAD_THREADS to adjust the number of
concurrent uploads (default 8) - Set DOMINO_UPLOAD_CHUNK_BYTES to adjust the
chunk size (default 3145728)
General Use Case
The general usage case is to upload the contents of a folder. As an example let's use the information below to do an upload:
- Local folder on my laptop:
Users/user/data
- My user ID:
user_id
- Project to upload to:
project_name
- Dataset to upload to:
upload_dataset
So to do the upload, use the following syntax.
% domino upload-dataset user_id/project_name/upload_dataset Users/user/data
Resuming your upload
If your upload fails at some point, you will get an error similar to the following:
+--------+
| Error! |
+--------+
Snapshot upload failed. You can retry this upload with the following command:
domino upload-dataset user_id/project_name/upload_dataset /Users/user/data/ aed75b88-cb1c-4ce3-a06b-05b675c296cf
This error includes a token for you to resume the upload. Save this token, if it scrolls off the screen you will need to parse the domino.log file for the token used for your upload.
To resume your upload use the following syntax:
% domino upload-dataset user_id/project_name/upload_dataset Users/myuser/data aed75b88-cb1c-4ce3-a06b-05b675c296cf
Managing Path Collision
Use the--fileUploadSetting
option to handle path collisions as follows:
-
overwrite
: If a file already exists in the Dataset, the new file overwrites the existing file. -
rename
: If a file already exists in the Dataset, the new file is uploaded and renamed with_1
appended to the filename. For example,/Users/myUser/data/file.txt
becomes/Users/myUser/data/file_1.txt
-
ignore
: If a file already exists in the Dataset, the new file is ignored.
To use this option use the following syntax:
% domino upload-dataset --fileUploadSetting ignore user_id/project_name/upload_dataset Users/myuser/data
Note that the documentation has improper usage of this option and is being corrected; internal docs ticket: DOCS-1883.
Fine Tuning
There are two tuning environment variables that can be used to tune the upload speed to a degree.
- DOMINO_UPLOAD_THREADS to adjust the number of concurrent uploads (default 8)
- DOMINO_UPLOAD_CHUNK_BYTES to adjust the chunk size (default 3145728)
These are environment variables so set them wherever you are running the CLI upload command through whatever method your OS uses to set environment variables.
Performance
The goal of the performance testing here was to give users some idea of how long an upload might take as they are slower than many might expect given the effort to make them robust. This can give rise to concerns that something may be broken.
After quite a bit of testing I found no silver bullet settings to reduce the time that multi gigabyte uploads take. I'll give the recommendations and then describe my setup, talk about results if you are interested in the gory details.
Recommendations
- If you are an average user who uploads the occasional dataset just use the default settings of the thread and chunk size. You might save some upload time if you find the best settings for your configuration, but you'll spend hours doing testing to find those efficiencies.
- If you are uploading on a frequent basis you can save 5~15% on your upload times by tuning the chunk and thread parameters.
- You may also save a lot more time if you are on better hardware and internal corporate networks. So upload from the office network to the office network if you can.
- Uploading from home will likely introduce throttling from your ISP. I never succeeded in uploading any dataset larger than 12GB on my home network. Just be aware you may need to limit the size of your uploads and upload your dataset in smaller parts and then reconnect into a single data set within your Domino project.
- Bigger chunk sizes via DOMINO_UPLOAD_CHUNK_BYTES was always slower for me, test in your setup, but this was the least effective way to find an efficiency. The default setting was the best, 3145728 bytes.
- More threads wasn't always better, test if you want to find the most efficient but my best savings was 4% with 16 threads vs the default of 8 threads. So if 100GB takes 1000 minutes to upload it might save you 40 minutes or so if you are only gaining 4%.
- More just an FYI, 1GB took about 10 minutes to upload across an average home wifi connection. My experience was that this was fairly linear, so 10gb expect 100 minutes or so.
- File size/count had some impact, fewer & larger files was faster. Probably not that surprising, but nearly a 14% improvement between 1GB 1000 files/1mb & 1GB 100 files/10mb.
Performance Testing Setup
Platform:
Data resided on M1 mac running Monteray 12.6.1
Data uploaded to Domino v5.2.1 on AWS
CLI client v5.2.1
Network:
Home wifi with avg 25 Mbps upload speeds
Data Sizes:
- 1GB 1000 files @ 1mb
- 1GB 200 files @ 5.1mb
- 1GB 100 files @ 10mb
- 5GB 100 files @ 10mb
- 10GB 1000 files @ 10mb
Tunable settings:
- DOMINO_UPLOAD_THREADS: Tested this setting at 8/16/24/32 threads
- DOMINO_UPLOAD_CHUNK_BYTES: Tested this setting at 3145728/6291456/12582912/25165824/50331648 bytes
Results
I tested 1GB of 1mb/5mb/10mb files with all 4 settings of both the threads & chunks settings. The best setting for my mac and network was 16 threads, where 8 is default. The best setting for the chunk size was always the default of 3145728 bytes.
The shortest time for a 1GB upload: 8 min 43 sec
DOMINO_UPLOAD_THREADS: 16
DOMINO_UPLOAD_CHUNK_BYTES: 3145728 bytes
Average upload time for 1GB: 10 min 33 sec
Average upload time for 5GB: 45 min 52 sec
Average upload time for 10GB: 1 hr 27 min 51 sec
Notes:
Internal feature requests exist to improve this command:
DOM-42388 Better functionality for resuming dataset-upload from CLI
DOM-42424 Add progress meter to the CLI upload-dataset command
Comments
0 comments
Please sign in to leave a comment.