Move big data fast: rclone + checksums (practical patterns)

Moving datasets is half the job. Do it once, do it right. Here’s a clean, repeatable way to get large data into and out of your GPU computing service with integrity checks and resumable transfers.

What this covers

Install and configure rclone on a CUDA‑ready template
Copy to/from S3‑compatible storage and SSH/SFTP servers
Make and verify SHA‑256 manifests
Resume safely after disconnects
Pick chunk sizes, parallelism, and compression that matter

Opinion: use rclone for cloud/object storage; use rsync only for LAN/SSH copies when both ends are POSIX and you need hard links/permissions.

Start in seconds with the fastest, most affordable cloud GPU clusters.

Launch an instance in under a minute. Enjoy flexible pricing, powerful hardware, and 24/7 support. Scale as you grow—no long-term commitment needed.

Try Compute now

1) Install rclone (once per template)

Inside your running container:

curl -fsSL https://rclone.org/install.sh | sudo bash rclone version

Keep rclone in your custom template so you don’t repeat this.

2) Configure a remote (S3 or SSH)

Start the interactive config:

rclone config

Add a remote:

S3 (AWS, MinIO, Wasabi, etc.): choose s3, set provider, region, and access keys.
SFTP/SSH: choose sftp, set host, port, and key path.

Don’t bake secrets into images. Store access keys in the rclone config, or set env vars at runtime.

Env‑only (no interactive config) — S3 example

export RCLONE_CONFIG_myremote_TYPE=s3 export RCLONE_CONFIG_myremote_PROVIDER=AWS export RCLONE_CONFIG_myremote_ACCESS_KEY_ID=XXXX export RCLONE_CONFIG_myremote_SECRET_ACCESS_KEY=YYYY # optional: custom endpoint # export RCLONE_CONFIG_myremote_ENDPOINT=https://s3.my-org.example

3) Copy data in (and resume if it breaks)

To instance (S3 → NVMe)

# pull a dataset down to local /data mkdir -p /data rclone copy myremote:datasets/projectA /data \ --progress --transfers 16 --checkers 8 --fast-list \ --s3-chunk-size 64M --s3-upload-concurrency 6

From instance (NVMe → S3)

rclone copy /data/results myremote:results/projectA \ --progress --transfers 16 --checkers 8 --fast-list \ --s3-chunk-size 64M --s3-upload-concurrency 6

Resumable: rclone resumes interrupted transfers automatically.
Tuning: start with the settings above; raise --transfers gently until bandwidth or IOPS saturate. Large objects like .tar.zst prefer larger --s3-chunk-size (128M+).

SSH/SFTP example

rclone copy /data/results sftpremote:/srv/results/projectA \ --progress --transfers 8 --checkers 4

4) Integrity: SHA‑256 manifests you can trust

Make a manifest on the source, copy data and manifest, then verify on the destination.

Create manifest at source

cd /data/results rclone hashsum SHA-256 . > SHA256SUMS.txt

Copy data + manifest

rclone copy /data/results myremote:results/projectA --progress rclone copy /data/results/SHA256SUMS.txt myremote:results/projectA

Verify at destination (downloaded)

# Option A: verify after download back on another machine rclone copy myremote:results/projectA ./projectA cd projectA && sha256sum -c SHA256SUMS.txt

Verify in place (remote hash listing)

# If your remote exposes SHA-256/MD5, list remote hashes and compare rclone hashsum SHA-256 myremote:results/projectA > REMOTE_SHA256.txt # diff REMOTE_SHA256.txt with your local manifest (paths must match)

If the object store doesn’t expose strong hashes per part (common with S3 multipart), trust the manifest workflow: recompute locally after download and compare.

5) Sync vs copy, and delete safety

copy only adds/updates files on the destination.
sync makes the destination match the source (including deletes). Use with care:

rclone sync /data/results myremote:results/projectA --progress --delete-before

Add --dry-run first to preview deletes.

6) Fewer files = faster transfers (bundle smartly)

Millions of tiny files stall on metadata. Bundle logically, then compress.

# bundle and compress (multi-core) cd /data/run123 tar -I 'zstd -T0 -19' -cf run123.tar.zst . # upload the single archive + a tiny MANIFEST file listing contents rclone copy run123.tar.zst myremote:runs/ --progress

Prefer zstd for speed; use pigz for gzip compatibility. Keep bundles below a few tens of GB if you need easy partial re‑runs.

7) Move data between buckets or projects

You can copy remote→remote without pulling to the instance:

rclone copy awsA:bucketA/prefix gsB:bucketB/prefix --progress --transfers 32 --checkers 16

Works across providers if both remotes are configured.

8) Bandwidth and reliability knobs

--bwlimit 100M to cap bandwidth if you share a link.
--retries 8 --low-level-retries 20 for flaky paths.
--timeout 2m --contimeout 10s to tune slow endpoints.
--checksum asks rclone to use hashes when the remote supports them.

Log the exact command in your run card.

9) rsync when both ends are POSIX

For SSH on LAN or a well‑peered WAN, rsync is great:

rsync -avhP --delete --partial --partial-dir=.rsync-partial \ /data/results user@host:/srv/results/projectA

--partial lets resumes continue. Still write a SHA‑256 manifest and verify.

10) Security basics

Keep access keys in rclone config or env vars, not in images.
Mount secrets at runtime; don’t commit them.
Prefer VPN/SSH to open buckets. If public, restrict by IP and expire presigned URLs quickly.

Methods snippet (copy‑paste)

Try Compute today

Start a GPU instance with a CUDA-ready template (e.g., Ubuntu 24.04 LTS / CUDA 12.6) or your own GROMACS image. Enjoy flexible per-second billing with custom templates and the ability to start, stop, and resume your sessions at any time. Unsure about FP64 requirements? Contact support to help you select the ideal hardware profile for your computational needs.

‍

← Back