GROMACS on cloud GPUs (RTX 4090): quickstart and a self‑benchmark kit

You want a clean run, not a rabbit hole. This guide gives you a copy‑paste path to run a medium MD system on a single RTX 4090, plus a small kit to generate your own numbers you can trust.

What we’ll run

A solvated protein system around 120k atoms (you can swap in your own).
A CUDA‑enabled GROMACS image. GPU runs use mixed precision by design. Double precision builds do not use GPU acceleration. GROMACS typically uses single precision or mixed precision for optimal GPU performance, and single precision is the default for most GPU-accelerated runs.

Inputs

system.tpr (or generate from your own .mdp, conf.gro, topol.top).

Prepare your template for NVIDIA GPUs (UI steps)

In most computing services, once the template boots, you’ll land inside the container with GPU access and gmx available. The container image is already configured with CUDA and GROMACS for immediate use.

Using pre‑made templates (Ubuntu 24.04 LTS / PyTorch 2.5)

Compute instances are compatible with CUDA 12.6 user‑space (and JupyterLab). That’s enough to run CUDA apps; the host driver is supplied usually supplied by the computing provider. For GROMACS you still want a CUDA build:

Fastest: use the GROMACS image approach above and save it as your own template.
If you pick Ubuntu v24.04 LTS (CUDA 12.6): install Apptainer and run the official GROMACS image with –nv:

sudo apt-get update && sudo apt-get install -y apptainer apptainer exec --nv docker://gromacs/gromacs:2024.1 gmx --version apptainer exec --nv -B $PWD:/work docker://gromacs/gromacs:2024.1 \ bash -lc "cd /work && gmx mdrun -deffnm md -nb gpu -pme gpu -update gpu -pin on"

If you pick PyTorch 2.5 (CUDA 12.6): same Apptainer method works. PyTorch libs don’t conflict with GROMACS in a separate container.

Detailed installation instructions for GROMACS and Apptainer can be found in the official GROMACS documentation.

Tip: The CUDA version shown on the card is the toolkit/runtime in the image, not the driver inside your container. Check nvidia-smi to confirm GPU visibility.

Start in seconds with the fastest, most affordable cloud GPU clusters.

Launch an instance in under a minute. Enjoy flexible pricing, powerful hardware, and 24/7 support. Scale as you grow—no long-term commitment needed.

Try Compute now

Run inside the container

You’re already in the GROMACS container. GROMACS operations are executed using specific commands, such as gmx mdrun, which allow you to tailor workflows for different hardware setups. Create a project folder and run:

mkdir -p /work && cd /work # If you have raw inputs, preprocess to a TPR (example): # gmx grompp -f md.mdp -c conf.gro -p topol.top -o system.tpr # Run with explicit GPU offload flags gmx mdrun -deffnm md \ -nb gpu -pme gpu -update gpu -pin on

The mdrun program reads the TPR file, which contains the molecular topology and simulation parameters, and generates output files for logs, trajectories, and energies.

Check usage

nvidia-smi # utilization and memory

Running on your own machine? Use Docker:

docker pull gromacs/gromacs:2024.1 docker run --gpus all -it --rm -v $PWD:/work -w /work gromacs/gromacs:2024.1 bash # then run the same gmx mdrun command inside the container

Option B · Apptainer/Singularity

If your policy requires it:

# Example: convert Docker image apptainer build gromacs.sif docker://gromacs/gromacs:2024.1 apptainer exec --nv gromacs.sif gmx --version apptainer exec --nv -B $PWD:/work gromacs.sif bash -lc \ "cd /work && gmx mdrun -deffnm md -nb gpu -pme gpu -update gpu -pin on"

Run‑your‑own benchmark kit

Protocol

System: ~120k atoms, PME, 2 fs, LINCS, 0.9 nm cutoffs.
Warmup: 5k steps. Measure: 50k steps.
Record: GPU, driver, CUDA, container digest, GROMACS version, CPU model/threads, exact flags, .mdp.

These benchmarks allow you to evaluate and compare the performance of different GPU and CPU configurations in GROMACS simulations.

Table (fill with your numbers)

Here is the HTML code for the table you provided:

Config	GPU	CPU threads	Flags
A	RTX 4090	4	-nb gpu -pme gpu -update gpu -pin on
B	RTX 4090	6	same
C	RTX 4090×2	6	-nb gpu -pme gpu (PME split if used)

Read your results

If PME dominates, try a few more CPU threads or enable PME split across GPUs on suitable builds.
Watch VRAM: 24 GB fits many single‑system MD jobs of this size.

Cost‑per‑result, not just speed

Scientists care about cost per ns/day more than peak FPS.

cost_per_ns_day = (hourly_price × wall_hours) / (ns_per_day × (wall_hours/24))

Lower is better. If cost per result is poor, change the instance or revisit flags.

Validate your results and output files

Baseline check. Run a short CPU, double‑precision test and compare against the GPU mixed‑precision run. Testing your setup with short runs and comparing results is essential to ensure correctness and performance. Look at energy drift, RMSD, temperature/pressure stability, or your task‑specific metric.
Determinism. Set a fixed seed where relevant and re‑run a short window. Small stochastic variance is fine; big swings mean configuration issues.
Offload sanity. Confirm the log prints GPU kernels for nonbonded/PME and that no major step has fallen back to CPU.
Step size and constraints. Use a stable time step (e.g., 2 fs with LINCS). Check for LINCS warnings or constraint failures.
Methods block. Fill the Methods snippet (hardware, CUDA/driver, container digest, solver version, flags) and keep it with the results.

Good defaults and small tweaks

The default recommendation is to start with 2–6 CPU threads per GPU for most GROMACS runs and profile.
Keep PME on the GPU for single‑GPU runs.
Use local NVMe for scratch; write logs less often to avoid I/O stalls.
Pin CUDA/driver and image digest. Don’t mix.

Troubleshooting

No GPU offload
Use a CUDA image and check the log. Confirm nvidia-container-toolkit is active.

Run slows over time
Check thermals and clocks with nvidia-smi. Keep -update gpu and -pin on.

Out of memory
Shrink the system, reduce neighbor list size, trim outputs, or move to a larger‑VRAM profile.

License servers for downstream tools
If you move to commercial solvers next, connect over VPN/SSH and fix FlexNet ports.

Methods snippet (fill and paste)

hardware: gpu: "RTX 4090 (24 GB)" driver: "<driver>" cuda: "12.x" cpu: "<model / threads used>" software: container: "docker://gromacs/gromacs:2024.1@sha256:<digest>" solver: "gromacs 2024.1 (CUDA build)" inputs: tpr: "system.tpr" run: cmd: "gmx mdrun -deffnm md -nb gpu -pme gpu -update gpu -pin on" outputs: performance: "<ns/day>" wallclock: "<HH:MM:SS>"

Try Compute today

Start a GPU instance with a CUDA-ready template (e.g., Ubuntu 24.04 LTS / CUDA 12.6) or your own GROMACS image. Enjoy flexible per-second billing with custom templates and the ability to start, stop, and resume your sessions at any time. Unsure about FP64 requirements? Contact support to help you select the ideal hardware profile for your computational needs.

‍

← Back