Agent‑based models thrive on parallelism. FLAME GPU executes agent functions as CUDA kernels, so one workstation‑class GPU can simulate millions of agents in real time—if you structure the model cleanly. This guide shows a practical, GPU‑friendly path from NetLogo/Mesa to FLAME GPU.
What we’ll cover
- Choosing a CUDA‑ready template and getting FLAME GPU built
- Minimal project layout that’s easy to maintain
- A tiny agent function skeleton and messaging pattern
- Validation against your CPU model
- A simple throughput benchmark (agents/s and cost per million agent‑steps)
- Profiling and common bottlenecks
Precision note: most ABMs are fine in FP32. If your model is sensitive, see the FP64 checklist.
1) Pick your image on your GPU service
Your job runs inside a container. Two options that work well:
A) CUDA runtime + build from source (portable)
- Template: Ubuntu 24.04 LTS (CUDA 12.6)
- Add build tools and Python if you’ll use the Python API.
# Dockerfile (sketch)
FROM nvidia/cuda:12.4.1-runtime-ubuntu22.04
ARG DEBIAN_FRONTEND=noninteractive
RUN apt-get update && apt-get install -y --no-install-recommends \
build-essential cmake git python3 python3-pip \
&& rm -rf /var/lib/apt/lists/*
ENV NVIDIA_VISIBLE_DEVICES=all \
NVIDIA_DRIVER_CAPABILITIES=compute,utility
B) Use a maintained FLAME GPU image (fastest)
- Point your template to your lab’s private image that already includes FLAME GPU and dependencies.
Either way, confirm GPU visibility inside the container:
nvidia-smi
2) Project layout
/work
├── CMakeLists.txt
├── src/
│ ├── agents.cu # agent functions
│ ├── model.cu # model description & layers
│ └── main.cu # entry point
├── python/ # optional Python driver
└── data/ # inputs, seeds, checkpoints
Initialize CMake and build out‑of‑tree:
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j
3) Agent function + messaging (skeleton)
FLAME GPU runs agent functions over agent arrays on the GPU. Use messages for local interactions.
// agents.cu
#include <flamegpu/flamegpu.h>
// Each agent reads messages in its neighborhood and updates velocity
FLAMEGPU_AGENT_FUNCTION(step, flamegpu::MessageSpatial2D, flamegpu::MessageSpatial2D) {
const float x = FLAMEGPU->getVariable<float>("x");
const float y = FLAMEGPU->getVariable<float>("y");
float vx = FLAMEGPU->getVariable<float>("vx");
float vy = FLAMEGPU->getVariable<float>("vy");
// simple cohesion
float cx = 0.f, cy = 0.f; int n = 0;
for (const auto &m : FLAMEGPU->message_in(x, y)) { // spatial iteration
cx += m.getVariable<float>("x");
cy += m.getVariable<float>("y");
n++;
}
if (n) { cx /= n; cy /= n; vx += 0.05f * (cx - x); vy += 0.05f * (cy - y); }
// write output state and position
FLAMEGPU->setVariable<float>("vx", vx);
FLAMEGPU->setVariable<float>("vy", vy);
FLAMEGPU->setVariable<float>("x", x + vx);
FLAMEGPU->setVariable<float>("y", y + vy);
// emit message for neighbors next step
FLAMEGPU->message_out.setVariable<float>("x", x);
FLAMEGPU->message_out.setVariable<float>("y", y);
return flamegpu::ALIVE;
}
Model description (pattern)
// model.cu
#include <flamegpu/flamegpu.h>
using namespace flamegpu;
ModelDescription model("abm");
AgentDescription agent = model.newAgent("A");
agent.newVariable<float>("x"); agent.newVariable<float>("y");
agent.newVariable<float>("vx"); agent.newVariable<float>("vy");
MessageSpatial2D::Description msg(model);
msg.setBounds(0, 100, 0, 100); // domain bounds
msg.setRadius(1.0f); // interaction radius
msg.newVariable<float>("x");
msg.newVariable<float>("y");
LayerDescription layer = model.newLayer("L");
layer.addAgentFunction(step);
This mirrors common NetLogo patterns (turtles + vision radius) but in a GPU‑friendly, structure‑of‑arrays layout.
4) Run, seed, and checkpoint
./build/abm --agents 5_000_000 --steps 1000 --seed 42 \
--output data/checkpoints --checkpoint-interval 100
- Keep deterministic seeds for validation.
- Write fewer, larger checkpoints to reduce I/O overhead.
5) Validate vs your CPU model
- Pick a small world and identical rules.
- Run CPU reference (NetLogo/Mesa) and GPU for a short horizon.
- Compare aggregate metrics: counts, mean positions, cluster sizes, or domain‑specific stats.
- Differences should sit within stochastic variance for the same seed.
If results diverge, check message radius, boundary conditions, and update order.
6) Throughput + cost benchmark
Use numbers that matter for planning.
metrics:
agents: <N>
steps: <T>
wall_seconds: <…>
agent_steps_per_second: N*T / wall_seconds
cost_per_million_agent_steps: (price_per_hour * wall_seconds/3600) / 1e6 * (N*T)
Log GPU model/VRAM, driver, CUDA, FLAME GPU version, and the exact command line.
7) Profiling & bottlenecks
- nvidia-smi: watch utilization and VRAM.
- nsys / ncu: identify kernels with low occupancy or uncoalesced access.
- Messaging: spatial messages scale better than brute‑force all‑pairs; keep radii realistic.
- Host↔Device copies: avoid per‑step transfers; batch outputs.
- Branching: split agents by state into separate functions/layers when branches are hot.
8) Troubleshooting
GPU idle
Too few agents or heavy host‑side work. Raise N, reduce per‑step I/O, or move setup into device code.
Out of memory
Slim agent variables, chunk outputs, or choose a larger‑VRAM profile.
Non‑deterministic results
Fix seeds and avoid unordered host‑side reductions. Document RNG.
Build errors
Match CUDA to your base image and CMake toolchain. Clean and rebuild.
Methods snippet (copy‑paste)
hardware:
gpu: "<model> (<VRAM> GB)"
driver: "<NVIDIA driver>"
cuda: "<CUDA version>"
software:
flamegpu: "<version>"
image: "Ubuntu 24.04 LTS (CUDA 12.6)"
model:
domain: "[0,100]x[0,100]"
agents: <N>
rules: "cohesion-only demo"
run:
cmd: "./build/abm --agents <N> --steps <T> --seed 42"
checkpoints: "every 100 steps"
outputs:
wall_seconds: "<…>"
agent_steps_per_second: "<…>"
cost_per_million_agent_steps: "<…>"
Related reading
Try Compute today
Start a GPU instance with a CUDA-ready template (e.g., Ubuntu 24.04 LTS / CUDA 12.6) or your own GROMACS image. Enjoy flexible per-second billing with custom templates and the ability to start, stop, and resume your sessions at any time. Unsure about FP64 requirements? Contact support to help you select the ideal hardware profile for your computational needs.