Read & process GeoTIFFs on GPU (cuCIM + rasterio)

You can run heavy raster math on a single GPU if you load data in tiles and keep the georeferencing straight. This guide shows a clean pattern to use with your GPU service provider using rasterio for metadata and I/O, cuCIM/CuPy for speed, and pyproj when you need reprojection.

Plain truth: not every GeoTIFF reads natively on GPU. The reliable path is CPU read → GPU compute → CPU write, done in windows that match the file’s internal tiling. cuCIM can read many tiled TIFFs directly to GPU; when it can’t, you still win by copying windows to CuPy before the math.

1) Pick an image

Two good options:

A) RAPIDS image (fast start)
docker.io/rapidsai/rapidsai:<24.xx>-cuda12-runtime-ubuntu22.04-py3.10

B) CUDA base + Python libs
Start from a CUDA runtime (e.g., Ubuntu 24.04 / CUDA 12.x), then install:

# micromamba (recommended) micromamba create -y -n geo -c conda-forge -c rapidsai \ python=3.10 rasterio pyproj cudf cupy cucim micromamba activate geo python -c "import rasterio, pyproj, cupy, cucim; print('ok')"

Your template should set:

NVIDIA_VISIBLE_DEVICES=all
NVIDIA_DRIVER_CAPABILITIES=compute,utility

Sanity check inside the container:

nvidia-smi

2) Use Cloud‑Optimized GeoTIFFs (COGs) when you can

COGs store internal tiles and overviews that make windowed reading predictable. Even for local files, COG layout keeps I/O fast and plays well with GPU tiling. If you can, convert once to a COG with 512×512 or 1024×1024 tiles and LZW/Deflate/ZSTD.

Start in seconds with the fastest, most affordable cloud GPU clusters.

Launch an instance in under a minute. Enjoy flexible pricing, powerful hardware, and 24/7 support. Scale as you grow—no long-term commitment needed.

Try Compute now

3) Pattern A — CPU read → GPU compute → CPU write (robust default)

This works on any GeoTIFF and keeps georeferencing intact.

Example: NDVI from RED/NIR bands

import rasterio from rasterio.windows import Window import cupy as cp import numpy as np src_path = "scene.tif" # multiband GeoTIFF (e.g., band 3=RED, band 4=NIR) dst_path = "ndvi.tif" RED_BAND, NIR_BAND = 3, 4 with rasterio.open(src_path) as src: profile = src.profile.copy() profile.update(count=1, dtype="float32", nodata=np.nan) tile_w, tile_h = src.block_shapes[0] # use internal tiling with rasterio.open(dst_path, "w", **profile) as dst: for ji, window in src.block_windows(1): # iterate tiles red = src.read(RED_BAND, window=window).astype("float32") nir = src.read(NIR_BAND, window=window).astype("float32") # move to GPU red_d = cp.asarray(red) nir_d = cp.asarray(nir) # NDVI = (NIR - RED) / (NIR + RED) denom = nir_d + red_d ndvi_d = cp.where(denom != 0, (nir_d - red_d) / denom, cp.nan) # back to CPU and write dst.write(cp.asnumpy(ndvi_d), 1, window=window)

Why this works

Rasterio handles CRS/transform/nodata and writes a valid GeoTIFF.
You only move a tile at a time to GPU, so VRAM stays bounded.
Works with compressed inputs; decompression happens during src.read().

4) Pattern B — GPU‑native read with cuCIM (when supported)

cuCIM can read many tiled TIFFs straight to GPU without an intermediate CPU array. You still use rasterio to fetch metadata and to write outputs.

import rasterio import cupy as cp from cucim import CuImage from rasterio.windows import Window src_path = "scene.tif" red_band, nir_band = 3, 4 with rasterio.open(src_path) as src: H, W = src.height, src.width tile_w, tile_h = src.block_shapes[0] cu = CuImage(src_path) profile = src.profile.copy(); profile.update(count=1, dtype="float32") with rasterio.open("ndvi.tif", "w", **profile) as dst: for _, window in src.block_windows(1): col_off, row_off = window.col_off, window.row_off w, h = window.width, window.height # Read a subregion (x,y,w,h) directly to GPU # Note: some files/drivers may require fallbacks; handle exceptions. region = cu.read_region( location=(col_off, row_off), size=(w, h), level=0 ) # returns array with bands-last # slice bands and move to CuPy nir_d = cp.asarray(region[..., nir_band - 1], dtype=cp.float32) red_d = cp.asarray(region[..., red_band - 1], dtype=cp.float32) ndvi_d = (nir_d - red_d) / cp.where((nir_d + red_d) != 0, (nir_d + red_d), cp.nan) dst.write(cp.asnumpy(ndvi_d), 1, window=window)

Caveats

Band order may be bands‑last; adjust indexing.
Some GeoTIFF features aren’t fully supported by cuCIM; catch exceptions and fall back to Pattern A.

5) Reprojection (keep it simple)

Do the warp math once on CPU to produce a target grid, then sample on GPU.

import numpy as np, cupy as cp, rasterio from rasterio.warp import calculate_default_transform from rasterio.vrt import WarpedVRT src = rasterio.open("scene.tif") dst_crs = "EPSG:32633" # example UTM transform, width, height = calculate_default_transform(src.crs, dst_crs, src.width, src.height, *src.bounds) # Create a virtual warped view (rasterio handles geodesy & resampling) with WarpedVRT(src, crs=dst_crs, transform=transform, width=width, height=height, resampling=rasterio.enums.Resampling.bilinear) as vrt: profile = vrt.profile.copy(); profile.update(dtype="float32", count=1) with rasterio.open("band3_utm.tif", "w", **profile) as dst: for _, window in vrt.block_windows(1): arr = vrt.read(3, window=window).astype("float32") # do GPU math on the warped tiles if needed arr_d = cp.asarray(arr) # ... compute ... dst.write(cp.asnumpy(arr_d), 1, window=window)

Why this approach

You keep geodesy (CRS transforms, resampling kernels) in a battle‑tested library.
You still get GPU speedups for the heavy per‑pixel math after warp.

6) Performance & VRAM tips

Match the file’s block size (dataset.block_shapes) for windows; don’t invent odd sizes.
Prefer float32 for calculations unless you’ve proven FP64 is required.
Limit concurrent tiles; process 1–4 at a time depending on VRAM.
Avoid per‑tile Python overhead by grouping windows and using simple loops.
Write fewer, larger files (or COGs with internal tiling) to cut metadata overhead.

7) Self‑benchmark & cost math

inputs: file size, bands, dtype, tiles, CRS hardware: GPU model/VRAM, CUDA, driver; CPU model code: versions (rasterio, cucim, cupy) metrics: tiles/sec, MB/sec, wall time, peak VRAM

Cost per 100 GB processed

cost_per_100gb = price_per_hour × wall_hours × (100 / (gb_processed))

8) Troubleshooting

GPU OOM
Process fewer/larger tiles; cast to float32; free arrays (del arr_d; cp._default_memory_pool.free_all_blocks()).

Weird results
Mismatched bands or nodata handling. Check band order; propagate nodata to the output.

Slow reads
Not a COG, tiny tiles, or heavy compression. Convert once to a well‑tiled COG.

cuCIM errors
Fall back to Pattern A and log the file’s creation info for later conversion.

Methods snippet (copy‑paste)

hardware: gpu: "<model> (<VRAM> GB)" driver: "<NVIDIA driver>" cuda: "<CUDA version>" software: image: "rapidsai/rapidsai:<24.xx>-cuda12-runtime-ubuntu22.04-py3.10" libs: - rasterio: "<ver>" - cucim: "<ver>" - cupy: "<ver>" - pyproj: "<ver>" inputs: file: "scene.tif (bands=<...>, dtype=<...>)" grid: "CRS=<EPSG:...>, transform=<...>" run: pattern: "CPU read → GPU compute → CPU write (tiles)" tile: "<w>×<h> from block_shapes" outputs: wall_seconds: "<…>" tiles_per_sec: "<…>" mb_per_sec: "<…>" notes: "COG? compression? resampling? nodata propagation"

Try Compute today

Start a GPU instance with a CUDA-ready template (e.g., Ubuntu 24.04 LTS / CUDA 12.6) or your own GROMACS image. Enjoy flexible per-second billing with custom templates and the ability to start, stop, and resume your sessions at any time. Unsure about FP64 requirements? Contact support to help you select the ideal hardware profile for your computational needs.

‍

← Back