← Back

Read & process GeoTIFFs on GPU (cuCIM + rasterio)

You can run heavy raster math on a single GPU if you load data in tiles and keep the georeferencing straight. This guide shows a clean pattern to use with your GPU service provider using rasterio for metadata and I/O, cuCIM/CuPy for speed, and pyproj when you need reprojection.

Plain truth: not every GeoTIFF reads natively on GPU. The reliable path is CPU read → GPU compute → CPU write, done in windows that match the file’s internal tiling. cuCIM can read many tiled TIFFs directly to GPU; when it can’t, you still win by copying windows to CuPy before the math.

1) Pick an image

Two good options:

A) RAPIDS image (fast start)
docker.io/rapidsai/rapidsai:<24.xx>-cuda12-runtime-ubuntu22.04-py3.10

B) CUDA base + Python libs
Start from a CUDA runtime (e.g., Ubuntu 24.04 / CUDA 12.x), then install:

# micromamba (recommended)
micromamba create -y -n geo -c conda-forge -c rapidsai \
 python=3.10 rasterio pyproj cudf cupy cucim
micromamba activate geo
python -c "import rasterio, pyproj, cupy, cucim; print('ok')"

Your template should set:

  • NVIDIA_VISIBLE_DEVICES=all
  • NVIDIA_DRIVER_CAPABILITIES=compute,utility

Sanity check inside the container:

nvidia-smi

2) Use Cloud‑Optimized GeoTIFFs (COGs) when you can

COGs store internal tiles and overviews that make windowed reading predictable. Even for local files, COG layout keeps I/O fast and plays well with GPU tiling. If you can, convert once to a COG with 512×512 or 1024×1024 tiles and LZW/Deflate/ZSTD.

Start in seconds with the fastest, most affordable cloud GPU clusters.

Launch an instance in under a minute. Enjoy flexible pricing, powerful hardware, and 24/7 support. Scale as you grow—no long-term commitment needed.

Try Compute now

3) Pattern A — CPU read → GPU compute → CPU write (robust default)

This works on any GeoTIFF and keeps georeferencing intact.

Example: NDVI from RED/NIR bands

import rasterio
from rasterio.windows import Window
import cupy as cp
import numpy as np

src_path = "scene.tif"      # multiband GeoTIFF (e.g., band 3=RED, band 4=NIR)
dst_path = "ndvi.tif"
RED_BAND, NIR_BAND = 3, 4

with rasterio.open(src_path) as src:
   profile = src.profile.copy()
   profile.update(count=1, dtype="float32", nodata=np.nan)

   tile_w, tile_h = src.block_shapes[0]  # use internal tiling
   with rasterio.open(dst_path, "w", **profile) as dst:
       for ji, window in src.block_windows(1):  # iterate tiles
           red = src.read(RED_BAND, window=window).astype("float32")
           nir = src.read(NIR_BAND, window=window).astype("float32")

           # move to GPU
           red_d = cp.asarray(red)
           nir_d = cp.asarray(nir)

           # NDVI = (NIR - RED) / (NIR + RED)
           denom = nir_d + red_d
           ndvi_d = cp.where(denom != 0, (nir_d - red_d) / denom, cp.nan)

           # back to CPU and write
           dst.write(cp.asnumpy(ndvi_d), 1, window=window)

Why this works

  • Rasterio handles CRS/transform/nodata and writes a valid GeoTIFF.
  • You only move a tile at a time to GPU, so VRAM stays bounded.
  • Works with compressed inputs; decompression happens during src.read().

4) Pattern B — GPU‑native read with cuCIM (when supported)

cuCIM can read many tiled TIFFs straight to GPU without an intermediate CPU array. You still use rasterio to fetch metadata and to write outputs.

import rasterio
import cupy as cp
from cucim import CuImage
from rasterio.windows import Window

src_path = "scene.tif"
red_band, nir_band = 3, 4

with rasterio.open(src_path) as src:
   H, W = src.height, src.width
   tile_w, tile_h = src.block_shapes[0]

   cu = CuImage(src_path)

   profile = src.profile.copy(); profile.update(count=1, dtype="float32")
   with rasterio.open("ndvi.tif", "w", **profile) as dst:
       for _, window in src.block_windows(1):
           col_off, row_off = window.col_off, window.row_off
           w, h = window.width, window.height

           # Read a subregion (x,y,w,h) directly to GPU
           # Note: some files/drivers may require fallbacks; handle exceptions.
           region = cu.read_region(
               location=(col_off, row_off), size=(w, h), level=0
           )  # returns array with bands-last
           # slice bands and move to CuPy
           nir_d = cp.asarray(region[..., nir_band - 1], dtype=cp.float32)
           red_d = cp.asarray(region[..., red_band - 1], dtype=cp.float32)

           ndvi_d = (nir_d - red_d) / cp.where((nir_d + red_d) != 0, (nir_d + red_d), cp.nan)
           dst.write(cp.asnumpy(ndvi_d), 1, window=window)

Caveats

  • Band order may be bands‑last; adjust indexing.
  • Some GeoTIFF features aren’t fully supported by cuCIM; catch exceptions and fall back to Pattern A.

5) Reprojection (keep it simple)

Do the warp math once on CPU to produce a target grid, then sample on GPU.

import numpy as np, cupy as cp, rasterio
from rasterio.warp import calculate_default_transform
from rasterio.vrt import WarpedVRT

src = rasterio.open("scene.tif")
dst_crs = "EPSG:32633"  # example UTM
transform, width, height = calculate_default_transform(src.crs, dst_crs, src.width, src.height, *src.bounds)

# Create a virtual warped view (rasterio handles geodesy & resampling)
with WarpedVRT(src, crs=dst_crs, transform=transform, width=width, height=height, resampling=rasterio.enums.Resampling.bilinear) as vrt:
   profile = vrt.profile.copy(); profile.update(dtype="float32", count=1)
   with rasterio.open("band3_utm.tif", "w", **profile) as dst:
       for _, window in vrt.block_windows(1):
           arr = vrt.read(3, window=window).astype("float32")
           # do GPU math on the warped tiles if needed
           arr_d = cp.asarray(arr)
           # ... compute ...
           dst.write(cp.asnumpy(arr_d), 1, window=window)

Why this approach

  • You keep geodesy (CRS transforms, resampling kernels) in a battle‑tested library.
  • You still get GPU speedups for the heavy per‑pixel math after warp.

6) Performance & VRAM tips

  • Match the file’s block size (dataset.block_shapes) for windows; don’t invent odd sizes.
  • Prefer float32 for calculations unless you’ve proven FP64 is required.
  • Limit concurrent tiles; process 1–4 at a time depending on VRAM.
  • Avoid per‑tile Python overhead by grouping windows and using simple loops.
  • Write fewer, larger files (or COGs with internal tiling) to cut metadata overhead.

7) Self‑benchmark & cost math

inputs: file size, bands, dtype, tiles, CRS
hardware: GPU model/VRAM, CUDA, driver; CPU model
code: versions (rasterio, cucim, cupy)
metrics: tiles/sec, MB/sec, wall time, peak VRAM

Cost per 100 GB processed

cost_per_100gb = price_per_hour × wall_hours × (100 / (gb_processed))

8) Troubleshooting

GPU OOM
Process fewer/larger tiles; cast to float32; free arrays (del arr_d; cp._default_memory_pool.free_all_blocks()).

Weird results
Mismatched bands or nodata handling. Check band order; propagate nodata to the output.

Slow reads
Not a COG, tiny tiles, or heavy compression. Convert once to a well‑tiled COG.

cuCIM errors
Fall back to Pattern A and log the file’s creation info for later conversion.

Methods snippet (copy‑paste)

hardware:
 gpu: "<model> (<VRAM> GB)"
 driver: "<NVIDIA driver>"
 cuda: "<CUDA version>"
software:
 image: "rapidsai/rapidsai:<24.xx>-cuda12-runtime-ubuntu22.04-py3.10"
 libs:
   - rasterio: "<ver>"
   - cucim: "<ver>"
   - cupy: "<ver>"
   - pyproj: "<ver>"
inputs:
 file: "scene.tif (bands=<...>, dtype=<...>)"
 grid: "CRS=<EPSG:...>, transform=<...>"
run:
 pattern: "CPU read → GPU compute → CPU write (tiles)"
 tile: "<w>×<h> from block_shapes"
outputs:
 wall_seconds: "<…>"
 tiles_per_sec: "<…>"
 mb_per_sec: "<…>"
notes: "COG? compression? resampling? nodata propagation"

Related reading

Try Compute today

Start a GPU instance with a CUDA-ready template (e.g., Ubuntu 24.04 LTS / CUDA 12.6) or your own GROMACS image. Enjoy flexible per-second billing with custom templates and the ability to start, stop, and resume your sessions at any time. Unsure about FP64 requirements? Contact support to help you select the ideal hardware profile for your computational needs.

← Back