Processing Pipeline

The pipeline transforms raw COSMO-REA6 GRIB archives into a single analysis-ready NetCDF-4 file in four steps.

DWD OpenData (.grb.bz2)
    │
    ▼  Step 1 — Download (HTTPS, parallel per attribute/month)
download/
    │
    ▼  Step 2 — Decompress (lbzip2 / pbzip2 / Python bz2)
decompress/
    │
    ▼  Step 3 — Transform (xarray + cfgrib, dask threaded)
xarray.Dataset (in memory, chunked)
    │
    ▼  Step 4 — Export (NetCDF-4 with zlib, per-variable compute)
output/COSMO_REA6_2018_Jan.nc

Step 1 — Download

download.py fetches compressed GRIB files from the DWD OpenData archive.

  • One file per attribute per month (e.g. SWDIRS_RAD.2D.201801.grb.bz2).

  • Idempotent: compares local file size with remote Content-Length before downloading; skips complete files.

  • Supports HTTPS (default) with resume via Range header, and FTP fallback.

  • Files are written atomically (to a temp file, then renamed).

Step 2 — Decompress

decompress.py extracts raw GRIB from .grb.bz2 archives.

  • Auto-detects the best available tool: lbzip2 > pbzip2 > Python bz2.

  • Parallel decompression across files using a thread pool.

  • Atomic writes: a crash never leaves a half-written GRIB file.

  • lbzip2 is preferred because it scales better across multiple cores.

Step 3 — Transform

transform.py reads decompressed GRIB files and produces an xarray Dataset with analysis-ready variables.

Raw attributes (from COSMO-REA6):

Field

Description

Raw unit

SWDIFDS_RAD

Downward diffuse shortwave radiation at surface

W/m²

SWDIRS_RAD

Downward direct shortwave radiation at surface

W/m²

T_2M

Temperature at 2 m above ground

K

U_10M

U-component of wind at 10 m

m/s

V_10M

V-component of wind at 10 m

m/s

Derived fields (computed during transform):

Field

Formula

Unit

GHI

SWDIFDS_RAD + SWDIRS_RAD

W/m²

DHI

SWDIFDS_RAD

W/m²

T

T_2M 273.15

°C

WS_10M

\(\sqrt{U\_10M^2 + V\_10M^2}\)

m/s

  • Chunked processing with dask (time=168, ~1 week per chunk).

  • Uses the threaded scheduler — all threads share the same memory space, avoiding the overhead and OOM risks of dask.distributed.

Step 4 — Export

export.py writes the Dataset to a compressed NetCDF-4 file.

  • zlib compression (default level 1 — fastest; levels 2–9 give negligible size reduction at much higher CPU cost on the 824×848 COSMO grid).

  • Variables are computed one at a time to cap peak memory at ~4 GiB instead of materialising all fields simultaneously.

  • float32 encoding halves file size without meaningful precision loss.

Output naming convention:

Months processed

Filename

All 12 (full year)

COSMO_REA6_2018.nc

Single month

COSMO_REA6_2018_Jan.nc

Multiple months

COSMO_REA6_2018_Jan-Mar.nc

Step 5 — Cleanup (optional)

When --cleanup is passed, the pipeline removes the download/ and decompress/ directories after a successful export. The download and decompression steps are fast enough that re-running them is inexpensive.

Memory and Performance

The pipeline is tuned for a 1/8 node allocation on Snellius (16 cores, 28 GiB RAM):

  • Dask threaded scheduler (no distributed workers).

  • Chunk size time=168 (~67 MB per chunk).

  • Sequential per-variable export (peak ~4 GiB).

  • Benchmark: ~7.5 minutes for 1 month of all 5 attributes.