Processing Pipeline

The pipeline transforms raw COSMO-REA6 GRIB archives into a single analysis-ready NetCDF-4 file in four steps.

DWD OpenData (.grb.bz2)
    │
    ▼  Step 1 — Download (HTTPS, parallel per attribute/month)
download/
    │
    ▼  Step 2 — Decompress (lbzip2 / pbzip2 / Python bz2)
decompress/
    │
    ▼  Step 3 — Transform (xarray + cfgrib, dask threaded)
xarray.Dataset (in memory, chunked)
    │
    ▼  Step 4 — Export (NetCDF-4 with zlib, per-variable compute)
output/COSMO_REA6_2018_Jan.nc

Step 1 — Download

download.py fetches compressed GRIB files from the DWD OpenData archive.

One file per attribute per month (e.g. SWDIRS_RAD.2D.201801.grb.bz2).
Idempotent: compares local file size with remote Content-Length before downloading; skips complete files.
Supports HTTPS (default) with resume via Range header, and FTP fallback.
Files are written atomically (to a temp file, then renamed).

Step 2 — Decompress

decompress.py extracts raw GRIB from .grb.bz2 archives.

Auto-detects the best available tool: lbzip2 > pbzip2 > Python bz2.
Parallel decompression across files using a thread pool.
Atomic writes: a crash never leaves a half-written GRIB file.
lbzip2 is preferred because it scales better across multiple cores.

Step 3 — Transform

transform.py reads decompressed GRIB files and produces an xarray Dataset with analysis-ready variables.

Raw attributes (from COSMO-REA6):

Field	Description	Raw unit
`SWDIFDS_RAD`	Downward diffuse shortwave radiation at surface	W/m²
`SWDIRS_RAD`	Downward direct shortwave radiation at surface	W/m²
`T_2M`	Temperature at 2 m above ground	K
`U_10M`	U-component of wind at 10 m	m/s
`V_10M`	V-component of wind at 10 m	m/s

Derived fields (computed during transform):

Field	Formula	Unit
`GHI`	`SWDIFDS_RAD + SWDIRS_RAD`	W/m²
`DHI`	`SWDIFDS_RAD`	W/m²
`T`	`T_2M − 273.15`	°C
`WS_10M`	\(\sqrt{U\_10M^2 + V\_10M^2}\)	m/s

Chunked processing with dask (time=168, ~1 week per chunk).
Uses the threaded scheduler — all threads share the same memory space, avoiding the overhead and OOM risks of dask.distributed.

Step 4 — Export

export.py writes the Dataset to a compressed NetCDF-4 file.

zlib compression (default level 1 — fastest; levels 2–9 give negligible size reduction at much higher CPU cost on the 824×848 COSMO grid).
Variables are computed one at a time to cap peak memory at ~4 GiB instead of materialising all fields simultaneously.
float32 encoding halves file size without meaningful precision loss.

Output naming convention:

Months processed	Filename
All 12 (full year)	`COSMO_REA6_2018.nc`
Single month	`COSMO_REA6_2018_Jan.nc`
Multiple months	`COSMO_REA6_2018_Jan-Mar.nc`

Step 5 — Cleanup (optional)

When --cleanup is passed, the pipeline removes the download/ and decompress/ directories after a successful export. The download and decompression steps are fast enough that re-running them is inexpensive.

Memory and Performance

The pipeline is tuned for a 1/8 node allocation on Snellius (16 cores, 28 GiB RAM):

Dask threaded scheduler (no distributed workers).
Chunk size time=168 (~67 MB per chunk).
Sequential per-variable export (peak ~4 GiB).
Benchmark: ~7.5 minutes for 1 month of all 5 attributes.