Processing Pipeline
The pipeline transforms raw COSMO-REA6 GRIB archives into a single analysis-ready NetCDF-4 file in four steps.
DWD OpenData (.grb.bz2)
│
▼ Step 1 — Download (HTTPS, parallel per attribute/month)
download/
│
▼ Step 2 — Decompress (lbzip2 / pbzip2 / Python bz2)
decompress/
│
▼ Step 3 — Transform (xarray + cfgrib, dask threaded)
xarray.Dataset (in memory, chunked)
│
▼ Step 4 — Export (NetCDF-4 with zlib, per-variable compute)
output/COSMO_REA6_2018_Jan.nc
Step 1 — Download
download.py fetches compressed GRIB files from the DWD OpenData archive.
One file per attribute per month (e.g.
SWDIRS_RAD.2D.201801.grb.bz2).Idempotent: compares local file size with remote
Content-Lengthbefore downloading; skips complete files.Supports HTTPS (default) with resume via
Rangeheader, and FTP fallback.Files are written atomically (to a temp file, then renamed).
Step 2 — Decompress
decompress.py extracts raw GRIB from .grb.bz2 archives.
Auto-detects the best available tool:
lbzip2>pbzip2> Pythonbz2.Parallel decompression across files using a thread pool.
Atomic writes: a crash never leaves a half-written GRIB file.
lbzip2is preferred because it scales better across multiple cores.
Step 3 — Transform
transform.py reads decompressed GRIB files and produces an xarray
Dataset with analysis-ready variables.
Raw attributes (from COSMO-REA6):
Field |
Description |
Raw unit |
|---|---|---|
|
Downward diffuse shortwave radiation at surface |
W/m² |
|
Downward direct shortwave radiation at surface |
W/m² |
|
Temperature at 2 m above ground |
K |
|
U-component of wind at 10 m |
m/s |
|
V-component of wind at 10 m |
m/s |
Derived fields (computed during transform):
Field |
Formula |
Unit |
|---|---|---|
|
|
W/m² |
|
|
W/m² |
|
|
°C |
|
\(\sqrt{U\_10M^2 + V\_10M^2}\) |
m/s |
Chunked processing with dask (
time=168, ~1 week per chunk).Uses the threaded scheduler — all threads share the same memory space, avoiding the overhead and OOM risks of
dask.distributed.
Step 4 — Export
export.py writes the Dataset to a compressed NetCDF-4 file.
zlib compression (default level 1 — fastest; levels 2–9 give negligible size reduction at much higher CPU cost on the 824×848 COSMO grid).
Variables are computed one at a time to cap peak memory at ~4 GiB instead of materialising all fields simultaneously.
float32encoding halves file size without meaningful precision loss.
Output naming convention:
Months processed |
Filename |
|---|---|
All 12 (full year) |
|
Single month |
|
Multiple months |
|
Step 5 — Cleanup (optional)
When --cleanup is passed, the pipeline removes the download/ and
decompress/ directories after a successful export. The download and
decompression steps are fast enough that re-running them is inexpensive.
Memory and Performance
The pipeline is tuned for a 1/8 node allocation on Snellius (16 cores, 28 GiB RAM):
Dask threaded scheduler (no distributed workers).
Chunk size
time=168(~67 MB per chunk).Sequential per-variable export (peak ~4 GiB).
Benchmark: ~7.5 minutes for 1 month of all 5 attributes.