Datasets#

This page provides an overview of how datasets are defined, structured, and handled in the processing chain. The goal is to ensure consistent, analysis-ready data products that can be accessed efficiently and extended over time.

What is a Dataset?#

A dataset is the main unit of storage and analysis. A dataset:

  • represents a rectangular collection of variables with shared coordinates

  • is stored in Zarr format

  • corresponds to one deployment of an instrument or product

Datasets are designed so that they can be accessed directly for analysis via our intake catalog.

Analysis-Ready Datasets#

Our goal is to produce analysis-ready datasets. These datasets should be usable for scientific analysis without additional preprocessing.

An analysis-ready dataset should:

  • represent the whole observation period

  • follow CF conventions where possible

  • include appropriate metadata and units

Dataset Organization#

Datasets follow a hierarchical naming scheme that reflects their observational context.

The hierarchy is:

platform.campaign.instrument

or, if no campaign is relevant:

platform.instrument

The instrument name should include configuration (_c1) and version (_v1) information.

Examples:

BCO.surfacemet_wxt_v1
BCO.lidar_CORAL_LR_t_c1_v1
METEOR.EUREC4A.lidar_LICHT_LR_t_v1

Naming rules:

  • . separates hierarchical levels

  • _ is used within names

Datasets can be accessed through the intake catalog, for example:

import intake


cat = intake.open_catalog("https://tcodata.mpimet.mpg.de/catalog.yaml")
cat.BCO.surfacemet_wxt_v1.to_dask()
/builds/tco/bco/docs/.venv/lib/python3.12/site-packages/intake_xarray/base.py:21: FutureWarning: The return type of `Dataset.dims` will be changed to return a set of dimension names in future, in order to be more consistent with `DataArray.dims`. To access a mapping from dimension names to lengths, please use `Dataset.sizes`.
  'dims': dict(self._ds.dims),
<xarray.Dataset> Size: 8GB
Dimensions:      (time: 43702710, bnd: 2)
Coordinates:
    alt          float64 8B ...
    lat          float64 8B ...
    lon          float64 8B ...
  * time         (time) datetime64[ns] 350MB 2010-12-16T16:24:00 ... 2026-04-...
Dimensions without coordinates: bnd
Data variables: (12/22)
    DIR          (time) float64 350MB dask.array<chunksize=(262144,), meta=np.ndarray>
    DL           (time) float64 350MB dask.array<chunksize=(262144,), meta=np.ndarray>
    DR           (time) float64 350MB dask.array<chunksize=(262144,), meta=np.ndarray>
    H            (time) float64 350MB dask.array<chunksize=(262144,), meta=np.ndarray>
    HDS          (time) float64 350MB dask.array<chunksize=(262144,), meta=np.ndarray>
    HI           (time) float64 350MB dask.array<chunksize=(262144,), meta=np.ndarray>
    ...           ...
    TI           (time) float64 350MB dask.array<chunksize=(262144,), meta=np.ndarray>
    VEL          (time) float64 350MB dask.array<chunksize=(262144,), meta=np.ndarray>
    VH           (time) float64 350MB dask.array<chunksize=(262144,), meta=np.ndarray>
    VR           (time) float64 350MB dask.array<chunksize=(262144,), meta=np.ndarray>
    VS           (time) float64 350MB dask.array<chunksize=(262144,), meta=np.ndarray>
    time_bounds  (time, bnd) datetime64[ns] 699MB dask.array<chunksize=(262144, 2), meta=np.ndarray>
Attributes:
    Conventions:           CF-1.12
    _logical_cutoff_date:  2026-04-02T00:00:00Z
    bcoproc_version:       0.0.0.post1246.dev0+39062d5
    featureType:           timeSeries
    institution:           Max Planck Institute for Meteorology, Hamburg
    license:               CC0-1.0
    location:              The Barbados Cloud Observatory (BCO), Deebles Poin...
    platform:              BCO
    source:                Vaisala WXT-520
    summary:               This dataset contains basic meteorological measure...
    title:                 WXT-2 ground station data from BCO (Level 1)
    tool_versions:         {"Python": "3.11.2 (main, Apr 28 2025, 14:11:48) [...

Coordinate Conventions#

Datasets follow CF conventions for coordinate naming where possible.

Primary coordinates include:

Coordinate

Meaning

time

UTC timestamp

alt

altitude above the geoid (meters)

range

line-of-sight distance from instrument (meters)

lat

latitude (degrees_north)

lon

longitude (degrees_east)

Primary coordinates must:

  • be strictly monotonic

  • contain no missing values

When sensor and data coordinates differ, sensor coordinates are provided using the prefix sensor_, e.g., sensor_alt.

Incrementally Growing Datasets#

Datasets are stored as Zarr archives and are extended continuously as new data becomes available. Rather than rewriting datasets, new data is appended as additional chunks. This enables efficient cloud-based storage and scalable analysis. The processing chaing orchestrated by an Airflow server.