4. ESDC Generation

This section explains how a ESDC is generated and how it can be extended by new variables.

4.1. Command-Line Tool

To generate new data cubes or to update existing ones a dedicated command-line tool cube-gen is used.

After installing cablab-core as described in section Installation, try:

$ cube-gen --help

CAB-LAB command-line interface, version 0.2.1rc0+1
usage: cube-gen [-h] [-l] [-G] [-c CONFIG] [TARGET] [SOURCE [SOURCE ...]]

Generates a new CAB-LAB data cube or updates an existing one.

positional arguments:
  TARGET                data cube root directory
  SOURCE                <provider name>:dir=<directory>, use -l to list source
                        provider names

optional arguments:
  -h, --help            show this help message and exit
  -l, --list            list all available source providers
  -G, --dont-clear-cache
                        do not clear data cache before updating the cube
                        (faster)
  -c CONFIG, --cube-conf CONFIG
                        data cube configuration file

The list option lists all currently installed source data providers:

$ cube-gen --list

ozone -> cablab.providers.ozone.OzoneProvider
net_ecosystem_exchange -> cablab.providers.mpi_bgc.MPIBGCProvider
air_temperature -> cablab.providers.air_temperature.AirTemperatureProvider
interception_loss -> cablab.providers.gleam.GleamProvider
transpiration -> cablab.providers.gleam.GleamProvider
open_water_evaporation -> cablab.providers.gleam.GleamProvider
...

Source data providers are the pluggable software components used by cube-gen to read data from a source directory and transform it into a common data cube structure. The list above shows the mapping from short names to be used by the cube-gen command-line to the actual Python code, e.g. for ozone, the OzoneProvider class of the cablab/providers/ozone.py module is used.

The common cube structure is established by a cube configuration file provided by the cube-config option. Here is the configuration file that is used to produce the low-resolution ESDC. It will produce a 0.25 degrees global cube that whose source data will aggregated/interpolated to match 8 day periods and then resampled to match 1440 x 720 spatial grid cells:

model_version = '0.2.4'
spatial_res = 0.25
temporal_res = 8
grid_width = 1440
grid_height = 720
start_time = datetime.datetime(2001, 1, 1, 0, 0)
end_time = datetime.datetime(2012, 1, 1, 0, 0)
ref_time = datetime.datetime(2001, 1, 1, 0, 0)
calendar = 'gregorian'
file_format = 'NETCDF4_CLASSIC'
compression = False

To create or update a cube call the cube-gen tool with the configuration and the cube data provider(s). The cube data providers can have parameters on their own. All current providers have the dir parameter indicating the source data directory but this is not a rule. Other providers which read from multivariate sources also have a var parameter to indicate which variable of many possible should be used.

$ cube-gen mycube -c mycube.config ozone:dir=/path/to/ozone/netcdfs

will create the cube mycube in current directory using the mycube.config configuration and add a single variable ozone from source NetCDF files in /path/to/ozone/netcdfs.

Note, the GitHub repository cube-config is used to keep the configurations of individual ESDC versions.

4.2. Writing a new Provider

In order to add new source data for which there is no source data provider yet, you can write your own.

Make sure cablab-core is installed as described in section Installation above.

If your source data is NetCDF, writing a new provider is easy. Just copy one of the existing providers, e.g. cablab/providers/ozone.py and start adopting the code to your needs.

For source data other than NetCDF, you will have to write a provider from scratch by implementing the cablab.CubeSourceProvider interface or by extending the cablab.BaseCubeSourceProvider which is usually easier. Make sure you adhere to the contract described in the documentation of the respective class.

To run your provider you will have to register it in the setup.py file. Assuming your provider is called sst and your provider class is SeaSurfaceTemperatureProvider located in myproviders.py, then the entry_points section of the setup.py file should reflect this as follows:

entry_points={
    'cablab.source_providers': [
        'burnt_area = cablab.providers.burnt_area:BurntAreaProvider',
        'c_emissions = cablab.providers.c_emissions:CEmissionsProvider',
        'ozone = cablab.providers.ozone:OzoneProvider',
        ...
        'sst = myproviders:SeaSurfaceTemperatureProvider',

To run it:

$ cube-gen mycube -c mycube.config sst:dir=/path/to/sst/netcdfs

4.3. Sharing a Provider

If you plan to distribute and share your provider, you should create your own Python module separate from cablab-core with a dedicated setup.py with only your providers listed in the entry_points section. Other users may then install your module on top of an cablab-core to make use of your plugin.

4.4. Python Cube API Reference

Data Cube read-only access:

from cablab import Cube
from datetime import datetime
cube = Cube.open('./cablab-cube-v05')
data = cube.data.get(['LAI', 'Precip'], [datetime(2001, 6, 1), datetime(2012, 1, 1)], 53.2, 12.8)

Data Cube creation/update:

from cablab import Cube, CubeConfig
from datetime import datetime
cube = Cube.create('./my-cablab-cube', CubeConfig(spatial_res=0.05))
cube.update(MyVar1SourceProvider(cube.config, './my-cube-sources/var1'))
cube.update(MyVar2SourceProvider(cube.config, './my-cube-sources/var2'))
class cablab.BaseCubeSourceProvider(cube_config: cablab.cube_config.CubeConfig, name: str)[source]

A partial implementation of the CubeSourceProvider interface that computes its output image data using weighted averages. The weights are computed according to the overlap of source time ranges and a requested target time range.

Parameters:
  • cube_config – Specifies the fixed layout and conventions used for the cube.
  • name – The provider’s registration name.
compute_source_time_ranges() → list[source]

Return a sorted list of all time ranges of every source file. Items in this list must be 4-element tuples of the form (time_start: datetime, time_stop: datetime, file: str, time_index: int). The method is called from the prepare() method in order to pre-compute all available time ranges. This method must be implemented by derived classes.

compute_variable_images(period_start: datetime.datetime, period_end: datetime.datetime)[source]

For each source time range that has an overlap with the given target time range compute a weight according to the overlapping range. Pass these weights as source index to weight mapping to compute_variable_images_from_sources(index_to_weight) and return the result.

Returns:A dictionary variable name –> image. Each image must be numpy array-like object of shape (grid_height, grid_width) as given by the CubeConfig. Return None if no such variables exists for the given target time range.
compute_variable_images_from_sources(index_to_weight: Dict[int, float])[source]

Compute the target images for all variables from the sources with the given time indices to weights mapping.

The time indices in index_to_weight are guaranteed to point into the time ranges list returned by py:meth:compute_source_time_ranges.

The weight values in index_to_weight are float values computed from the overlap of source time ranges with a requested target time range.

Parameters:index_to_weight – A dictionary mapping time indexes –> weight values.
Returns:A dictionary variable name –> image. Each image must be numpy array-like object of shape (grid_height, grid_width) as specified by the cube’s layout configuration CubeConfig. Return None if no such variables exists for the given target time range.
log(message: str)[source]

Log a message.

Parameters:message – The message to be logged.
prepare()[source]

Calls compute_source_time_ranges and assigns the return value to the field source_time_ranges.

spatial_coverage

Return the spatial grid coverage given in the Cube’s configuration (default).

Returns:A tuple of integers (x, y, width, height) in the cube’s image coordinates.
temporal_coverage

Return the temporal coverage derived from the value returned by compute_source_time_ranges().

class cablab.BaseStaticCubeSourceProvider(cube_config: cablab.cube_config.CubeConfig, name: str)[source]

A CubeSourceProvider that * uses a NetCDF source dataset read from a given dir_path; * performs only spatial resampling.

Parameters:
  • cube_config – Specifies the fixed layout and conventions used for the cube.
  • name – The provider’s registration name.
close()[source]

Does nothing. Override to implement any required close operation.

close_dataset(dataset: object)[source]

Close dataset. :param dataset: the dataset returned by open_dataset()

get_dataset_file_path(dataset: object) → str[source]

Get the file path for dataset. :param dataset: the dataset returned by open_dataset() :return: a file path

get_dataset_image(dataset: object, name: str)[source]

Get a 2D-image for dataset for the given variable name. :param dataset: the dataset returned by open_dataset(). :param name: the variable name. :return: a 2D-image

open_dataset() → object[source]

Open the single dataset and return its representation. :return: a dataset object

prepare()[source]

Clear the flag that indicates that the static sources have been processed.

spatial_coverage

Return the spatial grid coverage given in the Cube’s configuration (default).

Returns:A tuple of integers (x, y, width, height) in the cube’s image coordinates.
temporal_coverage

Return the temporal coverage derived from the value returned by compute_source_time_ranges().

transform_source_image(source_image)[source]

Does nothing but returning the source image. Override to implement transformations if needed. :param source_image: 2D image :return: source_image

class cablab.NetCDFCubeSourceProvider(cube_config: cablab.cube_config.CubeConfig, name: str, dir_path: str, resampling_order: str)[source]

A BaseCubeSourceProvider that * Uses NetCDF source datasets read from a given dir_path * Performs temporal aggregation first and then spatial resampling

Parameters:
  • cube_config – Specifies the fixed layout and conventions used for the cube.
  • name – The provider’s registration name.
  • dir_path – Source directory to read the files from. If relative path, it will be resolved against the cube_sources_root path of the global CAB-LAB configuration (cablab.util.Config.instance()).
  • resampling_order – The order in which resampling is performed. One of ‘time_first’, ‘space_first’.
close_unused_open_files(index_to_weight)[source]

Close all datasets that wont be used anymore w.r.t. the given index_to_weight dictionary passed to the compute_variable_images_from_sources() method.

Parameters:index_to_weight – A dictionary mapping time indexes –> weight values.
Returns:set of time indexes into currently active files w.r.t. the given index_to_weight parameter.
transform_source_image(source_image)[source]

Returns the source image. Override to implement transformations if needed. :param source_image: 2D image :return: source_image

class cablab.NetCDFStaticCubeSourceProvider(cube_config: cablab.cube_config.CubeConfig, name: str, dir_path: str)[source]

A CubeSourceProvider that * Uses a NetCDF source dataset read from a given dir_path * Performs only spatial resampling

Parameters:
  • cube_config – Specifies the fixed layout and conventions used for the cube.
  • name – The provider’s registration name.
  • dir_path – Source directory to read the single file from. If relative path, it will be resolved against the cube_sources_root path of the global CAB-LAB configuration (cablab.util.Config.instance()).
class cablab.TestCubeSourceProvider(cube_config: cablab.cube_config.CubeConfig, name: str='test', var: str='test')[source]

CubeSourceProvider implementation used for testing cube generation without any source files.

The following usage generates a cube with two variables test_1 and test_2:
cube-gen -c ./myconf.py ./mycube test:var=test_1 test:var=test_2
Parameters:
  • cube_config – Specifies the fixed layout and conventions used for the cube.
  • name – The provider’s registration name. Defaults to "test".
  • var – Name of a (float32) variable which will be filled with random numbers.
class cablab.Cube(base_dir, config)[source]

Represents a data cube. Use the static open() or create() methods to obtain data cube objects.

base_dir

The cube’s base directory.

close()[source]

Closes the data cube.

closed

Checks if the cube has been closed.

config

The cube’s configuration. See CubeConfig class.

static create(base_dir, config=CubeConfig(spatial_res=0.250000, grid_x0=0, grid_y0=0, grid_width=1440, grid_height=720, temporal_res=8, ref_time=datetime.datetime(2001, 1, 1, 0, 0)))[source]

Create a new data cube. Use the Cube.update(provider) method to add data to the cube via a source data provider.

Parameters:
  • base_dir – The data cube’s base directory. Must not exists.
  • config – The data cube’s static information.
Returns:

A cube instance.

data

The cube’s data which is an instance of the CubeDataAccess class.

info() → str[source]

Return a human-readable information string about this data cube (markdown formatted).

static open(base_dir)[source]

Open an existing data cube. Use the Cube.update(provider) method to add data to the cube via a source data provider.

Parameters:base_dir – The data cube’s base directory which must be empty or non-existent.
Returns:A cube instance.
update(provider: cablab.cube_provider.CubeSourceProvider)[source]

Updates the data cube with source data from the given image provider.

Parameters:provider – An instance of the abstract ImageProvider class
class cablab.CubeConfig(spatial_res=0.25, grid_x0=0, grid_y0=0, grid_width=1440, grid_height=720, temporal_res=8, calendar='gregorian', ref_time=datetime.datetime(2001, 1, 1, 0, 0), start_time=datetime.datetime(2001, 1, 1, 0, 0), end_time=datetime.datetime(2012, 1, 1, 0, 0), variables=None, file_format='NETCDF4_CLASSIC', compression=False, chunk_sizes=None, static_data=False, model_version='1.0.0')[source]

A data cube’s static configuration information.

Parameters:
  • spatial_res – The spatial image resolution in degree.
  • grid_x0 – The fixed grid X offset (longitude direction).
  • grid_y0 – The fixed grid Y offset (latitude direction).
  • grid_width – The fixed grid width in pixels (longitude direction).
  • grid_height – The fixed grid height in pixels (latitude direction).
  • temporal_res – The temporal resolution in days.
  • ref_time – A datetime value which defines the units in which time values are given, namely ‘days since ref_time‘.
  • start_time – The inclusive start time of the first image of any variable in the cube given as datetime value. None means unlimited.
  • end_time – The exclusive end time of the last image of any variable in the cube given as datetime value. None means unlimited.
  • variables – A list of variable names to be included in the cube.
  • file_format – The file format used. Must be one of ‘NETCDF4’, ‘NETCDF4_CLASSIC’, ‘NETCDF3_CLASSIC’ or ‘NETCDF3_64BIT’.
  • compression – Whether the data should be compressed.
date2num(date) → float[source]

Return the number of days for the given date as a number in the time units given by the time_units property.

Parameters:date – The date as a datetime.datetime value
easting

The latitude position of the upper-left-most corner of the upper-left-most grid cell given by (grid_x0, grid_y0).

geo_bounds

The geographical boundary given as ((LL-lon, LL-lat), (UR-lon, UR-lat)).

static load(path) → object[source]

Load a CubeConfig from a text file.

Parameters:path – The file’s path name.
Returns:A new CubeConfig instance
northing

The longitude position of the upper-left-most corner of the upper-left-most grid cell given by (grid_x0, grid_y0).

num_periods_per_year

Return the integer number of target periods per year.

store(path)[source]

Store a CubeConfig in a text file.

Parameters:path – The file’s path name.
time_units

Return the time units used by the data cube as string using the format ‘days since ref_time‘.

class cablab.CubeSourceProvider(cube_config: cablab.cube_config.CubeConfig, name: str)[source]

An abstract interface for objects representing data source providers for the data cube. Cube source providers are passed to the Cube.update() method.

Parameters:
  • cube_config – Specifies the fixed layout and conventions used for the cube.
  • name – The provider’s registration name.
close()[source]

Called by the cube’s update() method after all images have been retrieved and the provider is no longer used.

compute_variable_images(period_start: datetime.datetime, period_end: datetime.datetime) → Dict[str, <Mock name='mock.ndarray._subs_tree()' id='140134655197536'>][source]

Return variable name to variable image mapping of all provided variables. Each image is a numpy array with the shape (height, width) derived from the get_spatial_coverage() method.

The images must be computed (by aggregation or interpolation or copy) from the source data in the given time period period_start <= source_data_time < period_end and taking into account other data cube configuration settings.

The method is called by a Cube instance’s update() method for all possible time periods in the time range given by the get_temporal_coverage() method. The times given are adjusted w.r.t. the cube’s reference time and temporal resolution.

Parameters:
  • period_start – The period start time as a datetime.datetime instance
  • period_end – The period end time as a datetime.datetime instance
Returns:

A dictionary variable name –> image. Each image must be numpy array-like object of shape (grid_height, grid_width) as given by the CubeConfig. Return None if no such variables exists for the given target time range.

cube_config

The data cube’s configuration.

name

The provider’s registration name.

prepare()[source]

Called by a Cube instance’s update() method before any other provider methods are called. Provider instances should prepare themselves w.r.t. the given cube configuration cube_config.

spatial_coverage

Return the spatial coverage as a rectangle represented by a tuple of integers (x, y, width, height) in the cube’s image coordinates.

Returns:A tuple of integers (x, y, width, height) in the cube’s image coordinates.
temporal_coverage

Return the start and end time of the available source data.

Returns:A tuple of datetime.datetime instances (start_time, end_time).
variable_descriptors

Return a dictionary which maps target(!) variable names to a dictionary of target attribute values. The following attributes have a special meaning and shall or should be provided:

  • data_type: A numpy data type. Mandatory attribute.
  • fill_value: The value used for indicating missing grid cells. Mandatory attribute.
  • source_name: the name of the variable in the source (files).
    Optional, defaults to the target variable’s name.
  • scale_factor: See CF conventions. Optional, defaults to one (1.0).
  • add_offset: See CF conventions. Optional, defaults to zero (0.0).
  • units: See CF conventions. Optional.
  • standard_name: See CF conventions. Optional.
  • long_name: See CF conventions. Optional.
Returns:dictionary of variable names to attribute dictionaries