The ML_Utilities package: Functions for data loading and normalisation

Module supporting ML experiments - mostly functions for making reanalysis data available to ML models. The idea is to abstract away the data provision to this module, so experiments can concentrate on model building.

The module has two basic functions: The prepare_data() function takes reanalysis data and reformats it into input files suitable for TensorFlow, and the dataset() function turns those files into a tf.data.Dataset which can be passed to a model for training or testing.

The first step might be:

import ML_Utilities

count=1
for year in [1969,1979,1989,1999,2009]:

    start_day=datetime.datetime(year,  1,  1,  0)
    end_day  =datetime.datetime(year, 12, 31, 23)

    current_day=start_day
    while current_day<=end_day:
        purpose='training'
        if count%10==0: purpose='test' # keep 1/10 data for testing
        ML_Utilities.prepare_data(current_day,
                                  purpose=purpose,
                                  source='20CR2c',
                                  variable='prmsl')
    # Keep samples > 5 days apart to minimise correlations
    current_day=current_day+datetime.timedelta(hours=126)
    count += 1

That will produce TensorFlow files containing 5 years of 20CRv2c prmsl fields - note that you will need to download the data first - see the IRData package.

Then to use those data in an ML model, it’s:

import ML_Utilities

(training_data,n_data)=ML_Utilities.dataset(purpose='training',
                                            source='20CR2c',
                                            variable='prmsl')
(test_data,n_test)=ML_Utilities.dataset(purpose='test',
                                        source='20CR2c',
                                        variable='prmsl')
# model=(specify here)

model.fit(x=training_data,
          validation_data=test_data)

You may need to use the repeat and map functions on the Datasets to get a long enough dataset for many epochs of training, and to tweak the dataset to the requirements of your model (perhaps to set the array shape).

ML_Utilities.Dataset

alias of tensorflow.python.data.ops.dataset_ops.DatasetV1

ML_Utilities.dataset(purpose='training', source='20CR2c', variable='prmsl', length=None, shuffle=True, buffer_size=10, reshuffle_each_iteration=False)[source]

Provide a tf.data.Dataset of analysis data, for tf.keras model training or tests.

Data must be available in directory $SCRATCH/Machine-Learning-experiments, previously generated by prepare_data().

Parameters:
  • purpose (str) – ‘training’ (default) or ‘test’.
  • source (str) – Where to get the data from - any string, but needs top be supported by prepare_data().
  • variable (str) – Variable to fetch (e.g. ‘prmsl’).
  • length (int) – Required length - will be repeated enough times to deliver at least this many data points. If None (default) uses the amount of data on disc as the length (not repeated).
  • shuffle (bool) – If True (default), shuffle the data order. If False, present the data in the order of the files on disc.
  • buffer_size (int) – Passed to tf.data.Dataset.shuffle().
  • reshuffle_each_iteration (bool) – Passed to tf.data.Dataset.shuffle().
Returns:

Dataset suitable for passing to tf.keras models, and int: length of that dataset.

Return type:

tuple of tf.data.Dataset

Raises:

ValueError – Necessary data not on disc, need to run create_dataset() to make it.


ML_Utilities.get_normalise_function(source='20CR2c', variable='prmsl')[source]

Provide a normalisation function to scale a 20CR data field to mean=0 sd=1 (approximately).

Args:
source (str): Where to get the data from - at the moment, only ‘20CR2c’ is supported . variable (str): Variable to use (e.g. ‘prmsl’)
Returns:
function which, when called with a numpy array as its argument, returns a normalised version of that array.
Raises:ValueError – Unsupported source or variable.

ML_Utilities.get_unnormalise_function(source='20CR2c', variable='prmsl')[source]

Provide an unnormalisation function to scale a 20CR data field back to its original units from the normalised representation with mean=0 sd=1 (approximately).

Args:
source (str): Where to get the data from - at the moment, only ‘20CR2c’ is supported . variable (str): Variable to use (e.g. ‘prmsl’)
Returns:
function which, when called with a (normalised) numpy array as its argument, returns an unnormalised version of that array.
Raises:ValueError – Unsupported source or variable.

ML_Utilities.glob(pathname, *, recursive=False)[source]

Return a list of paths matching a pathname pattern.

The pattern may contain simple shell-style wildcards a la fnmatch. However, unlike fnmatch, filenames starting with a dot are special cases that are not matched by ‘*’ and ‘?’ patterns.

If recursive is true, the pattern ‘**’ will match any files and zero or more directories and subdirectories.

ML_Utilities.prepare_data(date, purpose='training', source='20CR2c', variable='prmsl', member=1, normalise=None, opfile=None)[source]

Make tf.load-able files, suitably normalised for training ML models

Data will be stored in directory $SCRATCH/Machine-Learning-experiments.

Parameters:
  • (obj (date) – datetime.datetime): datetime to get data for.
  • purpose (str) – ‘training’ (default) or ‘test’.
  • source (str) – Where to get the data from - at the moment, only ‘20CR2c’ is supported .
  • variable (str) – Variable to use (e.g. ‘prmsl’)
  • normalise – (func): Function to normalise the data (to mean=0, sd=1). Function must take an iris.cube.cube' as argument and returns a normalised cube as result. If None (default) use a standard normalisation function (see :func:`normalise.
Returns:

Nothing, but creates, as side effect, a tf.load-able file with the normalised data for the given source, variable, and date.

Raises:

ValueError – Unsupported source, or can’t load the original data, or normalisation failed.