The ML_Utilities package: Functions for data loading and normalisation¶
Module supporting ML experiments - mostly functions for making reanalysis data available to ML models. The idea is to abstract away the data provision to this module, so experiments can concentrate on model building.
The module has two basic functions: The prepare_data()
function takes reanalysis data and reformats it into input files suitable for TensorFlow, and the dataset()
function turns those files into a tf.data.Dataset
which can be passed to a model for training or testing.
The first step might be:
import ML_Utilities
count=1
for year in [1969,1979,1989,1999,2009]:
start_day=datetime.datetime(year, 1, 1, 0)
end_day =datetime.datetime(year, 12, 31, 23)
current_day=start_day
while current_day<=end_day:
purpose='training'
if count%10==0: purpose='test' # keep 1/10 data for testing
ML_Utilities.prepare_data(current_day,
purpose=purpose,
source='20CR2c',
variable='prmsl')
# Keep samples > 5 days apart to minimise correlations
current_day=current_day+datetime.timedelta(hours=126)
count += 1
That will produce TensorFlow files containing 5 years of 20CRv2c prmsl fields - note that you will need to download the data first - see the IRData package.
Then to use those data in an ML model, it’s:
import ML_Utilities
(training_data,n_data)=ML_Utilities.dataset(purpose='training',
source='20CR2c',
variable='prmsl')
(test_data,n_test)=ML_Utilities.dataset(purpose='test',
source='20CR2c',
variable='prmsl')
# model=(specify here)
model.fit(x=training_data,
validation_data=test_data)
You may need to use the repeat
and map
functions on the Datasets to get a long enough dataset for many epochs of training, and to tweak the dataset to the requirements of your model (perhaps to set the array shape).
-
ML_Utilities.
Dataset
¶ alias of
tensorflow.python.data.ops.dataset_ops.DatasetV1
-
ML_Utilities.
dataset
(purpose='training', source='20CR2c', variable='prmsl', length=None, shuffle=True, buffer_size=10, reshuffle_each_iteration=False)[source]¶ Provide a
tf.data.Dataset
of analysis data, for tf.keras model training or tests.Data must be available in directory $SCRATCH/Machine-Learning-experiments, previously generated by
prepare_data()
.Parameters: - purpose (
str
) – ‘training’ (default) or ‘test’. - source (
str
) – Where to get the data from - any string, but needs top be supported byprepare_data()
. - variable (
str
) – Variable to fetch (e.g. ‘prmsl’). - length (
int
) – Required length - will be repeated enough times to deliver at least this many data points. If None (default) uses the amount of data on disc as the length (not repeated). - shuffle (
bool
) – If True (default), shuffle the data order. If False, present the data in the order of the files on disc. - buffer_size (
int
) – Passed totf.data.Dataset.shuffle()
. - reshuffle_each_iteration (
bool
) – Passed totf.data.Dataset.shuffle()
.
Returns: Dataset suitable for passing to tf.keras models, and
int
: length of that dataset.Return type: tuple of
tf.data.Dataset
Raises: ValueError
– Necessary data not on disc, need to runcreate_dataset()
to make it.- purpose (
-
ML_Utilities.
get_normalise_function
(source='20CR2c', variable='prmsl')[source]¶ Provide a normalisation function to scale a 20CR data field to mean=0 sd=1 (approximately).
Raises: ValueError
– Unsupported source or variable.
-
ML_Utilities.
get_unnormalise_function
(source='20CR2c', variable='prmsl')[source]¶ Provide an unnormalisation function to scale a 20CR data field back to its original units from the normalised representation with mean=0 sd=1 (approximately).
Raises: ValueError
– Unsupported source or variable.
-
ML_Utilities.
glob
(pathname, *, recursive=False)[source]¶ Return a list of paths matching a pathname pattern.
The pattern may contain simple shell-style wildcards a la fnmatch. However, unlike fnmatch, filenames starting with a dot are special cases that are not matched by ‘*’ and ‘?’ patterns.
If recursive is true, the pattern ‘**’ will match any files and zero or more directories and subdirectories.
-
ML_Utilities.
prepare_data
(date, purpose='training', source='20CR2c', variable='prmsl', member=1, normalise=None, opfile=None)[source]¶ Make tf.load-able files, suitably normalised for training ML models
Data will be stored in directory $SCRATCH/Machine-Learning-experiments.
Parameters: - (obj (date) – datetime.datetime): datetime to get data for.
- purpose (
str
) – ‘training’ (default) or ‘test’. - source (
str
) – Where to get the data from - at the moment, only ‘20CR2c’ is supported . - variable (
str
) – Variable to use (e.g. ‘prmsl’) - normalise – (
func
): Function to normalise the data (to mean=0, sd=1). Function must take aniris.cube.cube' as argument and returns a normalised cube as result. If None (default) use a standard normalisation function (see :func:`normalise
.
Returns: Nothing, but creates, as side effect, a tf.load-able file with the normalised data for the given source, variable, and date.
Raises: ValueError
– Unsupported source, or can’t load the original data, or normalisation failed.