Process benchmark data into tensors for model building¶

The data to be modelled are taken from Auto-transcription benchmark 2: Fake data and consist of 10,000 png images and 10,000 python pickle files - each pickle file containing an array of 360 digits (0-9). To model these with TensorFlow we need to convert both the images and the pickled arrays into tensors and it’s much more efficient to do this in a preprocessing step than in-line during model training.

So we need a script to convert a png image file into a tensor file, a script to convert a pickled array into a tensor, and a script to run each of these 10,000 times - once for each case in the benchmark dataset.

The resulting serialised tensor files will take about 100 Gb of disc space (tensors are a space-inefficient way to store images).

Then to present all those tensors to a ML model for training we need to package them as a TensorFlow Dataset:

Function providing benchmark data as a tf.data.Dataset

Process benchmark data into tensors for model building¶

Table Of Contents

Get the data

Found a bug, or have a suggestion?