Machine Learning for Data Assimilation¶
Reanalysis is awesome, but it’s very slow and very expensive. We can make it dramatically easier and cheaper with Machine Learning.

Introduction¶
We would like to know the weather everywhere in the world, for every hour in the last 100 years at least. But for most times, and most places, we have no observations of the weather. So we need to use the observations we do have as efficiently as possible - we need to make each observation inform our estimates of the weather in places remote from the observation. A powerful technique for this is Data Assimilation (DA) which starts from a model of the weather, and uses observations to constrain the state of the model. Using DA with General circulation Models (GCMs) has been enormously successful, providing precise and accurate estimates of global weather, operational weather forecasts, comprehensive modern reanalyses such as ERA5 and long sparse-observation reanalyses such as the Twentieth Century Reanalysis (20CR). But GCMs are complex to use and expensive to run. Reanalysis projects require specialist expertise and enormous quantities of supercomputer time, so this technology, despite its power, is not very widely used. We already know that that we can use Machine Learning (ML) to make fast approximations to a GCM, can we extend this to do DA as well?
Here I show that you can use a Variational AutoEncoder (VAE) to build a fast deep generative model linking physically-plausible weather fields to a complete, continuous, low-dimensional latent space. Data Assimilation can then be done by searching the latent space for the state that maximises the fit between the linked field and the observations. The DA process takes about 1 minute on a standard laptop.
Finding an atmospheric state that matches observations¶
Suppose we have a representation of the atmospheric state
From Bayes’ theorem:
where
Start by assuming
To calculate
We also need a prior estimate
where
It will produce precise and accurate estimates of the atmospheric state
. In practice this means is small compared to alternative models.It will be fast, so it’s cheap to generate samples of
.It will be cheap and easy to estimate
- the prior estimate of the atmospheric state.
Modern GCMs score very highly on the first of these points - they make very precise and accurate estimates of the atmospheric state vector - but nothing about them is cheap, or easy. On the other hand, simple statistical models can make fast samples of
Modern Machine Learning (ML) offers us the chance to square this circle: simple
A concrete example: 2-metre air temperature¶
As a concrete example, we will use not the whole weather state vector, but just the 2-metre air temperature anomaly (as our

Or

Where each
This example has the virtue that the observations operator
Generative model, and latent space¶
Traditionally, we would use a GCM to generate plausible temperature fields (
So let’s say that the model state vector

This function is called a generator, and the inputs
This function is specified to meet the three requirements above (good quality
Learning a generator with a Variational AutoEncoder¶
We can create a generator with exactly these properties, using a Variational AutoEncoder (VAE). An autoencoder is a pair of neural nets: one of them (the encoder) compresses an input field into a low-dimensional latent space, and the other (the generator) expands the small latent space representation back into the input field. They are trained as a pair - optimising to make generator(encoder(input)) as close to the original input as possible. A variational autoencoder adds two complications to the basic autoencoder: The encoder outputs a distribution in latent space not a point (the distribution is parametrized by its mean

The generator produced by training meets the requirements specified above:
It makes good quality
as output ( ), because we train it on examples of good quality - here temperature anomaly fields from ERA5.It has a cheap and easy
, because , , and we train (and so ) to be a multi-variate unit normal distribution.It is continuous and complete because of the variational construction (noise in the encoder output, and constraint on the latent space distribution).
It is fast because the generator is implemented as a convolutional neural net (CNN). (It runs in much less than one second).
So as long as as the VAE can be trained successfully on good quality data (examples of the sort of

VAE validation: top left - original field, top right - generator output, bottom left - difference, bottom right - scatter original::output. (Note that a substantially better result could be produced with more model-building effort and a larger latent space, but this is good enough for present purposes).¶
The VAE is serving as a generator factory - learning a generator function

Assimilating through the latent space¶
But we don’t want

We want to find the value of
and
We have

This optimisation search provides our function
To check that it works, we can make some pseudo-observations from a known field, and see how well we can recover the original field from just the observations:

Assimilation validation: bottom - original field (ERA5 T2m anomaly), top - assimilation results. Black dots mark observations assimilated, grey hatching marks regions where the result is very uncertain.¶
This process works as expected. We can reconstruct the weather field precisely in regions where we have observations, and with uncertainty in regions where observations are unavailable.
It’s not just 2-metre air temperature - the same approach can be used over a wide range of applications.
Examples of use¶
How far can we go with this method? In principle the same method could be used for very large and complex sets of weather and observation fields (including satellite radiances). We can be confident that suitable generator functions are possible because they would be neural network approximations to GCMs that already exist. But it’s not yet clear how difficult it would be to train such functions.
Conclusions¶
Machine Learning makes data assimilation easy, cheap, and fast.
Data assimilation means finding a weather field
and this is difficult because
Machine learning provides us with a function
is straightforward. Producing such a function
Small print¶
This document is crown copyright (2022). It is published under the terms of the Open Government Licence. Source code included is published under the terms of the BSD licence.