DIYA: A module for do-it-yourself data assimilation

Suppose you have an ensemble reanalysis, like 20CR. So you have 56 different estimates of the global mean-sea-level pressure field at a particular date (say, 1903-10-27:18). There is quite a lot of spread in this ensemble, the pressures are uncertain.

But we’ve also found a previously-unused observation, a measurement of MSLP at that time a a weather station (say in Fort William). We can use that new observation to improve the reanalysis estimate, with the Ensemble Kalman Filter or something similar.

This module does this. It takes reanalysis data and observations, and assimilates the observations into the reanalysis fields, making improved versions.

import DIYA
new_field=DIYA.constrain_cube(old_field,old_field,obs)

DIYA.buddy_check(obs, field, obs_error=0.1, random_state=None, model=None, lat_range=(-90, 90), lon_range=(-180, 360))[source]

Checks the influence of assimilating each ob on its neighbours.

Parameters:obs – Observations. Dataframe must have columns ‘latitude’, ‘longitude’, ‘value’, and ‘dtm’ - the last a datetime.datetime.

For each observation, assimilate into the field provided, then compare all the other observations to the field both before and after assimilation. For each ob, return the ratio of this mean difference with::without assimilation (<1 is good, observation improves fit to buddies, >1 is bad, degrades fit).

Returns:Mean difference ratio from assimilating each observation.
Return type:(pandas.Series)

DIYA.constrain_cube(target, cube_function, obs, obs_error=0.1, random_state=None, model=None, lat_range=(-90, 90), lon_range=(-180, 360))[source]

Constrain the target at each gridpoint.

Generates the constraints at each observation location, and then runs constrain_point() for each grid point in the target cube (inside the lat:lon ranges).

Parameters:
  • target (iris.cube.Cube) – Ensemble to be constrained, must have dimensions ‘member’,’latitude’, and ‘longitude’”.
  • cube_function (function taking a numpy.datetime64 as argument, and returning an iris.cube.Cube) – Resulting cube must have dimensions (‘member’,’latitude’,’longitude’)”.
  • obs (pandas.DataFrame) – Observations. Dataframe must have columns ‘latitude’, ‘longitude’, and ‘value’, and value should have same units as constraints.
  • obs_error (float) – Uncertainty in each observation value (1 s.d.) . Units as obs.value.
  • random_state (int | numpy.random.RandomState, optional) – Set to produce exactly reproducible results.
  • model (sklearn.linear_model, optional) – Model to use to relate target to constraints. Defaults to sklearn.linear_model.ElasticNet with default parameters.
  • lat_range (list, optional) – Only do constraint in this range of latitudes.
  • lon_range (list, optional) – Only do constraint in this range of longitudes.
Returns:

target, after applying constraints.

Return type:

iris.cube.cube


DIYA.constrain_point(target, constraint, model, obs)[source]

Constrain the target at a point

Fit a model to the target given the constraint (t=m(c)+r). Then make a predicted ensemble using the observations in the fitted model, and the residuals to the fit (p=m(o)+r). If nE=number of ensemble members, and nP=number of observation points:

Parameters:
  • target (numpy.array) – 1d array size nE - the target ensemble at the target point.
  • constraint (numpy.array) – 2d array size [nE,nP] - the constraint ensemble at each of the observation points.
  • model (sklearn.linear_model) – Model to fit.
  • obs (numpy.array) – 1d array size nP - the observations.
Returns:

1d array size nE - the constrained target ensemble.

Return type:

numpy.array


DIYA.haversine(origin, destination)[source]

Calculate distance between two observations.

Parameters:
  • origin (pandas.DataFrame) – 1 row from an observations dataframe. Must have columns ‘latitude’ and ‘longitude’.
  • destination (pandas.DataFrame) – 1 row from an observations dataframe. Must have columns ‘latitude’ and ‘longitude’.
Returns:

Distance between origin and destination in km.

Return type:

float


DIYA.nearby_observations(observations, target, distance)[source]

Find the subset of observations within a given distance of a target.

Parameters:
  • observations (pandas.DataFrame) – Observations dataframe. Must have columns ‘latitude’ and ‘longitude’.
  • target (pandas.DataFrame) – 1 row from an observations dataframe. Must have columns ‘latitude’ and ‘longitude’.
  • distance (float) – Maximum distance (km).
Returns:

Observations dataframe. Same as ‘observations’ input except that it only contains rows less than ‘distance’ from ‘target’.

Return type:

(pandas.DataFrame)


DIYA.qc_at_station_and_time(obs, station, dte)[source]

Get, from these observations, the quality of value at the selected station and time.

This is analagous to DWR.at_station_and_time() except it gets the QC status of the observation instead of its value. If the value is interpolated, it will fail QC if either of the values it’s interpolated from fails.

Parameters:
  • obs (pandas.DataFrame) – Batch of observations for the period around the desired time. Should have extra columns ‘plausible’ and ‘first_guess’, giving results of those QC checks.
  • station (str) – Name of station, as used in obs.name.
  • dte (datetime.datetime) – Time of required observed value.
Returns:

True if this selected data has passed QC checks.

Return type:

bool

Raises:

StandardError – obs does not contain at least two values for selected station, one before and one after specified time. So interpolation not possible.


DIYA.qc_compare_reanalysis(obs, variable='prmsl', version='2c')[source]

Get 20CR ensemble values at the time and place of each observation.

Parameters:
  • obs (pandas.DataFrame) – Observations. Dataframe must have columns ‘latitude’, ‘longitude’, and ‘dtm’ - the last a datetime.datetime.
  • variable (str) – Which 20CR variable to compare to. Defaults to ‘prmsl’
  • version (str) – Which 20CR version to load data from. Defaults to ‘2c’
Returns
pandas.Series: Reanalyis ensemble associated with each observation.

DIYA.qc_first_guess(obs, nsd=3, osd=2, comparison=None, variable='prmsl', version='2c')[source]

Checks the obs value against the 20CR ensemble.

Parameters:
  • obs (pandas.DataFrame) – Observations. Dataframe must have columns ‘latitude’, ‘longitude’, ‘value’, and ‘dtm’ - the last a datetime.datetime.
  • nsd (float) – Number of standard deviations for rejection threshold. Defaults to 3.
  • osd (float) – Observation standard deviation. Defaults to 2.
  • comparison (pandas.Series) – Reanalysis ensemble values at the time and place of each observation. Defaults to None - calculate them.
  • variable (str) – Which 20CR variable to compare to. Defaults to ‘prmsl’
  • version (str) – Which 20CR version to load data from. Defaults to ‘2c’

For each observation, loads the 20CR ensemble at the time of observation, extracts the mean and standard deviation at the time and place of observation, and compares ob-ensemble mean with the expected difference given the ensemble spread and observation error sqrt(ensemble sd**2 + osd**2). If the observed difference is greater than nsd times the expected difference, mark ob with false, otherwise with true.

Returns:True where ob-ensemble mean within expected difference from 20CR spread, False otherwise.
Return type:(pandas.Series)

DIYA.qc_plausible_range(obs, min=880.0, max=1060.0)[source]

Checks the obs value for plausibility.

Parameters:
  • obs (pandas.DataFrame) – Observations. Dataframe must have column ‘value’, and value should have same units as max and min.
  • min (float) – Minimum plausible value, defaults to 880.
  • max (float) – Maximum plausible value, defaults to 1060.
Returns:

True where obs[‘value’ between min and max, False otherwise.

Return type:

(pandas.Series)