**Process for filling total column ozone fields**

There are three main steps to creating filled fields:

Creating a machine-learning (ML) based statistically modelled field.

Creating a conservatively filled field.

Merging the original, the conservatively filled field and the ML-modelled fields.

Each of these steps is described in greater detail below.

This algorithm is provided freely for academic and non-commercial use. We would ask that any person or agency using this algorithm for satellite data-filling to please acknowledge Bodeker Scientific as the source of this algorithm.

### A machine-learning based modelled field

For a target date, the total column ozone (TCO) fields 'around' that target date is regressed against fields of potential vorticity at 550K (PV550K) and tropopause height (TH). The PV550K and TH fields are obtained at 6-hourly temporal resolution from the **NCEPCFSv2 reanalyses**. Essentially the TCO field is modelled as:

where, to account for the correlation of TCO against PV550K and TH depending on latitude and/or longitude, the fit coefficients in the regression model are expanded in spherical harmonics i.e.

where:

For the purposes of fitting Equation 3 to the TCO fields, the normalisation constants can be ignored as they are taken up into the fit coefficients. When fitting Equation 1, the expansions for the fit coefficients are as follows:

The fitting happens 'around' the target date as fields on neighbouring dates and years are used to establish the dependence of TCO on PV550K and TH. The figure below shows an example of the fields that are used to train the regression model if the target date is 22 July 1995 (as an example).

The search ellipse is iteratively expanded by a factor of 1.5 on each iteration until there are *M* (currently set at 20) valid TCO fields that can be used for the training. Not all data from these *M* fields are passed to the regression model; currently only 20,000 data points are passed to the regression model determined by sampling only every *L*th value from the list of values available such that the total number of values passed to the regression model is less than or equal to *M*. The latitude and longitude of each ozone value is also passed to the regression model so that the associated PV550K and TH values can be extracted. The times associated with the TCO fields are assumed to be local noon times (since most of the satellites making these measurements were sun-synchronous satellites with an equator crossing time of close to solar noon). Therefore the actual UTC time varies across the TCO field. The PV550K and TH values are linearly interpolated to those exact UTC times.

Various 'versions' of the regression model are tested i.e. different basis functions are excluded/included; the offset (gamma) basis function is always included. In addition to switching different basis functions on/off, different values for *N* and *l′* are tested (perturbing these by ±1 around their start value) but ensuring that the maximum allowed zonal and meridional expansions (see table above) are not exceeded. This results in many different possible constructs of the regression model. If any model, when evaluated over all latitudes and longitudes, results in a TCO value more than 10% above the maximum measured TCO value passed to the regression model, or below 10% less than the minimum value passed to the regression model, it is immediately discarded. This eliminates any statistical models of the ozone field that result in significant over-fitting. For all models that pass this initial test, a Bayesian Information Criterion (BIC) score is calculated as:

where *N* is the number of data passed to the regression model, *R*^{2} is a modified sum of the squares of the residuals, and *NC* is the total number of coefficients in the fit. In calculating *R*^{2} , where model values are below the minimum measurement passed to the regression model, or above the maximum measurement passed to the regression model, the residual is inflated exponentially to impose an additional cost on the model for generating values outside of the range of the data. This also acts as a strong disincentive for the model to avoid going significantly outside of the range of the measurements.

Once the construct of the regression model with the minimum BIC has been found, this model is used to generate the statistically modelled TCO field. The database of different regression models is also used to calculate the structural uncertainties that result from different possible choices of spherical harmonic expansions. The uncertainties that result from uncertainties on the regression model fit coefficients are also calculated. These two sources of uncertainty are then added in quadrature. The structural uncertainty statistics are calculated using only those fields that have the same sequence of basis functions switched on and off compared to the best fit. Examples of the unfilled TCO fields, the ML-modelled TCO fields and the uncertainty on the modelled TCO fields for days 304 and 305 of 1978 are shown below.

** Original TCO field**

** ML modelled TCO field**

** Uncertainty on ML modelled TCO field**

** Original TCO field**

** ML modelled TCO field**

** Uncertainty on ML modelled TCO field**

**A conservatively filled field**

The ML-filled fields described above are rather 'coarse' estimates of the true TCO field, but they are always guaranteed filled and available. A more conservatively filled field is also required. The first step in creating a conservatively filled field is to do a 'nearest neighbour' interpolation to fill in as many missing values as possible. This is done by looking for cells with null values which are neighboured, either to the north and south, or to the east and west, by non-null values. If such a non-null pair is found, that pair of TCO values, together with their uncertainties, is used to estimate the interstitial value which is taken as the mean of the two neighbouring values. The uncertainty on the interpolated values is calculated by taking the RMS of the uncertainty on the neighbouring values.

After doing the nearest neighbour interpolation, cases are sought where, for a cell containing a null value on day N, there are non-null values in the same cell on day N-1 and day N+1. Interpolation between the previous and following day, using the same approach for estimating the interstitial value as was used for the nearest neighbour interpolation, is used. The following two steps are repeated until no additional values are inserted into the grid:

Do a nearest neighbour interpolation.

Do a longitudinal interpolation.

The nearest neighbour interpolation is done since there may now be other regions amenable to that approach after doing the interpolation between the previous and following day. The longitudinal interpolation works by finding two non-null values at the same latitude that are separated by two or more null values. In this case, an additional constraint is imposed that the non-null values cannot be separated by more than 30° in longitude. Linear interpolation between the two non-null values, including an estimate of the uncertainties, is used to determine the interstitial values and their uncertainties. A plot of the original TCO field, the conservatively filled field, and the uncertainties on the conservatively filled field for day 3 of 1980 are shown below.

** Original field**

** Conservatively filled field**

** Uncertainties on conservatively filled field**

**Merging the original, ML-modelled fields and conservatively-filled fields**

We now have the original TCO fields, statistically modelled fields, and conservatively filled fields. These are now all merged in such a way that the original values are preserved where they are available, where they are not available we relax to the conservatively filled field values where they are available, and where they are not available, we relax to the ML-modelled field values.

Preparing the ML-modelled fields: The ML-modelled fields, because they are modelled on PV550K and TH (which themselves can contain anomalous values), can occasionally display physically unrealistic spatial or temporal structures so, for any given day, to obtain a smoother ML-filled field, a (1,4,6,4,1) weighting of the five ML-filled fields centered on the day of interest is calculated.

Merging scenarios: For any given day there are several possibilities:

None of the three fields are available: In this case no final filled field is generated.

Only the ML-modelled field is available: In this case some tests are done on that ML-modelled field to make sure that no large spatial or temporal gradients in that field remain (if there are, the field is smoothed until these tests are passed). This validated ML-modelled field, and its associated uncertainties, become the final filled field for the target day.

Only the conservatively filled field and the ML-modelled fields are available: In this case the conservatively filled field (field A) and the validated ML-modelled field (field B) are blended to create the final filled field. A description of how two fields are blended is provided below.

All three fields are available: In this case the conservatively filled field and the validated ML-modelled field (see above) are blended to create field A. The original TCO field and field A are then blended to create the final filled TCO field and its uncertainty. An example of the original field for TCO on day 304, the final field obtained from this blending process, and the uncertainty on the final filled field are shown below (the ML modelled field for that day can be seen above).

** Original TCO field**

** Conservatively filled field**

** Final filled field **

** Final filled field uncertainties**

The final filled fields replicate the measured TCO values where they are available, where data are missing it blends into the conservatively filled field values where they are available, and then, when conservatively filled values are not available, we blend into the ML-modelled field.

**Blending field A and field B**

Two fields, A (the primary field) and B (the secondary field) are blended as follows:

If there is a null value in some cell in field A and a non-null value in the same cell in field B then a proxy value for field A is found and combined with the value from field B as follows:

where * B* is the blended value,

*is a weight calculated as detailed below,*

**W**

**A**_{proxy}is a proxy value for the missing value in the A field determined as detailed below, and

**B**_{value}is the non-null value from the B field. To derive an

**A**_{proxy}value, 60° sectors around that missing grid node are scanned for non-null values from the A field which are within 20 grid cells in the east-west and north-south directions. A weighted mean of mean of the 6 values from those 6 search sectors is then calculated where the weighting is calculated as:

where * D* is the distance to the nearest non-null value in that sector measured in metres. The weight is set to zero when

*is greater than 1000 km.*

**D**

**A**_{proxy}is then set to the weighted mean of the non-null values across all 6 sectors.

In calculating the blended value using equation 5, the weight (* W*) is calculated using the distance to the nearest non-null value across all 6 search sectors in equation 6. If no

**A**_{proxy}value can be found, then

*in equation 5 is set to zero. Standard error propagation rules are used to determine an uncertainty on the blended value. This process results in using values from field A where they are available which then blend into values from field B when they are not available.*

**W**