How should we interpret an ensemble of models? Part I: Weather models

curryja

13 years ago

by Judith Curry

Over the last two weeks, there have been some interesting exchanges in the blogosphere on the topic of interpreting an ensemble of models.

rgbatduke kicked off the exchange with a comment over at WUWT, which was elevated to a main post entitled The “ensemble” of models is completely meaningless, statistically, which elicited an additional comment from rgbatduke. Matt Briggs responded with a post entitled An Ensemble of Models is Completely Meaningful, Statistically.

Who is correct, rgbatduke or Matt Briggs? Well each made valid points, and each make statements that don’t seem quite right to me. Rather than digging into the statements by rgbatduke and Matt Briggs, I decided to do a series of two posts on ensemble interpretation. Part I is on weather models, including seasonal forecast models.

ECMWF ensemble forecast system

An overview of ensemble weather forecast models is given by the Wikipedia. See also this excellent presentation by Malaquias Pena.

The European Centre for Medium Range Weather Forecasting (ECMWF) arguably produces the world’s best weather forecasting system. The ECMWF ensemble weather forecast system includes the following products:

High resolution atmospheric model: 1-10 days at 0.125^ox 0.125^o horizontal resolution, available at 3-hour intervals to 144 hours, and at 6-hour intervals at beyond 144 hours. Output variables include 10 m and 100 m wind velocities and maximum 10 m wind gusts. Base time for forecasts: 00 and 12 UTC daily.
Atmospheric Ensemble Prediction System: 51 ensemble members, 1-15 days at 0.25^ox 0.25^o resolution to 10 days and 0.5^ox 0.5^o resolution beyond 10 days. Available at 6-hour intervals.
Monthly forecasting system: 51 ensemble members, 1-32 days at 0.5^ox 0.5^o resolution. Output variables include wind velocities at 10 m, 1000 hPa and 925hPa, available at 6-hour intervals. Base time for forecasts: 00 UTC, twice per week Calibration forecasts (reforecasts) are provided once per week.
Seasonal forecasting system: 41 ensemble members, 1-7 months at 1.5^ox 1.5^o resolution. Output variables include 10 m wind velocities, available at 6-hour intervals, once per month. Historical (hindcast) simulations are provided back to 1980.

The ECMWF ensemble members are generated using a singular vector approach that perturbs both model parameters and initial conditions.

A few weeks ago, I attended the annual users meeting at ECMWF [link]. Some background presentations on the ECMWF weather forecast system are provided in the following presentations, including some verification statistics:

For my research and the operational forecasts provided by my company Climate Forecast Applications Network (CFAN), we use ECMWF products. I gave the keynote address at the recent ECMWF workshop, my presentation can be found at: Applications of ECMWF forecast products for the energy sector.

Ensemble interpretation

Specifically with regards to ensemble interpretation, my presentation focuses on the following techniques:

1. Statistical postprocessing using reforcasts and recent model performance, relative to observations, using Bayesian bias correction, quantile-to-quantile distribution calibration, and model output statistics.

2. Provision of probabilistic forecasts of surface weather, and applications of extreme value theory to probabilistic forecasts of extreme weather events

3. Expansion of ensemble size through use of lagged forecasts and Monte Carlo resampling techniques.

4. Ensemble clustering techniques

The techniques used by my team rank among the most sophisticated currently being used in an operational environment, although there are some more sophisticated techniques being used in research mode, e.g. ensemble dressing.

Averaging the the ensemble members to produce a mean is often done, effectively providing a deterministic forecast, but this does not take advantage of a primary rationale for the ensemble approach in terms of characterizing uncertainty.

If you make a deterministic forecast, then verification is simply done against observations using mean absolute error, correlation statistics, etc.

For an ensemble forecast, the following represent some commonly used verification statistics (from the Malaquias article linked to above):

Comparison of a distribution of forecasts to a distribution of observations:

– Reliability: How well the a priori predicted probability forecast of an event coincides with the a posteriori observed frequency of the event

– Resolution: How much the forecasts differ from the climatological mean probabilities of the event, and the systems gets it right?

– Sharpness: How much do the forecasts differ from the climatological mean probabilities of the event?

– Skill: How much better are the forecasts compared to a reference prediction system (chance, climatology, persistence,…)?

Performance measures of probabilistic forecast:

Brier Skill Score (BSS)
Reliability Diagrams
Relative Operating Characteristics (ROC)
Rank Probability Score (RPS)
Continuous RPS (CRPS)
CRP Skill Score (CRPSS)
Rank histogram (Talagrand diagram)

Multi-model ensembles

An ensemble size of 51 members works pretty well for many weather situations, although we noticed that last winter the ensemble size was definitely too small owing to highly variable and unpredictable conditions. For longer time scales (e.g. seasonal forecasts), an ensemble size of 40 is generally regarded to be too small.

EUROSIP is a multimodel ensemble for seasonal forecasts including ECMWF, UK Met Office, and MeteoFrance; recently the U.S. model was added. From the linked presentation by David Stockdale:

What would an ‘ideal’ multi-model system look like? Assume fairly large number of models (10 or more)

Assume models have roughly equal levels of forecast error
Assume that model forecast errors are uncorrelated
Assume that each model has its own mean bias removed
A priori, for each forecast, we consider each of the models’ forecasts equally likely [in a Bayesian sense – in reality, all the model pdfs will be wrong]
A posteriori, this is no longer the case: forecasts near the centre of the multi-model distribution have higher likelihood
Different from a single model ensemble with perturbed ic’s.
Multi-model ensemble distribution is NOT a pdf

Non-ideal case

Model forecast errors are not independent. Dependence will reduce degrees of freedom, hence the effective n; will increase uncertainty

In some cases, reduction in n could be drastic

Initial condition error can be important. The foregoing analysis applies to the ‘model error’ contribution to error variance

Initial condition error and irreducible error growth terms follow usual ensemble behaviour, and must be accounted for separately

What weight should be given to outliers?

Method for p.d.f. estimation (1)

Assume underlying normality

Calculate robust skill-weighted ensemble mean. Do not try a multivariate fit (very small number of data points)

Weights estimated ~1/(error variance). Would be optimal for independent errors – i.e., is conservative.

Then use 50% uniform weighting, 50% skill dependent

Comments: Rank weighting also tried, but didn’t help.

QC term tried, using likelihood to downplay impact of outliers, but again didn’t help. Outliers are usually wrong, but not always.

Models usually agree reasonably well, and tweaks to weights have very little impact anyway.

Method for p.d.f. estimation (2)

Re-centre lower-weighted models. To give correct multi-model ensemble mean Done so as to minimize disturbance to multi-model spread

Compare past ensemble and error variances.

-Use above method (cross-validated) to generate past ensembles

-Unbiased estimates of multi-model ensemble variance and observed error variance

-Scale forecast ensemble variance

-50% of variance is from the scaled climatological value, 50% from the scaled forecast value

Comments: For multi-model, use of predicted spread gives better results.For single model, seems not to be so.

An additional example for weather models is TIGGE (Thorpex International Global Grand Ensemble). A good overview of TIGGE is give in this presentation by Hamill and Hagedorn hamill hagedorn:

•One goal of TIGGE is to investigate whether multi-model predictions are an improvement to single model forecasts

•The goal of using reforecasts to calibrate single model forecasts is to provide improved predictions

•Questions:

-What are the relative benefits (costs) of both approaches?

-What is the mechanism behind the improvements?

-Which is the “better” approach?

To cut to the chase, the best model (ECMWF) performs as well as the multi-model ensemble mean, and ECMWF calibrated by the reforecasts outperforms the multi-model ensemble.

Here is how my team has approached the issue. We use multiple models in our hurricane track forecasts and in our seasonal forecasts. However, we do not combine the simulations from the model ensembles into a grand ensemble; rather we consider each ensemble separately and the forecaster weights the ensemble based on recent model performance or uses the additional models in characterization of forecast uncertainty.

JC conclusion: The weather modelling and forecast communities have developed sophisticated techniques for the interpretation of ensemble simulations. The extent to which we can usefully apply these techniques to climate models will be discussed in Part II, along with alternative strategies for ensemble interpretation.

Moderation note: This is a technical thread, please keep your comments relevant.

Share this: