by Judith Curry
Over the last two weeks, there have been some interesting exchanges in the blogosphere on the topic of interpreting an ensemble of models.
rgbatduke kicked off the exchange with a comment over at WUWT, which was elevated to a main post entitled The “ensemble” of models is completely meaningless, statistically, which elicited an additional comment from rgbatduke. Matt Briggs responded with a post entitled An Ensemble of Models is Completely Meaningful, Statistically.
Who is correct, rgbatduke or Matt Briggs? Well each made valid points, and each make statements that don’t seem quite right to me. Rather than digging into the statements by rgbatduke and Matt Briggs, I decided to do a series of two posts on ensemble interpretation. Part I is on weather models, including seasonal forecast models.
ECMWF ensemble forecast system
An overview of ensemble weather forecast models is given by the Wikipedia. See also this excellent presentation by Malaquias Pena.
The European Centre for Medium Range Weather Forecasting (ECMWF) arguably produces the world’s best weather forecasting system. The ECMWF ensemble weather forecast system includes the following products:
- High resolution atmospheric model: 1-10 days at 0.125o x 0.125o horizontal resolution, available at 3-hour intervals to 144 hours, and at 6-hour intervals at beyond 144 hours. Output variables include 10 m and 100 m wind velocities and maximum 10 m wind gusts. Base time for forecasts: 00 and 12 UTC daily.
- Atmospheric Ensemble Prediction System: 51 ensemble members, 1-15 days at 0.25o x 0.25o resolution to 10 days and 0.5o x 0.5o resolution beyond 10 days. Available at 6-hour intervals.
- Monthly forecasting system: 51 ensemble members, 1-32 days at 0.5o x 0.5o resolution. Output variables include wind velocities at 10 m, 1000 hPa and 925hPa, available at 6-hour intervals. Base time for forecasts: 00 UTC, twice per week Calibration forecasts (reforecasts) are provided once per week.
- Seasonal forecasting system: 41 ensemble members, 1-7 months at 1.5o x 1.5o resolution. Output variables include 10 m wind velocities, available at 6-hour intervals, once per month. Historical (hindcast) simulations are provided back to 1980.
The ECMWF ensemble members are generated using a singular vector approach that perturbs both model parameters and initial conditions.
A few weeks ago, I attended the annual users meeting at ECMWF [link]. Some background presentations on the ECMWF weather forecast system are provided in the following presentations, including some verification statistics:
- ECMWF forecast products supporting general weather forecasting and decision making
- Future developments of the ensemble forecast system
- Recent upgrades of the operational forecast system
For my research and the operational forecasts provided by my company Climate Forecast Applications Network (CFAN), we use ECMWF products. I gave the keynote address at the recent ECMWF workshop, my presentation can be found at: Applications of ECMWF forecast products for the energy sector.
Ensemble interpretation
Specifically with regards to ensemble interpretation, my presentation focuses on the following techniques:
1. Statistical postprocessing using reforcasts and recent model performance, relative to observations, using Bayesian bias correction, quantile-to-quantile distribution calibration, and model output statistics.
2. Provision of probabilistic forecasts of surface weather, and applications of extreme value theory to probabilistic forecasts of extreme weather events
3. Expansion of ensemble size through use of lagged forecasts and Monte Carlo resampling techniques.
4. Ensemble clustering techniques
The techniques used by my team rank among the most sophisticated currently being used in an operational environment, although there are some more sophisticated techniques being used in research mode, e.g. ensemble dressing.
Averaging the the ensemble members to produce a mean is often done, effectively providing a deterministic forecast, but this does not take advantage of a primary rationale for the ensemble approach in terms of characterizing uncertainty.
If you make a deterministic forecast, then verification is simply done against observations using mean absolute error, correlation statistics, etc.
For an ensemble forecast, the following represent some commonly used verification statistics (from the Malaquias article linked to above):
Comparison of a distribution of forecasts to a distribution of observations:
– Reliability: How well the a priori predicted probability forecast of an event coincides with the a posteriori observed frequency of the event
– Resolution: How much the forecasts differ from the climatological mean probabilities of the event, and the systems gets it right?
– Sharpness: How much do the forecasts differ from the climatological mean probabilities of the event?
– Skill: How much better are the forecasts compared to a reference prediction system (chance, climatology, persistence,…)?
Performance measures of probabilistic forecast:
- Brier Skill Score (BSS)
- Reliability Diagrams
- Relative Operating Characteristics (ROC)
- Rank Probability Score (RPS)
- Continuous RPS (CRPS)
- CRP Skill Score (CRPSS)
- Rank histogram (Talagrand diagram)
Multi-model ensembles
An ensemble size of 51 members works pretty well for many weather situations, although we noticed that last winter the ensemble size was definitely too small owing to highly variable and unpredictable conditions. For longer time scales (e.g. seasonal forecasts), an ensemble size of 40 is generally regarded to be too small.
EUROSIP is a multimodel ensemble for seasonal forecasts including ECMWF, UK Met Office, and MeteoFrance; recently the U.S. model was added. From the linked presentation by David Stockdale:
What would an ‘ideal’ multi-model system look like? Assume fairly large number of models (10 or more)
- Assume models have roughly equal levels of forecast error
- Assume that model forecast errors are uncorrelated
- Assume that each model has its own mean bias removed
- A priori, for each forecast, we consider each of the models’ forecasts equally likely [in a Bayesian sense – in reality, all the model pdfs will be wrong]
- A posteriori, this is no longer the case: forecasts near the centre of the multi-model distribution have higher likelihood
- Different from a single model ensemble with perturbed ic’s.
- Multi-model ensemble distribution is NOT a pdf
Non-ideal case
Model forecast errors are not independent. Dependence will reduce degrees of freedom, hence the effective n; will increase uncertainty
In some cases, reduction in n could be drastic
Initial condition error can be important. The foregoing analysis applies to the ‘model error’ contribution to error variance
Initial condition error and irreducible error growth terms follow usual ensemble behaviour, and must be accounted for separately
What weight should be given to outliers?
Method for p.d.f. estimation (1)
Assume underlying normality
Calculate robust skill-weighted ensemble mean. Do not try a multivariate fit (very small number of data points)
Weights estimated ~1/(error variance). Would be optimal for independent errors – i.e., is conservative.
Then use 50% uniform weighting, 50% skill dependent
Comments: Rank weighting also tried, but didn’t help.
QC term tried, using likelihood to downplay impact of outliers, but again didn’t help. Outliers are usually wrong, but not always.
Models usually agree reasonably well, and tweaks to weights have very little impact anyway.
Method for p.d.f. estimation (2)
Re-centre lower-weighted models. To give correct multi-model ensemble mean Done so as to minimize disturbance to multi-model spread
Compare past ensemble and error variances.
-Use above method (cross-validated) to generate past ensembles
-Unbiased estimates of multi-model ensemble variance and observed error variance
-Scale forecast ensemble variance
-50% of variance is from the scaled climatological value, 50% from the scaled forecast value
Comments: For multi-model, use of predicted spread gives better results.For single model, seems not to be so.
An additional example for weather models is TIGGE (Thorpex International Global Grand Ensemble). A good overview of TIGGE is give in this presentation by Hamill and Hagedorn hamill hagedorn:
.
To cut to the chase, the best model (ECMWF) performs as well as the multi-model ensemble mean, and ECMWF calibrated by the reforecasts outperforms the multi-model ensemble.
Here is how my team has approached the issue. We use multiple models in our hurricane track forecasts and in our seasonal forecasts. However, we do not combine the simulations from the model ensembles into a grand ensemble; rather we consider each ensemble separately and the forecaster weights the ensemble based on recent model performance or uses the additional models in characterization of forecast uncertainty.
JC conclusion: The weather modelling and forecast communities have developed sophisticated techniques for the interpretation of ensemble simulations. The extent to which we can usefully apply these techniques to climate models will be discussed in Part II, along with alternative strategies for ensemble interpretation.
Moderation note: This is a technical thread, please keep your comments relevant.
