Forecast convergence score: a forecaster's approach to analysing hydro-meteorological forecast systems.

. In this paper the properties of a hydro-meteorological forecasting system for forecasting river ﬂows have been analysed using a probabilistic forecast convergence score (FCS). The focus on ﬁxed event forecasts provides a forecaster’s approach to system behaviour and adds an important perspective to the suite of forecast veriﬁcation tools commonly used in this ﬁeld. A low FCS indicates a more consistent forecast. It can be demonstrated that the FCS annual maximum decreases over the last 10 years. With lead time, the FCS of the ensemble forecast decreases whereas the control and high resolution forecast increase. The FCS is in-ﬂuenced by the lead time, threshold and catchment size and location. It indicates that one should use seasonality based decision rules to issue ﬂood warnings.


Introduction
Analysing the performance of a hydro-meteorological forecast system is one important component in establishing trust in the forecast results. If the forecasting system is designed to issue early flood warnings for medium to severe events such as the European Flood Alert System (EFAS, Thielen, 2009a) then this presents a particular challenge due to the low frequency of extreme events and the non-stationary of river flows (Cloke and Pappenberger, 2009). The performance of EFAS has recently been analysed over a 10 year period  and the skill of the EFAS forecasts has been shown to steadily increase. This study concentrated mainly on "rolling event forecasts" where the properties of a series of forecasts with a fixed lead time are Correspondence to: F. Pappenberger (florian.pappenberger@ecmwf.int) analysed (Holden et al., 1985). Although this gives important insights into the performance of a hydro-meteorological forecasting system, it is somewhat counter-intuitive as the process of issuing a forecast focuses on a particular event in the future. In contrast, a "fixed-event forecast" analyses the performance with respect to a given event and thus compares forecasts with different lead times using a probabilistic forecast convergence score (FCS) (Nordhaus, 1987;Clements, 1997;Clements and Taylor, 2001). Such an analysis can be used to understand the "Jumpiness", "Turning points", "Continuity", "Swings" or "Inconsistency" of a forecast time series (Zoster et al., 2009;Mills and Pepper, 1999;Lashley et al., 2008), which is a sequence of forecasts with change in behaviour. Understanding such a change in forecast behaviour is an intrinsic part of any decision making process. Strongly changing consecutive forecasts may make it more difficult to derive a decision. This is compounded by the issue that decision makers know that the number of false alarms must be minimised as in the case of flood forecasting (see Demeritt et al., 2007).
In the case of the EFAS this temporal consistency -or persistency -of forecasts is built into the decision making process ): a flood alert is issued only, when at least three consecutive 12-hourly flood forecasts predict that a critical threshold will be exceeded for the same river stretch. In addition, fixed event forecasts are in fact the building block of any optimized lagged forecasting system, as forecasts with different lead times are combined to optimize a particular performance. However, in these applications the focus is on predicting a correct outcome with respect to observations of river discharge, whereas the FCS compares forecasts without the focus on observations. The use of FCS enables the illustration of an important forecast attribute but does not serve as a forecast verification tool. It should be used in conjunction with an applicable suite of Published by Copernicus Publications on behalf of the European Geosciences Union. 28 F. Pappenberger et al.: Forecast convergence score performance measures (Kay, 2004). However, it is as important as measuring forecast quality and can add value for forecast customers (Lashley et al., 2008) The objective of this paper is to analyse the system properties of the EFAS focusing on fixed events. It will concentrate on three main questions: (1) Did the FCS change over a 10 year period? (2) What is the impact of forecast lead time on the FCS? and (3) what is the impact of different thresholds on the FCS? This is the first application of the FCS concept to a hydro-meteorological forecasting chain and probabilistic forecasts.

Setting of this study
In this paper we analyse forecasts from the EFAS, which aims at increasing preparedness for floods in trans-national European river basins by providing local water authorities with medium-range and probabilistic flood forecasting information 3 to 10 days in advance (Thielen et al., 2009a, b), complementary to Member State forecasting systems. For this study, EFAS river discharge forecasts have been reforecasted every week for a period of 10 years using the weather forecast available at the time as input. Here we use the control (the central unperturbed analysis), Ensemble (Ensemble Prediction Systems (EPS), 50 forecasts with perturbed initial conditions) and high-resolution weather forecasts of European Centre for Medium Range Weather Forecasts (ECMWF). An EPS accounts for the sensitivity of the non-linear set of equations of the numerical weather prediction (NWP) models to errors in the initial conditions as well as errors introduced through imperfections in the model. All simulations are evaluated for a total of 1025 river gauging stations distributed across Europe. The selected stations are sufficiently separated in space to avoid crosscorrelation of station time series. Further details of the 10 year re-forecasts and the European set-up are available in Pappenberger et al. (2010).

Probabilistic forecast convergence score
The properties of fixed event forecasts have been analysed in economics, particularly in fields such as inflation and growth forecasting, using several different measures ranging from regression, root mean squared error and bias-based approaches (Nordhaus, 1987;Clements, 1997;Clements and Taylor, 2001;Mills and Pepper, 1999;Bakhshi et al., 2005) to pseudo-maximum likelihood estimators (Clements and Taylor, 2001). In weather forecasting a latitude weighted root mean squared error (Zsoter et al., 2009) and the Ruth-Glahn forecast convergence score (Ruth et al., 2009) have been used. So far no application in hydro-meteorological forecasting or for probabilistic forecasts exists.
A drawback of the previous studies is that none has calculated probabilistic measures of FCS although this is straight forward. The most important consideration in selecting which performance measure to use in the FCS calculation is that the score is fit for purpose (Cloke and Pappenberger, 2008). This paper will only show one single type of measure to introduce the concept. However, it should be made clear that no single measure can completely describe this attribute. In this application, the FCS BS is based on the Brier Score and measures the mean squared probability difference between two forecasts from different lead times. Any other probabilistic score could be used such as the (Continuous) Rank Probability Score (FCS CRPS ), Ignorance Score (FCS IS ) or ROC Area (FCS ROC−Area ) and hence allows the FCS maximum flexibility. We use 7 different river discharge thresholds (4 EFAS thresholds representing return periods of 1, 2, 5 and 20 years and Q90, Q50, Q10 as explained below). Low (high) values of FCS BS , indicate consistency (inconsistency) between the compared forecasts. This measure indicates a system attribute rather than a system performance, thus there is no optimal or sub-optimal behaviour. It can be compared to the natural variability of for example scores of observations with a distance d, which in fact is a measure of autocorrelation. The score can be extended to measure the number of significant swings or turning points by defining a FCS BS level above which represents the minimum change necessary to count as a swing (extending the concept of Ruth et al., 2009 to probabilistic scores). It is also possible to integrate over several lead times. However, this is beyond the scope of this paper. EFAS uses four thresholds to issue flood warnings, namely severe, high, medium and low. These are generated from the model climatology of a 17-year run (1990)(1991)(1992)(1993)(1994)(1995)(1996)(1997)(1998)(1999)(2000)(2001)(2002)(2003)(2004)(2005)(2006) with observed data on a daily time step. In this study, we have also analysed all percentiles from 5 to 95 percentile, and the results section will concentrate on the Q90, Q50 and Q10 to represent the typical flow statistics of a hydrological time series. The selected quantiles are of course not directly flood related and are more relevant for water management, however, they allow some conclusions on the general behaviour of a forecasting system.

Did the FCS change over a decade?
In Figure 1 the FCS BS is shown as an average over Europe comparing the lead times of 6 and 5 days (similar results can Adv. Geosci., 29, 27-32, 2011 www.adv-geosci.net/29/27/2011/ be observed with other lead days). In an early warning system such as EFAS these are the most important lead-times as they are outside the reach of deterministic predictability (for the medium-size catchments) and not yet in the range of high uncertainty. The figure shows a seasonal cycle with higher inconsistencies in the forecast (high FCS BS ) during the rainy period and lower values during drier episodes. The figure also indicates natural variability derived from the observations for comparison. Although one does not expect the forecast to fully follow this natural variability, it still will influence the forecast. Such a fluctuation would have to be reflected in any decision rules for flood warnings. There is no significant trend in the annual mean or the minimum. However the FCS BS decreases in the maximum over 13 years indicating an increased consistency (minimum trend and maximum trend are indicated by dotted lines in Fig. 1). One might expect that the analysis will be influenced by major hydrological events. In 2000In , 2002In , 2005 and 2006 more than the average number of floods occurred in Europe. In addition there has been one large drought (2003)  However, individually none of these factors seems to have had a major impact on the results. Thus the increase in consistency is most probably the accumulated effect of changes in the NWP system and the effect of ever improving data assimilation over the years. Figure 2 shows a clear impact of lead time on the FCS BS for Q50 (all other thresholds show the same behaviour). For the EPS the index decreases over time as the EPSs are approaching climatological distribution and as such become increasingly similar in the threshold exceedance values. The High resolution and Control forecast show the opposite behaviour with an increasing FCS BS over lead time. The error of these forecasts increases with lead time and thus there is a higher probability of the forecast jumping. This means a larger forecast discrepancy can be expected between day 9 and 10 than day 2 and 3. In other words, the probability of the forecasts being different between days 9 and 10 is greater than discrepancies between days 2 and 3. The control forecast has a lower FCS BS because of its coarser resolution and is therefore smoother (see also results in Zsoter et al., 2009). The EPS has even in the beginning a lower FCS BS which should make it more suitable for flood forecasting decisionmaking. It should be noted that EPSs also have a higher skill than deterministic forecasts (see Pappenberger et al., 2010).  It can also be demonstrated that FCS BS for the deterministic runs are correlated: meaning that if there is a high inconsistency in one forecast pair then there is also a high probability of a high inconsistency value for one or more other forecast pairs. In the high-resolution forecast in 53% of cases a FCS BS value of 1 is observed in more than 1 lead time (conditional to a FCS BS value at least achieving a value of 1 in a single forecast of Q50). The same value is 51% for the control forecast and 11% for the EPS (latter based on a FCS BS value of at least 0.7). This reinforces the argument on differences between control, high resolution and ensemble runs (see above).

Impact of different thresholds on the FCS BS
So far all results have been based on using the Q50 threshold for illustration purposes. Although there are some minor differences between the thresholds, all other thresholds exhibit a broadly similar behaviour and would not lead to different conclusions (Table 1). The EFAS alert levels clearly show low values indicating a high consistency in comparison to Q10, Q50 and Q90. However, this is misleading as the alert levels are rarely exceeded and thus have a substantial number of correct rejections. Otherwise Q50 shows the highest number of FCS BS with Q10 and Q90 indicating lower numbers.

Impact of catchment size and catchment location
The impact of catchment size and catchment location has also been studied (see Table 2). The smaller the catchment the higher the FCS BS as smaller catchments usually have a quicker response time. The differences are more prominent in the high-resolution and control run than in the EPS. This indicates that a persistence criterion used in a flood warning will work well for large catchments but may struggle for smaller catchments given the present EPS horizontal resolution. Location of the catchment is paramount as it is correlated to the stability of synoptic patterns (not shown).

The role of inconsistency in a forecast chain
The consistency of forecasts becomes especially important if this attribute is used and incorporated into a decision making process. Consistent forecasts may in some cases improve forecasting ability. For example, Bartholmes et al. (2009) demonstrated a reduced false alarm rate in combination with limited impact on correct forecast rate through the combination of fixed event forecasts. Consistency of results from one forecast to another has become an important element in decision making for EFAS forecasters. Persson and Grazzini (2007) argue that many meteorological forecasters are very well adept in handling inconsistent forecasts. Such inconsistency prevents the forecaster from relying on the latest NWP forecast. In addition, they argue that a consistent forecast may lull forecasters into a false sense of reliability, which makes it even more difficult to deal with sudden unexpected forecasts. The magnitude of the inconsistency is of particular importance as a gradually changing forecast may contribute to a higher sense of reliability than an abruptly changing one (Lashley et al., 2008). Inconsistency can be an asset as it can point to certain types of events e.g. for typically convective situations, small scale phenomena and flash flood the forecasts are less consistent than for largely synoptic scale driven floods, e.g. 5B weather types. In addition, it alerts forecasters to possible forecast problems and highlights alternative developments (see full details in Persson and Grazzini, 2007). If forecasts are inconsistent it may be best practise to rely more heavily on the most recent, or a synthesis of the two -but over-interpretation and non-issue of warnings remain pitfalls with inconsistent forecasts. In flood forecasting there is a requirement for a complex decision making framework as forecasters have a necessary adversity to false alarms and unwillingness to change flood warning levels (Demeritt et al., 2007) as well as the decision rules when to issue a forecast .
It is interesting to note that human forecasts tend to be more consistent than a pure numerical forecast (Lashley et al., 2008). In addition, it is vital to understand to whom one communicates these forecasts and information on inconsistency. It may well be that trained experts are better able to deal with inconsistency whereas it may cause a loss of confidence in untrained audiences (Lashley et al., 2008). These issues need further exploration in future research.

Consistency and forecast skill
Although consistency should not be used as a proxy for forecast accuracy (Hamill, 2003), the inconsistency of an ensemble of successive forecasts is taken in many cases to be an indication of forecast uncertainty (Hamill, 2003;Hoffman and Kalnay, 1983;Dalcher et al., 1988;Palmer and Tibaldi, 1988). Nevertheless, there is a clear relationship between forecast consistency and forecast error. Persson and Grazzini (2007) demonstrate that correlation between forecast jumpiness and forecast errors (typically 30% according to investigations by Hoffman and Kalnay, 1983;Dalcher et al., 1988;Palmer andTibaldi, 1988, Roebber, 1990 and others) is a statistical artefact. They further demonstrate that this correlation increases with a decreasing forecast skill with a peak at 50% for completely skill-less forecasts (see Appendix B in Persson and Grazzini, 2007).
Probabilistic forecasts require that a correct forecast can also occur on the margins of the probability distribution. If one interprets this crudely as the initial conditions of a consecutive forecast then a fixed event forecast can have "turning points". Robust forecast verification therefore must be used alongside any analysis of consistency. Such an analysis must be combined with forecast verification in order to understand any forecast system. Without this understanding, when forecasting fixed events such as floods objective decision making may be hindered through the erroneous interpretation of consecutive forecasts. The focus on fixed event forecasts provides a forecaster's approach to system behaviour and adds an important perspective to the commonly used suite of forecast verification tools.

Conclusions
In this paper the system properties of a hydro-meteorological forecasting system (the European Flood Alert System) in terms of fixed event forecasts has been analysed. Fixed event forecast analysis uses a forecast consistency score (FCS) with respect to a given event and thus compares forecasts with different lead times. A high FCS indicates a more inconsistent forecast and a low FCS indicates a consistent forecast. The analysis has been based on a 10-year hindcast. It has been found that: -The annual mean and minimum of the FCS do not change over the last ten years. The annual maximum decreases.
-The FCS has a seasonal pattern which should be included into any decision making framework.
-The FCS for the EPS decreases over lead time and increases for the control and high-resolution forecast.
-The FCS is sensitive to the threshold magnitude and flow regime.
-There is clear impact of catchment size and location on forecast consistency with a lower consistency in smaller catchments and at locations with more unstable synoptic weather patterns.
It is important to stress that an inconsistent fixed event forecast can be a completely natural occurrence and is not necessarily a negative feature of a forecasting system. The variability of the FCS indicates that EFAS decision rules on when to issue flood forecasts have to include a seasonal dependency. Future studies should investigate other formulations for evaluating consistency as well as measures in a combined accuracy-consistency assessment. More combinations of forecast lead times should be studied as well.