Multi-model data fusion as a tool for PUB: example in a Swedish mesoscale catchment

: Post-processing the output of different rainfall-runoff models allows one to pool strengths of each model to produce more reliable predictions. As a new approach in the frame of the ”Prediction in Ungauged Basins” initiative, this study investigates the geographical transferability of different parameter sets and data-fusion methods which were applied to 5 different rainfall-runoff models for a low-land catchment in Central Sweden. After usual calibration, we adopted a proxy-basin validation approach between two similar but non-nested sub-catchments in order to simulate ungauged conditions. Many model combinations outperformed the best single model predictions with improvements of efficiencies from 0.70 for the best single model predictions to 0.77 for the best ensemble predictions. However no ”best” data-fusion method could be determined as similar performances were obtained with different merging schemes. In general, poorer model performance, i.e. lower efficiency, was less likely to occur for ensembles which included more individual models. Abstract. Post-processing the output of different rainfall-runoff models allows one to pool strengths of each model to produce more reliable predictions. As a new approach in the frame of the “Prediction in Ungauged Basins” initiative, this study investigates the geographical transferability of different parameter sets and data-fusion methods which were applied to 5 different rainfall-runoff models for a low-land catchment in Central Sweden. After usual calibration, we adopted a proxy-basin validation approach between two similar but non-nested sub-catchments in order to simulate ungauged conditions. Many model combinations outperformed the best single model predictions with improvements of efﬁciencies from 0.70 for the best single model predictions to 0.77 for the best ensemble predictions. However no “best” data-fusion method could be determined as similar performances were obtained with different merging schemes. In general, poorer model performance, i.e. lower efﬁciency, was less likely to occur for ensembles which included more individual models.


Introduction and scope
Numerous rainfall-runoff models have been developed to describe the water balance and predict runoff at different spatial and temporal scales. However, due to the complexity of natural systems, a lot of the predictive uncertainty is generally linked to the incomplete representation of the different processes involved in modelling flow generation, the so called structural uncertainty . Other sources of uncertainty are the initial conditions of the system, measured Correspondence to: J.-F. Exbrayat (jean-francois.exbrayat@umwelt.uni-giessen.de) input and parameter values, regrouped under the general term of stochastic uncertainty.
Depending on the processes they simulate and at which scale the assumptions are made, different models might therefore have different strengths and weaknesses in predicting certain parts of the hydrograph. This has been highlighted in several inter-comparison projects (e.g. Smith et al., 2004;Breuer et al., 2009) The ensemble modelling approach has been proposed to take advantage of these heterogeneities in order to provide more reliable predictions . Single-models ensembles (SMEs) are obtained from several realisations of the same model structure while exploring the parameter uncertainty. SMEs are for example typical output of the widely used Monte-Carlo based GLUE approach (Beven and Binley, 1992;Beven and Freer, 2001). Multi-model ensembles (MMEs) pool different results obtained from different model structures.
All ensembles can be evaluated in a probabilistic way based on the frequency of prediction of some selected particular events (Renner et al., 2009;Georgakakos et al., 2004). In some other studies the single predictions were combined using different statistical post-processing methods in order to produce "best" forecasts (Shamseldin et al., 1997;Georgakakos et al., 2004;Viney et al., 2009).
In the frame of applying the concept of ensemble modelling to improve the reliability of Predictions in Ungauged Basins (PUB; Sivapalan, 2003) this study evaluates the geographical transferability of parameter sets and combination schemes applied to 5 different rainfall-runoff models: LAS-CAM (Sivapalan et al., 1996), LASCAM-S (Exbrayat et al., 2010), a self written model based on the snow and soil moisture routines of HBV (Lindström et al., 1997) coupled to the published flow generation equations of INCA (Whitehead et al., 1998)  and INCA Modified in Python, described in Exbrayat et al., 2010), SWAT (Arnold et al., 1998) and HBV-N-D (Lindgren et al., 2007). Different data-fusion methods based on some of those used by Viney et al. (2009) were applied in order to produce large sets of new deterministic MMEs. This paper is organised as follows. Section 2 presents the catchment and the available data for model application, the models themselves and the different combination methods. In Sect. 3 we present the results for the single models and the newly compiled ensembles in the proxy-basin validation approach. Results are discussed in Sect. 4 and a short summary with conclusions and possible further research directions are presented in Sect. 5.

The river fyris catchment
The study area is located in central Sweden (Fig. 1). The Fyris River catchment has an area of 2000 km 2 and flows into Lake Ekoln which drains into the Baltic Sea. It is a lowland catchment with an elevation ranging between 15 and 115 m a.s.l. Land-use is dominated by mainly coniferous forests (59%) and crop lands (33%). Minor other land cover types are wetlands (4%), urban areas (2%) and lakes (2%) (Lindgren et al., 2007).
Daily records of precipitation (8 gauges) and temperature (3 stations) available from the Swedish Meteorological and Hydrological Institute (SMHI) were used for the chosen 5 years study period (2000 to 2004). Two time series of daily runoff were available over the same period for two non-nested sub-catchments of the Fyris River: Vattholma (281 km 2 ; light brown in Fig. 1) and Sävja (699 km 2 ; light green in Fig. 1). These catchments were already studied in another PUB oriented study (Seibert and Beven, 2009). Mean annual runoffs were 219 and 189 mm at Vattholma and Sävja, respectively. As a response to snow melt, high flows usually occur from late autumn to early spring with some thaw-refreezing events leading to high temporal variability during the flood.

Multi-model members
The five selected models all provide daily runoff predictions. Table 1 gives an overview of different model characteristics and requirements such as the smallest spatial units and input data. There is a good structural variability among the cohort and they may be sorted into an approximate increasing degree of complexity: LASCAM, LASCAM-S, CHIMP, SWAT and HBV-N-D. All these models feature conceptual descriptions of the natural mechanisms involved in flow generation.
LASCAM, LASCAM-S, CHIMP and SWAT were setup in a semi-distributed way. The same sub-catchment delineation, derived from an SRTM digital elevation model, was adopted for each of these 4 models. This spatial disaggregation was obtained with the ArcSWAT extension for ArcGIS (Olivera et al., 2006) which was used for the whole setup of the SWAT model. The sub-catchment delineation divided the Vattholma and Sävja basins into 9 and 28 sub-entities respectively, corresponding to mean sub-catchment areas of 31 and 25 km 2 , respectively. While the same parameter sets were applied to each sub-catchment in LASCAM and LASCAM-S, they were independent for each land-use class in CHIMP. SWAT required a disaggregation of each basin into different Hydrological Response Units (HRUs), based on unique combinations of land-use class and soil type. Land-use classes and HRUs in the two latter models were not spatially identified within their sub-catchments and their respective contributions to the flow generation were weighted as a function of their relative areas.
The HBV-N-D model is a fully distributed adaptation of the concepts of the semi-distributed HBV model (Lindström et al., 1997) based on the D8 single flow-direction algorithm (O'Callaghan and Mark, 1984). The HBV-N-D model application used in this study is based on the same setup used by Lindgren et al. (2007) and features 250 m × 250 m grid cells which are associated with a specific land-use type. The running-time of this model was the limiting factor of this study and explains the choice of a relatively short 5 years evaluation period as well as discrepancies in the calibration procedure (Sect. 2.3). Daily potential evapotranspiration was computed with the temperature-based Hargreaves method (Hargreaves and Samani, 1985) and aggregated to the required time period for each model (Table 1). Snowmelt and snowpack processes were simulated for each calculation unit based on the empirical degree-day approach in all models except LAS-CAM which does not include any snow routine. We therefore developed the LASCAM-S model by implementing a similar method based on the equations published by Lindström et al. (1997) applied at the sub-catchment scale adopted in LASCAM.

Ensemble construction and assessment
A summary of the methodological approach used in this study is presented in Fig. 2. This flow chart gives an overview of the different steps we followed in order to create our new model-fusion based forecasts. Each model was calibrated once against each daily discharge record. The calibration was realised in a single-objective way using the Shuffled Complex Evolution optimisation algorithm (Duan et al., 1992) for all the models except for the time-consuming HBV-N-D for which the Parameter Estimator PEST (PEST; Doherty, 2004) was chosen. The optimisation criterion OF was defined as the average of the Nash-Sutcliffe efficiency (Nash and Sutcliffe, 1970) calculated for predicted discharge values directly (Eq. 1) and the efficiency obtained with logarithmic values (Eq. 2).
In Equations (1) and (2), O i and S i are observed and simulated discharges at time step i whileŌ is the mean observed runoff over the N considered time steps. NSE is more sensitive to higher values while lnNSE is also sensitive to lower ones (Krause et al., 2005). The criterion OF therefore gives a better account of the global quality of the prediction. It ranges between −∞ and 1, corresponding to the poorest fit and a perfect match between observations and predictions, respectively.
Then, by comparing predictions obtained with the calibrated parameter sets with the observed data of the corresponding calibration station, we determined the different weights to be applied to our predictions according to the following methods ( Table 2). Most of these methods were already tested in a previous split-sample application case ). The weighted-mean approach (WM) used the value of the OF criterion as a weighting coefficient, giving more weight to the better performing members of the fusion procedure. Weight values were in that case independent of the number of members to be merged together. On the other hand, un-constrained and constrained multiple linear regression coefficients (UR and CR schemes) were obtained by using the observations as dependent variables and each possible combinations of 2 to 5 model realisations as independent variables. We therefore obtained different sets of coefficients depending on the combination itself.
The final step of the ensemble generation was to use the calibrated parameter sets to create new single predictions at the alternate subcatchment in the frame of the proxy-basin 46 J.-F. Exbrayat et al.: Multi-model data fusion as a tool for PUB approach. Simple daily mean (ME) or daily median (MD) were used as data-fusion methods to combine the new model realisations along the aforementioned weights and regression coefficients fitted at the other station. All the methods were applied to every possible combination of 2 to 5 models leading to the creation of 130 new deterministic predictions for each station. Eventually negative regression coefficients could lead to the occurrence of unrealistic negative predictions leading the corresponding MMEs to be disqualified. The quality of the ensemble predictions was finally evaluated by computing the aforementioned OF criterion and the percent bias (PB%, which should be close to zero), calculated by Notations correspond to those used in Eqs. (1) and (2). We finally compared the multi-model ensembles with their members and the directly calibrated single runs based on these goodness-of-fit descriptors.

Results
Calibration and validation results of the single models at the two discharge stations have been summarised in Table 3. The label Calib was used when the corresponding station was the calibration one, while the label Proxy was used when the station was used as validation one (i.e. with geographically transferred parameter sets inherited from the other subcatchment calibration). Calib runs always significantly outperformed the Proxy ones. The LASCAM model gave the worst results in both calibration and proxy-basin application. The upgraded LASCAM-S yielded the best calibration results at Vattholma and second best at Sävja but its predictive quality was really lowered in both proxy-basin approaches. The three other models (i.e. CHIMP, SWAT and HBV-N-D) also showed good calibration results for each sub-catchment.
In the proxy-basin context they always showed significantly better performance than LASCAM and LASCAM-S.
The prediction quality of the different model-combination schemes which were applied was illustrated with scatter plots of the criteria values for each station in Figs. 3a and b for Vattholma and Sävja, respectively. A number of different model combinations outperformed the best single model (Proxy runs in Figs. 3a and b) and while regression and median MMEs were the best at Vattholma, mean and weightedmean methods also performed well at Sävja. However, no MMEs outperformed the four best Calib models in either subcatchment.
In Figs. 4a and b the distribution of the criteria values obtained was represented with boxplots as a function of the number of ensemble members and merging scheme for Vattholma and Sävja, respectively. As already shown in Figs. 3a and b, the highest OF values obtained with MMEs  Table 2 for abbreviations).  Table 2 for abbreviations).
were 0.77 for Vattholma and 0.68 for Sävja. These corresponded to two different model fusion methods: constrained regression with two members at Vattholma and weightedmean of three models at Sävja. The best MMEs achieved improvements of +0.07 and 0.06 in comparison to the corresponding best Proxy run (Table 3). However, as illustrated in Figs. 3a and b, different model fusion methods provided predictions almost as good, especially at Sävja, with either more or less members (Figs. 4a and b). Comparatively, PB% Fig. 4a. Criteria values distribution depending on applied scheme (red: ME; black: WM; green: MD; blue: UR; magenta: CR; see Table 2 for abbreviations) and number of ensemble members at Vattholma. Missing crosses indicate unrealistic negative prediction obtained with UR or CR schemes. OF = objective function according to Eqs. (1) and (2); PB% = percent bias according to Eq. (3). values close to 0 were achieved several times in both cases (Figs. 3a and b) and more frequently with simple mean and weighted-average methods. The five-member MMEs were never the best predictors but the corresponding median always had low biases and even outperformed the other fivemember MMEs for OF at Vattholma. However, in this latter case, the two regression-based new predictions were disqualified since negative flow values were predicted.

Discussion
As expected the implementation of the snow module into LASCAM significantly improved the prediction quality of the calibrated models at each station for both OF and PB%. In the two calibration/proxy-basin cases, no single model could be pointed out as the global best performer Fig. 4b. Criteria values distribution depending on applied scheme (red: ME; black: WM; green: MD; blue: UR; magenta: CR; see Table 2 for abbreviations) and number of ensemble members at Sävja. Missing crosses indicate unrealistic negative prediction obtained with UR or CR schemes. OF = objective function according to Eqs. (1) and (2); PB% = percent bias according to Eq. (3).
for each considered station and criterion. Very different predictions (according to the metrics) have been obtained even though the two studied catchments were very similar. The more distributed CHIMP, SWAT and HBV still obtained better results with transferred parameter values than LASCAM and LASCAM-S. The overall heterogeneity of model predictions was considered as a good starting point for the ensemble generation following data-fusion methods (Shamseldin et al., 1997).
Different models combinations gave a large range of predictions and good improvements were realised by some of them. As plotted in Figs. 4a and b the best MMEs considering the criterion OF were different between the two stations: constrained regression schemes were usually more efficient at Vattholma (magenta dots in Fig. 4a) while highest improvements were realised with median and weighted-mean schemes at Sävja (green and black dots in Fig. 4b). This was achieved in both cases regardless of the number of merged members. This statement is consistent with Abrahart and See (2002) who showed that the most efficient data-fusion methods depended on the particular application case. However, as Georgakakos et al. (2004) demonstrated, the simple mean of five model predictions consistently outperformed the best single model prediction in several catchments, it was only true at Sävja in our study.
A general trend was that using more members in the selected compilation methods constrained the distribution of the OF and PB% values (Figs. 4a and b). This was similar to the results obtained by Viney et al. (2009) for simple mean and weighted combinations in both calibration and validation periods of a split-sample approach. More precisely, lower values of OF were less likely to occur with more members except with the "median" scheme. On the other hand this latter method applied to the five model predictions gave the closest value to 0 for PB% among the five-member MMEs while keeping OF values close to the best achieved by any other combination method. This could be explained because CHIMP was the most frequent model to participate in this MME while also providing the least biased of the Proxy predictions.
The simplest averaging schemes (i.e. mean and weightedmean) showed similar results in terms of criteria values distribution (MD and WM in Figs. 4a and b). There was a slight shift of boxplots, and therefore the distribution, towards zero bias and higher efficiencies for "weighted-mean" schemes, thus illustrating the effect of the weighting process. This occurred even if the poorly fitted LASCAM-S Proxy runs were given heavy weights in response to good calibration results (Table 3). Viney et al. (2009) also showed that applying multiplelinear regression coefficients gave the best results in terms of NSE for calibration periods. But these ensembles were outperformed by some other combinations in a split-sample calibration context. The proxy-basin validation scheme adopted in our study showed that the MMEs based on constrained regressions were the best performers at Vattholma but not at Sävja. Still, they obtained OF values close to the best ones (Figs. 3b and 4b) in this latter case. There was no real advantage in using these more complicated regression merging schemes as it resulted in unrealistic negative runoff predictions 14 and 2 times for Vattholma and Sävja, respectively. Some multi-model predictions gave PB% values close to zero (Figs. 3a and b) while the best Proxy run for this criterion had a bias of 8.3% (CHIMP at Sävja, Table 3). This could be attributed to an inter-model balance as LASCAM and LASCAM-S usually had opposed biases in comparison to the other Proxy model realisations (Table 3). It therefore moved the merged predictions towards bias values closer to 0. Moreover, even with the poor efficiencies illustrated by low OF values for LASCAM and LASCAM-S (Table 3), these MMEs could also achieve good results for this latter criterion at Vattholma. At Sävja they even obtained the best OF (Fig. 4b) while Table 3 showed that in that case, the LAS-CAM and LASCAM-S Proxy runs were very biased with negative OF.
These results represented another illustration of the advantage of combining strengths of different predictions which is the surrounding philosophy in model combination (Shamseldin et al., 1997). Such MMEs were able to provide at least good estimates of the global water balance over the study period even though some of them were partly based on the worst single models (which indeed provided the final prediction with interesting information).

Conclusions
Several interesting results could be either deduced or confirmed in regards of previously published studies but this time in the frame of a PUB application. First, no optimal combination schemes could be identified, even though the two catchments investigated were similar, which did not increase the transferability of the different methods. Still, the study showed that the more members that were merged together, the lower the risk of getting bad predictions. In the case of a PUB, this offers a minimum guarantee that the newly compiled predictions would be closer to reality even while including very bad single predictors. For example, using the simple daily median value of the five single model predictions provided good results with a low bias and could be identified as a good all-round compromise.
However, even if good results were obtained with some of the data fusion methods, none of them could outperform the calibration process as a result of a poor transferability of single parameter values. A probable limitation of this study was therefore to consider only one realisation per model in a deterministic way. According to the equifinality theory (Beven and Freer, 2001), different parameter combinations are able to give evenly good predictions. Due to small and unquantifiable heterogeneities between catchments, a common optimal parameter set is not likely to exist even while considering two basins in the same hydro-climatic context. Therefore, an idea for next PUB predictions would be to study the transferability of optimised (i.e. constrained) parameter ranges of the predictions after analyses of numerous realisations of the same model (i.e. SMEs) and to introduce probabilistic rather than deterministic predictions.