Large-scale assessment of Prophet for multi-step ahead forecasting of monthly streamflow

We assess the performance of the recently introduced Prophet model in multi-step ahead forecasting of monthly streamflow by using a large dataset. Our aim is to compare the results derived through two different approaches. The first approach uses past information about the time series to be forecasted only (standard approach), while the second approach uses exogenous predictor variables alongside with the use of the endogenous ones. The additional information used in the fitting and forecasting processes includes monthly precipitation and/or temperature time series, and their forecasts respectively. Specifically, the exploited exogenous (observed or forecasted) information considered at each time step exclusively concerns the time of interest. The algorithms based on the Prophet model are in total four. Their forecasts are also compared with those obtained using two classical algorithms and two benchmarks. The comparison is performed in terms of four metrics. The findings suggest that the compared approaches are equally useful.


Introduction
There are two different approaches to statistical time series forecasting regarding the exploited information for obtaining the forecasts.The first approach, known as the standard one, exclusively uses endogenous predictor variables, while the second approach also uses exogenous predictor variables (Hong and Fan, 2016;Hyndman and Athanasopoulos, 2018; see also the Supplement for some basic forecasting terminology used throughout the paper).Moreover, the number of the primary forecasting models is limited (Hong and Fan, 2016), while recent research in geoscience by Tyralis and Papacharalampous (2017), and Papacharalampous et al. (2018a, b, c) suggests that the forecast quality could hardly be improved in a long term run by moving from one forecasting algorithm to another.On the contrary, Hong and Fan (2016) emphasize that the use of appropriate exogenous predictor variables could considerably improve the forecasts.The exogenous predictor variables to be utilized for solving a specific forecasting problem could result through large-scale comparisons (since the results may vary significantly depending on the case study; Papacharalampous et al., 2017b) that precede the application of interest.Such comparisons are known to facilitate benchmarking and model assessment, and require large datasets.
Monthly streamflow or river discharge forecasting is of practical importance.There are several studies approaching this specific problem without utilizing exogenous predictor variables (e.g.Ballini et al., 2001;Koutsoyiannis et al., 2008;Papacharalampous et al., 2017a), while examples of case studies adopting the alternative approach can be found in Callegari et al. (2015) and Yang et al. (2017).The results of such studies are usually presented in terms of point forecasts (hereafter forecasts) rather than in a probabilistic way, as in Tyralis and Koutsoyiannis (2014).An extensive study on the use of climate index data for forecasting monthly streamflow at 88 locations in Brazil is available in Silveira et al. (2017).Another relevant and large-scale study by De Gregorio et al. (2018) uses data originating from 300 alpine basins.Finally, Sun et al. (2014) explore the usefulness of two sets of exogenous predictor variables for one-step ahead forecast-ing of monthly streamflow in 438 USA catchments using the MOPEX dataset (Schaake et al., 2006).The algorithms implemented in Sun et al. (2014) are the Gaussian process, AutoRegressive Moving Average with eXogenous predictor variables (ARMAX) and MultiLayer Perceptron (MLP).
Herein we expand this latter study by investigating the utility of three different sets of exogenous predictor variables in multi-step ahead forecasting of monthly streamflow.We use a more recent dataset, i.e. the CAMELS dataset (Addor et al., 2017a, b;Newman et al., 2014Newman et al., , 2015)), which is also larger than the MOPEX one.We implement Prophet, a forecasting model introduced by Taylor and Letham (2018) that provides the possibility of incorporating exogenous predictor variables.This model was first used in its standard mode for forecasting geophysical time series, specifically monthly precipitation and temperature time series, in Papacharalampous et al. (2018c).We compare the results provided by four variations of the Prophet model with those of two classical algorithms and two benchmarks.

Data and methods
Here we present the data and methods, while the reader is also referred to the Supplement, the code availability section and the data availability section for additional related information.We use the CAMELS dataset, which includes daily streamflow, precipitation and temperature data for 671 USA catchments.We exclude from the analysis all catchments including datasets containing missing values and, finally, we form the mean monthly time series of streamflow, precipitation and temperature for the remaining 513 catchments.In Fig. 1 we present the retained catchments.The retained monthly data span from January 1980 to December 2013 (408 monthly values).A brief exploration of the formed time series of monthly streamflow is displayed in Fig. S1 (see Supplement).The seasonality pattern is obvious in the sample autocorrelation function (ACF) of the original time series and reduced in the sample ACF of the deseasonalized time series, while the estimates of the Hurst parameter (H ) of the Hurst-Kolmogorov process (for its definition see Supplement; see also Tyralis et al., 2018), when the latter is fitted to the deseasonalized time series as described in Tyralis and Koutsoyiannis (2011), have a median value of 0.75 and, therefore, indicate significant long-range dependence.We note that the parameter H is commonly used in the literature for measuring this dependence under the established assumption that the latter is present in the various geophysical processes.Moreover, in Fig. S2 (see Supplement) we present the Pearson's correlations between the monthly streamflow and precipitation variables, and the monthly streamflow and temperature variables.The former range between −0.37 and 0.92 with a median of 0.58, and the latter range between −0.76 and 0.75 with a median of −0.21.These correlation values are nonnegligible.We fit a variety of algorithms to the monthly values of the years 1980 to 2012 (fitting period) and forecast the monthly values of year 2013 (forecast period).We implement five forecasting algorithms that exclusively use endogenous predictor variables, namely the Naïve 1, Naïve 2, ARFIMA, SES and Prophet 1 algorithms.The two former algorithms are based on the monthly values of the last year and the average monthly values respectively, while they serve as benchmarks within our methodological framework (see also Hyndman and Athanasopoulos, 2018, chap. 2.3).ARFIMA is an automatic AutoRegressive Fractionally Integrated Moving Average algorithm available in the forecast R package (Hyndman and Khandakar, 2008;Hyndman et al., 2018).SES (Simple Exponential Smoothing) and Prophet 1 are also automatic algorithms.The former is implemented through the forecast R package and the latter through the prophet R package (Taylor and Letham, 2017).
Since the ARFIMA and SES algorithms are suitable for forecasting normal non-seasonal data, we apply these algorithms to the normalized (through Box-Cox transformation) deseasonalized time series.The deseasonalization precedes the normalization and is performed by applying a multiplicative model of time series decomposition (see Hyndman and Athanasopoulos, 2018, chap. 6.3) to the original monthly values of the fitting period and by subsequently dividing the latter values by the estimated seasonal component, while seasonality is recovered in the produced forecasts.The same procedure is adopted for the Prophet 1 algorithm, in spite of the fact that the utilized Prophet model offers the possibility of internally handling of the seasonality.This choice is made, since the external seasonality handling is shown to lead to slightly better forecasts in Papacharalampous et al. (2018c), as well as for consistency purposes with respect to the application of ARFIMA and SES.The handling of the nonnormality in the Prophet 1 algorithm is made as default.For a brief description of the ARFIMA, SES and Prophet models see Supplement (see also Papacharalampous et al., 2018c, and the references therein).
Additionally to the above-described algorithms, we implement the Prophet 2, Prophet 3 and Prophet 4 ones, which  utilize exogenous predictor variables alongside with the endogenous ones.Specifically, in Prophet 2 S t , i.e. the mean monthly streamflow at time t, is also considered to depend on P t , i.e. the mean monthly precipitation at time t, as measured for the fitting period and forecasted for the forecast period (seasonality included).We use the forecasts of P t at the forecast period because the test set should not contain information which was unknown at the time that the forecast was performed.The respective exogenous predictor variables for Prophet 3 and Prophet 4 are T t , and P t and T t respectively, where T t is the mean monthly temperature at time t.T t is used as measured for the fitting period and forecasted for the forecast period (seasonality included).The precipitation and temperature forecasts are produced by the Prophet 1 algorithm, while seasonality and non-normality are handled as in Prophet 1.The same applies to the streamflow information utilized by Prophet 2, Prophet 3 and Prophet 4. We note that all the algorithms implemented herein are designed to fit to the data very fast.The large-scale forecasting experiment of this study takes about an hour to run in a regular home PC.
We assess the forecast quality using the RMSE (Root Mean Square Error), NSE (Nash-Sutcliffe Efficiency), d (index of agreement) and KGE (Kling-Gupta Efficiency) metrics.For their definitions see Supplement (see also Papacharalampous et al., 2018a, Supplement, and the references therein).These metrics can take values between 0 (optimal) and +∞, −∞ and 1 (optimal), 0 and 1 (optimal), and −∞ and 1 (optimal) respectively.We present the metric values in an aggregative form, while we also use them to rank the forecasting algorithms.

Results and discussion
Section 3 is devoted to the exploration of the results and the discussion of the main findings.In Fig. 2 we present the sideby-side boxplots of the metric values (far outliers excluded).We observe that the Prophet 1, Prophet 2, Prophet 3 and Prophet 4 algorithms produce comparable results with each other.However, the Prophet 1 algorithm produces slightly better forecasts in terms of RMSE and NSE.The same ap-plies to Prophet 2 for the d and KGE metrics.Moreover, in Fig. 3 we present a comparison of the computed RMSE values (far outliers included) for the {Prophet 1, Prophet 2}, {Prophet 1, Prophet 3} and {Prophet 2, Prophet 4} pairs of algorithms.The closeness in the performance of these four algorithms is also perceivable by the examination of this figure, while some few larger differences favouring the Prophet 1 algorithm are observed as far outliers.We further notice that the use of precipitation information seems to affect more Adv.Geosci., 45,[147][148][149][150][151][152][153]2018 www.adv-geosci.net/45/147/2018/than temperature information the forecasting performance.The use of both types of information, on the other hand, mostly results to the largest outlier RMSE values.Importantly, the fact that the use of these specific exogenous predictor variables did not (significantly) improve the performance of the algorithms in any of the 513 cases examined herein should be viewed as a lesson learned from this study.
In fact, the selection of appropriate exogenous variables is far identified in the forecasting literature as a target and challenging at the same time problem to be solved (see, for example, Hong and Fan, 2016), while several approaches not relying on exogenous information are mostly of the same usefulness, especially in geosciences, for which small differences in the forecasting performance of the algorithms do not have any practical effect on decision-making (see also Papacharalampous et al., 2018a).This conclusion can be drawn based on the large-scale results of Tyralis and Papacharalampous (2017) and Papacharalampous et al. (2017aPapacharalampous et al. ( , 2018a, b, c), b, c).Here as well, the differences in the results obtained using the various forecasting algorithms are mostly small, while Naïve 1 and SES are in average the worst performing.On the contrary, Naïve 2 performs well, almost as well as the best performing algorithms, i.e.Prophet 1 and ARFIMA.This good performance of Naïve 2 is particularly interesting, while it provides a good reason for always implementing naïve algorithms alongside with more advanced techniques, as also emphasized by forecasting experts (Hyndman and Athanasopoulos, 2018).
Finally, in Fig. 4 we comparatively present the rankings of the implemented algorithms within the conducted experiment according to the RMSE metric.We observe that each of the algorithms may perform better or worse compared to the rest depending on the examined case study.This figure is particularly interesting, especially when viewed in comparison to several studies presenting new techniques and reporting on their superior performance to others based on case studies, while it also confirms in an illustrative way the related to the "no free lunch theorem" findings of Papacharalampous et al. (2017bPapacharalampous et al. ( , 2018a, c), c).According to the no free lunch theorem, there is not a model which will always perform better than other models (Wolpert, 1996).We integrate Fig. 4 by also providing Figs.S3 and S4 (see Supplement).These figures present the number of times that each algorithm is ranked from best (1st) to worst (8th) and the average rankings of the algorithms respectively.The best average ranking is computed for Prophet 1 and is equal to 3.87, followed by Prophet 3 and Naïve 2 with average rankings equal to 3.94 and 3.95 respectively.SES is the worst performing according to this criterion with an average ranking equal to 5.58.The remaining methods are in between with average rankings 4.18 (ARFIMA), 4.34 (Prophet 2) and 4.89 (Prophet 4).

Conclusions
We implement the recently introduced Prophet model to compare the results obtained via two different approaches to multi-step ahead forecasting of monthly streamflow.The first approach uses endogenous predictor variables only, while the second one also uses observed and forecasted information (as available at the time of the forecast) about monthly precipitation and/or temperature.In the latter approach, the value(s) of the exogenous predictor variables considered at each time step exclusively concern the time of interest.The implementation is made for 513 USA catchments using the CAMELS dataset.The results indicate that the compared approaches produce equivalent results.Future work could focus on the selection of appropriate exogenous predictor variables as proposed by Hong and Fan (2016).
Special issue statement.This article is part of the special issue "European Geosciences Union General Assembly 2018, EGU Division Energy, Resources & Environment (ERE)".It is a result of the EGU General Assembly 2018, Vienna, Austria, 8-13 April 2018.

Figure 1 .
Figure 1.Locations of the 513 catchments examined in the forecasting experiment.

Figure 2 .
Figure 2. Metric values for the 513 catchments presented in an aggregated form.The far outliers (if any) have been removed.

Figure 3 .
Figure 3.Comparison of the RMSE values for the 513 catchments as computed for three pairs of algorithms using the Prophet model.The RMSE values are presented in an aggregated form.

Figure 4 .
Figure 4. Rankings of the algorithms according to the RMSE metric.The algorithms are ranked from best (1st) to worst (8th).