Regionalization of mesoscale physically based nitrogen modeling outputs to the macroscale by the use of regression trees

In this paper, a method is presented to estimate excess nitrogen on large scales considering single field processes. The approach was implemented by using the physically based model J2000-S to simulate the nitrogen balance as well as the hydrological dynamics within meso-scale test catchments. The model input data, the parameterization, the results and a detailed system understanding were used to generate the regression tree models with GUIDE (Loh, 2002). For each landscape type in the federal state of Thuringia a regression tree was calibrated and validated using the model data and results of excess nitrogen from the test catchments. Hydrological parameters such as precipitation and evapotranspiration were also used to predict excess nitrogen by the regression tree model. Hence they had to be calculated and regionalized as well for the state of Thuringia. Here the model J2000g was used to simulate the water balance on the macro scale. With the regression trees the excess nitrogen was regionalized for each landscape type of Thuringia. The approach allows calculating the potential nitrogen input into the streams of the drainage area. The results show that the applied methodology was able to transfer the detailed model results of the meso-scale catchments to the entire state of Thuringia by low computing time without losing the detailed knowledge from the nitrogen transport modeling. This was validated with modeling results from Fink (2004) in a catchment lying in the regionalization area. The regionalized and modeled excess nitrogen correspond with 94 %. The study was conducted within the framework of a project in collaboration with the Thuringian Environmental Ministry, whose overall aim was to assess the effect of agroenvironmental measures regarding load reduction in the water bodies of Thuringia to fulfill the requirements of the European Water Framework Directive (B äse et al., 2007; Fink, 2006; Fink et al., 2007).


Introduction
In many parts of Germany and Europe nutrient inputs in rivers and groundwater are a major reason for missing the targets of the European Water Framework Directive (WFD).Since the 1950s the excess nitrogen on German arable areas induced by N-fertilization increased from <30 kgN ha −1 a −1 to at least 85 kgN ha −1 a −1 nowadays (Nieder et al., 2007).The excess nitrogen enhances the risk of nitrogen leaching to surface and groundwater.This can result in changes of species spectrum in water flora and fauna, algae blooms culminating in eutrophication and hypoxia, which can in turn contribute to an increase of dead zones in coastal regions (Diaz and Rosenberg, 2008).The overall aim of this study which took place in collaboration with the Thuringian Environmental Ministry was to assess the effect of agroenvironmental measures regarding load reduction in the water bodies of Thuringia (Bäse et al., 2007).While the problem of nutrient surplus occurs in entire river systems the measures against the nutrient leaching take place mainly in individual agricultural fields.On the one hand the evaluation of nutrient reducing measures should take place with the help of detailed models, which are able to represent these measures.On the other hand the relevant river systems are often too large to apply such detailed models.Hence for the evaluation of federal state of Thuringia a methodology based on the detailed modeling of representative meso-scale river basins and a statistical regionalization for the entire state of Thuringia was applied.
To reach the goal of assessing nitrogen reduction potential of different environmental measures an integrative dynamic river basin modeling of water and nutrient transport was carried out.The nitrogen balance which is closely linked to the hydrological dynamics was represented with the Published by Copernicus Publications on behalf of the European Geosciences Union.
fully distributed physically based model J2000-S.Due to the model's ability to describe the nitrogen balance detailed and distributed it demands thorough input data and high computing resources.Therefore, three meso-scale catchments upper Gera (approx.850 km 2 ), Lossa (approx.230 km 2 ) and Helme (approx.190 km 2 ) were selected as modeling units.
However, the aim of the aforementioned study was to know the current status of excess nitrogen within the whole state of Thuringia to identify potential areas of risk and hence places, where measures for nitrogen reduction should be undertaken.Therefore regression trees were selected as regionalization method, since they are able to take into consideration high resolution data and a thorough system understanding from the modeling with J2000-S with low computational resources.This method was chosen under the assumption that there is a causal relationship between excess nitrogen and environmental conditions, such as land use and soil characteristics.Regression trees are piecewise constant or linear estimations of a regression function, which are generated by the partition of the sample.The partitions are illustrated as decision trees.Breiman et al. (1984) published the fundamental monograph of the Classification and Regression Trees (CART).
Regression trees are applied in geosciences especially for process identification and resulting spatial estimations.Schillinger (2002) used CART for spatial estimations of nitrate levels in soils and in another case study for estimation of proportions of grain-size fractions.Lahaa and Blöschl (2006) tested different statistical grouping methods to predict low flow discharges in Austria being regression trees the method that shown the second best performance.
The water balance was calculated with the much simpler monthly based model J2000g for the entire state of Thuringia in a distributive manner (Krause and Hanisch, 2007).The results of the J2000g model were used together with other physical geographical variables like slope, land use and field capacity of soils in a statistical regression tree technique.For each landscape type of Thuringia an individual regression tree was trained with the help of the results of J2000-S and extrapolated to the respective landscape type.

Study area
The state of Thuringia is located in the central part of Germany.It has an area of 16 172 km 2 .According to the Thuringian Environmental Ministry (Hiekel et al., 2004) the state can be divided into seven major landscape types: Low Mountain Range, Sandstone, Shell Limestone, Basalt, Inner-Thuringian Agricultural Hill-land, Alluvial Land and Zechstein (Fig. 1).
The model was implemented in the three aforementioned test catchments (rivers Helme, Gera and Lossa shown in Fig. 1) while the regionalization methodology was applied to the entire state of Thuringia.Table 1 gives an overview  (Hiekel et al., 2004).Coordinate System: Gauss-Krüger (GK 4, 31468).about the different environmental conditions of the three test basins.The Helme and Gera watersheds have higher relief energy than the Lossa catchment, which influences climate parameters, e.g.temperature, precipitation.Therefore the Lossa basin has a higher annual average temperature and lower annual precipitation than the other two catchments.
The land use varies from dominant arable land in the Helme and Lossa regions to dominant forest in the Gera catchment.

Software
The J2000-S model is a combination of the fully distributed physically based J2000 model for hydrological simulations of meso-to macro-scale river basins (Krause, 2001) extended with the nutrient transport routines of the semi-distributive Soil and Water Assessment Tool (SWAT; Arnold et al., 1998).The implementation was done within the Java-based modeling framework JAMS (Jena Adaptable Modeling System; Kralisch and Krause, 2006).
The water pathways are calculated on the spatial fundament of Hydrological Response Units (HRUs).According to Flügel (1996) HRUs are units with homogeneous land use as well as topological, pedological and geological characteristics controlling the water dynamics.In contrast to the SWAT model, which also uses the HRU concept, considering soil, land use and slope (Winchell et al., 2007), while J2000 overlays slope, aspect, elevation, soil, land use and geology to delineate the HRUs.Moreover, the topological HRU concept of J2000 (Staudenrausch, 2001) provides a fully distributive approach, while the semi-distributive character of SWAT is not able to represent single agricultural fields, because only portions of different combinations of the mentioned layers are modeled.
The hydrological model J2000g was used to interpolate and model the associated water balance parameters for the entire state of Thuringia.J2000g is a simplified development of J2000, designed to model the hydrology of macro-scale catchments.The model is able to calculate the water balance physically based and spatially distributed, but without considering lateral routing processes (Krause and Hanisch, 2007).
The regionalization was performed using GUIDE (Generalized Unbiased Interaction Detection and Estimation), which is a freely available software to produce classification and regression trees (Loh, 2002).The total sample amount, from which the tree is built, forms the root or root node (t0) of the tree.The several sample parts are represented through the nodes (t1,...,tn).The partitioning of the nodes is binary, which means, if the argument is fulfilled (1) the tree splits left and if the argument is not true (0) the tree splits right.The partitioning stops at the end nodes (or leaves), if a stop criterion is reached, with preferably homogeneous values in the partition of the end nodes.Hence, in the first step an overlarge and high complex tree is created, which over represents (overfitting) the training data sample.This tree can be pruned back to eliminate nodes, which do not improve the prediction (Schillinger, 2002).The average of the predicted excess nitrogen values is calculated from the model and shown at the leaves.
GUIDE can handle both categorical and numerical predictor variables.It uses Pearson's chi-square test of residuals and bootstrap (see Efron and Tibshirani, 1994) calibration to detect co-variances between the independent variables.Based on the chi-squared test GUIDE shows a high sensitivity to local and pairwise interactions of independent variables using curvature and interaction tests.GUIDE provides different options to fit a regression tree model as well as diverse pruning options, e.g.k-fold cross validation method of CART (Breiman et al., 1984), where k represents the size of the training sample.In this case a linear least square regression with constant complexity at each node was built.The final tree was pruned using k-fold cross validation.

Modeling
The modeling with J2000-S was performed for the three meso-scale test catchments.Figure 2 shows one example of the modeled nitrogen output in the Gera catchment.The observed and simulated nitrogen loads at the station Erfurt-Möbisburg indicate a high conformity both of the small loads during summer low flow periods and of the higher loads during winter and spring time.The coefficient of determination for this catchment shows (R 2 = 0.60) a high correlation (Bäse et al., 2007).The average measured nitrogen concentration was 22.2 mg l −1 and the simulated 24.1 mg per l for the model run (Fig. 2) (Fink, 2007).For each model entity (Hydrological Response Unit or HRUs) an average value of excess nitrogen over a long time period was determined.Excess nitrogen is calculated in this case per each model entity (HRU) using the nitrogen input (atmospheric deposition, fertilizer) and subtract the nitrogen output (plant uptake, lateral runoff to other areas, denitrification).Due to the topological routing of the model the nitrogen load of lateral flows (interflow and surface runoff) is routed through HRUs until it reaches receiving waters.In this case the lateral routed nitrogen which flows into the HRU was subtracted from the output and only the resulting net nitrogen from every HRU is considered as excess nitrogen.The value of excess nitrogen per HRU can become negative through this procedure, if a HRU within the routing path has a lower nitrogen load in lateral flows than the former HRU.However these areas can be defined as nitrogen sinks.They either represent areas which have a higher lateral nitrogen input than the nitrogen produced by this area itself or nitrogen uptake through plants or nitrogen is released through denitrification processes.Grassland has the potential to be a nitrogen sink (Herrmann and Neftel, 2002;Kolbe, 2002).During the regionalization the HRUs with negative values considered as well, but they can be understood as zero-values.The remaining amount of nitrogen is the excess nitrogen per HRU. Figure 3 shows these nitrogen outputs in kgN ha −1 a −1 for the Lossa catchment.

Data preparation for regionalization
The input data and the model results were analyzed focusing on the independent variables showing significant influence on the dependent variable, the excess nitrogen.The HRUs of the test catchments were used as the total sample amount.The frequency distribution of excess nitrogen per HRU is illustrated in Fig. 4. The highest number of HRUs has a nitrogen output between 0 and 10 kgN ha −1 a −1 .The second peak lies in the range of 25 to 35 kgN ha −1 a −1 .
A statistical analysis was done considering the influence of each model input variable on the prediction behavior of excess nitrogen.An example of the excess nitrogen dependent on land use classes is shown as box plots in Fig. 5.Even though land use plays an important role on nitrogen input due to use of fertilizers, soil conductivity and other soil physical properties have also a significant impact.Therefore the following variables were selected based on their statistical sig- nificance with respect to excess nitrogen: precipitation, evapotranspiration as well as soil, geology and land use classes.
In the further data preparation one of the selected independent numerical variables (slope) was classified and the other independent categorical variables (land use, soil and hydrogeology) were reclassified based on excess nitrogen statistical significance.The new classes and the classification criteria are shown in Table 2.The land use classes, which were used for the nitrogen modeling with J2000-S, were reclassified from nine to four groups combining classes with similar nitrogen output behavior, like for instance different forest classes with similar excess nitrogen values.The same was done for the soil and hydrogeology classes.The former 39 soil classes were grouped to five classes depending on the field capacity (mm) in the upper layer of the soil (1 m).The eight hydrogeology classes were reclassified into aquiferous (class 1) and non aquiferous (classes 2 and 3) rock based on their hydraulic conductivity (K).
Moreover, independent variables for the whole state of Thuringia were generated.The HRU delineation of Thuringia (Krause and Hanisch, 2007) provides the spatial basis for the regionalization.To get the spatial information of evapotranspiration and precipitation for each of the 211 000 HRUs for Thuringia the J2000g model was used to interpolate point data of precipitation and simulate evapotranspiration after Penman-Monteith (Monteith, 1975).
The regionalization was done for every landscape type assuming that environmental conditions are more related in regions with similar landscapes.Hence for every landscape type a regression was generated.Figure 6 shows an example of a regression tree for the landscape type Arable Hill Land.
The training data sets are located in the Gera catchment and Degree < 1 1-2 2-5 5-10 10-15 > 15 Fig. 6.Regression tree to predict the average excess nitrogen in kg ha −1 a −1 for the landscape type Arable Hill Land.
in the Lossa catchment (Fig. 1).The training data set includes 9993 values.A split sample of 50 % was used for the calibration and the other 50 % for the validation.The leaf nodes are labeled with the predicted excess nitrogen in kg ha −1 a −1 and an ID representing the node number.The high numbers are produced because first an overlarge tree is built, which is afterwards pruned back to avoid overfitting.
For the landscape type Arable Hill Land the first division is made based on land use.The land use class arable land (class 4) is separated from the others.The classes 1, 2 and 3 have only one more distinction depending on the elevation and there is either no excess nitrogen for elevations under 323.2 m or very low excess nitrogen value (3 kgN ha −1 a −1 ).The main branch on the right side represents only predictions for the land use class 4 (arable land).Within the landscape type Arable Hill Land this land use class is dominant (72 % of the area).
The next division was done based on the soil classes.The moderately fertile soil classes 3 and 4 are separated from the others.Following this central branch of the tree the next distinction is governed by the evapotranspiration, which is an indicator for the local water balance.Further differentiations were done base on the variables elevation, the separation of the soil classes 3 and 4 and the additional water balance parameter precipitation.The noticeable value of the leaf with the ID 49 of −88.85 kgN ha −1 a −1 is caused by high amounts of precipitation combined with low elevations.These are indicators for wet conditions in the valleys, which might result in high denitrification rates.
Regarding the right branch of the tree the first division is again caused by the soil class which underlines the importance of the soil properties for nutrient leaching (Scheffer and Schachtschabel, 1998).The soil class 1, which includes the poorest soil in the classification, causes the highest nutrient output (ID 15 with 45.41 kg ha −1 a −1 ).Further differentiations were made based on evapotranspiration, soil class again, elevation, geology class and slope.

Results
For every landscape type of Thuringia a regression tree model was prepared with GUIDE.From the sample points of every landscape type a calibration and a validation data set were generated.The correlation coefficient between the modeled and the regionalized excess nitrogen was calculated and is shown for the calibration and the validation in Table 3.It underlines the quality of the calibration and validation of the regression tree models for all landscape types.Only landscape type 7 (Zechstein) shows a significant higher coefficient of determination for calibration than for the validation, which means that the calibrated tree is overfitted.The reason might be the fact that this landscape type has relatively small training area located in the Helme catchment.The result of the regionalization is illustrated in Fig. 7.The regionalized excess nitrogen represents the potential leaching risk and therefore the contribution of each HRU to the nitrogen influx in waters.The borders of the three test catchments are also shown in this figure.The map of excess nitrogen indicates a significant higher nitrogen output in the regions of the Arable Hill Land (Fig. 1), the eastern part of the Thüringer Schiefergebirge (Thuringian slate mountains) located in the southeastern part of Thuringia, in the Helme catchment and in the western parts of the state.In contrast to the landscape type Low Mountain Range, for instance the Thüringer Wald (Thuringian forest), in which forest is the dominant land use.The results can also be shown in numbers.
Table 4 shows the excess nitrogen of the landscape types of Thuringia in t a −1 as well as the respective area in km 2 .

Validation
To test the regionalization accuracy, the excess nitrogen from another nutrient transport modeling study (Fink, 2004) in the catchment of the river Weida in south Thuringia was used to compare the results.The modeling was performed with WASMOD (Reiche, 1994) for the same time series.An average value of 183 t N a −1 for the whole basin was simulated with WASMOD.The catchment has an area of 162 km 2 , of which 30 km 2 are located in the state of Saxony.The regionalization was only done for Thuringia, which comprises 132 km 2 of this catchment.Hence the regionalized excess nitrogen for the area of Thuringia was projected to the area of 162 km 2 of the whole Weida catchment.The regression tree model predicted 194 t N a −1 of excess nitrogen for the Catchment.The difference between the modeled and the regionalized excess nitrogen of the Weida catchment is 11 t N a −1 , which implies a correlation of 94 %.

Summary and conclusions
The results show that the statistical regression tree method applied in this study is able to predict the excess nitrogen based on modeled training data.The goodness of the regionalization depends on the quality of the modeling and the model's ability to achieve a thorough system understanding.Due to the HRU-based modeling local characteristics, e.g.significantly high or low excess nitrogen outputs, can be identified and hence analyzed.The assumption was that the potential excess nitrogen depends on the specific local environmental factors, which can be described by a mathematical relation.The results show that a regression tree method like GUIDE is able to predict excess nitrogen for other areas with similar environmental conditions.Therefore, it can be used as an addition for meso-scale physically based nitrogen transport modeling.Both the results and the system understanding gained from the modeling can be regionalized in order to answer larger scale questions considering single fields.
Edited by: R. Ludwig, K. Schulz, and M. Disse Reviewed by: two anonymous referees

Fig. 2 .
Fig. 2. Simulated and observed nitrogen concentrations in kg d −1 of the river Gera at the gauging station Erfurt-Möbisburg.

Fig. 3 .Fig. 4 .
Fig. 3. Modeled excess nitrogen output in kgN ha −1 a −1 for each HRU of the Lossa catchment.Each HRU is framed by a black line.The bold black lines are buffer strips also represented as (small) HRUs.

Fig. 5 .
Fig. 5. Box plots of the excess nitrogen per HRU and different land use classes.The percentages highlight the HRU proportion of each land use class.The upper border of the box (blue) depicts the upper or third quartile (Q.75) of the distribution, while the lower boundary describes the lower or first quartile (Q.25).The length of the Whiskers (black) represent 1.5 times of the interquartile distance and describes the dispersion of the distribution.The central tendency of the distribution is expressed by the median (red).Outliers are not shown.

Table 2 .
List of categories of independent variables and classification criteria.

Table 3 .
Correlation coefficients (R) of the calibration and validation data set for the Thuringian landscape types.Landscape Type Low Mountain Range Sandstone Shell Limestone Arable Hill Land Alluvial Land Zechstein

Table 4 .
Excess nitrogen in t a −1 and area in square kilometers of the landscape types of Thuringia.