Forecast Encompassing Tests for the Expected Shortfall

In this paper, we introduce new forecast encompassing tests for the risk measure Expected Shortfall (ES). Forecasting and forecast evaluation techniques for the ES are rapidly gaining attention through the recently introduced Basel III Accords, which stipulate the use of the ES as primary market risk measure for the international banking regulations. Encompassing tests generally rely on the existence of strictly consistent loss functions for the functionals under consideration, which do not exist for the ES. However, our encompassing tests are based on recently introduced loss functions and an associated regression framework which considers the ES jointly with the corresponding Value at Risk (VaR). This setup facilitates several testing specifications which allow for both, joint tests for the ES and VaR and stand-alone tests for the ES. We present asymptotic theory for our encompassing tests and verify their finite sample properties through various simulation setups. In an empirical application, we utilize the encompassing tests in order to demonstrate the superiority of forecast combination methods for the ES for the IBM stock.


Introduction
Evaluation and comparison of statistical forecasts is an essential issue which heavily relies on the existence of suitable loss and identification functions for the forecasts under consideration. Classically, point forecasts are issued for a central tendency of the variable under consideration, which are usually thought of as forecasts for the mean. An appropriate loss function for the evaluation of mean forecasts is the well-known squared loss function. In recent years, the scope was extended to forecasts of non-central functionals of the predictive distribution. E.g. in the financial world, attention has shifted towards forecasting of functionals which capture the risk involved in the financial products. Examples for these are the variance (volatility), quantiles (Value at Risk, VaR), expectiles and the Expected Shortfall (ES). For certain important functionals such as the ES and the variance (in the presence of a non-zero mean), there do not exist suitable loss functions for the evaluation of the forecasts (Gneiting, 2011). We then say that these functionals are non-elicitable.
For helpful comments, we thank Sander Barendse, Sebastian Bayer, Alastair Hall, Andrew Patton and the seminar participants at Universität Konstanz, Duke University, the 2019 QFFE conference in Marseille, and the 2019 IAAE conference in Nicosia. Financial support by the Graduate School of Decision Sciences (GSDS) and the travel fund of the International Association of Applied Econometrics (IAAE) is gratefully acknowledged. cases, the combined VaR forecasts significantly outperform the stand-alone models. Furthermore, the third variant of our test which can be applied when only ES forecasts are available (without their accompanying VaR forecasts), exhibits very similar results. Thus, we can draw the following conclusions. First, forecast combination methods for the ES significantly increase the forecast accuracy. Second, our results imply that the gains from forecast combination for the ES are even more pronounced than they are for the VaR.
The rest of the paper is organized as follows. In Section 2, we introduce encompassing tests for the ES and derive the asymptotic distribution of the associated test statistics. Section 3 presents an extensive simulation study analyzing the size and power properties of our tests. In Section 4, we apply the testing procedure to daily returns of the IBM stock and thereby illustrate the power of forecast combination techniques. Section 5 provides concluding remarks. The proofs are deferred to Appendix A.

Theory
Consider a stochastic process Z = Z t : Ω → R k+1 , k ∈ N, t = 1, . . . , T , which is defined on some common complete probability space (Ω, F, P), where F = {F t , t = 1, . . . , T } and F t = σ {Z s , s ≤ t}. We partition the stochastic process as Z t = (Y t , X t ), where Y t : Ω → R is an absolutely continuous random variable of interest and X t : Ω → R k is a vector of explanatory variables. We denote the conditional distribution of Y t+1 given the information set F t by F t . Accordingly, E t , Var t and f t denote the expectation, variance and density corresponding to F t . We further denote the class of conditional distributions of Y t+1 given the information set F t by P.
Our goal is to construct encompassing tests for certain statistical functionals (also known as characteristics or properties) of the predictive distribution F t . We formally define an s-dimensional functional, s ∈ N, as a mapping from the space of predictive distributions to a real-valued s-dimensional vector. Many forecasts are issued as (vector-valued) point forecasts even though the forecaster has indeed a full predictive distribution F t in mind (Gneiting, 2011). Typical functionals which attract a lot of attention in practice are the mean, the variance, quantiles, expectiles and the ES. In many cases, the person or institution evaluating the forecasts is different from the forecaster and consequently only has access to the issued point forecasts and the full predictive distribution stays concealed. One prominent example for this are risk forecasts, where the Basel Committee requires the financial institutions to report forecasts for the ES of their forecasted return distributions (Basel Committee, 2016, 2017. In this context, two important properties of statistical functionals are elicitability and identifiability. Elicitability means that there exists a strictly consistent loss function, i.e. a loss function ρ Γ (Y, x) depending on the random variable Y and the issued forecast x, whose expectation E [ρ Γ (Y, ·)] is uniquely minimized by the true functional Γ. Using such a loss function, one can assess the quality of issued forecasts by comparing their average losses induced by the realized predicted variable of interest. Identifiability means that there exists an identification function ϕ Γ (Y, x) , whose expectation E [ϕ Γ (Y, x)] equals zero if and only if x equals the true functional Γ. Identification functions can usually be used in order to test for forecast rationality, i.e. for testing whether the average of ϕ Γ (Y, ·) equals zero (Diebold and Lopez, 1996;Elliott et al., 2005;Patton and Timmermann, 2007). As a direct consequence, almost all of the literature on tests for forecast comparison and forecast rationality evolves around the associated loss and identification functions (Gneiting, 2011).
Unfortunately, there exist statistical functionals for which there are no such strictly consistent loss functions. Important examples for these functionals are the variance (with non-zero mean), the ES, the mode, the minimum and the maximum (Gneiting, 2011;Heinrich, 2014;Fissler and Ziegel, 2016). Some of these can be made elicitable at some higher-order, by simply considering a pair of functionals such as e.g. the variance together with the mean, and the ES at level α jointly with the associated α-quantile (Lambert et al., 2008;Gneiting, 2011;Fissler and Ziegel, 2016). In this work, we make use of joint elicitability which allows for two crucial ingredients of forecast encompassing test. First, we can jointly evaluate these forecasts using a strictly consistent loss function and thus define forecast superiority. Second, these loss and identification functions allow for M-and GMM-estimation of the associated semiparametric regression equations, which are a crucial ingredient of forecast encompassing tests.
In order to conduct the forecast evaluation in an out-of-sample fashion, we divide the sample size T in an in-sample part of size m and an out-of-sample part of size n such that T = m + n. The in-sample period is used to generate the forecasts, e.g. by estimating model parameters in parametric approaches. The out-of-sample period is used for the evaluation of the forecasts. Following Giacomini and Komunjer (2005), we pose little restrictions on how to generate the forecasts, where we allow for parametric, semiparametric or nonparametric techniques, and also allow for nested and non-nested forecasting procedures.
Let γ t,m denote the model parameters at time t (or alternatively the semi-or non-parametric estimator used in the construction of the forecasts), which are estimated using the previous m data points. We assume that the one-step ahead forecastsf t = f γ t,m , Z t , Z t−1 , . . . , issued with the knowledge available at time t, are fixed functions f (·) over time. This construction allows for both, fixed forecasting schemes, where the model parameters γ t,m are only estimated once, and also rolling window forecasting schemes, where the parameters γ t,m are re-estimated in each step.
In the following section, we formally introduce the concept of forecast encompassing in the classical case of one-dimensional, real-valued and 1-elicitable functionals. However, as e.g. the variance and the ES are not 1-elicitable, but 2-elicitable together with the mean and the quantile respectively, the main focus of this paper is to generalize encompassing tests to higher-order elicitable functionals, which is discussed in Section 2.2.

Encompassing Test for 1-Elicitable Functionals
Following Hendry and Richard (1982), Mizon and Richard (1986), Diebold (1989) and Giacomini and Komunjer (2005), we formally introduce the general concept of forecast encompassing in the following for one-dimensional, real-valued and elicitable functionals. We assume that two competing forecasters predict the variable of interest Y t+1 and issue point forecastsf 1,t andf 2,t for a given functional Γ(F t ). Furthermore, let ρ Γ Y t+1 , f be a strictly consistent loss function for Γ, which means that for all x in the domain of Y t+1 . Then, we say that the forecastf 1,t conditionally encompassesf 2,t when for all θ 1,t , θ 2,t ∈ Θ and for all t = m, . . . , T − 1. Equation (2.3) implies that, in terms of the loss induced by ρ Γ , forecastf 1,t is at least as good as any (linear) combination off 1,t andf 2,t . This implies that forecast f 2,t does not add any information on Y t+1 which is not already incorporated inf 1,t . We define θ * 1,t , θ * 2,t as the parameters which minimize the loss under the conditional expectation, where Θ ⊆ R 2 is some compact, non-empty and convex parameter space. Then, it obviously holds that for all θ 1,t , θ 2,t ∈ Θ and for all t = m, . . . , T − 1. In particular, this implies that with respect to the loss function ρ Γ , which is equivalent to The preceding definition shows that one main ingredient for forecast encompassing is the specification of the underlying loss function. One could state Definition 2.1 using any loss function which has the appropriate dimensions. However, the inequality (2.3) only becomes meaningful if the loss function ρ Γ is chosen appropriately for the functional Γ. It is possible to consider economically motivated loss functions, such as financial loss, utilities, etc. In this paper we only consider loss functions which are meaningful in the strong statistical sense of being strictly consistent for the underlying functional as defined in (2.2).
In fact, strict consistency of loss functions only implies that a correctly specified forecast exhibits the smallest loss in expectation. In reality however, competing forecasts are often not correctly specified and we seek to know which one of the competing forecasts is the one with a higher predictive power. The definition of strict consistency in (2.2) does however not directly imply that the better of two misspecified forecasts receives the smaller loss. Holzmann and Eulert (2014) show that for competing forecasts which are based on increasing information sets and which are correctly specified given their underlying (but usually incomplete) information set (auto-calibrated), applying any strictly consistent loss function results in a correct ranking of the forecasts. In our case, we indeed have nested information sets as it obviously holds that σ f 1,t ,f 2,t ⊇ σ f 1,t . Thus, by further assuming that the issued forecasts are auto-calibrated given the forecasters information set, we can conclude that the ranking implied by (2.3) is indeed the correct one for any strictly consistent loss function.

Encompassing Test for Higher Order Elicitable Functionals
In this work, we consider statistical functionals such as the ES which are not elicitable, i.e. for which there does not exist such a strictly consistent loss function. Consequently, defining forecast encompassing such as in Definition 2.1 fails due to lack of an underlying and statistically meaningful loss function. However, the functionals we consider are elicitable jointly with some other functional, such as the α-ES is jointly elicitable with the α-quantile. A further prominent example is the variance, which is only jointly elicitable with the mean. Thus, Definition 2.1 can be generalized by using a candidate from these class of strictly consistent joint loss functions.

Definition 2.2 (Forecast Encompassing for Higher-Order Elicitable Functionals).
We say that the vector of forecastsf 1,t = f (1) 1,t , . . . , f (s) 1,t encompassesf 2,t = f (1) 2,t , . . . ,f (s) 2,t at time t for the vector-valued, elicitable functional Γ : P → R s if and only if 2,t , . . . , θ * (s) with respect to the loss function ρ Γ , which is equivalent to This definition allows for setting up joint encompassing tests for any vector-valued and elicitable functional of interest. Besides the maybe most interesting case of non-elicitable but higher-order elicitable functionals as the variance and the ES, this definition also enables us to consider forecast encompassing for combinations of 1-elicitable functionals. Such an analysis might be interesting for the case of two quantiles at different probability levels such as the 1% and 5% VaR which are often considered (jointly) in financial applications. In the following section, we turn to the case of encompassing tests for the ES.

Joint Forecast Encompassing for the ES and the VaR
In this section, we turn to the functional of interest, the ES which is defined as (2.10) where Q t,α (Y t+1 ) denotes the conditional α-quantile of Y t+1 given F t , which is unique when the distribution F t is absolutely continuous. As discussed above, the main ingredient of forecast encompassing testing is the specification of the underlying loss function, which has to be associated with the functional(s) we consider forecasts of. The crucial property the loss function has to fulfill is strict consistency. Weber (2006) and Gneiting (2011) show that for the ES, there does not exist such a strictly consistent loss function. However, Fissler and Ziegel (2016) show that it is elicitable jointly with the quantile and they characterize the full class of strictly consistent loss functions for the pair consisting of the quantile and the ES, where g and φ are smooth functions where g is non-negative and φ and φ are strictly positive. This class of loss functions has the following interpretation. The first line is a generalized piece-wise linear loss, which corresponds to the full class of strictly consistent loss functions for the quantile. The second line resembles the Bregman class of loss functions for the mean, where inside of the big bracket, instead of Y , there is a quantile-truncated version of Y . This form of the loss functions is not unexpected given that the ES is in fact a quantile-truncated expectation. Using this class of loss functions, we can now define the concept of joint forecast encompassing for the pair consisting of the quantile and the ES.
In order to facilitate conditional encompassing, we rely on formulating our testing problem through GMM moment conditions. When an M-estimator is used for estimation of the combination weights, it is only feasible to condition on the explanatory variables used in the semiparametric model, in this case the issued forecasts. However, as shown in the following, formulating this as a GMM problem opens up the possibility to test encompassing conditional on any information set G t ⊆ F t .
In general, there exist infinitely many different identification functions (see Gneiting (2011) andZiegel (2016)), which can be used for the formulation of conditional moment conditions in the GMM setting here. As Dimitriadis and Bayer (2019) find that the GMM-estimator is numerically unstable for many choices of these identification functions (obtained as the derivatives of a loss function), we rely on the following conditional moment conditions, which allows for an easy and numerically stable two-step estimation procedure.
(2.14) † A very similar two-step estimator is proposed by Barendse (2017) for the simplified case of W q,t =q t and W e,t =ê t almost surely. Then, similar to the classical OLS estimator, we can find a closed form solution to the second moment condition in (2.14).
The following proposition shows that (2.14) is a conditional moment condition in the sense that its conditional expectation equals zero if and only if the parameters θ equal θ * t . Proposition 2.4. Given thatq t 0 andê t 0 almost surely for all t, we get that The proof is given in Appendix A.
The conditional encompassing condition in (2.15) is equivalent to Testing whether something holds for all F t -measurable functions W t is in general infeasible. In order to facilitate this, we choose a vector of instruments W t = W q,t , W e,t : Ω → R l , l ∈ N, which captures all the relevant information contained in F t and we define the unconditional moment conditions as ( 2.17) For the estimation of the forecast combination weights, we choose to use the identification function ψ 2step QES as this allows for a two-step estimation procedure in the sense that in the first step, we estimate the parameters of the semiparametric quantile model. In the second step, we use these pre-estimates in order to estimate the parameters of the ES model. This is desirable as Dimitriadis and Bayer (2019) find that the general one-step GMM-estimator is numerically unstable and requires vastly higher computation times. This is possible by the special form of ψ 2step QES , as the first component is independent of the ES parameters and thus, has the following general form, where the first component only depends on the quantile-specific parameters. This estimator falls in the general class of two-step GMM estimators considered in Chapter 6 of Newey and McFadden (1994). In the following, we use the notation (2.21) The first-step quantile estimator is defined bŷ Then, conditional on the pre-estimateβ n , we define the estimatorη n as the minimizer of the inner product 23) The following proposition shows consistency of this two-step estimator and states the regularity conditions we need to impose for this. Proposition 2.5. Assume that for every t = m, . . . , T − 1, it holds that (a) the process (W t , Z t ) is α-mixing with α of size −r/(r − 1) for somer > 1, (b) the distribution of Y t given F t , denoted by F t is absolutely continuous with continuous and strictly positive density f t , (c)q t andê t are nonzero almost surely and corr(q t ) 1 and corr(ê t ) 1, (e) θ * ∈ int(Θ), where the parameter space Θ is compact.
The proof is given in Appendix A. The conditions of the proposition are similar to the standard conditions in the literature on asymptotic theory for semiparametric quantile and ES models, see e.g. Komunjer (2005), Giacomini and Komunjer (2005), Engle and Manganelli (2004) and Patton et al. (2019). The mixing condition (a) is a standard condition on the allowed dependence and heterogeneity of the underlying stochastic process such that a law of large numbers holds for the process Ψ 2step QES (θ). Furthermore, existence of the moments in (d) is required for the law of large numbers. Condition (b) assures that the conditional quantile is unique and that there is positive mass at the conditional quantile in the form of a strictly positive density. This assumption is standard in the literature on conditional quantile estimation (Koenker and Bassett, 1978;Komunjer, 2005;Engle and Manganelli, 2004). Condition (c) rule out perfectly co-linear explanatory variables. Finally, assumption (e) is standard in the literature on time series parameter estimation inference. We now turn to asymptotic normality of the estimator, which is given in the following proposition.
Proposition 2.6. In addition to the conditions of Proposition 2.5, we assume that it holds that (f) the process (W t , Z t ) is α-mixing with α of size −r/(r − 2) for some r > 2, (h) the matrix Σ n , defined in (2.26) is positive definite with a determinant bounded away from zero for all n sufficiently large, (i) the density f t is bounded from above almost surely on the whole support of F t , (j) the matrices E W q,tq t and E W e,tê t have full row rank for all t = 1, . . . , T, Then, it holds that where The proof is given in Appendix A. The conditions imposed for Proposition 2.6 again resemble conditions usually employed for asymptotic normality for quantile and ES models, see e.g. Komunjer (2005), Giacomini and Komunjer (2005), Engle and Manganelli (2004) and Patton et al. (2019). The strengthened condition for α-mixing in (f) is required such that a CLT holds for Ψ 2step QES (θ). For this, we furthermore need condition (h) and the moment condition (g), which extends the moment condition (d), required for consistency. Equivalently, condition (i) strengthens condition (b) which requires that we have a sufficiently smooth conditional distribution, such that we can smoothen the indicator function in the moment conditions. In order to guarantee that the matrix Λ n Λ n has full rank and can consequently be inverted, we have to impose condition (j). Condition (k) limits the exact number of hits of Y t+1 in the conditional quantile model. For linear models and in the scenario of iid data, this condition is readily verified as in Ruppert and Carroll (1980). Eventually, condition (l) requires that the moment conditions are stochastically equicontinuous, where primitive conditions for this can be found e.g. in Engle and Manganelli (2004), Komunjer (2005) Giacomini and Komunjer (2005) and Patton et al. (2019).
We estimate the asymptotic covariance by the following estimatorŝ and,Ω n = Λ nΛ n −1 Λ nΣ nΛn Λ nΛ n −1 . The most problematic term to estimate is the conditional density which, following Engle and Manganelli (2004) and Patton et al. (2019), is estimated consistently by the estimator 1 2c n 1 {|Y t +1 −q tβ n |<cn } . We get consistency of the covariance estimators by the following proposition.
Proposition 2.7. There is a deterministic and positive sequence c n that satisfies c n = o(1) and c −1 The proof is given in Appendix A. Using the three previously established propositions, we can now derive the asymptotic distribution of the test statistic.

Theorem 2.8 (Joint Encompassing Test). Under
The proof is given in Appendix A. Thus, we establish a joint encompassing test for the VaR and the ES and derive the asymptotic distribution of the test statistic.

Forecast Encompassing Tests for the ES Stand-Alone
The primary objective of this paper is to set up encompassing tests for high-order elicitable functionals as e.g. the ES. In the previous section, we develop a joint test for the VaR and ES, which is reasonable given the joint elicitability property of the VaR and ES. However, in this section we also develop encompassing tests for the ES stand-alone. This is one feature where encompassing tests are superior to the literature on testing for superior predictive ability in the sense of Diebold and Mariano (1995), Giacomini and White (2006) and West (2006). As these tests are based directly on the average loss differential, they can only test for joint predictive ability of the VaR and ES.
In contrast, encompassing tests are based on the regression coefficients of the semiparametric models, which are of course estimated jointly but still are separate coefficients. Thus, the joint encompassing test defined in Definition 2.3 depends on the associated loss functions only indirectly. This fundamental difference allows for only testing the hypothesis that the ES-specific regression parameters equal zero and one. In this approach, we do not test the quantile specific parameters and consequently, under the null hypothesis, we do not impose that the underlying quantile forecasts also have to encompass its competitor. We call this test Auxiliary ES Encompassing Test as it still depends on the auxiliary VaR forecasts which are used for the estimation. The test is formally defined in the following. Definition 2.9 (Auxiliary ES Forecast Encompassing). Let q 1,t ,ê 1,t and q 2,t ,ê 2,t denote competing forecasts for the pair consisting of the conditional quantile and ES of F t . We say thatê 1,t encompassesê 2,t at time t if and only if As we utilize the same estimation technique as for the joint encompassing test, the asymptotic test distribution can be derived by using the same asymptotic theory as derived in Theorem 2.5 and Theorem 2.6, which we formulate in the following.

Theorem 2.10 (Auxiliary ES Encompassing Test).
Under H 0 , it holds that The proof is straight-forward and follows the proof of Theorem 2.8. Even though the emphasis of this test is on the ES, we still need quantile forecasts for the implementation of the parameter estimation. This is problematic for two reasons. First, the quantile forecasts are still utilized in the estimation procedure and thus have an indirect effect on the parameter estimates of the ES specific parameters. Second, the test is only applicable in the setup where the person who is interested in applying the tests actually have the pair of competing forecasts at hand. In the current implementation of the regulatory framework of the Basel Committee (Basel Committee, 2016, 2017, the banks are only obligated to report their ES forecasts at probability level 2.5%. Thus, the accompanying VaR forecast, which the ES forecast is based on internally, is in general not available to the regulator who has to decide on an adequate risk management of the financial institution at hand. This motivates a third design of the encompassing test, which can be setup without having access to the accompanying VaR forecasts. The underlying idea is that in pure scale models Y t+1 = σ t+1 u t , u t ∼ F(0, 1), which are still the most frequently used class of models for risk management, the VaR and ES are linear transformations of themselves,ê where z α and ξ α are the α-quantile and ES of the distribution F(0, 1). Thus, we estimate the following joint model (2.35) Given that the underlying data stems from a location-scale model and under H 0 : the forecastsê 1,t encompass the forecastsê 2,t , we get that β * 1,t , β * 2,t , η * 1,t , η * 2,t = z α /ξ α , 0, 1, 0 . As we are generally agnostic about encompassing of VaR forecasts and as the ratio z α /ξ α is generally unknown, we only test for the ES specific parameters and set up the following definition of the strict ES encompassing test.
Theorem 2.12 (Strict ES Forecast Encompassing). Assume that the conditions from Proposition 2.5, 2.6 and 2.7 hold. Then, under H 0 and given that Y t follows a pure scale model, it holds that (2.37) The proof is straight-forward and follows the proof of Theorem 2.8. Derivation of the asymptotic distribution in the case where the underlying forecasts do not stem from an exact location-scale model is still an open problem. However, parameter estimates of financial data of non-location-scale models implies that the deviations from location-scale are not huge, and consequently, this test can be seen as a good approximation. Furthermore, the simulation study in Section 3 shows that this test is still well-sized and has good power properties. Thus, we propose to use this test in the scenario where the user of this test does not have the accompanying VaR forecasts at hand.
One important application of these ES encompassing tests is in the context of selecting the bestperforming forecast, i.e. selecting at timet a best forecasting method for timet + 1. This is particularly relevant as the ES is recently introduced into the Basel regulations without having proper forecast selection procedures at hand. Consequently, following Giacomini and Komunjer (2005), we propose the following decision rule for ES forecasts. Perform the two encompassing tests of H 0,1 :ê 1,t encompassesê 2,t and H 0,2 :ê 2,t encompassesê 1,t on data up to timet. Then, there are four possible scenarios: (1) If neither H 0,1 nor H 0,2 are rejected, then the test is not helpful for forecast selection. (2) if H 0,1 is rejected while H 0,2 is not rejected, then we can conclude that forecastê 2,t did not add any information to forecastê 1,t and consequently, we decide to use the forecasting method ofê 1,t . (3) if H 0,2 is rejected while H 0,1 is not rejected, then the same logic applies and we would use the forecasting method ofê 2,t . (4) eventually, if both H 0,1 and H 0,2 are rejected, then the test delivers statistical evidence that both forecasts contain information and that a forecast combination performs significantly better. Consequently, one would choose a combined forecastê c,t =η n,1ê1,t +η n,2ê2,t where the combination weights are estimated by the GMM estimator proposed in this paper.

Simulation Study
In this section, we evaluate the empirical performance of our three proposed ES encompassing tests in two different scenarios. The first, presented in Section 3.1 analyses the test properties for two competing forecasts, both stemming from location-scale GARCH-type models. In Section 3.2, we present a simulation setup based on CARE models of Taylor (2019), which are outside of the class of location-scale models. This simulation setup also enables us to evaluate the performance of the third strict ES encompassing test in a situation of misspecification. We report and discuss the results of the simulation study in Section 3.3.

GARCH setting
In this section, we consider two model specifications from the location-scale GARCH family with zero mean. The first is a standard GARCH(1,1) model, calibrated to IBM data, which leads to the following model specification, Y t+1 = σ 1,t+1 u t+1 , where u t+1 ∼ N (0, 1) and σ 2 1,t+1 = 0.042 + 0.053Y 2 t + 0.925σ 2 1,t . (3.1) We obtain forecasts for the VaR and ES by the following formulaŝ where z α and ξ α are the α-quantile and α-ES of the standard normal distribution. The second model is a GJR-GARCH(1,1) model with Gaussian returns, which we calibrate to the same data, and which leads to the following model specification, Y t+1 = σ 2,t+1 u t+1 , where u t+1 ∼ N (0, 1) and This model specification allows for a leverage term by including the indicator function into the squared return part of the GARCH equation. This corrects for the stylized fact that negative returns lead to higher future volatilities than positive returns of the same magnitude. Forecasts for the VaR and ES of the GJR-GARCH model are again given bŷ q 2,t+1 = z α σ 2,t+1 andê 2,t+1 = ξ α σ 2,t+1 . (3.4) Using these two relatively realistic volatility models for daily return data, we simulate data from the following process where u t iid ∼ N (0, 1) and where ρ ∈ [0, 1] is chosen on a linear grid between zero and one. For ρ = 0, the simulation setup described in (3.5) implies that the simulated data follows the GARCH model specified in (3.1) and consequently, the true VaR and ES forecasts are given by (3.2). Consequently, the GJR-GARCH forecasts in (3.4) are misspecified and cannot add any valuable information to the forecast combination. In other words, the GARCH forecasts are already correctly specified and generate the lowest possible loss (in expectation or alternatively in the limit by applying a weak law of large numbers). Consequently, the setup ρ = 0 corresponds to the null hypothesis.

CARE setting
We also use a different simulation setup in order to generate data which does not follow a location-scale process. This is particularly important as the underlying theory of the strict ES test imposes data stemming from location-scale models. Thus, we conduct this simulation experiment in order to demonstrate robustness of this test against typical deviations from location-scale models. For this, we adapt Conditional Autoregressive Expected Shortfall (CARE) models introduced by Taylor (2019) in order to generalize the classical CAViaR quantile model of Engle and Manganelli (2004). The asymmetric slope (AS) CARE model, calibrated to the same daily IBM data as used in Section 3.1 is given bŷ (3.10) Using this model, we can directly generate quantile and ES forecasts. The second model variant we consider is the symmetric absolute value (SAV) CARE model of Taylor (2019), given bŷ In this simulation setup, we simulate data according to the following additive model (3.14) The underlying reason for this additive simulation design is that for ρ = 0, it holds that The same holds inversely for ρ = 1. This implies that we generate data for which the ES forecastsê 1,t equal the true conditional ES, whereas this does not hold for the respective quantile forecastsq 1,t . Consequently, the ES forecasts should be able to encompass any other misspecified forecasting scheme without the associated quantile forecasts. This establishes a simulation setup which justifies the second and third variant of our encompassing tests which test encompassing of ES forecasts stand-alone.

Simulation Results
Section 3.3 reports the empirical sizes of the three different ES encompassing tests introduced in Section 2 at a 5% nominal significance level based on 2000 Monte Carlo replications. The respective null hypothesis being tested is that the model given in the line H 0 encompasses the respective competitor. The upper panel depicts results for the GARCH simulation design of Section 3.1 whereas the lower panel depicts results for the CARE simulation design of Section 3.2. We show results for a test versions based on estimation of the asymptotic covariance for the sample sizes n = 1000 and n = 2500 and a bootstrap test versions for n = 1000. We find that our tests are well-sized, especially for the increased sample size n = 2500 and for the bootstrap estimator. The second and third test which only test the ES forecasts exhibit overall better size properties than the joint encompassing test. This observation can be explained as the joint test is based on an estimate of a four by four covariance which includes the density quantile function as a nuisance quantity in the covariance estimation (Koenker and Bassett, 1978;Giacomini and Komunjer, 2005;Dimitriadis and Bayer, 2019). As a consequence, this test benefits strongly from applying a bootstrap procedure which circumvents this source of inaccuracy. In contrast, the ES versions of the test do not rely on covariance submatrices containing this quantity and consequently perform better. Furthermore, we find that the ES encompassing test, which is subject to misspecification in the quantile equation for the CARE simulation setup exhibits good size properties for both simulation setups. As we simulate with calibrated data, this shows that this test is robust against misspecifications usually encountered in financial data and delivers robust results. Figure 1 displays power curves (empirical rejection rates) for the both simulation setups, both covariance estimators and the two different sample sizes for the tests at a nominal significance level of 5% and for 2000 Monte Carlo replications. Solid lines show rejection rates for the test with the null hypothesis that the GARCH (AS-CARE) model encompasses the GJR-GARCH (SAV-CARE) model. The dashed lines depict the rejection rates of the opposing null hypothesis of encompassing. We analyze the power of our tests using the DGPs specified in (3.5) and (3.14) for different values of ρ on a regularly spaced grid of values between zero and one. This plot includes the respective empirical sizes of Section 3.3 for ρ = 0 and ρ = 1. Increasing (decreasing) values of ρ feature continuously increasing degrees of misspecification for the respective tests.
We find that overall, all tests exhibit strong power properties against the alternatives, even for mediocre values of ρ and as expected, the test power increases by increasing the sample size from n = 1000 to n = 2500. As in the size analysis, we notice that the joint VaR and ES test benefits the most from a bootstrapped covariance as the joint test rejects too often in the version based on asymptotic covariances but not as often in the bootstrapped test versions. However, the power of all three tests does not change dramatically by applying the bootstrap which implies that the testing procedures benefit from the bootstrap especially in terms of being well-sized. Furthermore, the tests seems to perform equally well for both opposing hypotheses, which implies that the test size and power are not influenced by considering misspecifications which result either from over-or underspecification of the true model. Eventually, these results imply that all three ES encompassing tests exhibit very similar power patterns, which implies that it is relatively unimportant which test we apply in practice. These results especially emphasize the applicability of the third strict ES test in situations where only the ES is available, even under more general specifications than location-scale models. q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q The first column of plots shows results for the GARCH simulation setup described in Section 3.1 whereas the second column shows plots for the CARE simulation setup described in Section 3.2. The tests used in the first line of plots is based on the estimator for the asymptotic covariance for n = 1000 whereas the second line shows results for n = 2500. The third line shows plots for a bootstrapped based test for n = 1000. In each of the plots, the case ρ = 0 corresponds to the null hypothesis that the GARCH (AS-CARE) model encompasses its competitor and the case ρ = 1 corresponds to the null hypothesis that the GJR-GARCH (SAV-CARE) model encompasses its competitor. The cases ρ ∈ (0, 1) represent misspecified cases with continuously increasing degree of misspecification.

Empirical Application
In this section, we illustrate the application of our encompassing tests for the ES (jointly with the VaR). The Basel Committee of Banking supervision recently decided to shift from the VaR to the ES as the standard market risk measure for the international banking regulation (Basel Committee, 2016Committee, , 2017. Thus, it becomes obvious that tests on superior forecast performance for the ES are of uttermost importance. We use daily open-close returns from the IBM stock from January 1st, 2001 until October, 1st, 2018, which amounts to a total of T = 4417 daily observations. We use the first m = 1000 observations to estimate models in order to generate the forecasts, while the remaining n = 3417 observations are used for the forecast evaluation in an out-of-sample fashion. In the following, we use a fixed forecasting scheme, i.e. (for the parametric models) the model parameters are estimated once on the first m = 1000 in-sample observations. These parameter estimates are then used for generating the forecasts of the VaR and ES for the remaining days in a rolling-window fashion. In the following, we focus on one-day ahead forecasts, whereas multi-day ahead forecasting can easily be applied in the same fashion. As the Basel Committee specified that the ES has to be reported for the probability level α = 2.5%, we use this probability level for the VaR and ES throughout the empirical application.
In the analysis, we consider the following competing forecasting models. First, we use the GJR-GARCH(1,1)-n model of Glosten et al. (1993) based on Gaussian returns, which is estimated by maximum likelihood. This model is based on the assumption of a location-scale framework, i.e. Y t+1 =σ t u t+1 , where u t+1 ∼ N (0, 1) and We use this model in order to generate volatility forecastsσ t and then obtain forecasts for the VaR and ES from the model by using the formulasq t = z ασt andê t = ξ ασt , where z α and ξ α are the α-quantile and α-ES of the standard normal distribution. The second model is the RiskMetrics model, which is also in the location-scale family but models the conditional volatility as a fixed function (without estimated parameters) and without leverage effects, The forecasts for the VaR and ES are obtained by applying the same location-scale formulas as for the GJR-GARCH model. Third, we employ the Historical Simulation model which generates VaR forecasts by computing the empirical α-quantile of the past 250 trading days. Equivalently, ES forecasts are obtained by computing the empirical α-ES of the past 250 trading days. The forth and fifth model are GAS models (Creal et al., 2013;Harvey, 2013) for the VaR and ES proposed by Patton et al. (2019), which are estimated by minimizing the strictly consistent loss function for the VaR and ES given in (2.11). The 2-factor GAS model together with the estimated parameters is given by where the forcing variables of the system are the identification functions for the quantile and ES, (4.4) Patton et al. (2019) also introduce a one factor GAS model for the VaR and ES. For this model, the VaR and ES are driven by one forcing variable, which is similar to the classical GARCH framework. However, in order to focus on the VaR and ES instead of the volatility, this model is again estimated by minimizing the strictly consistent loss function for the pair of VaR and ES given in (2.11). The model together with the estimated parameters is given bŷ The parameter values of the GJR-GARCH and the two GAS models are estimated. In contrast, the RiskMetrics and Historical Simulation models do not require parameter estimation and can directly be applied to generate the forecasts. Table 2 shows the correlations of the respective forecasts. We can see that no pair of forecasts is perfectly correlated, which is crucial for the applicability of the encompassing tests as stated in Theorem 2.5. We run pair-wise encompassing tests comparing all five described forecasting methods. I.e. for each pair, we run encompassing test of the hypothesis that forecast A encompasses forecast B and vice versa that forecast B encompasses forecast A. For this, we choose the vector of instruments W q,t = 1,q 1,t ,q 2,t and W e,t = 1,ê 1,t ,ê 2,t . This implies that we test forecast encompassing conditional on the information set the respective forecasters incorporate in their issued forecasts. It is important to note that this might be a subset of the full information set the respective forecaster has at hand.  Giacomini and Komunjer (2005). The p-value in the i-th row and j-th column of the respective matrices corresponds to testing for H 0 : the forecasts from model i encompass the forecasts from model j. * denote model pairs, where both encompassing tests (test whether model i encompasses model j and vice versa) are significant at the 10% level.
A rejection of both null hypotheses implies that neither forecast encompasses its competitor. Thus, both forecasts add some information and a forecast combination is superior to the stand-alone forecasting models. We find that out of the ten pairwise comparisons, the VaR encompassing test rejects both null hypotheses for two pairs. In comparison, the joint VaR and ES encompassing tests rejects for six pairs and both, the Auxiliary ES and the strict ES encompassing reject for five pairs. These results imply the following conclusions. First, we find that all three ES encompassing tests reject fairly often and thus, our results support the theoretical advantages of forecast combination for the ES. Second, the three ES-specific encompassing tests jointly reject both hypotheses in more cases than the VaR test and thus, forecast combination is even more promising for the ES (and the pair VaR and ES) than it already is for the VaR. Third, the two tests which only focus on testing regression parameters for the ES perform very similar to the joint VaR and ES encompassing test. Thus, one can apply tests for forecast encompassing of the ES in cases where one does not have VaR forecasts at hand, such as it is currently imposed by the Basel Committee of Banking Supervision Basel Committee (2016Committee ( , 2017.

Conclusion
With the implementation of the third Basel Accords, risk managers and regulators will shift attention to forecasts for the risk measure Expected Shortfall (ES). This underlines the importance of tools for forecast evaluation and comparison for the ES. In this paper, we introduce new encompassing tests for the ES, which are based on a recently developed joint loss function and an associated joint regression framework for the ES (jointly with the Value at Risk). We propose three variants of the ES encompassing test. The first tests joint encompassing of the VaR and the ES, whereas the second and third consider encompassing of the ES stand-alone. We show through simulation studies that all proposed tests are reasonably sized in typical financial applications and exhibit good power properties against general alternatives.
Tests for forecast encompassing establish a theoretical foundation for forecast combination of two competing forecasts when both opposing hypotheses of forecast encompassing are rejected. This situation corresponds to the case when neither forecast encompasses its competitor. Generally, applying forecast combinations can be highly beneficial through the diversification gains stemming from combining different model specifications and underlying information sets. This benefit can be particularly pronounced for extreme risk measures such as the ES as the stand-alone models are very sensitive to the very little observations in the tails of the return distributions. Thus, forecast combination can be seen as a robustification of the forecasts.
We apply the new encompassing tests for the ES to the problem of evaluating ES forecasts for financial returns, which is paramount at presence due to recent introduction of the ES into the regulatory framework of the Basel Committee of Banking Supervision (Basel Committee, 2016Committee, , 2017. We use daily returns from the IBM stock and show that combined forecasts for the ES outperform the stand-alone models in five or six of the ten considered cases. Furthermore, our results imply that the gains from forecast combination for the ES are even more pronounced than they are for the VaR (Giacomini and Komunjer, 2005). The third variant of our test, which can also be applied when only ES forecasts (without their accompanying VaR forecasts) are available exhibits very similar results. This test variant can be seen as particularly relevant as by the current set of implied rules, the regulatory authorities only obtain ES forecasts and thus, can only apply this test.

Appendix A Proofs
Proof of Proposition 2.4. Recall that we define where we assume that this minimum is unique. We first show that For this, we first notice that it holds that Equivalently, we get that Consequently, (A.2) follows. As θ * t uniquely minimizes E t ρ g,φ QES Y t+1 ,q t β,ê t η almost surely, which is a continuously differentiable function, we get that ∇ θ E t ρ g,φ QES Y t+1 ,q t β,ê t η θ=θ * t = 0 almost surely. As the F t -measurable vectorsq t andê t are nonzero almost surely, we also get that almost surely. Furthermore, it holds that which has full rank almost surely as φ and φ are strictly positive and g is non-negative. Thus, we can We further have to show that E t ϕ 2step QES Y t+1 ,q t ,ê t , θ = 0 almost surely (for some θ ∈ Θ) implies that θ = θ * t . For this, we first consider the first component of ϕ 2step , implies that F t (q t β) = α almost surely. As F t is absolutely continuous, this implies that q t β =q t β * t almost surely and consequently β = β * t . Now, we consider the second component of ϕ 2step QES , where we obtain β = β * t from above. This yields that where the third equality follows from F t (q t β * t ) = α almost surely. An equivalent calculation yields that e t η * t = 1 α E t Y t+1 1 {Y t +1 ≤q t β * t } almost surely. Thus, we can conclude thatê t η * t =ê t η t almost surely which implies that η = η * t .
Proof of Proposition 2.5. The consistency proof of this two-step estimate β n ,η n follows the same idea as Theorem 2.6 of Newey and McFadden (1994). The only difference is that we have to show uniform convergence in probability for both moment conditions Ψ 2step QES,1 and Ψ 2step QES,2 . The reasoning for the second step is equivalent as in the proof of Theorem 2.6 of Newey and McFadden (1994). One has to apply the same inequality as in this proof for the second moment condition in order to be able to apply Theorem 2.1 of Newey and McFadden (1994) again.
The consistency proof of the first-step estimateβ n follows Theorem 2.1 of Newey and McFadden (1994). Condition (i) directly follows from Proposition 2.4 and condition (ii) is simply assumed. Condition (iii) follows as which is continuous in β as F t is assumed to be continuous. Next, we show the uniform convergence condition (iv) by applying the uniform weak law of large numbers given in Theorem A.2.5. in White (1994). For that, we have to show that (A) the map β → ψ 2step QES,1 Y t+1 , W t ,q t , β is Lipschitz-L 1 almost surely on Θ β (see Definition A.2.3 in White (1994), (B) For all β o ∈ Θ β , there exists δ o > 0, such that for all δ, 0 < δ ≤ δ o , the sequences obey a weak law of large numbers.
The L 1 -Lipschitz continuity condition (A) of ψ 2step QES,1 Y t+1 , W t ,q t , β on Θ β is obvious as the function is a constant times W t apart from the point where Y t+1 =q t β, which forms a null set with respect to the distribution F t . We next show condition (B). The sequencesψ 1,t (β o , δ) and ψ 1,t (β o , δ) are strong mixing of size −r/(r − 1) for some r > 1 by condition (a) and by applying Theorem 3.49 in White (2001), p. 50 as the functions ψ 2step QES,1 and the supremum and infimum functions are measurable. Furthermore, we get that E |ψ t (β o , δ)|r +δ ≤ sup 1≤t ≤T < ∞ for all t, 1 ≤ t ≤ T, T ≥ 1 and for some δ > 0 and the same inequality holds for |ψ 1,t (θ o , δ)| by condition (d).
Thus, we can apply the weak law of large numbers for strong mixing sequences in Corollary 3.48 in White (2001), p. 49 in order to conclude that for all β o ∈ Θ β such that || β o − β|| ≤ δ, it holds that P → 0, which shows condition (B). Consequently, the uniform convergence condition (iv) holds by applying the uniform weak law of large numbers given in Theorem A.2.5. in White (1994).
The proof of the uniform convergence condition of the second identification function goes along these lines by again applying Theorem A.2.5. in White (1994). First, the map β, η → ψ 2step QES,2 Y t+1 , W t ,q t , β, η is Lipschitz-L 1 almost surely on Θ. The map ψ 2step QES,2 consists of a piece-wise linear function multiplied by W t and thus, it is Lipschitz-L 1 as the sequences 1 n T t=m+1 E ||W tq t || and 1 n T t=m+1 E ||W tê t || are bounded for all θ o ∈ Θ.
Proof of Proposition 2.6. For this proof, we apply Theorem 1 of Bartalotti (2013), who shows that the asymptotic normality results for the two-step GMM estimator given in Theorem 6.1 of Newey and McFadden (1994) also holds in the case of nonsmooth objective functions. In this case, instead of checking the standard regularity conditions of Theorem 3.4 of Newey and McFadden (1994), one has to verify the conditions of Theorem 7.2, which is the counterpart for nonsmooth objective functions. In this case, the weighting matrix is the identity matrix in our case and can consequently be ignored in the proof. Condition (i) of Theorem 7.2 follows directly as in the proof of Proposition 2.5 and condition (iii) follows directly from assumption (e). For condition (ii), we have to show continuity of the two components E Ψ 2step QES,1 β and E Ψ 2step QES,2 β, η . For this, which is continuously differentiable in β as F t is continuously differentiable with derivative For the second component, we get that It holds that the term E t Y t+1 1 {Y t +1 ≤q t β } is continuously differentiable as F t is continuously differentiable and thus Consequently, we get that .27) and ∇ η E Ψ 2step QES,2 β * , η * = W tê t . Together, this yields that From condition (j), we get that the matrix Λ n has full column rank and thus, Λ n Λ n is positive definite.
We conclude that as n → ∞ as in the proof of Lemma 2 in the online supplement of Patton et al. (2019), which is a primitive condition for Ψ 2step QES QES θ + o P (n −1 ) (see the discussion on p.2187 in Newey and McFadden, 1994). Eventually, the equicontinuity condition (v) directly follows from assumption (l).
Thus, we have verified the conditions of Theorem 7.2 of Newey and McFadden (1994) and can consequently apply Theorem 1 of Bartalotti (2013). In the notation of Bartalotti (2013), it holds that Proof of Proposition 2.7. The proof ofΣ n − Σ n P → 0 is straight-forward given condition (m) and will be omitted here. The proof ofΛ n − Λ n P → 0 follows the idea of Engle and Manganelli (2004). For this, we defineΛ which is o P (1) as in the proof of Theorem 3 in Engle and Manganelli (2004). We now turn to Λ n −Λ n = 1 2nc n T t=m+1 W q,tq t 1 {|Yt+1−q t β * |<cn} − E W q,tq t 1 {|Yt+1−q tβ n |<cn} (A.41) = ... (A.42) which is again o P (1) as in Engle and Manganelli (2004). Thus, the result of the Theorem follows.