General Framework for Verification


Reference: (Starred references are particularly important for the course)

*Murphy, A. H., and R. L. Winkler, 1987: A general framework for forecast verification. Mon. Wea. Rev., 115, 1330-1338. (PDF)

While there is a wide variety to verification problems, it is desireable to have a general framework for development of useful systems. Murphy and Winkler (1987) have set such a framework out and, whenever possible, we will try to evaluate approaches based on how they fit into the general framework.

Note that while there is really only one way for forecasts to be good (perfect forecasting), there are many ways to be bad. Often, we can identify the least useful forecast for each class of verification problem and use that to help in evaluation of real forecasts.

The general framework is based on the fact that the joint probability distribution of forecasts and observations (or events), denoted p(f,x), contains all of the non-time dependent information required to evaluate all aspects of forecasts. [If we are looking at a set of forecasts i=1,2,3,...n, p(f,x) says nothing about the order of the forecasts. That is, if situation i = 1 contains information for situation i = 2, the pdf does not directly point this out. It may be possible to extend the framework to include that situational dependence, but it is part of the basic package.] Most verification work has focused on measures, rather than the basic indicators of performance from the joint distribution. The latter is essential if the goal of the verification process is to improve forecasting.

Three basic classes of problems

  1. 2 x 2 problem: Categorical forecasts of dichotomous events (yes/no forecasts and event either occurs or doesn't occur).
  2. n x 2 problem: Typically, probabilistic or class forecasts of dichotomous events.
  3. n x n problem: Continuous forecasts of continuous events. Can be displayed as a scatter diagram or as contingency table (finite categories or intervals).

Two factorizations

Along with the full pdf, there are two factorizations of p(f,x) that are particularly useful. Each involves a marginal probability and a related conditional probability. The names come from terminology from the psychological, decision-making literature.

Calibration-refinement factorization

p(f,x) = p(x|f)p(f)

p(f) tells how often different forecasts are made (how refined or sharp the forecasts are)

p(x|f) tells how often different observations occur when a particular forecast, f, is given (measuring the reliability or calibration of the forecasts)

This factorization looks forward, in a sense: what happens when a particular forecast is issued.

For categorical forecasts, we want p(x=1|f=1) to be large and p(x=1|f=0) and p(x=0|f=1) to be small. For probabilistic forecasts, if p(x=1|f) = f for all f, the forecasts are perfectly reliable (calibrated). Reliability diagrams can be be used to display how close this is to being true for forecasts. If there are more than two values of x, we want E(x|f) = f.

The least useful forecasts are when p(x|f) = p(x), so that the conditional probability of observations is the same sample climatological distribution for all forecasts.

Likelihood-base rate factorization

p(f,x) = p(f|x)p(x)

p(x) is climatological distribution of events which is sometimes called the base rate

p(f|x) (the likelihoods) tell how often different forecasts are associated with particular observations

p(x) tells how we would predict x in the absence of a forecast. The likelihoods tell how how helpful the forecasts are above and beyond the sample climatology. In effect, it lets users revise information.

The LBR factorization focuses on how discriminatory forecasts are and is oriented towards potential value of the forecasts.

The two factorizations are related to each other via

p(x|f)p(f) = p(f|x)p(x)

which becomes Bayes theorem

p(x|f) = [p(f|x)p(x)]/p(f)

Relationship of MSE to p(f,x)

Three decompositions of the MSE show the relationship of the joint pdf to that measure.

The MSE, in terms of p(f,x) is given by

MSE = SumfSumx (f-x)2p(f,x)

where the sums are summation over all possible f and x, respectively.

The decompositions of interest are given by

  1. MSE = Var(f) + Var(x) - 2Cov(f,x) + [E(f)-E(x)]2
  2. MSE = Var(x) + Ef[f-E(x|f)]2 - Ef[E(x|f)-E(x)]2
  3. MSE = Var(f) + Ex[x-E(f|x)]2 - Ex[E(f|x)-E(f)]2

The subscripts represent the expectation with respect to the marginal distribution indicated by the subscript. The second term on the RHS of 2 is related to the reliability of the forecasts (small values are better) and the third term is related to the resolution (large values are better).

The second term on the RHS of 3 is weighted average of errors of conditional average forecasts (small vales are good) and the third term is related to the difference between the average forecast associated with each observation and the overall average forecast, which measures how average forecasts associated with different events differ from one another.