National Severe Storms Laboratory

Norman, Oklahoma

How does one make *probabilistic* forecasts? Well, it might be just as
valid to ask how does one make *categorical* forecasts? Let's begin with
the difference between the two. In meteorological forecasting, the categorical
forecast is one that has only two probabilities: zero and unity (or 0 and 100
percent). Thus, even what we call a categorical forecast can be thought of in
terms of two different probabilities; such a forecast can be called
**dichotomous**. On the other hand, the conventional interpretation of a
probabilistic

forecast is one with more than two probability categories; such a forecast can
be called **polychotomous**, to distinguish it from dichotomous forecasts.
Forecasting dichotomously implies a constant certainty: 100 percent. The
forecaster is implying that he or she is 100 percent certain that an event will
(or will not) occur in the forecast area during the forecast period, that the
afternoon high temperature will be exactly 82F, the wind will be constantly and
exactly from the northeast at 8 mph, etc. Is that how you really feel when
forecasting? Think about it.

Let's assume for the sake of argument that you are forecasting some quantity,
Q, at a point. This could be temperature, rainfall, etc. There are different
options to take other than the standard approach of guessing what the Q-value
will be. Probabilistic forecasts can take on a variety of structures. As
shown in Fig. 0, it might be possible to forecast Q as a probability
distribution. [Subject to the constraint that the area under the distribution
always sums to unity (or 100 percent), which has *not* been done for the
schematic figure.] The distribution can be narrow when one is relatively
confident in a particular Q-value, or wide when one's certainty is relatively
low. It can be skewed such that values on one side of the central peak are
more likely than those on the other side, or it can even be bimodal [as with a
strong quasistationary front in the vicinity when forecasting temperature]. It
might be possible to make probabilistic forecasts of going past certain
important threshold values of Q. Probabilistic forecasts don't all have to
look like PoPs! When forecasting for an area, it is quite likely that forecast
probabilities might vary from place to place, even within a single metropolitan
area. That information could well be very useful to forecast customers, could
it not?

If the *forecast* is either dichotomous or polychotomous, what about the
*events* that we are trying to forecast? In one sense, many forecast
events are dichotomous: it either rained or it did not, there was hail or
there was not, a snowfall did or did not accumulate to 4 inches, it froze or it
didn't, and so forth. On the other hand, the outcome of an event might be
polychotomous: the observed high temperature almost any place on the planet is
going to fall somewhere in a range from -100F to +120F (in increments of one
degree F), measureable rainfall amounts can be anything above 0.01 inches (in
increments of 0.01 inches), wind directions can be from any compass direction
(usually in something like 5 degree increments from 0 to 355 degrees), an so
on.

If we make up a table of forecast and observed events, such a table is called a
**contingency table**. For the case of dichotomous forecasts and
dichotomous events, it is a simple 2 x 2 table:

Observed (x) Forecast (f) Yes (1) No (0) Sum Yes (1) n11 n12 n1.=n11+n12 No (0) n21 n22 n2.=n21+n22 Sum n.1=n11+n21 n.2=n12+n22 n..=N

The occurrence of an event is given a value of unity, while the non-occurrence is given a value of zero; for these dichotomous forecasts, they also take on values of unity and zero.

If we have polychotomous forecasts (as in PoP's with, say, m categories of probability) and the event is dichotomous (it rained a measurable amount or it didn't), then the table is m x 2. If the event is also polychotomous (with, say, k categories), the table is m x k. The sums along the margins contain information about the distribution of forecasts and observations among their categories. It should be relatively easy to see how the table generalizes to polychomotous forecasts and/or events. This table contains a lot of information about how well the forecasts are doing (i.e., the verification of the forecasts). A look at verification will be deferred until later.

Think about how *you* do a forecast. The internal conversation you carry
on with yourself as you look at weather maps is virtually always involves
probabilistic concepts. It is quite natural to have uncertainty about what's
going to happen.[1] And uncertainty
compounds itself. You find yourself saying things like "If that front moves
here by such-and-such a time, and if the moisture of a certain value comes to
be near that front, then an event of a certain character is more likely than if
it those conditions don't occur." This brings up the notion of
**conditional** probability. A conditional probability is defined as the
probability of one event, given that some other event has occurred. We might
think of the probability of measureable rain (the standard PoP), given that the
surface dewpoint reaches 55F, or whatever.

Denote probability with a "p" so that the probability of an event x is simply p(x). If we are considering a conditional probability of x, conditioned on event y, then denote that as p(x|y).

There are many different kinds of probability. The textbook example is derived
from some inherent property of the system producing the event; an example is
tossing a coin. Neglecting the quite unlikely outcome of the coin landing on
its edge, this clearly is a dichotomous event: the coin lands either head up
or tail up. Assuming an unbiased coin, the probability of either a head or a
tail is obviously 50 percent. Each time we toss the coin, the probability of
either outcome is always 50 percent, no matter how many times the coin is
tossed. If we have had a string of 10 heads, the probability of another head
is still 50 percent with the next toss. Now the frequency of any given
*sequence* of outcomes can vary, depending on the the particular sequence,
but if we are only concerned with a particular toss, the probability stays at
50 percent. This underscores the fact that there are well-defined laws for
manipulating probability that allow one to work out such things as the
probability of a particular sequence of coin toss outcomes. These laws of
probability can be found in virtually any textbook on the subject. Outcomes
can be polychotomous, of course; in the case of tossing a fair die, the
probability of any particular face of the die being on top is clearly
1/6=16.6666 .... percent. And so on. This **classic** concept of
probability arises inherently from the system being considered. It should be
just as obvious that this does not apply to meteorological forecasting
probabilities. We are not dealing with geometric idealizations when we look at
real weather systems and processes.

Another form of probability is associated with the notion of the
**frequency** of occurrence of events. We can return to the coin tossing
example to illustrate this. If a real coin is tossed, we can collect data
about such things as the frequency with which heads and tails occur, or the
frequency of particular sequences of heads and tails. We believe that if we
throw a fair coin enough times, the observed frequency should tend to 50
percent heads or tails, at least in the limit as the sample size becomes large.
Further, we would expect a sequnce having a string of 10 heads to be much less
likely than some combination of heads and tails. Is this the sort of concept
we employ in weather forecasting probabilities? We don't believe so, in
general. Although we certainly make use of analogs in forecasting, each
weather system is basically different to a greater or lesser extent from every
other weather system. Is the weather along each cold front the same as the
weather along every other cold front? Not likely! Therefore, if a weather
system looks similar to another one we've experienced in the past, we might
think that the weather would evolve similarly, but only to a point. It would
be extremely unlikely that exactly the same weather would unfold, down to the
tiniest detail. In fact, this uncertainty was instrumental in the development
of the ideas of "chaos" by Ed Lorenz. No matter how similar two weather
systems appear to be, eventually their evolutions diverge, due to small
differences in their initial states, to the point where subsequent events are
as dissimilar as if they had begun with completely different initial
conditions. These ideas are at the very core of notions of "predictability," a
topic outside the scope of this primer.

This brings us to yet another type of probability, called **subjective
**probability. It can be defined in a variety of ways, but the sort of
definition that makes most sense in the context of weather forecasting is that
the subjective probability of a particular weather event is associated with the
forecaster's uncertainty that the event will occur. If one's assessment of the
meteorological situation is very strongly suggestive of a particular outcome,
then one's probability forecast for that event is correspondingly high. This
subjective probability is just as legitimate as a probability derived from some
other process, like the geometric- or frequency-derived probabilities just
described. Obviously, two different forecasters might arrive at quite
different subjective probabilities. Some might worry about whether their
subjectively derived probabilities are right or wrong.

An important property of probability forecasts is that single forecasts using probability have no such clear sense of "right" and "wrong." That is, if it rains on a 10 percent PoP forecast, is that forecast right or wrong? Intuitively, one suspects that having it rain on a 90 percent PoP is in some sense "more right" than having it rain on a 10 percent forecast. However, this aspect of probability forecasting is only one aspect of the assessment of the performance of the forecasts. In fact, the use of probabilities precludes such a simple assessment of performance as the notion of "right vs. wrong" implies. This is a price one pays for the added flexibility and information content of using probability forecasts. Thus, the fact that on any given forecast day, two forecasters arrive at different subjective probabilities from the same data doesn't mean that one is right and the other wrong! It simply means that one is more certain of the event than the other. All this does is quantify the differences between the forecasters.

A meaningful evaluation of the performance of probability forecasts (i.e.,
*verification*) is predicated on having an *ensemble* of such
forecasts. The property of having high PoPs out on days that rain and having
low PoPs out on days that don't rain is but one aspect of a complete assessment
of the forecasts. Another aspect of importance is known as *reliability*:
reliable forecasts are those where the observed frequencies of events match the
forecast probabilities. A perfectly reliable forecaster would find it rains 10
percent of the time when a 10 percent PoP forecast is issued; it would rain 20
percent of the time when a 20 percent PoP forecast is issued, etc. Such a set
of forecasts means that it is quite acceptable to have it rain 10 times out of
100 forecasts of 10 percent PoPs! We'll return to this verification stuff
again.

Bayes' Theorem is an important tool in using conditional probability, and is stated as follows

Bayes' Theorem: If x1, x2, ... , xm are m mutually exclusive events, of which some one must occur in a given trial, such that

,

and E is some event for which p(E) is non-zero, then

.

The denominator is simply p(E). Thus, this could have been written

,

which provides a sort of symmetry principle for conditional probabilities; the conditional probability of the event xi given event E times the unconditional probability of E is equal to the conditional probability of E given xi times the unconditional probability of xi.

If a dichotomous event is denoted by x, and the non-occurence of the event is denoted by , then

,

and we note that p(y) + p(
)
= 1.0. If y happens to be *polychotomous* such that there are m possible
values of y (and the sum of the probabilities of all of these is unity[2]), this formula can be extended to say
that

,

which we have used already in Bayes' Theorem.

For the time being, let's assume that we are dealing with dichotomous events, so we can use the simple form above. Let's consider how this works for the event of having a tornado conditioned on the occurrence of a thunderstorm. In both cases, the events are dichotomous; a tornado either occurs or it doesn't, a thunderstorm either occurs or it doesn't. For all practical purposes, one must have a thunderstorm in order to have a tornado, which means that p(x| )=0, which means in turn that if we are given the separate probabilities of the unconditional probability of a thunderstorm and the conditional probability of a tornado given that there is a thunderstorm, we can find the unconditional probability of a tornado by simply forming the product of those two probabilities.

We use this concept unconsciously all the time in arriving at our subjective
probability estimates. The events we forecast are conditioned on a whole
series of events occuring, none of which are absolute certainties the vast
majority of the time. Hence, we must arrive at our confidence in the forecast
in some way by applying Bayes' Theorem, perhaps unconsciously. Knowing Bayes'
Theorem consciously might well be of value in arriving at quantative
probability estimates in a careful fashion. The probability of a *severe*
thunderstorm involves first having a thunderstorm. Given that there is a
thunderstorm, we can estimate how confident we are that it would be severe.
But the probability of a thunderstorm is itself conditioned by other factors[3] and those factors in turn are
conditioned by still other factors. Somehow our minds are capable of
integrating all these factors into a subjective estimate. Provided we do not
violate any known laws of probability (e.g., using a probability outside the
range from zero to unity), these mostly intuitive estimates are perfectly
legitimate.

Of course, we would like to be "right" in our probability estimates, but we have seen already that this is a misleading concept in evaluating how well are estimates are performing. We really need to accumulate an ensemble of forecasts before we can say much of value about our subjective probability estimates. There are some important aspects of probability forecasting to have in mind as we go about deriving our subjective estimates of our confidence. From a certain point of view,[4] verification of our forecasts involves having information about what happened when we issued our forecasts ... in other words, we need to have filled in the contingency table. This may prove to be more challenging than it appears on the surface. There may be some uncertainty about how accurate our verification information is; for such things as severe thunderstorms and tornadoes, there are many, many reasons to believe that our current database used for verification is seriously flawed in many ways.

To the maximum extent possible, it is essential to use as verification data
those observations that are directly related to the forecast. Put another way,
we can only verify forecasts if we can observe the forecast events. This can
be a troublesome issue, and we will deal with it further in our verification
discussion. For example, PoP verification requires rainfall measurements;
specfically one needs to know only whether or not at least 0.01 inches of
precipitation was measured. But it is not quite so simple as that; one also
must be aware of how the *forecast* is defined. When a PoP forecast is
issued, does it only apply to the 8 inch diameter opening at the official rain
guage? What does PoP really mean in the forecast? And what is the period of
the forecast? It should be clear that probability of a given event goes up as
the area-time product defining the forecast is increased. The probability of
having a tornado somewhere in the United States during the course of an entire
year is virtually indistinguishable from 100 percent. However, the probability
of having a tornado in a given squre mile within Hale County, Texas between the
hours of 5:00 p.m. CDT and 6:00 p.m. CDT on the 28th of May in any given year
is quite small, certainly less than one percent. Therefore, one must consider
the size of the area and the length of the forecast period when arriving at the
estimated probability.

Moreover, we have mentioned Hale County, Texas because it has a relatively high
tornado probability during late afternoons at the end of May. If we were to
consider the likelihood of a tornado within a given square mile in Dupage
County, Illinois between the hours of 10:00 a.m CST and 11:00 a.m CST during
late January in any given year, that probability would be quite a bit lower
than the Hale County example, perhaps by two orders of magnitude. In deciding
on a subjective probability, having a knowledge of the **climatological
frequency** is an important base from which to build an estimate. Is the
particular meteorological situation on a given day such that the confidence in
having an event is greater than or less than that of climatology? It is quite
possible to imagine meteorological situations where the likelihood of a tornado
within a given square mile in Dupage County, Illinois between the hours of
10:00 a.m CST and 11:00 a.m CST during late January is actually *higher*
than that of having a tornado in a given squre mile within Hale County, Texas
between the hours of 5:00 p.m. CDT and 6:00 p.m. CDT on the 28th of May. To
some extent, the weather does not know anything about maps, clocks, and
calendars. Thus, while knowledge of climatological frequency is an important
part of establishing confidence levels in the forecast, the climatology is only
a starting point and should not be taken as providing some absolute bound on
the subjective estimate.

It is useful to understand that a forecast probability equal to the
climatological frequency is saying that you have no information allowing you to
make forecast that differs from any randomly selected situation. A
climatological value is a "know-nothing" forecast! There may be times, of
course, when you simply *cannot* distinguish anything about the situation
that would allow you to choose between climatology and a higher or lower value.
In such an event, it is quite acceptable to use the appropriate climatological
value (which might well vary according to the location, the day of the year,
and the time of day). But you should recognize what you are doing and saying
about your ability to distinguish factors that would increase or lower your
subjective probability relative to climatology.

Another important factor is the projection time. All other things being equal, forecasts involving longer projections have probabilities closer to climatological values as a natural consequence of limited predictability. It is tougher to forecast 48 h in advance than it is to forecast 24 h in advance. As one projects forecasts far enough into the future, it would be wise to have the subjective probabilities converge on climatology at your subjective predictability limit. What is the probability for a tornado within a given square mile in Hale County on a specific date late next May between 5:00 p.m. and 6:00 p.m. CDT? Almost certainly, the best forecast you could make would be climatology.

In this discussion, it is important to remember that the notion of time and
space specificity is quite dependent on these factors. We expect to be better
at probability estimation for large areas rather than small areas, for long
times rather than short times, and for short projections rather than long
projections, in general. Unless we have a great deal of confidence in our
assessment of the meteorology, we do not want to have excessively high or low
probabilities, relative to climatology. Using high probabilities over a wide
area carries with it a particular implication: events will be widespread and
relatively numerous within that area. If we try to be too space-specific with
those high values, however, we might miss the actual location of the events;
high probabilities might be warranted but if we cannot be confident in our
ability to pinpoint **where** those high probabilities will be realized,
then it is better to spread lower probabilities over a wide area.

Another important notion of probability is that it is defined over some finite area-time volume, even if the area is in some practical sense simply point measurement (recall the 8-inch rain guage!). However, it is possible to imagine a point probability forecast as an abstraction. What is the relationship between point and area probability estimates? If one establishes as a condition that __, then the area probability can be thought of as the areal coverage of the point events within that area. If one has showers over 20 percent of the forecast area, that is equivalent to an average point probability of 20 percent for all the points in the domain.

Suppose we have a meteorological event, e, for which we are forecasting.
During the forecast time period, T, we have m such events, ei, i=1,2, ... ,m.
If the forecast area is denoted A, then we consider the probability of one or
more events in A, pA, to be the **area** probability; i.e., that one or
more events will occur somewhere within A. As an abstraction, A is made up of
an infinite number of points, with coordinates (x,y). The jth point is given
by (xj,yj). If the probability of having one or more events occur at each
point is finite, it is clear that pA cannot be the simple sum of the point
probabilities, since that sum would be infinite (or might exceed unity)!

Consider Fig. 1. Assume that each "point" in the area is actually represented by a finite number of small sub-areas, Ak, k=1,2, ... ,n. This small subarea is the "grain size" with which we choose to resolve the total area A, which is the simple sum of the n sub-areas. The area coverage of

events during the forecast period is simply that fraction of the area which actually experiences one or more events during the forecast period, C. Mathematically, if n' is the number of subareas in which an event is observed during the period, then

,

where summation is only over those subareas affected and where the symbol " " denotes the intersection. At any instant, each of the ongoing events only covers a fraction of the total area affected by events during the time T.

The forecast area coverage, C*f*, is that fraction of the area we are
forecasting to be affected. First of all, this does *not* mean we have to
predict *which* of the subareas is going to be hit with one or more
events. It simply represents our estimate of what the fractional coverage will
be. Second, this is clearly a conditional forecast, being conditioned by
whether at least one event actually occurs in A during T. If no event
materializes, this forecast coverage has no meaning at all.

The average probability over the area A is given by

,

where the pi are the probabilities of one or more events during time T within the ith sub-area, Ai. It is assumed that the probability is uniform within a sub-area. If for some reason, the sub-areas vary in size, then each probability value must be area-weighted and the sum divided by the total area. It should be obvious that the areas associated with the sub-areas ("pseudo-points") need to be small enough that a single probability value can be applied to each. If these pseudo-point probabilities are defined on a relatively dense, regular array (e.g., the "MDR" grid), then these details tend to take care of themselves.

It is simple to show that

where it is important to note that the coverage is the **forecast** area
coverage. Since the expected coverage is always less than or equal to unity,
this means that the *average* pseudo-point probability is always less than
or equal to the area probability. But observe that from an *a posteriori*
point of view,
,
the **observed** area coverage. That is, average point probability within
the area A can be interpreted in terms of areal coverage. This is not of much
use to a forecaster, however, since it requires knowledge of the area coverage
before the event occurs (if an event is actually going to occur at all)!

There are at least three different sorts of probability forecasts you might be called upon to make: 1) point probabilities, 2) area probabilities, and 3) probability contours. The first two are simply probability numbers. PoP forecasts, certainly the most familiar probability forecasts, are generally associated with average point probabilities (which implies a relationship to area probability and area coverage, as mentioned above). The verification of them usually involves the rainfall at a specific rain gauge, and incorporates the concepts developed above.

Although it is not generally known, the SELS outlook basically is an average
point probability as well, related officially to the forecast area coverage of
severe weather events. If one has "low, moderate, and high" risk categories,
these are defined officially in terms of the forecast density of severe weather
events within the area, or a forecast area coverage (C*f*). This involves
both an average point probability and the area probability, as we have shown
above.

Many forecasters see probability contours associated with the TDL thunderstorm and severe thunderstorm guidance products. These have been produced using screening regression techniques on various predictor parameters and applied to events defined on the MDR grid. The predictor parameters may include such factors as climatology and observations as well as model forecast parameters.

There are other TDL guidance forecasts, including point PoPs for specific stations, contoured PoPs, and others. Whereas most forecasters are at least passingly familiar with PoPs (in spite of many misconceptions), it appears that most have little or no experience with probability contours. Thus, we want to provide at least a few tips and pointers that can help avoid some of the more egregious problems. Most of these are based on the material already presented and so are very basic. There is no way to make forecasting easy but we hope this removes some of the fear associated with unfamiliarity.

Presumably, as you begin to consider the task, you somehow formulate an intuitive sense of the likelihood of some event during your forecast period. Suppose your first thoughts on the subject look something like Fig. 2.

Figure 2. Schematic showing initial forecast probability contours.

However, you then consider that you are forecasting pretty high probabilities of the event over a pretty large area. Is it realistic to think that at least 80 percent of the pseudo-points inside your 80 percent contour are going to experience one or more events during the forecast period?[5] Perhaps not. O.K., so then you decide that you know enough to pinpoint the area pretty well. Then your forecast might look more like Fig. 3

Now you're getting really worried. The climatological frequency of this event is about 5 percent over the region you've indicated. You believe that the meteorological situation warrants a considerable increase over the climatological frequency, but are you convinced the chances are as high at 18 times the climatological frequency? Observe that 18 x 5 = 90, which would be the peak point probability you originally estimated inside your 80 percent contour. This might well seem pretty high to you. Perhaps you've decided the highest chances for an event at a point within the domain are about 7 times climatology. And you may be having second thoughts about how well you can pinpoint the area? Perhaps it would be a better forecast to cut down on the probability numbers and increase the area to reflect your geographical uncertainties. The third stage in your assessment might look more like Fig. 4.

If it turns out that you are forecasting for an event for which TDL produces a contoured probability guidance chart, you're in luck ... provided that your definition of both the forecast and the event coincide with that of TDL's chart. In that wonderful situation, the TDL chart provides you with an objective, quasi-independent assessment of the probabilities that you can use either as a point of departure or as a check on your assessment (depending on whether you look at it before or after your own look at the situation leading to your initial guess at the contours). For many forecast products, you will not be so lucky; either the event definition or the forecast definition will not be the same as that used by TDL to create their chart. However, you can still use that TDL guidance if it is in some way related to your forecast, perhaps as an assessment of the probability of some event which is similar to your forecast event, or perhaps as some related event which might be used to condition your forecast of your event.

9. Conditional probability contours

Now that you are producing probability contours, you need to consider how to use and interpret conditional probability contours. Note that some of the TDL severe thunderstorm products involve conditional probabilities. There is not necessarily some particular order in which to consider them, but suppose you have produced something like Fig. 5.

In this figure, relatively high contours of p(x|y) extend into the northwestern U.S. where the values of p(y) are relatively low. This means that the conditioning event is relatively unlikely, but if it does occur, the chances for event x are relatively high. This conveys useful information, as in situations where x=severe thunderstorm and y=thunderstorm. The meteorological factors that are associated with the conditioning event, y, may be quite different from those that affect the primary event, x, given the conditioning event. The opposite situation is also possible, where p(y) is high and p(x|y) is low. If one desires, it is possible to do the multiplications and contour the associated unconditional probabilities, p(x). This might or might not be a useful exercise, depending on the forecast.

This topic can be responsible for a lot of heartburn. We are going to consider the verification of probabilistic forecasts and not consider verification of dichotomous forecasts (the latter of which we believe to be a less than satisfactory approach for meteorologists to take).

Assuming, then, that we have decided to make probabilistic forecasts, one of
the first issues we are going to have to settle upon is the probability
**categories**. How many categories do we want to empoly and what rationale
should go into deciding how to define those categories. There are several
things to consider:

*1. What is the climatological frequency of the event in question? Do we want
roughly the same number of categories above and below the climatological
frequency?*

*
2. What are the maximum and minimum practical probability for the event?
Obviously, if one knew precisely when and where things are going to occur, it
would make sense to forecast only zero and unity for probabilities. This
dichotomous ideal is virtually impossible to attain, which is why we are using
probability in the first place, so what is practical in terms of how certain we
can ever be?*

*
3. Do we want the frequency of forecasts to be approximately constant for all
categories?*

*
4. Given that the number of categories determines our forecast "resolution,"
what resolution do we think we are able to attain? And what resolution is
practical? Can we generate our maps of probability fast enough to meet our
deadlines?*

*
5. Do our categories convey properly our uncertainty to our users? This can
be a serious problem for rare events, such a tornadoes. The climatological
frequency may be so low that a realistic probability sounds like a pretty
remote chance to an unsophisticated user even when the chances are many times
greater than climatology. Is there a way to express the probabilities to avoid
this sort of confusion?*

There may well be other issues, as well. Let us assume that we somehow have arrived at a satsifactory set of probability categories, say f1, f2, ..., fk. Further, let us assume that we have managed to match our forecasts to the observations such that we have no conflict between the definition of the forecast and the definition of an event. For the sake of simplicity, we are going to consider only the occurrence and non-occurrence of our obaserved event; i.e., the observations are dichotomous. Thus, we have the k x 2 contingency table:

Observed (x) Forecast (f) Yes (1) No (0) Sum f1 n11 n12 n1. f2 n21 n22 n2. . . . . . . . . . . . . fk nk1 nk2 nk. Sum n.1 n.2 n.. = N

This table contains a lot of information! In fact, Murphy argues that it
contains **all** of the non-time-dependent information[6] we know about our verification. It is
common for an assessment of the forecasts to be expressed in terms of a limited
set of measures, or verification scores. This limited set of numbers typically
does not begin to convey the total content of the contingency table.
Therefore, Allan Murphy (and others, including us) has promoted a
distributions-oriented verification that doesn't reduce the content of the
table to a small set of measures. Murphy has described the complexity and
dimensionality of the verification problem and it is important to note that a
single measure is at best a one-dimensional consideration, whereas the real
problem may be extensively multi-dimensional.

This is not the forum for a full explanation of Murphy's proposals for verification. The interested reader should consult the bibliography for pertinent details. What we want to emphasize here is that any verification that reduces the problem to one measure (or a limited set of measures) is not a particularly useful verification system. To draw on a sports analogy, suppose you own a baseball team and for whatever reason, you are considering trading away one player, and again for some reason you must choose between only two players, each of whom has been in the league for 7 years. Player R has a 0.337 lifetime batting average and scores a 100 runs per year because he is frequently on base, but averages only 5 home runs per year and 65 runs batted in. Player K has a 0.235 lifetime batting average and scores 65 runs per year, but averages 40 home runs per year and has 100 runs batted in because he hits with power when he hits. Which one is most valuable to the team? Baseball buffs (many of whom are amateur statisticians) like to create various measures of "player value" but we believe that this is a perilous exercise. Each player contributes differently to the team, and it is not easy to determine overall value (even ignoring imponderables like team spirit, etc.) using just a single measure. In the same way, looking at forecasts with a single measure easily can lead to misconceptions about how the forecasts are doing. By one measure, they may be doing well, whereas by some other measure, they're doing poorly.

As noted, our standard forecasting viewpoint is that as forecasters we often
want to know what actually happened, given the forecast. This viewpoint can be
expressed in terms of p(x|f), where now the values of p(x|f) are derived from
the entries in the contingency table as frequencies. [Note that these
probabilities are distinct from our *probability categories* which are the
forecasts.] Thus, for example, p(x=yes (1)| f=fi) is simply ni1/n.1. The
table is can then be transformed to

Observed (x) Forecast (f) Yes (1) No (0) Sum f1 n11/n.1 n12/n.2 f1 f2 n21/n.1 n22/n.2 f2 . . . . . . . . . . . . fk nk1/n.1 nk2/n.2 fk Sum 1 1

where

.

These marginal sums correspond to the frequency of forecasts in each forecast
category; in the sense discussed above (in Section 2), these can be thought of
as *probabilities* of the forecast, *f*i=p(fi).

However, there is another viewpoint of interest; namely, p(f|x), the probability of the forecast, given the events. This view is that of an intelligent user, who could benefit by knowing what you are likely to forecast when an event occurs versus what you are likely to forecast when the event does not occur. This can be interpreted as a "calibration" of the forecasts by the user, but it is a viewpoint of interest to the forecaster, as well. The table can be transformed in this case to

Observed (x) Forecast (f) Yes (1) No (0) Sum f1 n11/n1. n12/n1. 1 f2 n21/n2. n22/n2. 1 . . . . . . . . . . . . fk nk1/nk. nk2/nk. 1 Sum [phi]1 [phi]2

where

.

Note that x=x1 implies "yes" or a value of unity, and x=x2 implies "no" or a
value of zero. These latter marginal sums correspond to the frequency of
events and non-events, respectively; as we have just seen from the p(x|f)
viewpoint, these can be thought of as *probabilities* of the observed
events, [phi]i = p(xi).

Many things can be done with the contingency tables, especially if we are willing to look at these two different viewpoints (which correspond to what Murphy calls "factorizations"). The bibilography is the place to look for the gory details; however, forecasters who worry about their subjective probabilities can derive a lot of information from the two different factorizations of the contingency table's information. If they consider the marginal distributions of their forecasts relative to the observations, they can see if their forecasts need "calibration." It is quite likely that forecasters would make various types of mistakes in assessing subjective probabilities, and the information in these tables is the best source for an individual forecaster to assess how to improve his or her subjective probability estimates. Knowledge of the joint distribution of forecasts and events is the best mechanism to adjust one's subjective probabilities.

No matter how effective the forecasts might be, anything short of perfection leaves room for improvement. A reasonably complete verification offers forecasters the chance to go back and reconsider specific forecast failures. And successes may need reconsideration as well. Basically, the primary value of verification exercises lies in the opportunities for improvement in forecasting. Providing forecasters with feedback about their performance is important but the story definitely should not end there. If there are meteorological insights that could have been used to make better forecasts, these are most likely to be found by a careful re-examination of forecast "busts" and, perhaps to a lesser extent, forecast successes. If this important meteorological evaluation does not eventually result from the primarily statistical exercise of verification, then the statistical exercise's value is substantially reduced. Time and resources must go into verification, but then the goal should be to do the hard work of "loop-closing" by delving into meteorological reasons for success and failure on individual days.

We have said that you expect it to rain roughly 10 percent of the time you
forecast a 10 percent chance of rain. And, conversely, you expect it not to
rain roughly 10 percent of the time when you forecast a 90 percent chance.
However, the greater the departure of the forecasts from the observations, the
more concerned you should be; **perfect** forecasts are indeed categorical.
Uncertainty is at the heart of using probabilities, but this doesn't mean that
individual forecast errors are not of any concern. After all, when it rains on
a 10 percent chance, that is a forecast-observation difference of 0.1-1.0 =
-0.9; and when it fails to rain on a 90 percent forecast, that is a
forecast-observation difference of 0.9-0.0 = +0.9. That means a substantial
contribution to the RMSE, no matter how you slice it. Thus, it would not be in
your best interest to, say, intentionally put out a 10 percent forecast when
you thought the chances were 90 percent, simply to increase the number of rain
events in your 10 percent category because the frequency of rain in your 10
percent bin was currently less than 10 percent! Hopefully, such large errors
are rare, and it might well be feasible to go back and find out if there was
any information in the meteorology that could have reduced the large error
associated with these individual.

Naturally, this brings up the subject of "hedging." Some might interpret a
probabilistic forecast as a hedge, and that is not an unreasonable position,
from at least some viewpoints. However, what *we* are concerned with
regarding "hedging" in verification is a tendency to depart from a forecaster's
best judgement in a misguided effort to improve verification scores. The
example just given is just such a foolish attempt; although doing so would
improve the "reliability" score (perhaps), it also would increase the RMSE, and
other measures, to the overall detriment of the results. In what has been
referred to as a "strictly proper" verification system, a forecaster obtains
his or her best verification scores when making a forecast equal to his or her
best estimate. Many forecasters believe that any verificaiton system can be
"played" to achieve optimal results ... if a forecaster does this, then the
only real loser is the forecaster, because then the benefits to the forecaster
associated with the verification exercise are lost. It is indeed possible to
hedge forecasts in this way, even with a strictly proper scoring system, but
when the scoring is strictly proper it is easily shown that the forecaster does
more poorly overall this way than by going with the his or her best
judgement.

Of course, this presumes that the forecaster has "calibrated" the forecasts by obtaining regular feedback from verification results. It is imperative that this feedback be as rapid as possible, given the constraint that a useful evaluation of probabilistic forecasts requires a reasonably large ensemble of forecasts. Hence, setting up a verification system should include a mechanism to display the results to the forecasters as soon as they are available. It would make sense that individuals could see their own tables, charts, and numbers, as well as the capability to compare their results to those of the group, but there is no obvious benefit to making every individual's data available to the group.