Assessing Probabilistic Forecasts Graphically

References

Mason, I., 1982:

Wilks, D. S., 1995: Statistical Methods in the Atmospheric Sciences. Academic Press. (Chapter 7)

Wilson, L. J., 2000:


Let's start with the following set of forecasts:

F

N

Precip

%

0

175

3

1.7

5

1

0

0

10

48

4

8.3

20

37

5

13.5

30

20

4

20

40

10

4

40

50

10

5

50

60

13

5

38.5

70

13

7

53.8

80

3

3

100

90

2

1

50

100

6

3

50

F is the forecast value in per cent, N is the number of forecasts made with that value, Precip is the number of times that precipitation occurred for that forecast and % is the frequency (in per cent) of precipitation for that forecast value.

There are two ways to display this that are useful. The first is the attributes diagram (discussed in Wilks), focusing on the refinement-calibration decomposition of p(f,x). The second is the Relative (or Receiver) Operating Characteristics (ROC) curve, focusing on the likelihood-base rate calibration.


Attributes Diagrams

The first step in creating an attribute diagram is to plot the observed frequency for each forecast value. Formally, this is a plot of p(x|f) vs. f. (This step creates what is referred to as a "reliability diagram.") We can add some additional lines to the figure to help in understanding. The first is a diagonal line running from the point at the bottom left (0,0) to the upper right (1,1 or 100%,100%). This is the line of perfect reliability. Points from the reliability curve on this line have perfect reliability (zero), such as the 40% and 50% points in this case. Second we add horizontal and vertical lines at the sample climatology. In this case, the sample climatological frequency is 44/338 (the sum of N and Precip above) or 13%. The horizontal line is referred to as the no resolution line. Points from the reliability curve on this line have no resolution (zero), such as the 20% forecast. We can draw a line halfway between the no resolution and perfect reliability lines. This is the "no-skill" line, because points on it do not add to the Brier skill score. Points closer to the no resolution line than the perfect reliability line (20%, 30%, 90%, and 100%, in this case) contribute negatively to the Brier skill score. Finally, we often add a line, dashed in this case, to show the relative frequency of use of each forecast category. For example, the 0% forecast is used a little over 50% of the time.


Relative (Receiver) Operating Characteristics

Mason (1982) is the standard reference in meteorology to ROC curves, which come from the signal detection community.

In order to make a ROC curve, we start with the forecast table above, but instead of looking at the number of forecasts and the number of times it rained for each probability, we look at the number of non-events and yes-events for each probability. Then, we sum each of those for all values greater than or equal to the forecast probability value. This is equivalent to saying that we're making a probabilistic forecast into a dichotmous forecast for a series of different decision points equal to our forecast values. We then calculate the fraction of the overall non-events (yes-events) that occurred for probability values greater than or equal to our decision threshold. For example, in the table below, all 294 of the total non-events occurred if we make our threshold 0% (always forecast yes), but only 122 of the 294 (0.41) of the non-events occurred for a threshold of 5%. This is equivalent to looking at p(f=1|x=0) assuming that f=1 for all values of the forecast probability greater than or equal to our threshold. p(f=1|x=0) is called the false alarm rate or probability of false detection. The hit rate is defined in a similar way, except for p(f=1|x=1). The values of the false alarm rate and hit rate for the table at the top are given below.

No Precip

Yes Precip

Sum(No)

Sum(Yes)

FAR

Hit Rate

0

172

3

294

44

1.00

1.00

5

1

0

122

41

.41

.93

10

44

4

121

41

.41

.93

20

32

5

77

37

.26

.84

30

16

4

45

32

.15

.73

40

6

4

29

28

.10

.64

50

5

5

23

24

.08

.55

60

8

5

18

19

.06

.43

70

6

7

10

14

.03

.32

80

0

3

4

7

.01

.16

90

1

1

4

4

.01

.09

100

3

3

3

3

.01

.07

>100

0

0

0

0

.00

.00

To make a ROC curve, simply plot Hit Rate vs. FAR. Note that two points always appear on a ROC curve. The point 1,1 is associated with always forecasting yes and the point 0,0 is associated with always forecasting no. We usually add a diagonal line running from lower left to upper right. On a ROC curve, this is the line of no skill, since p(f=1|x=1) = p(f=1|x=0), so that, in effect, our forecasts look the same, no matter what the observation is. A simple measure of skill on a ROC chart is the area under the curve. Its maximum value is 1 and no skill is given by 0.5. Simply connecting the dots on the figure below gives an area of 0.86. Wilson (2000) discusses the issue of fitting the points on an ROC curve and then calculating the area under the fitted curve. This is equivalent to assuming the forecast distribution is Gaussian, which is frequently approximately true. In any event, the "dot-to-dot" method typically underestimates the "true" area.