15th AMS Conference on Weather Analysis and Forecasting
Norfolk, Virginia, 19-23 August 1996, in press



Arthur Witt, Harold E. Brooks, and Michael D. Eilts

NOAA/ERL/National Severe Storms Laboratory

Norman, Oklahoma


The purpose of a weather forecast should be to help people make better weather-dependent decisions. In the Oklahoma City area, public weather forecasts can be obtained from numerous different sources. To optimize weather-dependent decisions, one obviously would want information that would help them get the most value out of the forecasts. Unfortunately, information on the quality of public weather forecasts is difficult, if not impossible, to obtain. (In general, there is a complex relationship between the quality and value of forecasts (Murphy 1993), but analysis of the quality is a reasonable place to start.) To help fill (ever so slightly) this vast data void, we set out to record and verify public weather forecasts for the Oklahoma City area for a 14 month time period.

Television forecasts were tape-recorded and the newspaper forecasts taken from the daily paper. We wrote letters to each of the media sources asking questions about their procedures and forecast descriptors, but received only one reply. Therefore, we have had to make interpretations of some aspect of the forecasts, particularly the meaning of Probability of Precipitation (PoP) in the media forecasts. When no answer was received, we assumed they used the same definition as the National Weather Service (NWS). Based on the characteristics of the forecast PoP, we do not believe this decision has a significant impact on the interpretation of the forecasts. Nevertheless, the use of undefined terms represents a dilemma for forecast users.


Forecast data were collected from five different sources: evening forecasts from the three network TV stations, a daily newspaper with early morning delivery, and the Oklahoma City NWS. Each of these sources puts out a local forecast for Oklahoma City for at least five days in advance. The NWS forecast used was the local forecast issued around 4 pm. The TV station forecasts used were those presented during the late afternoon/evening newscasts. Maximum and minimum temperature forecasts (up to five days ahead) were evaluated for all five sources. Precipitation forecasts were evaluated only for those sources which produced numerical PoPs (three sources). The data collection period ran from 1/4/94 to 3/6/95. Verification data came from the OKC SAO. Maximum temperature forecasts were verified for the time period from 12 UTC to 03 UTC. Minimum temperature forecasts were verified for the time period from 00 UTC to 15 UTC. Precipitation forecasts were verified for 24 hour time periods (beginning and ending at 06 UTC), except for the first three days of each NWS forecast, which were verified for 12 hour periods. In the following sections, when measures are presented from more than one source (for comparison purposes), only those days when forecasts were recorded from all the sources were used.


3.1 Measures-oriented results

We have examined forecast quality using measures-oriented performance statistics (see Murphy and Winkler 1987, Brooks and Doswell 1996 for a discussion of measures-oriented verification and Wilks 1995 for definition of measures) for the entire period of record () . One result consistent for all forecast sources is (as expected) that accuracy decreases as forecast lead time increases. Another result is that there are significant relative differences in accuracy between the different forecast sources. For minimum temperatures, Forecast Source (FS) #2(1) has the lowest Mean absolute error (MAE) for all periods, while FS #4 has the highest MAE for all periods. For maximum temperatures (Table 1b), there are significant differences for the first time period (Day 1), with smaller differences for the other time periods. Once again, FS #2 had the lowest MAE for all time periods, except Day 4. All forecast sources had a consistent warm bias to their temperature forecasts, especially FS #4.

Looking at the precipitation forecasts (Table 1c), skill scores (SS) approached zero (or less) by Day 4 in the forecast, meaning that the accuracy of the forecasts by this measure is equal to or worse than climatological forecasts. FS #2 had the highest SS for all time periods. Dry biases dominate the precipitation forecasts (except for the early part of NWS forecasts), and are especially strong at the longer lead times. (Note that for the NWS, at longer lead times, numerical PoP values are not given. We have converted the terms "slight chance," "chance," and "likely" to numerical values, 20%, 30%, and 70%, respectively.) This increasingly strong dry bias leads to the observed lowering of the SS as forecast lead time increases. The bias can be further seen in the "climatological forecast PoP" of the FS#1 and FS#2, that is, the mean forecast PoP for each day. For FS#1, it decreases from 16.8% on Day 1 to 8.6% on Day 5. For FS#2, it decreases from 17.6% to 11.9%. The observed sample climatology was 20.7%.

We computed seasonal accuracy statistics as well. For the NWS, the MAE for maximum temperature forecasts was highest during the winter (Table 2). The seasonal difference is large enough that a Day 5 forecast during the summer has a lower MAE than a Day 1 forecast during the winter. The minimum temperature forecasts have less extreme seasonality (not shown).

3.2 Distributions-oriented results

Distributions-oriented approaches provide a much richer picture of the characteristics of a forecast system (Murphy and Winker 1987, Brooks and Doswell 1996). In general, one wishes to describe the joint probability distribution of forecasts and errors. For the number of forecast sources and variables under consideration here (5 sources with 10 temperature forecasts gives 50 arrays, even without the precipitation or intercomparisons of different sources or combinations of forecast variables), it is prohibitive to even show the distributions. We plan to make the distributions available to interested parties and to display them in our presentation. Here, we will focus on a few highlights from a distributions-oriented approach and through the use of methods to stratify forecasts (Murphy 1995).

Many things can be learned from looking at the joint distributions. As an example, the Day 1 low temperature forecasts from the NWS (Table 3) shows some curious features. (Table 3 presents the information in a very compressed format to highlight the point.) Forecasts and observations have been grouped into 5šF bins, centered on values divisible by 5. The centering was chosen so that one bin represented temperatures at or just below freezing. In general, if a particular forecast is made, the mode observation will be in the same 5šF bin. There is one exception to this and it occurs with forecasts just above freezing (33-37 šF). In that case, the mode observation is in the bin at or below freezing. Table 1 shows the high bias in the forecasts, overall, but it is particularly pronounced around freezing. In some situations (e.g., precipitation is expected), this error seems to have the potential to cause problems for public safety.

Contingency tables for precipitation can be used to construct reliability diagrams (Wilks 1995), indicating how well the observed frequency of an event matches the forecast probability. To get larger sample sizes, we've summed the forecasts over all five days for FS#1 and FS#2, the two sources that produce numerical PoPs each day (Fig. 1). The general dry bias is readily apparent, as is the tendency to overuse the 0% PoP. The observed frequency for both sources with a PoP of zero is on the order of 10% and, in fact, precipitation was observed more frequently when FS#2 forecast 0% PoP than 10% PoP! FS#2 typically produced a "wetter" forecast (see numbers below diagram). More than 25% of FS#2's forecasts are for a greater than sample climatology (20.7%) chance of precipitation, while only 16% of FS#1's forecasts are "wet." This difference extends up to the highest probabilities, with FS#2 using 90 and 100% PoPs 8 times, including a Day 3 100% PoP. In contrast, FS#1 never used higher than 60% after Day 1.

The variety of forecasts available to consumers in the area can, potentially, lead to confusion. In effect, not considering the Day 6 and Day 7 forecasts that are available from some of the television stations, there are 25 forecasts of a given day's high and low temperature. One way of using those forecasts is to consider the quality of the forecast of the mean of the 25 forecasts when they agree and disagree. We've divided the forecasts into those cases in which the variance of the forecasts is less than or greater than 10šF. The MAE increases with variance (Table 4). It is interesting to note that the variance in the high-temperature forecasts is quite a bit larger than in the low temperature forecasts. 182 forecasts met the low-variance criterion for the minimum temperature forecast, while only 149 did so for the maximum temperature forecast. Other things being equal, one might expect more variance in the maximum temperature forecasts, since they have a slightly longer lead time. However, given the other results, it seems the maximum temperatures are harder to forecast.

One important point to remember is that all of the television forecasters have the opportunity to use the NWS forecast as input into their forecasts. (We do not know for certain when the newspaper forecast is created. NWS forecast information may or may not be available to them.) An interesting question then is to consider the quality of the forecasts when the private sector forecasts disagree strongly with the NWS. Since many radio stations in the Oklahoma City area use the NWS forecasts in their broadcasts, this is another forecast that reaches the public directly, even if they do not have NOAA Weather Radio. To look at this question, we have stratified the forecasts by the disa greement with the NWS forecasts and done a count of the number of times the private forecast was closer to the observations (Table 5). Only FS#2 monotonically increases the number of disagreements with increasing lead time of the forecast. It also improves on the NWS forecast at each period. The rest of the sources show little consistent behavior. We offer no speculations about the motivations behind the patterns or lack thereof. At the moment, they are a curiosity, although it seems obvious that an understanding would be critical to any of the private sector forecasters, if they are interested in improving the quality of their forecasts.


A wide variety of forecast sources and sources are available to the public via the NWS and the media. The significant disagreements between their forecasts inevitably leads to the question of "who's best"? Based on our analysis, we believe, as discussed by Murphy (1993), that this question is simplistic and the rich amount of information from even a cursory verification process implies that there is no absolutely correct answer to that question. Every one of the sources has its strengths and weaknesses, even without discussing issues such as hazardous weather preparedness. Examination of the forecasts from sources available to the public provides fertile ground for verification specialists, but, more importantly, it should allow the forecasters to evaluate their own strengths and weaknesses and help in improving their products, if the quality of these forecasts is a primary concern. From what we have seen, many of the weaknesses could be improved very easily, perhaps just by having the long-term PoP forecasts tend towards climatology, rather than zero. Given that we have been able to see this, it seems that the sources under examination have either 1) not verified their own forecasts, or 2) their perception of what the public wants is different than providing the highest-quality forecast. Neither option is satisfying from the public perspective.


J. T. Johnson helped record forecasts from the television stations. We did this project in order to increase our understanding of the communication of weather information to the public.


Brooks, H. E., and C. A. Doswell III, 1996: A comparison of measures-oriented and distributions-oriented approaches to forecast verification. Wea. Forecasting, 11, in press.

Murphy, A. H., 1993: What is a good forecast? An essay on the nature of goodness in weather forecasting. Wea. Forecasting, 8, 281-293.

_____, 1995: A coherent method of stratification within a general framework for forecast verification. Mon. Wea. Rev., 123, 1582-1588.

_____, and R. L. Winkler, 1987: A general framework for forecast verification. Mon. Wea. Rev., 115, 1330-1338.

Wilks, D. S., 1995: Statistical Methods in the Atmospheric Sciences. Academic Press, 467 pp.

Table 1







Table 1: Overall measures-oriented performance results. DA is Days Ahead, NF is Number of Forecasts, MAE is Mean Absolute Error, T is Total, SS is Skill Score (improvement in per cent over climatological Brier Score). a) Minimum temperature. b) Maximum temperature. c) Precipitation.

Table 2

Table 2: MAE for NWS maximum temperature forecasts by Days Ahead (DA) and season. T is total.

Table 3

Table 3: Day 1 NWS low-temperature forecasts (Fcst.) and observations (Obs) . Bins represent 5šF bins centered on temperature at beginning of row/column, e.g., there were 11 cases of observations of 28-32 F and with forecasts of 33-37šF. Note that first and last rows/columns (20šF, 45šF) include all temperatures below and above that value.

Table 4

Min TMin TMax TMax T
Low Var.
High Var.19.54.624.66.4
Table 4: Variance and MAE of mean temperature forecasts for cases with variance of forecasts <10šF (Low Var.) and >10šF (High Var.).

Table 5


FS#150 (4)32(25)43 (46)53 (45)63(60)
FS#267 (3)60(10)72(39)73(49)55 (60)
FS#363 (5)38(13)57 (37)71(28)56 (55)
FS#438(13)39(23)52 (66)48 (61)46 (79)


FS#120 (10)37 (30)52 (33)47 (47)59 (81)
FS#267(12)50 (20)53 (57)52 (73)63(83)
FS#340(10)31(13)57 (31)63(38)61(51)
FS#442 (24)48 (31)52 (62)55 (75)55 (101)
Table 5: Percentage of time when given forecast source disagreed with NWS by more than 5šF and was more accurate. Total number of disagreements in parentheses. Bold numbers indicate that source was better than NWS in cases of disagreement 60% or more of the time (10 case minimum). Italics indicate source did better than NWS 40% or less. a) Minimum temperatures. b) Maximum temperatures.