Data Collection and Processing

(Updated 29 August 2003. Comments to Harold Brooks .)

Collecting and Processing the Data

Severe Weather Reports

We have used two datasets to create the figures shown here. The first contains reports of all kinds of severe weather (defined in the United States as either tornadoes, thunderstorm winds of at least 58 mph, or hail of at least 3/4 inch in diameter) collected by National Weather Service (NWS) meteorologists from all over the United States and archived at the NWS Storm Prediction Center. Although data of this kind have been collected since 1950, we have focused on the portion of the record since 1980. We've limited our consideration to this time period because of the large increase in the number of reports over time (almost two orders of magnitude since 1950). Much of this increase is due to increased efforts to collect the data. We have to compromise between having a record that shows as little of an effect of the increase in reports and a long enough record to be meaningful.

The second dataset was produced by Tom Grazulis of The Tornado Project. It contains significant tornadoes (rated F2 or greater in damage) since 1680 in the United States. In the same way as we have had to limit our attention to the period beginning in 1980 with the NWS dataset, we have limited our attention to 1921-1995 in the Grazulis dataset.

Data Processing

We know that the reports aren't always a perfect record; events are missed and erroneous reports are collected. We have tried to focus on aspects of the reports that we believe may be the most reliable: the location and the date of the reports. For the NWS data, the location is given by latitude/longitude coordinates. For the Grazulis dataset, it is by county. We have taken the location and mapped each report onto a grid. (Technically, it is a Lambert conic conformal grid, true at 30 and 60 N.) The grid is approximately 80 km on a side, so that the area associated with each grid point is roughly equivalent to a circle 25 miles in diameter. For simplicity and for consistency in some of the processing that goes on later, we have taken only the touchdown location of tornadoes. Also, we have limited our attention only to the question of whether or not a particular kind of severe weather event occurred on a day, not how many occurred on any particular day. Thus, the probability maps and graphs that are displayed should be interpreted as representing probability of one or more of the severe weather events occurring within 25 miles of the location. The total threat maps show the average number of days per year with one or more of the severe weather events occurring within 25 miles of the location.

We've used a statistical technique known as nonparametric density estimation to produce the probabilities. In a nutshell, we smooth the reports in time and space. (The technical details are described below.) That means that we think that a tornado occurring on 3 May tells us something about the likelihood that one could occur on 2 May or 4 May, but not very much about how likely one is on 3 March. Similarly, a tornado at Fort Worth, Texas, tells us something about the probability of a tornado at Dallas, Texas, but not very much about the probability at Chicago, Illinois. We've included information from different time periods of the record to show how variable the reports are.

Technical Details

The maps are based on the reports from a particular time period. The procedure is as follows:

Reports for each day are put onto a grid 80 km x 80 km. Thanks for this go to Mike Kay of the SPC, without whom this would have gone nowhere.
If one or more reports occur in a grid box, that box is assigned the value "1" for the day. If no reports occur, it's a zero.
The raw frequency for each day at each grid location is found for the period (number of "1" values divided by number of years) to get a raw annual cycle.
The raw annual cycle at each point is smoothed in time, using a Gaussian filter with a standard deviation of 15 days.
The smoothed time series are then smoothed in space with a 2-D Gaussian filter (SD = 120 km in each direction).

The smoothing is intentionally heavy, trying to leave only the strong signals. The smoothing leads to a slight underestimate of the probabilities which, fortuitously is about the same amount as the area of a 25 mile radius circle is smaller than a 80 km square, so that the calculations of threat on an 80 km grid is close enough to the threat within a 25 mile radius to be less than the other sources of error.

Back to Thunderstorm Hazards page