Topic 3: Statistical Sidebar

Reference (starred references are particularly important):

*Wilks, D. S, 1995: Statistical Methods in the Atmospheric Sciences. Academic Press, 467 pp.

Expected Value

Expected value is a concept that will come up in both the forecast evaluation and decision analysis parts of the course. For discrete variables, we have

where p(X = x) is the probability that X = x.

For continuous functions, the parallel expression is

Basic Concepts

We are interested in describing distributions of data. In particular, there are three concepts that are especially important:

  1. Location
  2. Spread
  3. Symmetry

 

In addition, we'd like those descriptors to have certain 'nice' properties. Among those properties are two that deal with the sensitivity to specific aspects of the data, robustness and resistance. A robust method is one that may not be optimal for a particular problem, but performs reasonably well most of the time. Robust methods generally are insenstive to assumptions about the overall nature of the data. Resistant measures, on the other hand, are ones that are not heavily influenced by outliers in a data set.

As in many other parts of life, you can't always get what you want and, sometimes, we end up using methods that are neither robust nor resistant. That doesn't mean they're necessarily bad, but they can lead to misleading conclusions if you aren't aware of the problems.

Two Tracks for Describing Distributions

There are two parallel tracks that we will use to describe data distributions that can be lumped into broad categories-parametric methods and non-parametric methods. They have different strengths and weaknesses and it is certainly not a case of one size fits all. Parametric techniques make implicit assumptions about the underlying statistical distributions. To the extent that those assumptions are valid, exact expressions about the distribution can be obtained. Non-parametric techniques make no such assumptions and, as such, can describe any distribution very well. However, evaluation of the distribution may be more difficult and you can't extrapolate beyond the data. It is often advisable to use both approaches when first looking at a data set.

Let's look at methods of describing our three basic concepts to describe data distributions along these two tracks.

Parametric

Location

The standard parametric descriptor of location is the mean. It is given by the first equation in these notes and is usually designated by the symbol m. It is the sum of all values in the distribution divided by the number of values. It is neither robust nor resistant. The latter is easy to see if a distribution of data {1, 2, 3, 4, 5, 6, 7, 8, 9} is considered. Here, m = 5. If the last value in the distribution was changed to 99, m = 15.

Spread

We can derive an expected value for the square of differences between values of a population and the mean. This is called the variance, given by

More commonly, we use an unbiased estimate of the variance, the sample variance, given by

Now, the square root of the sample variance is the sample standard deviation, another common descriptor of spread, but one with less desirable theoretical properties, although having the same dimensions as the variable in question. Again, variance and standard deviation are neither robust nor resistant. For the first data set above, the sample variance is 7.5. For the second one, it is 997.5.

At large sample sizes (large n), the sample variance is very close to the value of Var[X] given above, differing by a factor of n/(n-1).

Symmetry

Higher-order parametric descriptors can be obtained by substituting a larger integer for the n =2 in the Var[X ] equation. (Note that the second line will not hold for n > 2. These descriptors are known as the n -th moments of the distribution. For instance, the variance is the second moment. A measure of the symmetry of the distribution, skewness, is the third moment (n =3) and kurtosis is the fourth. I have no idea if there are names for any of the higher moments, but we are unlikely to use anything above n =3.

Non-parametric

Location

The standard non-parametric descriptor of location is the median, which is the value of the data at the center of the distribution, so that the same number of points are above and below it. In the case of both of our simple data sets, the median is 5. The median is a robust and resistant estimator of location.

Spread

There are two standard non-parametric estimators of spread. The first is the median absolute deviation (MAD). In parallel with the variance, we find the absolute differences between the values and the median of the distribution. In our case for the first data set, we'd have {4,3,2,1,0,1,2,3,4}. Then, we find the median of those values, which is 2. For the second data set, the deviations are {4,3,2,1,0,1,2,3,94}. The median of those values is 2.

The second estimator is the interquartile range (IQR). It compares the values of two quantiles. A quantile, qi, is the value in the data set for which the fraction, i, of the points in the distribution are smaller. For example, q.5 represents the point at which half of the data are smaller. (Note that this is the median.) The IQR is the difference between q.25 and q.75. It is easier to calculate than MAD, but has the disadvantage of looking at a smaller fraction of the data. Again, these methods are robust and resistant.

Symmetry

The standard estimator of symmetry is the Yule-Kendall index, given by

Again, we are unlikely to look even this high very often, but it is useful to see the parallels between the two approaches.