**Reference (starred references are particularly important):**

*Wilks, D. S, 1995: *Statistical Methods in the
Atmospheric Sciences*. Academic Press, 467 pp.

Expected value is a concept that will come up in both the forecast evaluation and decision analysis parts of the course. For discrete variables, we have

where p(X = x) is the probability that X = x.

For continuous functions, the parallel expression is

We are interested in describing distributions of data. In particular, there are three concepts that are especially important:

- Location
- Spread
- Symmetry

In addition, we'd like those descriptors to have certain 'nice' properties.
Among those properties are two that deal with the sensitivity to specific
aspects of the data, **robustness** and **resistance**. A robust method
is one that may not be optimal for a particular problem, but performs reasonably
well most of the time. Robust methods generally are insenstive
to assumptions about the overall nature of the data. Resistant measures, on the
other hand, are ones that are not heavily influenced by outliers in a data set.

As in many other parts of life, you can't always get what you want and, sometimes, we end up using methods that are neither robust nor resistant. That doesn't mean they're necessarily bad, but they can lead to misleading conclusions if you aren't aware of the problems.

There are two parallel tracks that we will use to describe data
distributions that can be lumped into broad categories-**parametric**
methods and** non-parametric** methods. They have different strengths and
weaknesses and it is certainly not a case of one size fits all. Parametric
techniques make implicit assumptions about the underlying statistical
distributions. To the extent that those assumptions are valid, exact
expressions about the distribution can be obtained. Non-parametric techniques
make no such assumptions and, as such, can describe any distribution very well.
However, evaluation of the distribution may be more difficult and you can't
extrapolate beyond the data. It is often advisable to use both approaches when first looking at a data set.

Let's look at methods of describing our three basic concepts to describe data distributions along these two tracks.

*Location*

The standard parametric descriptor of location is the mean. It is given by the first equation in these notes and is usually designated by the symbol m. It is the sum of all values in the distribution divided by the number of values. It is neither robust nor resistant. The latter is easy to see if a distribution of data {1, 2, 3, 4, 5, 6, 7, 8, 9} is considered. Here, m = 5. If the last value in the distribution was changed to 99, m = 15.

*Spread*

We can derive an expected value for the square of differences between values of a population and the mean. This is called the variance, given by

More commonly, we use an unbiased estimate of the variance, the sample variance, given by

Now, the square root of the sample variance is the sample standard deviation, another common descriptor of spread, but one with less desirable theoretical properties, although having the same dimensions as the variable in question. Again, variance and standard deviation are neither robust nor resistant. For the first data set above, the sample variance is 7.5. For the second one, it is 997.5.

At large sample sizes (large n), the sample variance is very close to the
value of Var[*X*] given above, differing by a
factor of n/(n-1).

*Symmetry*

Higher-order parametric descriptors can be obtained by substituting a larger
integer for the *n *=2 in the Var[*X* ] equation. (Note that the second line will not
hold for *n* > 2. These descriptors are known as the *n -*th moments of the distribution.
For instance, the variance is the second moment. A measure of the symmetry of
the distribution, skewness, is the third moment (*n
*=3) and kurtosis is the fourth. I have no idea if there are names for any
of the higher moments, but we are unlikely to use anything above *n *=3.

*Location*

The standard non-parametric descriptor of location is the median, which is the value of the data at the center of the distribution, so that the same number of points are above and below it. In the case of both of our simple data sets, the median is 5. The median is a robust and resistant estimator of location.

*Spread*

There are two standard non-parametric estimators of spread. The first is the median absolute deviation (MAD). In parallel with the variance, we find the absolute differences between the values and the median of the distribution. In our case for the first data set, we'd have {4,3,2,1,0,1,2,3,4}. Then, we find the median of those values, which is 2. For the second data set, the deviations are {4,3,2,1,0,1,2,3,94}. The median of those values is 2.

The second estimator is the interquartile range (IQR). It compares the values
of two *quantiles*. A quantile,
*q _{i}*, is the value in the data set for which the fraction,

*Symmetry*

The standard estimator of symmetry is the Yule-Kendall index, given by

Again, we are unlikely to look even this high very often, but it is useful to see the parallels between the two approaches.