**Reference (starred references are particularly important):**

*Wilks, D. S, 1995: *Statistical Methods in the Atmospheric
Sciences*. Academic Press, 467 pp.

Expected value is a concept that will come up in both the forecast evaluation and decision analysis parts of the course. For discrete variables, we have

where p(X = x) is the probability that X = x.

For continuous functions, the parallel expression is

We are interested in describing distributions of data. In
particular, there are three concepts that are especially
important:

- Location
- Spread
- Symmetry

In addition, we'd like those descriptors to have certain 'nice'
properties. Among those properties are two that deal with the
sensitivity to specific aspects of the data, **robustness** and
**resistance**. A robust method is one that may not be optimal for
a particular problem, but performs reasonably well most of the time.
Robust methods generally are insenstive to assumptions about the
overall nature of the data. Resistant measures, on the other hand,
are ones that are not heavily influenced by outliers in a data
set.

As in many other parts of life, you can't always get what you want
and, sometimes, we end up using methods that are neither robust nor
resistant. That doesn't mean they're necessarily bad, but they can
lead to misleading conclusions if you aren't aware of the
problems.

There are two parallel tracks that we will use to describe data
distributions that can be lumped into broad
categories-**parametric** methods and** non-parametric**
methods. They have different strengths and weaknesses and it is
certainly not a case of one size fits all. Parametric techniques make
implicit assumptions about the underlying statistical distributions.
To the extent that those assumptions are valid, exact expressions
about the distribution can be obtained. Non-parametric techniques
make no such assumptions and, as such, can describe any distribution
very well. However, evaluation of the distribution may be more
difficult and you can't extrapolate beyond the data. It is often
advisable to use both approaches when first looking at a data
set.

Let's look at methods of describing our three basic concepts to
describe data distributions along these two tracks.

*Location*

The standard parametric descriptor of location is the mean. It is given by the first equation in these notes and is usually designated by the symbol m. It is the sum of all values in the distribution divided by the number of values. It is neither robust nor resistant. The latter is easy to see if a distribution of data {1, 2, 3, 4, 5, 6, 7, 8, 9} is considered. Here, m = 5. If the last value in the distribution was changed to 99, m = 15.

*Spread*

We can derive an expected value for the square of differences between values of a population and the mean. This is called the variance, given by

given byMore commonly, we use an unbiased estimate of the variance, the sample variance, given by

s^2 = Sum{(xi-mu)^2}/(n-1)Now, the square root of the sample variance is the sample standard deviation, another common descriptor of spread, but one with less desirable theoretical properties, although having the same dimensions as the variable in question. Again, variance and standard deviation are neither robust nor resistant. For the first data set above, the sample variance is 7.5. For the second one, it is 997.5.

At large sample sizes (large n), the sample variance is very close to the value of Var[*X*] given above, differing by a factor of N/(N-1).

*Symmetry*

Higher-order parametric descriptors can be obtained by
substituting a larger integer for the *n *=2 in the Var[*X*
] equation. (Note that the second line will not hold for *n*
> 2. These descriptors are known as the *n -*th moments
of the distribution. For instance, the variance is the second moment.
A measure of the symmetry of the distribution, skewness, is the third
moment (*n *=3) and kurtosis is the fourth. I have no idea if
there are names for any of the higher moments, but we are unlikely to
use anything above *n *=3.

*Location*

The standard non-parametric descriptor of location is the median, which is the value of the data at the center of the distribution, so that the same number of points are above and below it. In the case of both of our simple data sets, the median is 5. The median is a robust and resistant estimator of location.

*Spread*

There are two standard non-parametric estimators of spread. The first is the median absolute deviation (MAD). In parallel with the variance, we find the absolute differences between the values and the median of the distribution. In our case for the first data set, we'd have {4,3,2,1,0,1,2,3,4}. Then, we find the median of those values, which is 2. For the second data set, the deviations are {4,3,2,1,0,1,2,3,94}. The median of those values is 2.

The second estimator is the interquartile range (IQR). It compares
the values of two *quantiles*. A quantile, *q _{i}*,
is the value in the data set for which the fraction,

*Symmetry*

The standard estimator of symmetry is the Yule-Kendall index, given by

Again, we are unlikely to look even this high very often, but it is useful to see the parallels between the two approaches.