The 2x2 Problem

References:

Doswell, C. A. III, R. Davies-Jones, and D. L. Keller, 1990: On summary measures of skill in rare event forecasting based on contingency tables. Wea. Forecasting, 5, 576-586. (PDF)

Marzban, C., 1998: Scalar measures of performance in rare-event situations. Wea. Forecasting, 13, 753-763. (PDF)

Murphy, A. H., 1996: The Finley affair: a signal event in forecast verification. Wea. Forecasting, 11, 3-20. (PDF)

Roebber, P. J., 2009: Visualizing multiple measures of forecast quality. Wea. Forecasting, 24, 601-608. (PDF)


Framework

The 2x2 verification problem is the most studied and widest-known problem in the discipline. Partially as a result of this, notation is not consistent from paper to paper and scores based off of the table have been rediscovered over the years. As a general framework, we write the problem in a variety of ways. In terms of the joint distributions, it is:

Events

Event 1

Event 2

Sums

Forecasts

Forecast 1

p(1,1)

p(1,2)

p(f=1)

Forecast 2

p(2,1)

p(2,2)

p(f=2)

Sums

p(x=1)

p(x=2)

1
where Forecast 1 and Forecast 2 are forecasts of the two event classes (frequently yes/no). Note the table can be written in terms of the raw observations and forecasts (n(1,1), etc.), where the total number of forecasts is N. Following Murphy (1996), we will do that and for simplicity, we'll write n(1,1) as n11 and n(x=1) as n.1.


Basic scores

A variety of scores can be derived from this table. The three references give these scores. We'll concentrate on just a small number, assuming that x=1 is the primary event for which we are forecasting. Brackets give worst, best possible scores

 

The Hit Rate or Fraction Correct is given by [0,1]

H = (n11+ n22)/n..

The probability of detection (POD) is given by[0,1]

POD=n11/n.1

The false alarm ratio (FAR) is given by [1,0]

FAR=n12/n1.

Note that a related term usually called the false alarm rate is given by n12/n.2. Confusion arises because the name false alarm rate is sometimes used in conjunction with the score defined by FAR. The false alarm rate is usually used in the context of relative operator characteristics (ROC) curves.

Bias can be defined as [(0,1),(1,inf.)]

Bias = n1./n.1

A score introduced by Gilbert in 1884 and later called the Threat Score or the Critical Success Index is given by [0,1]

CSI = n11/(n..-n22)


Skill scores

 

Scores that try to account for some version of random chance of being correct:

Equitable Threat Score [-1,1]

ETS = (n11-C)/(n..-n22-C)

where C = (n1.n.1)/n..

Heidke skill score (originally from Doolittle) [-inf.,1]

HSS= 2(n11n22-n21n12)/[n.1n2.+n1.n.2]

This is a measure of correct forecasts, with random correct forecasts subtracted out. The reference forecast is random forecasting, subject to the constraint that marginal distributions of forecasts are the same as the marginal distributions of observations.

Hanssen-Kuipers skill score (originally from Peirce, also true skill statistic) [-1,1]

KSS= (n11n22-n21n12)/[n.1n.2]

Similar to Heidke, except the constraint on the reference forecasts is that they are constrained to be unbiased.

The latter two scores can also be written as

HSS=(H-Hran)/(1-Hran)

KSS=(H-Hran)/(1-Hu,ran)

where H is the Hit Rate, defined above, Hran=(n1.n.1+ n2.n.2)/n..2 and Hu,ran= (n.12+n.22)/n..2