Testing homogeneity of time series: a rank problem
The example of the NH sea surface and air temperatures
Raymond Sneyers
Royal Meteorological Institute of Belgium
Summary
Extending its definition to the case of different meteorological variables when
involved in the same air mass, homogeneity of observations is shown to be ensured
by the randomness of rank differences. Moreover, the climate evolution resulting
mainly from changes in the frequency of weather types, climate changes occur in an
abrupt way, for which the appropriate methodology of detection is remembered. For
illustrating the methodology, the example of the joint distribution of the NH sea
surface and land air temperatures is considered. Moreover, humidifying of air masses
over sea and their instability depending on both the sea surface and air temperatures,
the time-series analysis has been applied on their difference.
Results give an explanation of the recent increase of importance of floods,
avalanches and tempests during these last years.
1. The homogeneity of simultaneous observations.
Let x and y be observations of two variables involved in the same air mass. When
measuring one variable at the same point and moment of an air mass, homogeneity
involves the equality
x = y or x - y = 0.
(1)
For different variables x and y measured simultaneously at fixed points in this
air mass, in case of a given situation with probabilities given by the distribution
functions F(x) and G(y), homogeneity leads to the relation
F(x) = G(y).
(2)
Having (Gumbel 1958)
E[F(x)] = X/(n+1) and E[F(y)] = Y/(n+1),
(3)
where n is the size of the set of observations and where X for x and Y for y are
the ranks of the observations when arranged in increasing order in their own series,
with (2), for homogeneous observations, we have then
X = Y or X - Y = 0.
(4)
For measurements made at the same point in the air mass, equations (1)
and (4) are equivalent, but for different variables, equality (4) means definite
situation and thus definite probability value.
Finally, measurements involving errors of observation, instead of (1) and (4), we have
x - y = d and X - Y = D,
(5)
and the condition of homogeneity may be replaced by the assumption of randomness for the
differences d and D.
2. Statistical characterization of time-series
2.1 The rank randomness for a time series. Statistical properties
For a given series of the measurements xi with i = 1, 2,.. , n, randomness being
defined by the two conditions of identity and independence for the distribution of
its elements (Sneyers 1975), the alternative assumptions to be put against this null
hypothesis are instability of distribution and serial correlation between consecutive
elements.
Under the assumption of randomness for the joint distribution Fn of the n elements
xi of the series, having the relation (R. Sneyers and L. Alvarez 2000)
Fn(x1, x2,.. , xn) =
F(x1).F(x2)... .F(xn),
(6)
all the permutations have the same probability. In particular, having for the set
of couples (xi, xj)
Fi,j(xi, xj) = F(xi).F(xj),
we have also for the ranks
Prob (Xi < Xj) = Prob (Xi > Xj).
(7)
It follows that from the n! permutations of the original series of ranks, the
following relations may be derived (see Annex):
var Xi = (n2 - 1)/12,
E(Xi - Xj) = 0,
E[cov (Xi, Xj)] = 0 and
var (Xi - Xj) = n(n+1)/6.
(8)
For the correlation coefficient r(Xi, Yj) between
n couples of ranks Xi and
Xj with i, j = 1, 2,.. , n and
Xi
Xj, the values of its mean and variance are
( Kendall and Stuart, 1967)
E(r) = 0 and var r = 1/(n - 1).
(9)
2.2 Testing randomness against its alternatives
Relations (8) referring to a random population of C2n = n(n-1)/2 possible pairs,
when for a series of n elements, randomness is questionable, through these relations,
their random or non random character may efficiently be verified with testing for this
series the random assumption against appropriate alternatives.
In particular, the alternative of trend to stability has to be expected in a given
permutation, if we have for the mean E
E(Xi - Xj) > or < 0 for j < i,
whether the trend is increasing or decreasing. Similarly, to independence, the
alternative of persistence or of alternance has to be considered in a permutation
whether the mean E of the serial correlation statistic r:
E[cov (Xi, Xi+1)] or r = r(Xi, Xi+1),
with i = 1, 2,.. , n and (n+1) = 1, is > or < 0.
It follows that the appropriate test of randomness against trend is the one
with the Mann t statistic (Mann 1945), defined as
t = nb (Xi > Xj) for j < i,
and for independence, the one with the serial correlation r statistic, having
respectively the means and variances
E(t) = n(n - 1)/4 E(r) = 0
var t = n(n-1)(2n+ 5)/72 var r = 1/(n - 1),
(10)
relations which may be derived from the relations (8) (see Annex).
As instability may be compensated by hidden internal trends, for the determination
of such inhomogeneities by means of a progressive trend analysis, advantage may be
taken of the recurrence
ti+1 = ti + ni+1,
(11)
where ti is the trend statistic for the
series Xj, with
j = 1, 2,.. , i and ni+1 =
nb (Xj < Xi+1) when j < i+1.
For the determination of the probabilities corresponding to the test statistic
]values, it should be noted that the normal approximation to the distribution
function of the test statistics is acceptable for n > 10. However, being concerned
with discrete distributions, as expected, calculations show that the correction for
continuity (Sneyers 1975) gives already good results when n > 4.
Finally, for very small internal sequences, the trend statistic losing its
efficiency, for short groupings of high or low values, estimating probabilities
through combinatorial anlysis remains the last way for reaching full efficiency
for the time series analysis.
3. The case of the change-point search. Methodology
If at the local scale, the variation of meteorological variables depends on the
neighbouring orography, at a larger scale, the general circulation of atmosphere
and of ocean are the main factors acting on the climate evolution. Moreover,
detected for the first time for seasonal averages of the air temperature series
(Sneyers 1958), the existence of change-points may be explained by indecision
situations resulting from the non-linear character of the differential equations
ruling both ocean and atmosphere circulations. In addition, found to be the single
non-random part in time-series of annual and seasonal averages, it seems to be
also the case for the series of averages at all time-scale lengths (Sneyers 1999).
For an exhaustive detection of the existing change-points and a complete statistical
characterization of the sequences separated by these change-points, the procedure
involves the three steps:
(a) testing trend and serial correlation
(b) change-point search by a progressive trend analysis, selection of sequences
with homogeneous means and testing randomness of the selected groups;
(c) derivation of the final climate evolution during the considered period after
having tested the distribution homogeneity of the final random samples, using
parametric or distribution free procedures whether normality is accepted or not.
3.1 The change-point search
Though the trend test gives the same result for the original series and for the
transformed one into ranks, the rank way has to be preferred due to its direct
relation with the underlying probability distribution, whatever this distribution
is and, at the same time, to an easier detection of groupings of large or small values.
Having computed forwards and backwards successively all the standardized values
u(ti) of the trend statistic ti for the series xj with
j = 1 to i, and the values u(t'i) of the statistic t'i,
for the series xj with j = i to n, noting that
t1 = t'n = 0 and tn = t'1, the first detection
may be made in separating from the beginning or from the end of the series the
sequence for which u(tj), for j = 1 to i, or u(t'j),
for (i+1) to n, remain very near to 0 before a systematic increase or decrease.
After the separation of these first stable sequences, the same operation has to
be performed on the remaining part of the series up to a remaining stable sequence.
The change-point detection is then finalized with ensuring that, for contiguous
sequences, the analysis of the joined series leads to the change-point i for which
the standardized test stastitics u(ti) and u(t'i+1) are
simultaneously closest to 0, which means that the test statistic v(t) (Sneyers 1995)
v(t) = u2(ti) + u2(t'i+1)
(12)
is nearest to 0.
For ensuring the exhaustivity of the change-point detection, in the case of small
groups of high or small values, they should be kept as homogeneous sequences,
already for sizes equal to 2, noting that combinatorial analysis shows that such
small groups may already be found significantly inhomogeneous with neighbouring
sequences. In the case of a high instability, a simplification of the selection
procedure is to be expected with beginning with this last selection operation.
3.2 Selection of random groups of homogeneous sequences.
For the purpose, the groups are re-arranged in increasing order of their rank mean
and the selection of groups homogeneous in the mean is made with a new progressive
trend analysis of the re-arranged series. Testing randomness is then realized with
testing independence of the elements for each group and with testing the stability
of their dispersion (absolute deviation from the mean).
3.3 Final time-series characterization
Coming back to the original data, sample tests are used for completing the selection
of homogeneous sets of groups using parametric tests or distribution free ones,
according to whether a specified distribution has been found acceptable or not.
In this case, homogeneity of variance and means are successively tested. These
results allow then to give an exact idea of the climate variation involved in the
concerned meteorological variable.
For having results as accurate as possible, one point has to be emphasised here.
Actually, if a few number of tied ranks may have but a negligible influence on the
reliability of the results, their existence may however be avoided when computing
averages at the seasonal or annual scale with stopping the calculation only at a
sufficient number of significant digits. This is generally realised whe
rounding up errors become negligible compared to the standard deviation of
the analysed series
Example: The joint distribution of the annual averages of the NH sea surface
and land air temperatures
Occurring at the water surface, evaporation depends essentially on the difference
between the temperature of the water surface and of the one of the surrounding air.
It follows that this phenomenon may have a vital importance in the humidification
of the air masses circulating over the oceans and is expected to play a major role
in weather and climate evolution. The availability of the 1994 P.D. Jones series
of the NH sea surface and land air temperature averages gives the best possibility
for verifying this meteorological feature.
Extended from 1856 to 1995, the analysed temperature series are average differences
with the normal values for the period 1961-1990, limitating in this way to the
climate evolution, the eventual source of average variations.
(a) Testing trend and serial correlation for the time-series
Limiting the search to annual averages, the first step has been testing
randomness and estimating distribution and correlation parameters (Table 1).
Table 1:
Annual averages x of the NH sea surface and y of the land air temperature.
Corresponding ranks X,Y. Tests of randomness, estimation of distribution
parameters and of correlation coefficients.
Standardized trend statistic u(t) for the complete series; extremes ux(t)
and ux(t')
derived from the progressive trend analysis; standardized serial correlation coefficient
u(r); for complete series, mean m, standard deviation s and correlation coefficients
r(x,y) and r(X,Y) for original and rank values.
The first observation raised by the data in Table 1 is the high significance of all
the test statistic values for the series of original values, while less or not
significant for the differences, remark which is especially true for the values of
the trend statistic u(t). For the extremes ux(t) and ux(t')
and the serial correlation test statistic u(r), a strong difference appears however
between the corresponding values for x and y. For the distribution statistics, the
standard deviations s are conversely proportional to the serial correlation statistic
u(r), while the correlation statistic values r(x,y) and r(X,Y) are practically identical.
The common reason justifying these results is the similarity of the internal evolution
for the two time- series and conversely the important difference between the standard
deviations s.
Actually, noting that if we put u = x/sx and v = y/sy, the
conversely proportion gives approximately the equality
which extends to the consecutive elements, a property independent of standard
deviation differences and confirms the degree of similarity of the chronological
evolution of the values of x and y in the time-series.
(b) The change-point search in the series of land air and sea surface temperature
differences
Going over to temperature differences, this search leads to the determination of 9
sets of groups homogeneous in the mean having a size of 1 to 8 elements (Table 2).
Table 2:
Differences (y - x) and (Y - X). Sizes of sequences homogeneous in the mean
separated by change- points
Rank number ng and n'g of the mean for the first and the final random groups; final
mean mg and standard deviation sg for the final groups.
For the nine sequences homogeneous in the mean, the serial correlation statistic
values make independence acceptable though this time, prevailing negative values
suggest the existence of a prevailing alternation. Moreover, the normal distribution
is found to be acceptable, exception made for two cases, due to the presence of ties
which reduces the real size of the sequence. Testing homogeneity with parametric
tests, identity of variance is accepted, while the significantly different means
are reduced to six (Table 2).
Table 3:
Differences (y - x). Series of groups n'g with alternating means ending at indicated year
(c) Final characterization of the chronological evolution of the differences between
sea surface and land air temperatures
Replacing the elements of the original series by the rank number of the corresponding
homogeneous group, a new selection has been made for underlining the alternation with
which the characterization of the weather evolution may be made.
Note that among the six rank numbers of the homogeneous groups, the low ranks belong
to the association of a cold sea surface with a warm land air and conversely, the high
ranks to the one of a warm sea surface with a cold land air. It appears in this way
(Table 3) that up to 1976, alternation occurred generally between extreme or low
differences, while after this year, in a worsening way, the alternation restricts
itself to the highest ones.
In addition, occurring in sequences with highest values for both sea surface and land
air temperatures, given the stability for the variance of the temperature differences
and the inequality of the standard deviation for each temperature series, an increase
effect is to be expected for the water evaporation and thus for the water vapour
content of the air masses during the last years.
In such situations, an increase of the instability of the air masses is an immediate
consequence. This means drizzle with relatively high pressure situations while with
low pressure ones, an increased probability of serious damages due to rain- or
snowfalls or to gusts has to be considered. Moreover, whether in winter or in summer,
an increased humidity of the air remains an unfavourable factor for the human health.
All the described damages did actually occur during the last years and if not eveywhere
each time, however in a worsening manner.
In conclusion, their relation with an obvious persistent large scale meteorological
situation makes it imperative to take protecting measures for keeping such damages
to a minimum of importance.
At the methodological point of view, it should be underlined that the straightforward
answer given to the considered problem has to be assigned to the rigorous way with
which the random component has been determined, the distribution free property of
randomness being simply verified by means of distribution free tests (Sneyers 1999).
References
Gumbel, E.J., 1958. Statistics of Extremes. Columbia University Press, N. Y., 375p.
Jones, P.D., 1994. Hemispheric surface air temperature: A reanalysis and an update
to 1993, Journal of Climate, 7, 1794-1802.
Kendall, M.G. and A. Stuart, 1967. The Advanced Theory of Statistics, Vol. 2,
Griffin, London, 690p.
Mann, H.B., 1945. Non parametric test against trend. Econometrika, 13, 245-259.
Sneyers, R., 1958. Connexions Thermiques entre Saisons Consécutives r Bruxelles-Uccle.
Institut R. Météor. de Belgique, Pub. B, No 23, 24p.
Sneyers, R., 1975. Sur l'Analyse Statistique des Séries d'Observations, O.M.M.,
Note Technique No 143, Gencve, Suisse. Spanish version, 1975, English version,
1990, 200p.
Sneyers, R., 1995. Climate Instability Determination. Discussion of Method and
Examples. 6th
International Meeting on Statistical Climatology, Galway, Ireland. Proceedings,
547-550 (corrected version).
Sneyers, R., 1999. The search for randomness in time series. Efficiency of the
methodology derived from its mathematical definition. Lecture given at the First
Congress on Climatology at the University of Barcelona, Spain, on December 4
(to be published).
Sneyers, R. and L. Alvarez, 2000. The change-point instability of climatological
time series as alternative to randomness. The example of annual temperature
averages 1908-1995 at Casablanca (Cuba). Bulletin of the Cuban Meteorological
Society, 6(1): electronic publication (revised).
(http://www.met.inf.cu/sometcub/boletin/v06_n01/english/paper_61.htm).
ANNEX
1.Statistical properties of random rank series
If the elements xi of a series of size n are replaced by their ranks
Xi, this new series is a permutation of the series of whole numbers
Xi = 1, 2,.. , n, and testing randomness comes down to applying the
mathematical properties of these numbers.
Having for i = 1, 2,.. , n the summations
and means E
In the test statistic tn = 1,n ni,
ni is for Xi the number of inequalities Xj < Xi for j < i.
For each ni, the equally possible values being 0, 1,.. , (i - 1), we have
var ni = (i2 - 1)/12.
Moreover, the values ni being independent, we have
var tn = 1,n
var ni= {[n(n + 1)(2n + 1)/6] - n }/12
= n[(n + 1)(2n + 1) - 6]/72
= n(2n2 + 3n - 5)/72
= n(n - 1)(2n + 5)/ 72
2.2 The serial correlation test
For the series Xi, with Xi, Xj where i, j = 1, 2,.. , n and
Xi Xj, we have
cov (Xi, Xj) = [1,n (2Xi2 -
(Xi - Xj)2]/2n
= var Xi - [1,n(Xi - Xj)2]/2n
Moreover, cov (Xi, Xj) = 0, implying var Xi = [var (Xi - Xj)]/2,
var cov (Xi, Xj) =
1,n [var (Xi - Xj)2]/2n.
With var (Xi - Xj)2 =
var [Xi(Xi - Xj) + Xj (Xi - Xj)]
we have finally
var cov (Xi, Xj) = [var Xi. var (Xi - Xj)]/2n
It follows that for the serial correlation
r = r (Xi, Xj) = [cov (Xi, Xj)]/ var Xi,
we have
E[(r) = 0 and var r = [var cov (Xi, Xj)]/[var Xi]2