Multiple Analysis of Series for Homogenization (MASH). Seasonal Application of MASH (SAM), Automatic Using of Meta Data

Tamás Szentimrey

Hungarian Meteorological Service
H1525, P.O. Box 38, Budapest, Hungary
szentimrey@met.hu 
The MASH method was developed in the Hungarian Meteorological Service (see References). It is a relative homogeneity test procedure that
does not assume the reference series are homogeneous. Possible break points and shifts can be detected and adjusted through mutual
comparisons of series within the same climatic area. The candidate series is chosen from the available time series and the remaining series
are considered as reference series. The role of series changes step by step in the course of the procedure . Depending on the climatic elements,
additive or multiplicative models are applied. The second case can be transformed into the first one by logarithmization.
Several difference series are constructed from the candidate and weighted reference series. The optimal weighting is determined by minimizing the variance of the
difference series, in order to increase the efficiency of the statistical tests. Providing that the candidate series is the only common series of all the difference series,
break points detected in all the difference series can be attributed to the candidate series.
A new multiple break points detection procedure has been developed which takes the problem of significance and efficiency into account. The significance and
the efficiency are formulated according to the conventional statistics related to type one and type two errors, respectively. This test obtains not only estimated
break points and shift values, but the corresponding confidence intervals as well. The series can be adjusted by using the point and interval estimates.
Since a MASH program system has been developed for the PC, the application of this method is now relatively easy, with emphasis on GAME of MASH
(see program MASHGAME.BAT), which is a playful version of MASH procedure for homogenization. This version can be developed towards the
automatization (see program MASHGAUT.BAT).
The new developments are connected with two special problems of the homogenization of climatic time series.
One of them is the relation of monthly, seasonal and annual series. The problem arises from the fact, that the signal to noise ratio is probably less in case of
monthly series than in case of derived seasonal or annual ones. Consequently the inhomogeneity can be detected easier at the derived series although we
intend to adjust the monthly series (see the SAM system).
The second problem is connected with the usage of meta data in the course of homogenization procedure. The developed version of MASH system makes
possible to use the meta data information  in particular the probable dates of break points  automatically.

(MOTTO) 
PROBLEM of HOMOGENIZATION 
Basis: DATA 
Tools: 
MATHEMATICS
META DATA SOFTWARE 
: abstract formulation
: historical, climatological : automatization 

SOLUTION = MATHEMATICS + META DATA + SOFTWARE 
(i) without SOFTWARE:
MATHEMATICS + META DATA = THEORY WITHOUT BENEFIT 
(ii) without META DATA:
MATHEMATICS + SOFTWARE = GAMBLING 
(iii) without MATHEMATICS:
META DATA + SOFTWARE = 'STONE AGE' + 'BILL GATES' 
BASIC PRINCIPLES OF 'MASH' PROCEDURE 
 Relative homogeneity test procedure.
 Step by step procedure: the role of series (candidate or reference series)
changes step by step in the course of the procedure.
 Additive or cumulative model can be used depending on the climate elements.
 Monthly, seasonal or annual time series can be homogenized.
 In case of having monthly series for all the 12 months, the monthly, seasonal
and annual series can be homogenized together.
(SAM procedure: Seasonal Application of MASH)
 The daily inhomogeneities can be derived from the monthly ones.
 META DATA (probable dates of break points) can be used automatically. 
Programmed Statistical Procedure (Software: MASHv2.01) 
EXAMPLE. Let us assume that there is a difficult stochastic problem.
In case of having relatively few statistical information:
 an intelligent man is possibly able to solve the problem, but it is timeconsuming;
 the solution of the problem can not be programmed.
In case of increasing the amount of statistical information:
 one is unable to discuss and evaluate all the information,
 but then the solution of the problem can be programmed. (CHESS!!) 
AIM, REQUIREMENT
 Development of mathematical methodology in order to increase the amount of statistical information.
 Development of algorithms for optimal using of both the statistical and the 'meta data' information. 
THE MAIN CLIMATOLOGICAL AND STATISTICAL PROBLEMS 
Modelling of the stochastic relationship between data series:
additive model, cumulative (multiplicative) model depending on climate elements,
distribution of series elements

Modelling of "inhomogeneity":
break points, shifts, outliers etc.. 
Comparison of the examined series (Relative Test):
methods for multiple comparison of the candidate series with more reference series,
selection for 'good' reference series systems, weighting of reference series,
estimation of weighting factors.

Missing values: methods for closing gaps in the series. 
Break points detection:
mathematical formalization according to the statistical conventions:
 first kind error ( significance )
 second kind error ( efficiency ),
point estimation and interval estimation (confidence interval),
procedure for multiple break points and outliers detection.

Correction (adjusting) of candidate series:
separation of the detected break points and outliers for the candidate series,
point estimation, interval estimation (confidence interval) for the shifts. 
Relation of monthly series, seasonal series, annual series:
SAM (Seasonal Application of MASH). 

Meta Data:: automatic using of station history. 
Automatization:: interactive, automatic procedures for homogenization. 
MATHEMATICAL BASIS OF 'MASH' PROCEDURE
(draft version) 
1. STATISTICAL MODELLING 
1.1 Additive Model (for example temperature) 
Examined series 
X_{i}(t) = C_{i}(t) + IH_{i}(t) + _{i}(t)
(i = 1,2,... ,N; t = 1,2,...,n)
C : climate change; IH : inhomogeneity, : noise 
1.2 Multiplicative Model (for example monthly or seasonal precipitation) 
Examined series 
X_{i}^{*}(t)
= C_{i}^{*}(t) IH_{i}^{*}(t)
_{i}^{*}(t)
(i
= 1,2,... ,N; t = 1,2,...,n)
C^{*} : climate change; IH^{*} : inhomogeneity, ^{*} : noise 
Logarithmization for Additive Model 
X_{i}(t) = C_{i}(t) + IH_{i}(t) + _{i}(t)
(i = 1,2,... ,N; t = 1,2,...,n)
where
X_{i}(t) = ln X_{i}^{*}(t) , C_{i}(t) = ln C_{i}^{*}(t) ,
IH_{i}(t) = ln IH^{*}(t) , _{i}(t) = ln
_{i}^{*}(t) 
Problem 
If X_{i}^{*}(t) values are near or equal to 0. 
This problem can be solved by a Transformation Procedure which increases slightlythe little values.
Consequently the Multiplicative Model can be transformed into the Additive One. 
2. MULTIPLE COMPARISON OF THE EXAMINED SERIES 
Candidate series and its inhomogeneity:
X_{c}(t) IH_{c}(t)
c { 1,2,..., N} 
Set of indexes of reference series: R_{c} { 1,2,..., N}
( i
R_{c}; , if C_{i}(t)
C_{c}(t)) 
Optimal Difference Series belonging to the subset R_{c}^{(m)}
R_{c} (m = 1,...,2 ^{R}c^{}  1 )
(  : numerosity ) 

Result:

Example:

Optimal Difference Series System:

(i) Z_{c}^{(m)}(t) : Optimal Difference Series belonging to subset R _{c}^{(m)}
( for efficiency)
(for identification of inhomogeneity of candidate series)
( for efficiency) 
(iv) If (i), (ii), (iii) are fulfilled then let M^{*} be minimal too! (for efficiency)

3. EXAMINATION OF DIFFERENCE SERIES 
3.1 Break Points Detection

BASIC POSTULATES FOR THE DECISION METHODS ( FORMALIZATION )
The detected break points: 
(i) Type one error (significance)
There exists such a
: 

homogeneous
We have to intend to give the probability of type one error, i.e. the significance level! 
(ii) Type two error (efficiency)
There exists such a real break point that we could not detect. As much as possible! 
3.2 Significant Procedure for Break Points Detection 
Inhomogeneity measure for all the intervals

Test Statistic of difference series
The inhomogeneity of difference series can be characterized by the
Test Statistic: TS = INH([k,l]) 
The critical value ( ) ( by Monte Carlo Method )
P ( Ts >
 if Z (t) homogeneous
) = sig. level ( = 0. 1, 0.05, 0.01
)
Test Statistic can be compared to the critical value and in case of homogeneity it should be less, on the given significance level. 
PROPERTIES OF THE DETECTING PROCEDURE
(FOR THE PURPOSE OF SIGNIFICANCE AND EFFICIENCY)
If the detected break points: , then
i.e. on the given significance level:
 the intervals are not homogeneous, consequently the detected
break points are not superfluous, 
 the intervals can be accepted to be homogeneous. 
Confidence Intervals
Confidence intervals also can be given for the break points on the
confidence level (1sig. level):
I _{l}
l=1,..., 
3.3 Estimation of Shifts
Point estimation; Confidence intervals for the shifts 
4. EVALUATION OF HOMOGENEITY OF CANDIDATE SERIES X_{c}(t)
Based on the Test Statistics (TS) belonging to the Optimal Difference Series:
Z_{c}^{(m)}(t) ( m = 1,...,2^{R}c^{}  1 ) 
5. CORRECTION OF CANDIDATE SERIES X_{c}(t)
Based on the examination of the Optimal Difference Series System:
BASIC PRINCIPLE OF BREAK POINT DETECTION FOR CANDIDATE SERIES
Let us assume, that
: detected Break Points,
I^{(m)}
: Confidence Intervals
belonging to the Optimal Difference Series Z_{c}^{(m)}(t)
, AND 

DECISION
The 'most probable'
is a Break Point of the Candidate Series X_{c}(t). 
6. USING OF META DATA (Meta Data: probable dates of break points) 
BASIC PRINCIPLE OF BREAK POINT DETECTION BY USING OF META DATA
Candidate series and its Meta Data:

Optimal Difference Series System:

Let us assume, that
:
detected Break Points,
I^{(m)}
: Confidence Intervals
belonging to the Optimal Difference Series
Z_{c}^{(m)}(t)
, AND

BASIC DECISION RULE

The 'most probable' D^{(c)}
Q is a Break Point of the Candidate Series X_{c}(t).
(Break Point: Meta Data)
(ii) If but
No Decision. 
(iii) If 
The 'most probable'
is a Break Point of the Candidate Series
X_{c}(t).
(Break Point: is not Meta Data, but "undoubtful") 
7. EVALUATION OF META DATA
(Meta Data: probable dates of break points) 
THE QUALITY OF META DATA CAN BE VERIFIED BY STATISTICAL TESTS!!!
For example: the problem of Missing Meta Data??
In Practice: the statistical Test Results are often verified with the Meta Data.
BUT: the question may be turned round! 
Examined series and their Meta Data
X_{i}(t), _{i} = {
}
( i = 1,2,....,N) 
Candidate series and its Meta Data:
X_{c}(t), _{c}
c{ 1,2,....,N } 
Optimal Difference Series belonging to the subset
R _{c}^{(m)} R _{c} : 

Transformation of Difference Series Z_{ci}(t)

_{ci}(a,b) : average of Z _{ci}(t)
above the interval (a,b). 
Transformed Optimal Difference Series belonging to the subset
R_{c}^{(m)} R _{c} : 

(m = 1,...,2
^{R}c^{}
 1 ) 
are homogeneous if the inhomogeneities can be explained by the Meta Data! 
EVALUATION OF META DATA : Based on the Test Statistics (TS) belonging to the
Transformed Optimal Difference Series
_{c}^{(m)}(t) . 
8. SEASONAL APPLICATION OF MASH (SAM) 
Monthly difference series: Z^{(k)}(t)
(k = 1,2,....,K) 
Expectations and Variances:
E(Z^{(k)}(t) ) = IH^{(k)}(t), V (Z^{(k)}) 
Seasonal mean difference series:

Expectation and Variance:

The test results after the Homogenization of monthly series
H_{0}:
IH^{(k)}(t) 0 ( k = 1,2,...,K) can be accepted. 
BUT! (sometimes) H _{0}:
can not be accepted! 
The reason of the problem
The efficiency of test depends on the signal to noise ratio, and according to the test results

as a consequence of the general inequality:
V() < V(Z^{(k)}) ( k = 1,2,...,K) 
Deviance series and ratios
( k = 1,2,...,K) 
Lemma 1
If R((t)) > R(Z^{(k)}(t)) ( k = 1,2,...,K) , then 

where
(Z 
) : arithmetic mean of the variances
V(Z^{(k)}  ) ( k = 1,2,...,K) ,
_{H}(Z  )
: harmonic mean of the variances V(Z^{(k)}
 )
( k = 1,2,...,K) 
Consequently if R((t)) >
R(Z^{(k)}(t)) 0 ( k = 1,2,...,K) , then
the ratios R(Z^{(k)}(t)  (t)) ( k = 1,2,...,K)
are probably near to 0. 
Test of Hypothesis
H_{0}:
R(Z^{(k)}(t)  (t))
0
( ) ( k = 1,2,...,K) 
The test of hypothesis is based on the examination of the deviance series
Z^{(k)}(t)  (t) ( k = 1,2,...,K) 
If H_{0} can be accepted, then
as a consequence of the following lemma. 
Lemma 2
where 
(Z) : arithmetic mean of the variances V( Z^{(k)})
( k = 1,2,...,K) , 
_{H}(Z) : harmonic mean of the variances V( Z^{(k)})
( k = 1,2,...,K) 
Consequently the ratios
R( Z^{(k)})(t)  ) ( k = 1,2,...,K) 
are probably near to 0, i.e. the monthly inhomogeneities IH^{(k)}(t) ( k = 1,2,...,K) 
can be estimated with the estimation of the seasonal inhomogeneity
. 
THE STRUCTURE OF PROGRAM SYSTEM (MASHv2.01) 
Main Directory MASH2001:
 README.DOC
 Subdirectory SAM:
 Subdirectory SAMPAR
(parametrization program)
 Main Program Files of SAM
 Subdirectory SAMEND
(finishing program)
 Subdirectory SAMMANU
("manual" programs)
 Subdirectory SAMSUB
(do not use it including "subroutines")
 Subdirectory MASH:
 Subdirectory MASHPAR
(parametrization program)
 Main Program Files of MASH
 Subdirectory MASHEND
(finishing program)
 Subdirectory MASHMANU
("manual" program)
 Subdirectory MASHSUB
(do not use it including "subroutines")

General Comments 
Monthly, seasonal or annual time series can be homogenized by the aid of the program system. The time series belonging to different stations
are compared in the course of the procedure.
Maximal number of the stations: 100
Maximal length of the time series: 200
In case of having monthly series for all the 12 months, the monthly, seasonal and annual series can be homogenized together by the main program files of the
subdirectory SAM (Seasonal Application of MASH).
In case of having only annual series, or monthly series belonging to a given month, or seasonal series belonging to a given season, the series can be
homogenized by the main program files of subdirectory MASH.
Depending on the climatic elements, additive (e,g. temperature) or multiplicative (e.g. precipitation) models are applied. The second case can be transformed
into the first one by logarithmization. The problem of values being near to zero can be solved by a Transformation Procedure which increases slightly the little values. 
THE MASH SYSTEM (MASH IN PRACTICE) 
 Subdirectory MASH:
 Subdirectory MASHPAR (parametrization program)
 Main Program Files of MASH
 Subdirectory MASHEND (finishing program)
 Subdirectory MASHMANU ("manual" program)
 Subdirectory MASHSUB (do not use it including "subroutines")

I. Parametrization in Subdirectory MASHPAR (MASHPAR.BAT)

Data File, Significance level (0.1, 0.05, 0.01), Table of Reference System, Table of META DATA

II. The Main Program Steps in Subdirectory MASH

1. Automatic filling of missing values ( MASHMISS.BAT )
It is obligatory in case of missing values! It can be repeated! 
2. The further steps can be used optionally
MASHLIER.BAT:
For automatic correction of outliers.
MASHHELP.BAT:
For evaluation of homogeneity of the examined series; for selection of candidate series.
METAHELP.BAT: For evaluation of META DATA.
MASHGAME.BAT:
An intensive examination for correction of one of the examined series in a playful way.
MASHGAUT.BAT:
An automatic version of MASHGAME.BAT for examination of all the series.
The examination is less intensive than the examination performed by MASHGAME.BAT.
MASHCOR.BAT: Possibility for manual correction of examined series.
MASHDRAW.BAT: Graphic series.
(The steps (1 2) can be repeated optionally!!!!!)

III. Finishing in Subdirectory MASHEND (MASHEND.BAT) 
THE SAM SYSTEM (SAM IN PRACTICE) 
 Subdirectory SAM:
 Subdirectory SAMPAR (parametrization program)
 Main Program Files of SAM
 Subdirectory SAMEND (finishing program)
 Subdirectory SAMMANU ("manual" program)
 Subdirectory SAMSUB (do not use it including "subroutines")

I. Parametrization in Subdirectory SAMPAR (MASHPAR.BAT)

Data File, Significance level (0.1, 0.05, 0.01), Table of Reference System, Table of META DATA 

II. The Main Program Steps in Subdirectory SAM

1. Taking the chosen monthly or seasonal series In ( SAMIN.BAT )
2. Automatic filling of missing values ( MASHMISS.BAT )
It is obligatory in case of missing values! It can be repeated!
3. The further steps can be used optionally
MASHLIER.BAT: For automatic correction of outliers.
MASHHELP.BAT:
For evaluation of homogeneity of the examined series; for selection of candidate series.
METAHELP.BAT: For evaluation of META DATA.
MASHGAME.BAT:
An intensive examination for correction of one of the examined series in a playful way.
MASHGAUT.BAT:
An automatic version of MASHGAME.BAT for examination of all the series.
The examination is less intensive than the examination performed by MASHGAME.BAT.
MASHCOR.BAT: Possibility for manual correction of examined series.
MASHDRAW.BAT: Graphic series.

4. The further steps can be used in case of Seasonal Series
SAMTESTC.BAT: Test for comparison of the inhomogeneities between the seasonal series and the appropriate monthly series.
SAMTESTS.BAT: Test Procedure for selecting stations having different inhomogeneities between the seasonal series and the appropriate monthly series.

5. Taking the chosen monthly or seasonal series Out ( SAMOUT.BAT )
(The steps (1  5) can be repeated optionally!!!!!) 

III. Finishing in Subdirectory SAMEND (SAMEND.BAT) 
References 
Szentimrey, T., 1994: "Statistical problems connected with the homogenization of climatic time series", Proceedings of the European Workshop on Climate Variations, Kirkkonummi, Finland, Publications of the Academy of Finland, 3/94, pp. 330339.
Szentimrey, T., 1995: "Statistical methods for detection of inhomogeneities", Proceedings of the Regional Workshop on Climate Variability and Climate Change Vulnerability and Adaptation, Prague, pp. 293298.
Szentimrey, T., 1995: "General problems of the estimation of inhomogeneities, optimal weighting of the reference stations", Proceedings of the 6h International Meeting on Statistical Climatology, Galway, Ireland, pp. 629631.
Szentimrey, T., 1996: "Some statistical problems of homogenization: break points detection, weighting of reference series", Proceedings of the 13th Conference on Probability and Statistics in the Atmospheric Sciences, San Francisco, California, pp. 365368.
Szentimrey, T., 1997: "Statistical procedure for joint homogenization of climatic time series", Proceedings of the Seminar for Homogenization of Surface Climatological Data, Budapest, Hungary, pp. 4762.
Peterson, T.C., Easterling, D.R., Karl, T.R., Groisman, P., Nicholls, N., Plummer, N., Torok, S., Auer, I., Boehm, R., Gullett, D., Vincent, L., Heino, R., Tuomenvirta, H., Mestre, O., Szentimrey, T., Salinger, J., Forland, E.J., HanssenBauer, I., Alexanderson, H., Jones, P. and Parker D., 1998: "Homogeneity adjustments of in situ atmospheric climata data: a review", International Journal of Climatology, 18: 14931517
Szentimrey, T., 1998: "MASHv1.03", Guide for Software Package, Hungarian Meteorological Service, Budapest, Hungary, p. 25.
Auer, I., Böhm, R., 1998: "Endbericht des Projects ALOCLIM, Teil III", Zentralanstalt für Meteorologie und Geodynamik, Wien.
Szentimrey, T., 1999: "Multiple Analysis of Series for Homogenization (MASH)", Proceedings of the Second Seminar for Homogenization of Surface Climatological Data, Budapest, Hungary; WMO, WCDMPNo. 41, pp. 2746.
Szentimrey, T., 2000: "MASHv2.0", Guide for Software Package, Hungarian Meteorological Service, Budapest, Hungary, p. 38.
Szentimrey, T., 2002: "MASHv2.01", Guide for Software Package, Hungarian Meteorological Service, Budapest, Hungary, p. 42. 