American Journal of Mathematics and Statistics

p-ISSN: 2162-948X    e-ISSN: 2162-8475

2014;  4(2): 51-57

doi:10.5923/j.ajms.20140402.01

Analysis of Compositional Time Series from Repeated Surveys

Etebong P. Clement

Department of Mathematics and Statistics University of Uyo, P.M.B.1017 Uyo, Uyo, Nigeria

Correspondence to: Etebong P. Clement, Department of Mathematics and Statistics University of Uyo, P.M.B.1017 Uyo, Uyo, Nigeria.

Email:

Copyright © 2012 Scientific & Academic Publishing. All Rights Reserved.

Abstract

A compositional time series is a multivariate time series in which each of the series has values bounded between zero and one and the sum of the series equals one at each time point. Data with such characteristics are observed in repeated surveys when a survey variable has a multinomial response but interest lies in the proportion of units classified in each of its categories. The main approach to analyzing Compositional Time Series data has been based on the application of an initial transform to break the unit sum constraint. Box-Cox transformation originally was envisioned as a panacea for simultaneously correcting normality, linearity and homoscedasticity. However, one thing is clear; that seldom does this transformation fulfill the basic assumptions as originally suggested. This paper aims at reviewing works relating to these transformations with some modifications and illustrative example as would be applicable to the analysis of compositional time series data.

Keywords: Box-Cox transformation,Compositional time series,Multinomial response, Repeated surveys

Cite this paper: Etebong P. Clement, Analysis of Compositional Time Series from Repeated Surveys, American Journal of Mathematics and Statistics, Vol. 4 No. 2, 2014, pp. 51-57. doi: 10.5923/j.ajms.20140402.01.

1. Introduction

Repeated surveys produce time series comprising estimates of the unknown target series . If a survey is repeated at times then multinomial responses at each time , lead to compositions. A composition is a vector of non-negative components summing to a constant, usually a unity. Symbolically, a vector such that: is a composition. A time series of composition is referred to as a compositional time series (CTS).
A compositional time series (CTS) is defined as a multivariate time series in which each of the series has values bounded between zero and one and the sum of the series equals one at each time point. Data with such characteristics are observed in repeated surveys when a survey variable has a multinomial response but interest lies in the proportion of unit classified in each of its categories. Therefore, the survey estimates are proportions of a whole subject to a unity-sum constraint.
A repeated survey is a sample survey which is performed more than once with essentially the same questionnaire or schedule but not necessarily with the same sample units. Many repeated surveys are based on a rotating panel design in which K panels of sampling units are investigated at each survey round (time point) and panels are replaced in a systematic manner, according to the rotating pattern of the survey design. In these surveys, elementary design unbiased estimates , for the population parameters , can be obtained from each rotation group. A rotation group is a set of sampling units that joins and leaves the sample at the same time [1].
A repeated survey enables estimation of changes for the population as well as cross-sectional estimate. Monitoring and detecting important changes will usually be a key reason for sampling in time. Common frequencies for repeated survey are monthly, quarterly and annual. However more frequent sampling may be adopted as in the opinion polls leading up to an election and monitoring Television or Radio rating [2].
Some examples of repeated surveys are monthly labour force surveys in Australia. Quarterly surveys include the labour force survey in U.K and Ireland and many business surveys. Annual surveys include the Annual Survey of Manufacturers of the U.S. Census Bureau enumerates a fixed panel of economic establishments for five survey years. Establishments are selected with probabilities proportionate to size using Poisson sampling. The June Enumerative Survey of the National Agricultural Statistics Service is a yearly survey of agricultural activities. The farm costs and returns survey, also of the National Agricultural Statistics Service, enumerates a stratified simple random sample of farms each year.
In a repeated survey there is not necessarily any overlap of the sample for different occasions. A rotating panel surveys also uses a sample that is followed over time, but the focus is on estimates at aggregate levels. When the emphasis is on estimates for the population an independent sample may be used on each occasion, which is often the case when the interval between the surveys is quite large. An option is to use the same sample at each occasion, with additions so that the sample estimates refer to the current population. For monthly or quarterly surveys the sample is often designed with considerable overlap between successive surveys. The sample overlap will reduce the sampling variance of estimates of change and reduce costs. Many important surveys are conducted repeatedly to give estimates of the level or mean for several time periods.
Repeated surveys can provide estimates for each time periods . A major value of repeated surveys is in their ability to provide estimates of change. The simplest analysis of change is the estimate of one period change . In a monthly survey this corresponds to one month change. For a survey conducted annually this corresponds to annual change. In general, therefore the change time periods apart can be estimated as the difference at .
The focus is often on , but for a survey repeated on a monthly basis changes for are also commonly examined [2]. Having sample overlap at will usually lead to a positive correlation between the estimates. Since
having sample overlap reduces the variance of compared with having no sample overlap. [3] considered the components of change in a repeated survey [4-6] give a general review of issues in the design and analysis of repeated surveys. [7] cover many of the important issues associated with panel surveys. [8-10] review estimation issues for repeated surveys.
The focus of this paper is on compositional data from repeated surveys. Data of this kind frequently arise in disciplines as disparate as biology, demography, ecology, economics, geology and politics. Examples are: the percentage of different species of fish recorded in a lake at different instants in time, the composition of monthly immigration to a city according to the country of origin, the daily market share at the end of trading, the breakdown of household monthly consumption by type of item in budget surveys and the results of opinion polls conducted at different times during an election campaign [11]. In this paper we give a detailed review of developments in the field of the statistical analysis of compositional time series (CTS).
Historically, the main approach to analyzing CTS data has been based on the application of an initial transform to break the unit sum constraint, followed by the use of standard time series techniques. The inverse transformation is then used on the derived results to obtain results pertinent to the original sample space. That is, the inverse transformation is applied to obtain the equivalent inferential results for the original compositional time series (CTS).
This approach was first discussed by [12] in the context of analyzing CTS from repeated sample surveys. In [12-14], the authors first proved that such an approach is in variant to the choice of the component used as the common divisor in the additive log ratio (alr) transformation. Secondly, assuming normality for the distribution of , they obtained forecasts for the original CTS by calculating the mean of the corresponding additive logistic distribution numerically.
In this paper two methods of analyzing CTS is discussed: The direct modeling in the simplex, and transformation of the simplex. An attempt is made at reviewing the works relating to the transformation of the simplex with some modifications.

2. Compositional Time Series

Let
(1)
be a vector of population quantities of interest at time , and assume that observations are taken at equally spaced time intervals .
Let
(2)
represent a survey-based estimate of based on data collected at time .
Repeated surveys produce time series comprising estimates of the unknown target series . According to [1] focusing on the unknown population vector , it is natural to imagine that knowledge of conveys useful information about but without implying that it is perfectly predictable from .
One way of representing this situation is by considering being a random variable which evolves stochastically in time following a certain time series model, as was first proposed for univariate survey analysis by [15-17]
The survey estimates of (1) and (2) can then be expressed as
(3)
where , and are random processes and are the sampling errors such that and .
Many variables investigated by statistical agencies have a multinomial response and interest lies in the estimation of the proportion of units classified in each of the categories. If this is the case, the vector of proportion sums to one and forms what is known as a composition.
A composition is a vector of non-negative components summing to a constant, usually 1, or put symbolically, a vector such that .
A time series of compositions is referred to as a Compositional Time Series (CTS). A Compositional Time Series is a sequence of vectors each belonging to the simplex .
If a survey is repeated at time , then multinomial response at each time at say constitute compositions.
which forms a multivariate time series.
The transformation of the series produces a multivariate time series defined on at each time point which can be analysed using standard methods. In particular [13] examined the use of ARMA models on the transformed series defined by .
In the multivariate case, the ideas of [18] who give a very simple procedure for choosing, estimating and testing such models is always followed.
However, it is always necessary to consider if the choice of reference variable in any way influences the analysis. Consequently, [12] proves the following results.
(i) Let
where is given by
and , then if follows a multivariate ARMA process of dimension then is also multivariate ARMA . The roots of the determinantal equations of both the AR and MA components from the two models are identical so that the stationarity and invertibility conditions remain consistent.
(ii) Consider the compositional time series where follows an ARMA (p, q) process. Then each ARMA model represents the same model for , except that the elements of and associated parameters have been permuted. That is, the model for is totally invariant to the choice of reference variable.
The consequences of results (i) and (ii) is that any component of may be selected as the reference variable without affecting the final results. In what follows, we assume that the reference variable is . The application of compositional data to modelling and forecasting is straight forward when the argument of [19] is followed.
Let the series be transformed to .
is then modeled by the vector ARMA , forecasts for can be obtained. Let the -step a head forecast of be denoted by and its covariance matrix , a “naïve” forecast for as:
Assuming normality for the distribution of so that . The optimum forecast of may be obtained numerically by calculating the mean of or may be approximated. Also a confidence region for may be obtained following standard multivariate theory, though the confidence region will not centered at .
A 100 confidence region for according to [13] can be formed from
where is the point of a distribution, by mapping points from onto the simplex .
Also forecasts for either the ratios or generally the log-ratios may be obtained.
where

3. Analyzing Compositional Time Series

Two methods of analyzing compositional time series will be explored, namely: Direct method and transformation method. Under the transformation methods of analysis, we shall examine two techniques: Box and Cox transformation and the log-ratio transformation. Again, the log ratio transformation shall be viewed under: (i) additive log ratio (alr) transformation (ii) centered log ratio transformation (clr) and (iii) isometric log ratio transformation.

3.1. Direct Modeling in the Simplex

Around the same time as the publication of [12] and [20-21] introduced a different approach to analyzing CTS, which had also been inspired by some of the earlier ideas of Aitchison. There and in [22], the authors developed space state models which could be used to model CTS data directly in the simplex. The distribution of the CTS conditioned on the unobserved state was assumed to be Dirichlet. The state distribution was assumed to be Dirichlet conjugate. This was a new generalization of the Dirichlet distribution proposed by them in order to allow for dependence between the components.
A vector of continuous proportions consists of the proportions of some total accounted for by its constituent components (compositions). We consider the situations where time series data are available and where interest focuses on the proportions rather than the actual amounts. A state space model for time series of compositions conditionally on the unobserved state, the observation are assumed to follow the Dirichlet distribution, often considered to be the most natural distribution on the simplex. The state follows the Dirichlet conjugate distribution.
Let be a vector of continuous proportions, namely a vector with positive components such that .
Where is a - vector of 1s.
Then follows the Dirichlet distribution if it has the density
(4)
In density (4) where for and
is the Dirichlet function, a - dimensional analogue of the beta function. We denote this situation by .
The sample space is the d-dimensional simplex ;
Expressing (4) in exponential family form, we have:
Let
and
Z is the vector of symmetric log ratios (clr) and Z = clr (y)
Also let
where so that . Then density (4) becomes:
(5)
The sample space is and the parameters space is . The purpose of this reparameterization according to [22] is to separate the effects of location and spread as far as possible.

3.2. Transformation Method

The sample space of a composition is referred to as the simplex . It has been known since the days of [23] that normal statistical methods are not applicable to element of the simplex (the compositions).
The major way, following the ideas of Aitchison of resolving these problems has been through transformation.
3.2.1. Box-Cox Transformation
[24] introduced the use of the well-known Box-Cox transformation as an alternative to the additive log ratio (alr) transformation. The Box-Cox transformation has the advantage of including the alr transformation as a special case. However, the only application of this approach known is that presented in [25]. These authors modeled the Box-Cox transformed data using dynamic linear models incorporating a rich class of distributions for the errors based on scale mixtures of multivariate normal distributions. This general class of distributions includes as special cases the multivariate normal, student-t, logistic and stable distributions, among others.
[25] used the same complex procedure as those proposed in [26] to carry out model selection and inference. They illustrated their approach using two CTS; the mortality data from Los Angeles (analyzed previously by [26] and a CTS on vehicle production which had been previously analyzed by [21].
[27] introduced a family of power transformation such that the transformed values are a monotonic function of the observations over some admissible range and indexed by
(6)
for . However, this family has been modified by [28] to take account of the discontinuity at , such that
(7)
and that for unknown
where is a matrix of known constants, is a vector of unknown parameters associated with the transformed values and is a vector of random errors. The transformation in equation (7) is valid only for and, therefore, modifications have had to be made for negative observations. [28] proposed the shifted power transformation with the form
(8)
where is the transformation parameter and is chosen such that .
[29] introduced the so-called modulus transformation which is considered to normalize distributions already possessing some measure of approximate symmetry and carries the form
(9)
[30] suggested another alternative which can be used with negative observations and which is claimed to be effective at turning skew unimodal distributions into nearly symmetric normal-like distributions and is of the form:
(10)
[31] suggested another modification so that distributions of with unbounded support such as the normal distribution can be included. For , the extension is:
(11)
It is important to note that the ranged of in equations (6) – (9) is restricted according to whether is positive or negative. This implies that the transformed values do not cover the entire range and, hence, their distributions are of bounded support. Consequently, only approximate normality is to be expected.
It is also remarked that since [28] transformation, other modifications of the transformation for special applications and circumstances had been made, but for most researchers, the original Box-Cox transformation of equation (7) suffices and is preferable due to computational simplicity.
3.2.2. Log Ratio Transformation
Let denote the family of all real D x D matrices such that
Let and . We defined the product as:
The function is an endomorphism of the vector space . Moreover, any endomorphism of can be written in this form. The matrix associated to identity endomorphism is the well-known centering matrix of order .
(i) Additive Log ratio Transformation (alr)
The alr transformation of index denoted by alr(x) is the one-to-one transformation from to define as:
where and
The inverse denoted or (gal) is defined as:
where gal means generalized additive logistic transformation
The additive log ratio transformation is asymmetric in the parts of the compositions.
(ii) Centered Log Ratio Transformation (clr)
The centered (or symmetric) log ratio transformation denoted by clr is the function from the compositional space to , defined by:
where and is the geometric mean of x.
The inverse denoted by is defined by
This transformation is symmetric in the parts of the composition. The transformation maps in the subspace of , which can be seen to be a hyperplane through the origin of , orthogonal to (vector of units). This subspace has dimension . If be any orthonormal basis of and if be the matrix .
(iii) Isometric Log Ratio Transformation (ilr)
The isometric log ratio transformation denoted by .
For a given matrix V of D rows and (D-1) columns such that (identity matrix of elements) and where may be any value and 1 is a matrix full of ones.
Alternatively, where d=D-1
where
The inverse denoted by (ilr)-1 is defined as:
,
where and
Let evaluation the log ratio transformations when D=3 and 4. For
where
where
where
Again if D=4, that is, then the resulting vectors of the different transformations are the following:
where g(x) is the geometric mean as defined earlier.
It is very important to emphasize that all these transformations - and its inverses are one-to-one linear transformations between the compositional vector space and the real vector space with the natural structure. Vectors and associated with the same composition are related by the following linear relationship expressed in matrix form.
1.
2.
3. and where is the matrix , with .

4. Conclusions

The Box-Cox transformation has been widely used since it was first proposed. It has inspired a large amount of research on its applicability as well as on the drawbacks arising from its usage. However, one thing is clear; that seldom does this transformation fulfill the basic assumptions of linearity, normality and homoscedasticity simultaneously as originally suggested by [28]. A review of alternatives approaches is presented with modifications and illustrations useful to the analysis of compositional time series data.

References

[1]  Silva, D.B.N. and Smith, T.M.F. 2001, Modeling compositional time series from repeated surveys. Survey Methodology, 27,205-215.
[2]  Steel, D.G. and McLaren, C. 2008, Design and analysis of repeated surveys. Centre for Statistical and Survey Methodology. University of Wollongong, Working Paper Series, 11-08, 2008,13p.http://ro.uow.edu.au/cssmwp/10.
[3]  Hott, D. and Skinner, C. J. 1983, Component of change in repeaters surveys. International Statistical Review 57, 1-18.
[4]  Duncan, G.J. and Kalton,G. 1987, Issues of design and analysis of surveys across time. International Statistical Review, 55, 97-117.
[5]  Kalton, G. and Citro, C.F. 1993, Panel surveys:adding the fourth dimension. Survey Methodology,19,205-215.
[6]  Steel, D.G., 2004, Sampling in time. Encyclopaedia of social measurement. Academic Press, 823-832.
[7]  Kasprzyk, D., Duncan, G., Kalton, G. and Singh, M.P. 1989, Panel surveys. John Willey and Sons, New York.
[8]  Smith, T.M.F. 1978, Principal and problems in the analysis of repeated surveys. Survey Sampling and Measurement. Ed. N. K. Namboodiri, Academic Press, New York.
[9]  Binder, D.A. and Hidiroglou,M,A. (1988). Sampling in time: in handbook of Statistics, (Eds., P.R. Krishnaiah and C.R. Rao). Elsevier Science 6,187-211.
[10]  Fuller, W. A. 1990, Analysis of repeated surveys. Survey Methodology, 16,167-180.
[11]  Aguilar Zuil,L., Barcelo-Vidal, C. and Larrosa, J.M. 2007, Compositional time series analysis: a review in proceedings of the 56th session of the ISI (ISI 2007), Lisbon, August 22-29.
[12]  Brunsdon, T.M. 1987, The time series analysis of compositional data. Ph.D. Thesis, university of Southampton, U.K.
[13]  Smith, T.M.F., and Brunsdon, T.M. 1989, The time series analysis of compositional data. Proceedings of American Statistical Association, 26-32.
[14]  Brunsdon, T.M. and Smith, T.M.F. 1998, The time series analysis of compositional data. Journal of Official Statistics 14 (3), 237-252.
[15]  Blight, B. J. N. and Scott, A. J. 1973, A stochastic model for repeated surveys. Journal of the Royal Statistical Society B: Methodological 35, 61-68.
[16]  Scott,A.J. and Smith, T.M.F. 1974, Analysis of repeated surveys using time series methods. Journal of American Statistical Association 69,674-678.
[17]  Scott, A. J., Smith, T.M.F. and Jones, R.G. 1977, The application of time series methods to the analysis of repeated surveys. International Statistical Review, 43, 13-28.
[18]  Tiao,G.C. AND Box, G.E.P. 1981, Modelling multiple time series with applications. Journal of American Statistical Association, 76, 802-816.
[19]  Wallis, K.P. 1987, Time series analysis of bounded economic variables. Journal of Time Series Analysis, 8,115-123.
[20]  Quintana, J. M. and West, M. 1988, The time series analysis of compositional data. Journal of Bayesian Statistics, 3,747-756.
[21]  Grunwald, G.K. 1987, Time series models for continuous proportions. Ph.D. Thesis, University of Washington.
[22]  Grunwald, G.K. Raftery, A.E. and Guttorp, P. 1993, Time series models for continuous proportions. Journal of Royal Statistical Society B, 55,103-116.
[23]  Pearson, K. 1897, Mathematical contributions to the theory of evolution: On a form of spurious correlation which may arise when indices are used in the measurement of organs. Proceedings of the Royal Society of London, LX, 489-498.
[24]  Aitchison, J. 1986, The statistical analysis of compositional data. Chapman and Hall, London.
[25]  Bhaumik,A., Dey, D.K and Ravishanker, N. 2003, A dynamic linear model approach for compositional time series analysis. Technical Report. University of Connecticut.
[26]  Ravishanker, N., Dey, D.K. and Iyengar, M. 2001, Compositional time series analysis of mortality proportions. Communication in Statistics Theory Methodology, 30(11), 2281-2291.
[27]  Turkey, J.W. 1957, The comparative anatomy of transformation. Annals of Mathematical Statistics, 28,602-632.
[28]  Box,G.E.P. and Cox,D.R. 1964, An analysis of transformation. Journal of the Royal Statistical Society, Series B, 26,211-252.
[29]  John, J. A and Drapper, N.R. 1980, An alternative family of transformation. Applied Statistics, 29,190-197.
[30]  Manly, B. F. 1976, Exponential data transformation. The Statistician, 25, 37-42.
[31]  Bickel, P.J. and Doksum, K. A. 1981, An analysis of transformation revisited. Journal of the American Statistical Association, 76,296-311.