International Journal of Statistics and Applications
p-ISSN: 2168-5193 e-ISSN: 2168-5215
2022; 12(3): 77-82
doi:10.5923/j.statistics.20221203.03
Received: Aug. 2, 2022; Accepted: Aug. 17, 2022; Published: Aug. 30, 2022

Wilson da C. Vieira, José A. Ferreira Neto, Mariane P. B. Roque, Bianca D. da Rocha
Department of Agricultural Economics, Federal University of Viçosa, Brazil
Correspondence to: Wilson da C. Vieira, Department of Agricultural Economics, Federal University of Viçosa, Brazil.
| Email: | ![]() |
Copyright © 2022 The Author(s). Published by Scientific & Academic Publishing.
This work is licensed under the Creative Commons Attribution International License (CC BY).
http://creativecommons.org/licenses/by/4.0/

This article presents a simple and effective procedure for the construction of socioeconomic status indices using principal component analysis. The methodological approach consists of obtaining principal components of the correlation matrix from a sample of random variables. For the calculation of the index, a weighted average of selected principal components is used. The proposed method is sufficiently general and can be applied to obtain other types of composite indices. To illustrate the versatility of the method, we provide in this article the calculation of a social vulnerability index for the municipalities of an area of the São Francisco river basin, Brazil, based on data from the demographic census.
Keywords: Socioeconomic status index, Principal component analysis, Methodology
Cite this paper: Wilson da C. Vieira, José A. Ferreira Neto, Mariane P. B. Roque, Bianca D. da Rocha, Using Principal Component Analysis to Build Socioeconomic Status Indices, International Journal of Statistics and Applications, Vol. 12 No. 3, 2022, pp. 77-82. doi: 10.5923/j.statistics.20221203.03.
represents a set of n random variables with mean 
and variance-covariance matrix
. Let
be the random vector of the corresponding standardized variables, that is,
where
represents the variance of variable 
Note that the covariance between variables
and
is related to the covariance between variables
and
as follows
that is, the variance-covariance matrix of z corresponds to the correlation matrix of x. In this article, the correlation matrix of x will be denoted by C.Although the principal components can be obtained from the variance-covariance matrix of x or the correlation matrix of x, they are not necessarily the same. This implies that the interpretation of the results must take into account the choice of the matrix that will be used to extract the principal components. [10] recommend using the correlation matrix to extract principal components when the scales of variables vary widely or they have very different variances. In this article, the analysis will be carried out with the correlation matrix, since the variables generally used to obtain socioeconomic status indices are diverse and with very different variances.In this sense, the principal components
are associated with the random vector z, such that
where
are constants that satisfy certain conditions. It can be shown that the mean of
is equal to zero,
, and its variance is given by 
where
The principal components are obtained sequentially: first,
is selected to capture as much of the variation in the original data as possible amongst all linear combinations of z such that
Then
is selected to account for a maximum proportion of the remaining variance subject to not being correlated with the first principal component,
and
Subsequent principal components are obtained in a similar manner. Formally, the jth principal component is the linear combination
that has the greatest variance subject to the following conditions
As it is an optimization problem with equality constraints, the Lagrange method can be used to obtain the solution (see, for example, [7]). The results of applying this method show that the vector of coefficients that defines the jth principal component,
is the eigenvector of the matrix C associated with its jth largest eigenvalue. Let
be the n eigenvalues of C. It can be shown that
that is, the variance of the jth principal component is equal to the eigenvalue
. It can also be shown that 
Thus, the proportion of the total variance of the standardized variables explained by the jth principal component is given by
and the percentage of the total variance explained by the m first principal components,
is given by
According to [11], “applied principal component analysis consists most often of a mere computation of eigenvectors and eigenvalues of a sample covariance matrix or correlation matrix” (p. 606). That’s largely what we are going to do in this article. To start, we summarize the main results of the principal component analysis related to eigenvalues and eigenvectors that will be useful for the construction of socioeconomic status indices in the following properties:
As mentioned in the introduction, some authors consider only the first principal component,
as a socioeconomic status index and others consider the first two main components,
and
, but as two distinct indices. In this article we propose the construction of a socioeconomic status index as a weighted average of the first m,
principal components. The idea behind this proposal is that the first few principal components will represent a substantial proportion of the variation in the original variables and can therefore be used to provide a convenient lower-dimensional summary of these variables.In this sense, we can construct a socioeconomic status index (SSI) as a linear combination of all the principal components as follows
where the weight vector,
with
is given by![]() | (1) |
principal components satisfy the criterion of [8], then the socioeconomic status index is given by
where
represents the vector of corrected weights, such that
Note that if we were to use the original weights
we would have
a result whose sum of weights is not equal to 1. To correct the weights, we use the following expression:
This correction of the weights is fundamental to obtain a socioeconomic status index as a weighted average of the first principal components. To illustrate this correction of weights, suppose that the first three principal components were selected to construct a socioeconomic status index and that the original weights are
and
. Applying the correction formula, knowing that
, we have
Note that, after correction, we have
. It is important to keep in mind that this procedure for constructing a socioeconomic status index must take into account all variables and all observations of each variable. Suppose you want to build a socioeconomic status index from a sample of 10 variables
and each variable contains 100 observations
. In this case, the principal components of each observation are calculated, that is,
where
represents the jth principal component of the lth observation. After calculating the 10 principal components associated with each to the 100 observations, the weights
corresponding to the sample of variables can be obtained (see expression for determining the vector of weights b above). Suppose further that three principal components were selected to build the socioeconomic status index. In this case, this index is constructed for each observation of the sample of variables as follows:
where
represents the corrected weight, and
denotes the principal component j associated with observation l of the sample of variables. After carrying out all the calculations, one obtains, as a result, an interval composed of 100 socioeconomic status indices (one index for each observation of the sample of variables) that can divided equally or using some other criterion to form the socioeconomic status classes (levels). This number of classes is defined according to the purpose of the study or to meet public policy interests. Commonly used arbitrary cut-off points classify the lowest 40% of individuals as 'poor', the highest 20% as 'rich' and the remainder as the 'average' group (see, for example, [12]). To avoid an eventual negative component in the weight vector, b, another possibility to define weights is to use the expression (2) given in the following propositionProposition. If the
correlation matrix
is positive definite and the eigenvectors associated with C are such that
,
and
,
,
then the weight vector b given by![]() | (2) |
and
Proof. From the expression
, we have
,
, and the eigenvalues are given by 
Substituting the expression for
in (2), we have
Since
is arbitrary, the proof is complete.If we use expression (2) to define the weight vector b, no correction is needed if
, where
is the number of principal components used to build the composite index. In this case,
Standard statistical software (such as STATA or SPSS) can be used to perform the necessary calculations and build composite indices. In the illustration of the use of the proposed method in the next section, the statistical analysis was performed with the R software ([13]) and ArcGis® ([14]) was used to spatialize the results (vulnerability classes) of the study area. ![]() | Figure 1. São Francisco River basin and study area |
![]() | Figure 2. Social vulnerability for the municipalities of an area of the São Francisco River basin |