American Journal of Mathematics and Statistics
p-ISSN: 2162-948X e-ISSN: 2162-8475
2023; 13(1): 44-59
doi:10.5923/j.ajms.20231301.02
Received: Oct. 6, 2022; Accepted: Oct. 30, 2022; Published: Apr. 15, 2023

1Department of Medicine, McMaster University, Hamilton, ON L8N 3Z5 Canada
2Department of Mathematics and Statistics, University of Windsor, Windsor, ON N9B 3P4 Canada
Correspondence to: Rajibul Mian, Department of Medicine, McMaster University, Hamilton, ON L8N 3Z5 Canada.
| Email: | ![]() |
Copyright © 2023 The Author(s). Published by Scientific & Academic Publishing.
This work is licensed under the Creative Commons Attribution International License (CC BY).
http://creativecommons.org/licenses/by/4.0/

The problem of regression analysis of count response data having information on some covariates missing may arise in some practical applications. Further complications, such as, over-dispersion and zero-inflation in the count responses, may also arise. In this paper we develop estimation procedure for the parameters of a zero inflated over/under dispersed count response model in the presence of missing covariates. A zero-inflated negative binomial model with missing covariate information is used. Obtaining maximum likelihood estimates by direct use of the log-likelihood involves multiple numerical integration. To avoid this we develop a weighted expectation maximization algorithm. A simulation study is conducted to investigate the properties of the estimates, in terms of bias, variance, mean squared errors (MSE) and coverage probability (CP). Further simulations are also conducted to study Robustness of the procedure for count data following other over-dispersed models, such as the log-normal mixture of the Poisson distribution. An example and a discussion are given.
Keywords: Count Data, EM Algorithm, Missing Covariate Information, Over dispersion, Regression model, Zero inflation
Cite this paper: Rajibul Mian, Sudhir Paul, Handling Missing Values in Covariate for Modeling Count Data with over Dispersion and Zero Inflation, American Journal of Mathematics and Statistics, Vol. 13 No. 1, 2023, pp. 44-59. doi: 10.5923/j.ajms.20231301.02.

showing that the data are also zero-inflated under a Poisson model.Regression analysis of count data may be further complicated by the existence of missing values either in the response variable and/or in the explanatory variables (covariates). Extensive work has been done on regression analysis of continuous response data with some missing covariates under normality assumption. See, for example, Rubin (1977), Little and Rubin (1987, 2002, 2014), Lipsitz and Ibrahim (1996), Ibrahim, Chen and Lipsitz (1999), Ibrahim, Chen, Lipsitz and Herring (2005), Sinha and Maiti (2007), Maiti and Pradhan (2009).Some work on missing values has also been done on logistic regression analysis of binary data. See, for example, Ibrahim (1990), Lipsitz and Ibrahim (1996), Ibrahim and Lipsitz (1996), Ibrahim, Chen and Lipsitz (1999), Ibrahim, Chen and Lipsitz (2001), Sinha and Maiti (2007), Maiti and Pradhan (2009).Rubin (1977) and Little and Rubin (1987, 2002, 2014) discuss various missingness mechanisms. If the missingness does not depend on observed data, then the missing data are called missing completely at random (MCAR). If the missing data mechanism depends only on observed data, then the data are missing at random (MAR). The MAR is also known as ignorable missing That is, in this case, the missing data mechanism is ignored. If the missing data mechanism depends on both observed and unobserved data, that is, failure to observe a value depends on the value that would have been observed, then the data are said to be missing not at random (MNAR) in which case the missingness is nonignorable. For more detailed discussion on missing data mechanism see Ibrahim et al. (2005).The purpose of this paper is to develop estimation procedure for the parameters of a zero-inflated negative binomial (ZINB) model for count data when information on some covariates on some individuals are missing.The problem of missing responses in ZINB model was dealt earlier by the same researchers in Mian and Paul (2016); and guided to this research.Obtaining maximum likelihood estimates by direct use of the log-likelihood involves multiple numerical integration. To avoid this we develop a weighted expectation maximization algorithm following Ibrahim (1990). A simulation study is conducted to investigate the properties of the estimates, in terms of bias, variance, mean squared errors (MSE) and coverage probability (CP). Further simulations are also conducted to study Robustness of the procedure for count data following other over-dispersed models, such as the log-normal mixture of the Poisson distribution. The method is illustrated using the dental epidemiology data of Bohning et al. (1999) discussed above.The procedure for the estimation of the parameters are developed in Section 2. Results of a simulation study is reported in Section 3. The illustrative example is given in Section 4 and a discussion leading to some conclusions is given in Section 5.![]() | (1) |
and
where
is the zero-inflation parameter. We denote this distribution by
distribution.Suppose that data for the
of n subjects are
which are realizations from
where
represents the response variable and
represents a
vector of covariates with the regression parameter 
such that
Here
is the intercept parameter in which case
for all
![]() | (2) |
the log likelihood, apart from a constant, can be written as ![]() | (3) |
can be estimated by directly maximizing the loglikelihood function (2.3) or by simultaneously solving the estimating equations Given in Appendix A.
(the
covariate value of the
individual) ![]() | (4) |
given in equation (1), the log-likelihood of
is ![]() | (5) |
and
the observed and the missing values for the
covariate are involved in
In MAR, conditional probability of missing covariate values depends on observed values
Parameters of the missingness mechanism are completely separate and distinct from the parameters of the model (1). In likelihood based estimation considering MAR, missingness mechanism can be ignored from the likelihood and missing data (covariate) that are missing at random are often known as ignorable missing, but the subjects having these missing covariates can not be deleted before the analysis. (see Little and Rubin, 1987, 2002, 2014 and Ibrahim, Chen, Lipsitz and Herring, 2005 for detailed discussion).Following Ibrahim (1990), and Lipsitz and Ibrahim (1996), we consider covariates
are random variables with finite parameters
These covariates
have distribution that can be expressed by one dimensional conditional distribution as
where
more specifically ![]() | (6) |
have some missing observations and these missing observations are missing at random. For this covariate
the probability of observing the missing observations (conditional on the response, y and the other completely observed covariates) does not depend on the missing covariate itself and any other covariate with unobserved observations, but may depend on the response as well as completely observed covariate. This flexible characteristics of MAR comes to an aid during estimation. Only if this probability of observing the missing observations depend on the response as well as completely observed covariate, then
needs to incorporate in the main likelihood
. Note that in practical regression problem, the covariates are usually very poorly dependent among each other, otherwise multicollinearity problems can be solved by using other statistical tools.To incorporate this covariate distribution with the complete data loglikelihood of
following Ibrahim (1990), we specify the joint distribution of
by using the conditional distribution of
and the marginal distribution of
that is
. Following Ibrahim (1990) we consider,
are independent and
are independently
and identically distributed for all
observations. Considering this, our likelihood becomes ![]() | (7) |
and the covariates model
are separate as well as distinct. This idea facilitates the separate maximization of the both parts of the likelihood. Moreover, covariates
can be discrete or continuous or mixture of discrete and continuous. All the covariates in
may not have missing observations, in that case, distributions of the completely observed covariates can be ignored (detailed discussion on this are available in Lipsitz and Ibrahim (1996), and Ibrahim et al. (2005)).In this scenario, our main goal is to estimate the parameters of the count data model
by maximizing the following loglikelihood (Little and Rubin, 1987, 2002, 2014 p.89) with respect to the parameters
![]() | (8) |
![]() | (9) |
is not, in general, straight forward. However, the EM algorithm (Dempster, Larid and Rubin, 1977) is a very useful tool for obtaining maximum likelihood estimates with missing observations.The EM algorithm uses two iterative steps known as the expectation-step (E-step) and the maximization-step (M-step). Following Little and Rubin (1987, 2002, 2014), the E-step provides the conditional expectation of the log-likelihood
given the observed data
and current estimate of the parameters
Suppose we have a covariate with missing observations and
of the
observations of the covariate are observed and
observations are missing and
be an arbitrary number of iterations during maximization of the log-likelihood, then the E-step of the EM algorithm for the
observation of the missing covariate for
iteration can be written as ![]() | (10) |
become ![]() | (11) |
iteration is ![]() | (12) |
become ![]() | (13) |
is the conditional distribution of the missing covariate given the observed data and the current
estimate of
However, in many situations,
may not always be available. Following Ibrahim, Chen, Lipsitz and Herring, 2005 and Sahu and Roberts, 1999, we can write 
where
is the complete data distribution given in (1),
is the distribution for the covariates where the missing values exist and both have very elegant forms. For the
of the B missing observations of the covariate we take a sample
from
using Gibbs sampler (see Casella and George, 1992 for details). Then, following Ibrahim, Chen and Lipsitz (1999) and Ibrahim, Chen, Lipsitz and Herring (2005)
can be written as ![]() | (14) |
is maximized. Here maximizing
is analogous to maximization of complete data log likelihood where each incomplete covariate being replaced by
weighted observations. More details of EM algorithm by method of weights can be found in Ibrahim, 1990; Lipsitz and Ibrahim, 1996(a,b), Ibrahim, Chen and Lipsitz, 1999, 2001; Ibrahim, Chen, Lipsitz and Herring, 2005; Sinha and Maiti, 2007; Maiti and Pradhan, 2009.Variance covariance matrix of the estimates of the parameters is calculated by inverting the observed information matrix at convergence (Efron and Hinkley, 1978) which is ![]() | (15) |
above are given in the Appendix.
be a random vector of missingness indicator for the
covariate. ![]() | (16) |
![]() | (17) |
are the indexing parameters for the conditional distribution of
Highest value for
can be
Logistic regression is a popular choice for the one dimensional distribution for
![]() | (18) |
Note that choice of variables for the model of
is important. Often many variables in this model are not necessarily significant, and more importantly parameters in the model for
are not the primary interest for estimation. Detailed discussion on this can be found in Ibrahim, Lipsitz and Chen (1999) and Ibrahim, Chen and Lipsitz (2001).Following Ibrahim, Lipsitz and Chen (1999), after incorporating the model for missingness mechanism
the data loglikelihood become ![]() | (19) |
where 
and
Note that
is the intercept parameter, hence
The explanatory variable
was generated from
when covariate is considered to be continuous, and from
in case of discrete covariate. We consider 5%, 10% and 25% missing observations in the explanatory variable. For empirical coverage probability we take nominal level
Simulation results for continuous covariate are given in Table 1 whereas results for discrete covariate are in Table 2.![]() | Table 1. Properties (estimate, bias, variance, mse, coverage probability (cp)) of the estimates of the parameters, data simulated from NB based on 5000 simulation runs (continuous covariate) |
![]() | Table 2. Properties (estimate, bias, variance, mse, coverage probability (cp)) of the estimates of the parameters, data simulated from NB based on 5000 simulation runs (discrete covariate) |
distribution. In order to see whether similar properties of the estimates hold when over-dispersed data are generated from another distribution rather than the
distribution. Such a distribution that has been used earlier by others (Lawless, 1987 and Paul and Banergee, 1998) is the log-normal
mixture of the Poisson distribution with 
and
, where
and
are the parameters of the
. In the situation in which there are covariates we take
. For more details of generating data from the log-normal mixture of the Poisson distribution see Lawless (1987).The parameter values used to simulate data from the zero-inflated log-normal mixture of the Poisson distribution were the same as those used to generate data from the zero-inflated negative binomial distribution. We also used the same percentages of missing data as those in the previous case.Results of the simulation study of the zero-inflated log-normal mixture of the Poisson distributed data are given in Table 3 and Table 4. Fortunately, we arrived at very similar conclusions of the results given in Table 1 and Table 2. This shows, perhaps, that the results will remain similar irrespective of the mechanism in which over-dispersed count data are generated.



, where
represents the intercept parameter and
represents the regression parameter for gender,
and
represent the regression parameters for the ethnic groups 1 and 2, and
and
represent the regression parameters for school 1, school 2, school 3, school 4, and school 5 respectively.The estimates of the mean parameter
where 
the over dispersion parameter
and the zero inflation parameter
based on the zero-inflated negative binomial model, under different percentages of missingness, and their corresponding standard errors are presented in Table 5. It is to note that the estimates of the parameters
and
and the corresponding standard errors changes with the amount of missingness in the covariate (this is expected as it depends on which observations have remained in the final data set). In general, the standard errors of the estimates are larger than those under complete data. However, estimates of
do not vary much irrespective of the percentage missing and the missing data mechanism. The same comment applies to
although for
is a bit higher under MNAR.![]() | Table 5. Estimates and Standard Errors of the parameters for DMFT data with covariates |
in the Model (2.3)![]() | (20) |

![]() | (21) |
![]() | (22) |
by using Gibbs sequence. For example Gibbs sequence for
is
For large
According to Sahu and Roberts (1999)
can be considered as a block and can obtained from
In this scenario, for each missing response, samples are considered as a block. For example if there are 5 missing response, then there are 5 blocks. Sahu and Roberts (1999) also mentioned that most practical cases, missing observations are independent of parameters and considers as a single block. In this case, 5 missing observations can be treated as a single block. In our model, missing responses are independent of parameters and hence we follow Sahu and Roberts (1999) for Gibbs sampling. We stop the sequence and obtain the required sample for which the absolute deviation of parameters between two consecutive steps become minimal. Extensive explanation of Gibbs sampler are available in Casella and George (1992) and Sahu and Roberts (1999).![]() | (23) |
is analogous to maximization of complete data log likelihood,
in (3) where each incomplete response being replaced by
weighted observations. The elements of the observed information matrix are as given below.![]() | (24) |
![]() | (25) |
![]() | (26) |
![]() | (27) |
![]() | (28) |
![]() | (29) |
![]() | (30) |