American Journal of Operational Research
2012; 2(2): 1-10
doi: 10.5923/j.ajor.20120202.01
Kamran Eftakhari 1, John Fontanesi 1, Gregory Feld 2, Daniel Bouland 3, Ajit B. Raisinghani 2, Kirk Knowlton 2
1School of Medicine, University of California, San Diego, Center for Management Science in Health, San Diego, CA 92093, USA
2School of Medicine, University of California, San Diego, Division of Cardiology, San Diego, CA 92103, USA
3School of Medicine, University of California, San Diego, Division of Hospital Medicine, San Diego, CA 92103, USA
Correspondence to: John Fontanesi , School of Medicine, University of California, San Diego, Center for Management Science in Health, San Diego, CA 92093, USA.
| Email: | ![]() |
Copyright © 2012 Scientific & Academic Publishing. All Rights Reserved.
A new model for patient offset times (i.e., patient deviation from scheduled appointment time) is developed. In previous studies, offset times was mostly assumed to be sampled from a normal distribution. Alexopoulos et al.[1] offered Johnson SU as the most suitable fit. A thorough analysis of patient offset times, obtained from workflow observations in a broad sampling of ambulatory care sites, revealed these assumptions are often not valid. Although Johnson SU is still largely acceptable, it is not the most stable fitted distribution of the observed data. Our study suggests that three distributions (Generalized Logistic, Johnson SU and Log-Logistic) are more suited to modeling patient offset times with Hosking[2] Generalized Logistic (GL) distribution the most stable in its estimated parameters. We will also consider uncertainty associated with computing parameters of a Generalized Logistic distribution fitted to observed data. This model is central in devising efficient scheduling strategies to reduce patient waiting time and improve patient throughput and satisfaction.
Keywords: Stochastic Arrival Offsets
to model
often depends on experience from studies of similar experiments or by analysis of data.Before attempting to fit a probability distribution to a set of observed data, it is worth first considering the properties of the variable in question. The properties of the distribution or distributions chosen to be fitted to the data should match those of the variable being modeled. As an example, range of variable should match that of fitted distribution. Any interpretation of data requires subjective inputs, usually in the form of assumptions about the variable. The key assumption here is that observed data is randomly sampled from a probability distribution we are attempting to identify. It is assumed the observed data are both as reliable and representative as possible; anomalies in the data were checked and unreliable data points discarded. We also paid attention to possible biases that could be produced by method of data collection.We are going to look at techniques to interpret observed data for a variable in order to derive a distribution that realistically models its true variability and our uncertainty about that true variability. In this study, we will first find estimated parameters of statistical distributions that best fit patient offset times. Second, we will study the over-the-samples stability of the estimated parameters of the statistical distributions that best fit patient offset times and choose among such best fit distributions the one that exhibits the largest degree of stability in its estimated parameters. That is to say, if the entire body of data is a set S ofelements,
, then a subsample
of size
is a proper subset of the set S (that is,
). Moreover, for the purpose of random sampling, the elements of the sub-sample
are randomly chosen from the elements of the set S. If
is sufficiently smaller than n, then from S one can draw many sub-samples,
. Any suitable statistical distribution can be fitted to the data in these samples to obtain its estimated parameters. Obviously, there will be sampling variations in the estimated parameters. If the sample variations are within reasonable limits, estimated parameters are stable over the sub-samples.We have considered numerous distributions such as Beta, Burr (4P), Cauchy, Chi-Squared (2P), Dagum (4P), Erlang (3P), Error, Error Function, Frechet (3P), Gamma (3P), Generalized Extreme Value, Generalized Gamma (4P), Generalized Logistic, Gumbel Min, Gumbel Max, Generalized Pareto, Hypersecant, Inv. Gaussian, Johnson-SU, Kumaraswamy, Laplace, Levy (2P), Log-Logistic(3P), Logistic, Normal, Pearson-5 (3P), Pearson-6 (4P), Pert, Rayleigh (2P), Weibull, and Wakeby. It is worth noting that all these distributions, except the Normal distribution, are either asymmetric or non-mesokurtic or both. We expect the best fit distributions to be both skewed and non-mesokurtic. The goodness-of-fit of the distributions is measured by three statistics pertaining to Kolmogorov-Smirnov (KS), Anderson-Darling (AD) and Chi-squared (CS) tests. In addition, three information criteria; SIC (Schwarz information criterion)[20], AICC (Akaike information criterion)[21-22], and HQIC (Hannan-Quinn information criterion)[23], are also used.The remainder of this article is organized as follows: We briefly review some theoretical concepts in Section 2 to keep this article self-contained. Section 3 introduces the genesis and main features of the Generalized Logistic model and notes the model is leptokurtic and with skewness and kurtosis governed only by one parameter. This section also provides some results concerning the Maximum Likelihood (ML) and Method of Moments (MOM) estimates of the Generalized Logistic parameters and its quantiles. In Section 4, a simulation study is carried out in order to appraise the over-the-sample performance of different candidate distribution functions that best fit patient offset times data. Section 5 explains fitting second-order distributions to observed data points and reports the results of application of Generalized Logistic distribution to the available sampled data and error estimation using Generalized Bootstrap method.
where
.This can be approximated as:
, where
is sample variance of the data.If data are i.i.d., Von Neumann ratio distribution is very close to normal distribution. One can reject the hypothesis of independence at level when
where
is the
of the standard distribution. The value is the user specified type I error (type I error is rejecting the null hypothesis when in fact it is valid). The p-value of this test is approximately
where
is the CDF of standard normal distribution. The p-value of a test is the probability that a test statistics larger than the current one would be obtained if the hypothesized distribution were correct.
are the observed values of n i.i.d. random variables
, each
having a density function
identical to
.An estimator of
is some function of the random variables and thus may be written as
a notion that emphasizes this estimator is itself a random variable.Various variable criteria have been proposed for an estimate to satisfy, among these are be unbiased, consistent and with low valiance of
. It would also be desirable if
has, either exactly or approximately, a normal distribution since well-known properties of normal distribution can then be used.
in a statistical distribution model and is the most widely used. The theory of MLE estimates has deep consequences for many fields in statistics (see[26]). Statistical properties of the MLE are also useful, as will be later discussed. The Maximum Likelihood Method considersindependent observations
and study the likelihood function
defined as joint probability density for the observed dataset. The maximum likelihood estimator (MLE) of a parametric distribution are the values of parameters that maximize
. Consider a probability distribution type defined by a parameter vector
. The likelihood function
of set of n data points
could be generated from the distribution with probability density function
as:
The MLE
is then the value of
that maximizes
.
or equivalently
In a majority of cases, whenever the density function is well behaved
For some distribution types, the MLE calculation is a relatively simple algebraic problem; for others the differential equation is extremely complicated and is solved numerically. It is known that MLE has some asymptotic properties among them:
.One very important point is that MLE depends strongly on the parametric family chosen. Numerous studies examining the “robustness” of the MLE have identified how “wrong” a model can be when the incorrect distribution family is used. The best justification for the MLE is in its asymptotic properties, it turns out to be asymptotically optimal.
whose probabilistic structure is determined a priori by the statistical model chosen. The probability distribution moments are often the best way to handle the unknown parameters
. This relationship is exemplified by the raw moments below:
Given a random sample
, the
sample moment is
The moment estimator of population parameters are obtained by matching the sample moments to the corresponding population moments and solving the resulting equations simultaneously.
be random variables, not necessarily identically distributed, and set
The least-squares method is estimating values
, say
such that the sum of square of error minimizes the loss function:
over
. That is:
Least square method is not necessarily asymptotically efficient and can be quite sensitive to heavy trails (i.e., outliers/error contamination)
the estimation error. For consistent estimators,
tends to zero as n increases without bounds. We can study the distribution of
, which, for example, can be used to find intervals that, with high confidence, we can claim
is in these intervals.The ML estimators possess many good properties. For example, it can be shown (see[25] or, for a review,[26]) that ML method is a consistent estimator if
satisfy certain regularity conditions, and
be independent observation variables. The consistent estimators are defined as estimators that the error
tends to zero as the number of observations n goes to infinity[27].It is shown in[25] that if
satisfies certain regularity conditions, ML estimators, behave asymptotically normal. Asymptotic normality means that for large n
where
, and
Estimation Error analysis can be performed numerically by parametric bootstrap method[28]. Bootstrap methods are most commonly used for complicated statistical problems, e.g. when the parameter
is a large vector, or when an analytical approach is not possible.For bootstrap methods, a computer program for Monte Carlo simulation is necessary. If the parameter
, equivalently, the distribution
is known, such a program can simulate independent samples
, where N is some large integer. All these samples have the same random properties as our initial sample x and from each sample estimated
are calculated
The error distribution
can be approximated by means of the empirical distribution of
, with increasing accuracy as N goes to infinity.Let
be the empirical distribution describing the variability of the sequence
. (Note that the empirical distribution depends both on the number n of observations in our original data set and the number N of bootstrap simulations). Usually N is much larger than n since it is only limited by the computer time we wish to spend for the simulations. Finally, one can prove that, under suitable conditions, with
,
Using the last result, if n is large we have an approximation of the error distribution
.The bootstrap quantiles defined by
, are close to the quantiles
.Thus an interval, which with (approximately)
confidence, covers the unknown parameter
is given by
and Kolmogorov-Smirnoff (K-S) statistics. The Anderson-Darling statistic is a modification of K-S statistics. The lower the value of these statistics, the closer the distribution fits the data. GOF statistics do not provide a true measure of the probability that the data actually come from the fitted distribution. Instead, they provide a probability that random data generated from the fitted distribution would have produced a GOP as low as that calculated for the observed data. Analysis of the
, K-S, and A-D statistics can provide confidence intervals proportional to the probability that fitted distribution could have produced the observed data.Critical values are determined by the required confidence level
-they are the values of the goodness-of-fit statistics that have a probability of being exceeded that is equal to the specified confidence level. Critical values of K-S and A-D statistics have been found by Monte Carlo simulation[29]. K-S and A-D statistics are designed to test whether a distribution of known parameters could have produced the observed data. If the parameters of the fitted distribution have been estimated from the data, they will produce conservative results. One way to circumvent this problem is to use a portion of data for estimation and remaining data for GOF test.
) statistics
) statistics measures how well the expected frequency of the fitted distribution having a CDF
compares with the frequency of the observed data points
. To conduct the most effective version of the test, we first divide the hypothesized distribution’s support into k”equiprobable” non-overlapping intervals; we identify values
such that
, for
where
is the inverse CDF. The respective intervals are
. We then compare the number of observation that fall in each interval;
to the corresponding expected number;
.The chi-square statistics is calculated:
,where
Critical values for the
are found from the
distribution. The shape and range of the
distribution are defined by the degree of freedom d, where
,
number of parameters that are estimated. We reject the null hypothesis that
is the appropriate distribution, if
where
is the
of chi-square distribution with d degree of freedom.Since the
statistics sums of the square of all of the error
, it can be disproportionately sensitive to any large errors. However, it is very dependent on the number intervals. For better results, n usually needs to be sufficiently large and k sufficiently small that
It is recommended that the number of intervals to be chosen using Scott’s[30] formula
.
arising from a continuous distribution having a CDF
Now let
denote the order statistics based on the sample
.The K-S statistics
is defined as
,where
is known as K-S distance, n is the number of observed data points,
for
, where is the commutative rank of the data point, and
is the distribution function of fitted distribution. It is well known Glivenko-Cantelli lemma[31] that, as the sample size n becomes large, the empirical CDF
converges uniformly to
for all x.The K-S test quantifies both the maximum deviation of empirical CDF above or below the uniform line. The upper
and lower
empirical CDF are calculated as follows:
The K-S test rejects the hypothesized distribution when the test statistics
is larger than a tabulated quantile based on the sample size and the type I error
.The K-S statistic is generally more useful than
statistic in that the data are assessed at all data points which avoids the problem of determining the number of intervals into which the data must be split. However, its value is only determined by the one largest discrepancy and takes no account for lack of fit across the remainder of distribution. The vertical distance between the observed distribution
and the fitted distribution
at any point has a distribution with a mean of zero and a standard deviation
given by binomial theory:
This indicates that the position of
along the x axis is more likely to occur where
is greatest, which generally is away from the low-probability tails. This insensitivity of K-S statistic to lack fit at the extremes of the distributions is corrected for in Anderson-Darling statistic.
where
.n is the number of observed data points,
is the CDF of fitted distribution,
is the density function of fitted distribution,
, for
cumulative rank of the observed data point and is the number of non-overlapping intervals.The Andeson-Darling statistic is an improved version of Kolmogorov-Smirnoff statistic.
compensates for the variance of the vertical deviation distance between sample distribution and fitted distribution (
).
weights the distance by the probability that a value be generated at that x value. The vertical distances are integrated over all values of x to make maximum use of observed data (the K-S static only look at the maximum deviation distance).The A-D statistic
is therefore generally a more useful measure of goodness of fit than the K-S, especially where it is important to place equal emphasis on fitting a distribution at the tails as well as at main body. Nonetheless, it still has the same problem as K-S statistic i.e., the fitted distribution should, in theory, not be estimated from the data.
be the maximized value of likelihood function.SIC (Schwarz information criterion, aka Bayesian information criterion, BIC)[20]
AICC (Akaike information criterion)[21-22]
HQIC (Hannan-Quinn information criterion)[23]
The aim is to find the model with the lowest value of the selected information criterion. The
term appearing in each formula is an estimate of the deviance model fit. The coefficients of in the first part of each formula, shows the degree by which the number of model parameters is being penalized. For
the SIC[20] is the strictest in penalizing loss of degree of freedom by having more parameters in the fitted model. For
AICC ([21-22] is the least strict of the three, and HQIC[23] is in between.
where
is the location parameter,
is the scale parameter, and
is the shape parameter. The range of possible values for the GL distribution is given by
Note that as a special case, if
then the GL distribution is reduced to the two-parameter logistic distribution. Additional generalizations of the logistic distribution are discussed[33].The mean, variance, and Fisher’s coefficient of are[33]
where
is the gamma function, and
exists only if
Since
is location and scale invariant the skewness of the distribution depends only on parameter
. A random variable X with generalized logistic distribution has a variance depending on the parameters
and
.The quantile estimator
of the GL distribution can be obtained by substituting
and solving for x
where
are the parameter estimators, and T is the return period[33].Method of moments (MOM)The skewness coefficient
of the GL distribution is only a function of the shape parameter
. Then
can be approximated as follows[34]
A more precise estimate of the shape parameter can be obtained using a numerical approximation. The
that minimizes[33]
is an approximation for the shape parameter. Once the she parameter is known,
and
can be obtained as follows:
Method of maximum likelihood (ML)Consider a sample of size n of independent positive random variables
. Let
the log-likelihood function of the GL distribution is given by[33]
where
n is the sample size, and represents the natural logarithm. The MLEs
are obtained from the maximization of
as the solution of the following likelihood equations or score functions:
where
The system does not admit any explicit solution; therefore the ML estimates
can be obtained only by means of numerical procedures.
Where
respectively are; continuous location parameter, scale parameter, and shape parameter.ii) Log-Logistic (3P) Distribution
where
respectively are; continuous location parameter, scale parameter, and shape parameter.Johnson SU Distribution
and
are respectively; continues location, scale (
), and shape (
) parameters.The illustrative fits of Generalized Logistic, Johnson SU, and Log-Logistics distributions to sample data are presented in Fig.-1.![]() | Figure 1. Generalized Logistic, Johnson SU, and Log-Logistics distributions fit to sample data |
|
|
subsets of data each of length
have been drawn randomly from a 738 points main sample data set. Estimated parameters
,
are tabulated in Tables 1.1, 1.2. Measures of central tendency and dispersion of the estimated parameters are presented in tables 2.1 and 2.2. The two measures of central tendency (median and mean) for all the parameters indicate their distributions are almost symmetrical and standard deviations are much smaller with respect to means[35].It becomes readily apparent that the estimated parameters of the Generalized Logistic distribution exhibit better over-the-samples stability than the other two distributions.
some function of the distribution
, and
is unknown. However, we have a random sample
, from
, and we want to estimate
.The Generalized Bootstrap (GB) approaches the problem as follows[38]: Suppose that one would typically estimate by
. Then, instead, proceed as follows: First, estimate
. Second, independently generate N random samples of n from
, and estimate
for each sample. Third, use the sample
to estimate
. For example, one may calculate[37]
which give the sample mean and sample variance, respectively, of the GB estimators .Then, assuming approximate normality, we have
and an approximate
confidence interval for
based on standard method is
A widely used alternative to standard method is "percentlie method," which uses the upper and lower
percentiles of the GB sample estimators as the confidence interval. Specifically, the percentlie method proceeds as follows: Place the N estimates
in increasing numerical order, obtaining
The percentile methods’ (approximate)
confidence interval for
is
Sun and Muller-Schwarze[36] compared the performance of bootstrap method and generalized bootstrap and concluded that GB is more consistent in parameter estimation than BM. Asymptotic properties of GB have been shown by[28].In this part of analysis we will independently generate
random samples of length
from a GL distribution
|
, and estimated
for each sample. Estimated values of parameters
,
are shown in Table 3.1. The calculated values of mean, standard deviation and their lower and higher limits for different confidence intervals are presented in table 3.2 and 3.3.![]() | Figure 2. Uncertainty in parameters of the Generalized Logistic distribution fitted to patient’s offset times |