International Journal of Statistics and Applications
2012; 2(4): 24-32
doi: 10.5923/j.statistics.20120204.01
Sueli A. Mingoti , Renata A. Matos
Department of Statistics, Federal University of Minas Gerais, Belo Horizonte, 31270-901, Brazil
Correspondence to: Sueli A. Mingoti , Department of Statistics, Federal University of Minas Gerais, Belo Horizonte, 31270-901, Brazil.
| Email: | ![]() |
Copyright © 2012 Scientific & Academic Publishing. All Rights Reserved.
In this paper the clustering algorithms: average linkage, ROCK, k-modes, fuzzy k-modes and k-populations were compared by means of Monte Carlo simulation. Data were simulated from Beta and Uniform distributions considering factors such as clusters overlapping, number of groups, variables and categories. A total of 64 population structures of clusters were simulated considering smaller and higher degree of overlapping, number of clusters, variables and categories. The results showed that overlapping was the factor with major impact in the algorithm’s accuracy which decreases as the number of clusters increases. In general, ROCK presented the best performance considering overlapping and non-overlapping cases followed by k-modes and fuzzy k-Modes. The k-populations algorithm showed better accuracy only in cases where there was a small degree of overlapping with performance similar to the average linkage. The superiority of k-populations algorithm over k-modes and fuzzy k-modes presented in previous studies, which were based only in benchmark data, was not confirmed in this simulation study.
Keywords: Clustering, Categorical Data, Monte Carlo Simulation
be a sample observation described by
categorical variables being the domain of each
denoted by
, and let
be the observed value of
for
. The objective function of k-modes which has to be minimized, is given by: ![]() | (1) |
is the vector mode of the cluster
,
,
; m is the number of categorical variables and
is the dissimilarity measure defined as ![]() | (2) |
, and the fuzzy parameter denoted by
. The membership degrees are estimated in the execution of the algorithm; the fuzzy parameter is fixed in advance and controls the degree of overlapping expected among clusters. In the literature is very common to set
. The degree of chaos in the final partition increases as
goes to infinity.The objective function of the fuzzy k-modes, which has to be minimized to produce the best partition, is given by![]() | (3) |
and
are defined as in section 2.1. According to Huang and Ng[16] the function (3) is minimized if and only if
and such that![]() | (4) |
is given as (5). As in k-modes the vector modes may not be unique.![]() | (5) |
is selected and normalized. It can be randomly chosen from the uniform distribution defined in the[0,1] interval for example; (ii) k initial modes are selected by using equation (4); (iii) all objects are compared to the modes using a dissimilarity measure and allocated to the nearest group; (iv) the
weights are recalculated according to equation (5) and the modes are updated; (v) the steps (iii) and (iv) are repeated till there is no change in the clusters modes.Fuzzy k-modes has been proved to be an efficientclusteralgorithm with the advantage of pointing out the data points which share some similarity with different clusters of the partition and therefore could be misclassified by k-modes-type of algorithms.
and it is defined as
, where
Therefore, the composition of the fuzzy centroids is determined by the categories and by the contribution degree of these categories to the specific cluster,
, which is defined as![]() | (6) |
and
denotes the category t of variable
. Briefly speaking, for each category t of the variable
,
describes the category distribution of the attribute for data belonging to the lth cluster. The normalizing factor
, is the length of the vector
, where
is the degree of membership as defined in section 2.2. According to Kim et. al.[9] the introduction of the
minimizes the imprecision in the representation of cluster centroids and improve the efficiency of the fuzzy clustering algorithm in finding the optimal partition rather than a local optimal. As an illustration suppose there is only one categorical variable with two possible categories
and that for a given cluster l , there are 3 observations 
Therefore,
.Let
,
,
and
Then,
since for the category t=1 of
,
,
and
. Similarly, it is found
and the fuzzy centroid is given by
. In k-populations the dissimilarity measure used to compare any observation
to any fuzzy centroid
is given by![]() | (7) |
a the normalization factor which corresponds to the length of the vector
.For k and
fixed, the basic steps of the k-populations algorithm are given as follows: (i) choose k initial seeds (fuzzy centroids) and the degree of contribution
at random; (ii) obtain the proximity matrix between the observations and the fuzzy centroids according to equation (7); (iii) estimate the degree of membership
according to the equation (5) and update the fuzzy centroids according to equation (6); (iv) repeat steps (ii)-(iii) until there is no reallocation of objects into clusters.
and
be two clusters with sizes
and
, respectively. In the Average linkage the dissimilarity measure between these two clusters is defined as:![]() | (8) |
is any dissimilarity measure used to compare the
and
elements. Average linkage does not depend on the notion of initial seeds and therefore, its final partition is not affected by the ordering of the objects in the data set. However, computationally speaking it is less efficient than non-hierarchical algorithms for large data sets.
and
are considered neighbors if
, where
is a pre-specified threshold parameter which controls the amount of similarity required for two observations to be considered neighbors. The
is the number of neighbors the two observations have in common; the higher is its value the more probable is that
and
belong to the same group. The main goal is to maximize the sum of links between the observations which belong to the same group and to minimize the sum of links between observations which belong to distinct groups. In practice the implementation of ROCK is as follows: after an initial computation of the number of links between the data objects, the algorithm starts with each cluster being a single object and keeps merging clusters till the specified number of clusters is achieved or no links remain between the clusters. In each step of the algorithm the two clusters which maximizes the goodness measure given in (9) are merged, where
and
are the clusters being compared and
is given by (10). The quantity in the denominator of (9) is approximately the expected number of cross links between the pair of clusters and
is a pre-specified function. According to Guha et. al.[17] empirical work had been shown that the function
works well in practical situations. Under this function when
the expected number of links in cluster
is approximately equal to
, i.e., each sample point is a neighbor of itself; when
the expected number of links in cluster
is approximately equal to
, which corresponds to the situation where all the observations in
are neighbors of each other. ![]() | (9) |
![]() | (10) |
![]() | (11) |
j=1,2,…,m, a random number was selected from the Uniform distribution defined on the[0,1] interval and the Beta distribution with parameters
and
The vector of the generated numbers from each distribution was normalized to describe the probability of observing each category of
. Data for all categories of
were then generated randomly according to the correspondent normalized probability vector. The Uniform model represents the situation where all categories of
had approximately the same chance of being observed. The Beta distribution was chosen to describe situations where some categories of
would have higher chance to appear in the sample than others, which is common in practical applications. As an illustration, Figure 1 presents the results of 4 samples of size 50 of a variable
with 3 categories, generated by the described procedure. As it can be seen the Uniform distribution tends to generate classes with similar frequencies different than the Beta distribution which tends to produce classes with larger frequencies than others. Due to the fact that the values for the variables not used to build the overlapping were generated randomly, there is a chance of possible overlapping among the clusters generated in cases 1-3. In the case 8 (k=2,m=2,t=15) there was no control of the overlapping degree among clusters since for each cluster, the observations were randomly selected from the Uniform and Beta distributions. Finally, the case 9 represents the situation where there was no control in the simulation for the number of clusters, variables and categories as well as for the degree of overlapping. For this case, the number of clusters was randomly selected from the set {2,3,…,8}, the number of variables from the set {2,3,…,15} and the number of categories from the set {2,3,…,10}. The observations for each cluster were generated from the Beta and the Uniform distributions, as previously described.From each cluster structure of Table 1 a total of 1000 runs were simulated. The elements of each run were clustered into k groups by using all five clustering algorithms presented in section 2. The resulted partitions were then compared with the true simulated population structure and the performance of the algorithm was evaluated by the average percentage of correct classification taken over 1000 runs (called recovery rate).
|
![]() | Figure 1. An illustration of samples from Beta and Uniform distributions. Simulated data – sample size 50 |
was set as 2 and in the final partition each object was assigned to the cluster whose
estimate was the largest. The dissimilarity measure given in (11) was used in all algorithms.
; for Uniform data
). When k=2 and the degree of overlapping increased to 3 and 4, the ARR values dropped down ranging from 51.07 to 76.46% (Beta data) and 50 to 64.1% (Uniform data). When k increased to 3 the ARR values decreased ranging from 55.47 to 100% for degrees 1 and 2 of overlapping and from 34.65 to 75.2% for degrees 3 and 4, taking into account data from both distributions. The larger impact took place when k=5 since the majority of the ARR values belongs to theinterval[20,65%] except for average linkage and k-populations which presented ARR values between 70 to 80% for non-overlapping data. Considering all the situations, average linkage and k-populations were the most affected by overlapping as it can be seen by the efficiency loss values (Effl) which is defined as the difference between the ARR values from degrees 1 and 4 of overlapping (see Table 2). On the contrary, ROCK was the less affected and the only method resulting ARR values larger than 50% in all situations for Beta data except for the degree 5 of overlapping, although being also the best in this situation. For Uniform data k-modes was the less affected by overlapping.
|
|
|
|
|