International Journal of Statistics and Applications
p-ISSN: 2168-5193 e-ISSN: 2168-5215
2021; 11(3): 61-69
doi:10.5923/j.statistics.20211103.02
Received: Aug. 11, 2021; Accepted: Aug. 30, 2021; Published: Sep. 26, 2021

Mst Sharmin Akter Sumy1, 2, Md Yasin Ali Parh1, 2, Md Sazzad Hossain2
1Department of Bioinformatics and Biostatistics, University of Louisville, Louisville, KY
2Department of Statistics, Islamic University, Kushtia, Bangladesh
Copyright © 2021 The Author(s). Published by Scientific & Academic Publishing.
This work is licensed under the Creative Commons Attribution International License (CC BY).
http://creativecommons.org/licenses/by/4.0/

We investigated how grouping consumers with similar interests is important for revenue optimization. A real dataset application is carried out to see this importance. To identify traveler archetypes from Google travel reviews Principal components analysis, hierarchical clustering, and k-means clustering were used in this article. K-nearest neighbors were used to classify the identified classes in the dataset. The results confirmed that, these prediction algorithms have high accuracy measures, but the clustering methodologies require further improvement. The classes identified should be checked by a domain expert for reasonableness before practical application. Because of the unlabeled data, it was not possible to test the model on new data. This model could be deployed on a small subset of customers and data could be collected on the performance of business metrics.
Keywords: Google review, Revenue optimization, Clustering methodologies
Cite this paper: Mst Sharmin Akter Sumy, Md Yasin Ali Parh, Md Sazzad Hossain, Identifying and Classifying Traveler Archetypes from Google Travel Reviews, International Journal of Statistics and Applications, Vol. 11 No. 3, 2021, pp. 61-69. doi: 10.5923/j.statistics.20211103.02.
![]() | Figure 1. Beach Reviews |
![]() | Figure 2. Median review for swimming pools |
![]() | Figure 3. Scatter plot Juice vs. Gyms |
observations live in
-dimensional space, but not all of these dimensions are equally interesting. PCA seeks a small number of dimensions that are as interesting as possible, where the concept of interesting is measured by the amount that the observations vary along each dimension. Each of the dimensions found by PCA is a linear combination of the
features. The first principal component of a set of features
is the normalized linear combination of the features![]() | (1) |
We refer to the elements
as the loadings of the first principal loading 
We constrain the loadings so that their sum of squares is equal to one, since otherwise setting these elements to be arbitrarily large in absolute value could result in an arbitrarily large variance. Given a
data set
, how do we compute the first principal component? Since we are only interested in variance, we assume that each of the variables in X has been centered to have mean zero (that is, the column means of X are zero). We then look for the linear combination of the sample feature values of the form![]() | (2) |
After the first principal component
of the features has been determined, we can find the second principal component
. The second principal component is the linear combination of
that has maximal variance out of all linear combinations that are uncorrelated with
. The second principal component scores
take the form![]() | (3) |
is the second principal component loading vector, with elements
. It turns out that constraining
to be uncorrelated with
is equivalent to constraining the direction
to be orthogonal (perpendicular) to the direction
[12]However, the reviews are on a fixed scale of 0 to 5, and the exploratory analysis showed the variability was mostly homogenous across features indicating standardization was not necessary in this case. PCA was not used to construct traveler archetypes, but the insights were used in the clustering methods.
observations is treated as its own cluster. The two clusters that are most similar to each other are then fused so that there now are
clusters. Next the two clusters that are most similar to each other are fused again, so that there now are
clusters. The algorithm proceeds in this fashion until all of the observations belong to one single cluster, and the dendrogram is complete.Single linkage creates sprawling, long-trailing clusters by considering the distance between a new point and the closest point in a cluster. Average linkage measures the average distance between a new point and all the points in a cluster and creates clusters that are more sprawling than complete linkage, but less long-tailed than single linkage.
th class are drawn from a multivariate gaussian distribution which has a class specific mean and common variance. Class means and common variance must be estimated from the data, and once obtained are then used to create linear decision boundaries in the data. LDA then simply classifies an observation according to the region in which it is located. [16]When there are more than two classes, it is no longer possible to use a single linear discriminant score to separate the classes. The simplest procedure is to calculate a linear discriminant for each class, this discriminant being just the logarithm of the estimated probability density function for the appropriate class, with constant terms dropped. Sample values are substituted for population values where these are unknown. Where the prior class proportions are unknown, they would be estimated by the relative frequencies in the training set. Similarly, the sample means and pooled covariance matrix are substituted for the population means and covariance matrix.Suppose the prior probability of class
is
, and that
is the probability density of
in class
, and is the normal density equation.![]() | (4) |
and attribute
is
and the logarithm of the probability of observing class
and attribute
is,![]() | (5) |
are given by the coefficients of x. ![]() | (6) |
by,![]() | (7) |
and
. To obtain the corresponding “plug-in” formulae, substitute the corresponding sample estimators: S for
for
; and
for
, where
is the sample proportion of class
examples. [17]
we obtain![]() | (8) |
. Here it is understood that the suffix
refers to the sample of values from class
.In classification, the quadratic discriminant is calculated for each class and the class with the largest discriminant is chosen. To find the a posteriori class probability explicitly, the exponential is taken of the discriminant and the resulting quantities normalized to sum to unity. Thus, the posterior class probabilities
are given by,![]() | (9) |
and associated expected costs explicitly. The most frequent problem with quadratic discriminants is caused when some attribute has zero variance in one class, for then the covariance matrix cannot be inverted. One way of avoiding this problem is to and a small positive constant term to the diagonal terms in the covariance matrix (this corresponds to adding random noise to the attributes). Another way, adopted in our own implementation, is to use some combination of the class covariance and the pooled covariance.Once again, the above formulae are stated in terms of the unknown population parameters
, and
and
. To obtain the corresponding “plug-in” formulae, substitute the corresponding sample estimators:
for
for
; and
for
where
is the sample proportion of class
examples. [18] [19]
“folds” of approximately equal size. The first fold is treated as a validation set, and the model is trained on the remaining
folds of data. This trained model is then used to predict the target in the Kth fold, and an accuracy metric,
, is computed. This procedure is repeated
times where a new validation set is used during each iteration. This process results in
estimates of the test error:
. The
fold CV estimate is computed by averaging these values,![]() | (10) |
![]() | (11) |
refers to the proportion of actual agreement and
refers the probability of making a correct classification purely by chance. Kappa values range from 0 to a maximum of 1 with a value of 1 indicates perfect agreement, a value of 0 indicating no agreement, and values between 0 and 1 indicating varying degrees of agreement. Depending on how a model is to be used, the interpretation of the kappa statistic might vary. Values above 0.8 were considered acceptable for this project. Traditional metrics such as precision, recall, and specificity can still be calculated with multiple classes, but the objective of this analysis was overall accuracy, not a specific error rate. [15] [21]![]() | Table 1. Average Archetype Values |
![]() | Figure 4. Biplot of First and Second Principal Components |
![]() | Figure 5. PCA and Clusters |
|
![]() | Figure 6. K-NN Performance for Different Values of K |
|