American Journal of Bioinformatics Research
p-ISSN: 2167-6992 e-ISSN: 2167-6976
2019; 9(1): 22-44
doi:10.5923/j.bioinformatics.20190901.03

Noel Dougba Dago 1, Inza Jesus Fofana 1, Nafan Diarrassouba 1, Mohamed Lamine Barro 1, Jean-Luc Aboya Moroh 1, Olefongo Dagnogo 2, Loukou N’Goran Etienne 1, Martial Didier Saraka Yao 1, Souleymane Silué 1, Giovanni Malerba 3
1Unité de Formation et de Recherche Sciences Biologiques, Département de Biochimie-Génétique, Université Peleforo Gon Coulibaly, Korhogo, Côte d’Ivoire
2Unité de Formation et de Recherche Biosciences, Université Felix Houphouët-Boigny, BP V34 Abidjan 01, Côte d’Ivoire
3Department of Neurological, Biomedical and Movement Sciences University of Verona, Strada Le Grazie, Verona, Italy
Correspondence to: Noel Dougba Dago , Unité de Formation et de Recherche Sciences Biologiques, Département de Biochimie-Génétique, Université Peleforo Gon Coulibaly, Korhogo, Côte d’Ivoire.
| Email: | ![]() |
Copyright © 2019 The Author(s). Published by Scientific & Academic Publishing.
This work is licensed under the Creative Commons Attribution International License (CC BY).
http://creativecommons.org/licenses/by/4.0/

Data harvesting, data pre-treatment and as well data statistical analysis and interpretation are strongly correlated steps in biological and as well agronomical experimental survey. In view to make straightforward the integration of these procedures, rigorous experimental and statistical schemes are required, playing attention to process data typologies. Numerous researchers continue to generate and analyse quantitative and qualitative phenotypical data in their agronomical experimentations. Considering the impressive heterogeneity and as well size of that data, we proposed here a semi-automate analysis procedure based on a computational statistical approach in R programming environment, with the purpose to provide a simple (programmer skills are not requested to users) and efficient (few minute are needed to get output files and/or figures) and as well flexible (authors can add own script and/or bypassed some functions) tool pointing to make straightforward heterogenic metric data interactions in biostatistics survey. The pipeline starts by loading a row data matrix followed by data standardization procedure (if any). Next, data were processed for a multivariate descriptive and as well analytical statistical analysis, comprising data quality control by providing correlation matrix heat-map and as well as p-value clustering analysis graphics and data normality assessment by Shapiro-Wilk normality test. Then, data were handled by principal component analysis (PCA) including PCA n factor survey in discriminating needed factors component explaining data variability. Finally data were submitted to linear and/or multiple linear regression (MLR) survey with the purpose to link mathematically managed data variables. The pipeline exhibits a high performance in term of time saving by processing high amount and heterogenic quantitative data, allowing and/or providing a complete descriptive and analytical statistical framework. In conclusion, we provided a quick and useful semi-automatic computational bio-statistical pipeline in a simple programming language, exempting the researchers to have skills in advanced programming and statistical technics, although it is not exhaustive in terms of features.
Keywords: Computational statistical pipeline, Biostatistics, Agronomic metric data, R software
Cite this paper: Noel Dougba Dago , Inza Jesus Fofana , Nafan Diarrassouba , Mohamed Lamine Barro , Jean-Luc Aboya Moroh , Olefongo Dagnogo , Loukou N’Goran Etienne , Martial Didier Saraka Yao , Souleymane Silué , Giovanni Malerba , A Quick Computational Statistical Pipeline Developed in R Programing Environment for Agronomic Metric Data Analysis, American Journal of Bioinformatics Research, Vol. 9 No. 1, 2019, pp. 22-44. doi: 10.5923/j.bioinformatics.20190901.03.
Assessment of processed data clustering and distribution by multivariate boxplot and hierarchical clustering analysisBoxplot graph allows to assess processed data dispersion by identifying outliers data and/or sample (data quality control). The boxplot function allows to build boxplots in base R. Boxplot is one of the most common type of graphic. It gives summary of one or several numeric variable. Indeed, the line that shares the box into 2 parts represents the median of process data while both upper and lower bases of the box shows the upper and lower quartiles respectively. The extreme lines shows the highest and lowest value excluding outliers. Hierarchical cluster analysis, is an algorithm that groups similar samples into groups called cluster. Hierarchical clustering can has been performed on raw and normalized genetic features data. Once data are provided, the pipeline automatically compute a distance matrix in the background. Usually, distance between two clusters has been computed based on length of the straight line drawn from one cluster to another. This is commonly referred to as the Euclidean distance. Here, the hierarchical survey based on the Euclidean distance, as it is usually the appropriate measure of distance in the physical world.Correlation testsIn the present pipeline several correlation coefficient have been evoked depending on processed data typology. Correlation coefficients are used in statistics to measure how strong a relationship is between two or more variables. There are several types of correlation coefficient: Pearson’s correlation is a correlation coefficient commonly used in linear regression. The value of correlation is numerically shown by a coefficient of correlation, most often by Pearson’s or Spearman’s coefficient, while the significance of the coefficient is expressed by p-value. The coefficient of correlation shows the extent to which changes in the value of one variable are correlated to changes in the value of the other. Spearman's coefficient of correlation or rank correlation is calculated when one of the data sets is on ordinal scale, or when data distribution significantly deviates from normal distribution and data are available that considerably diverge from most of those measured (outliers) [5, 19, 20]. Spearman rank correlation is a non-parametric test that is used to measure the degree of association between two variables. The Spearman rank correlation test does not carry any assumptions about the distribution of the data and is the appropriate correlation analysis when the variables are measured on a scale that is at least ordinal while Kendall rank correlation is a non-parametric test that measures the strength of dependence between two variables [21]. Also, it is noteworthy to underline that presently developed pipeline provides processed correlation heatmap as well as phylogenetic tree graphic.Parallel Principal Component Analysis (PCA)A first essential step in Factor Analysis is to determine the appropriate number of factors with Parallel Analysis. Parallel PCA survey and/or technique is realized with the purpose to evaluating the components or factors retained in a principle component analysis (PCA) or common factor analysis (FA). Evidence is presented that parallel analysis is one of the most accurate factor retention methods while also being one of the most underutilized in management and organizational research. Specifying too few factors results in the loss of important information by ignoring a factor or combining it with another [22]. This can result in measured variables that actually load on factors not included in the model, falsely loading on the factors that are included, and distorted loadings for measured variables that do load on included factors. Furthermore, these errors can obscure the true factor structure and result in complex solutions that are difficult to interpret [23, 24]. Several studies have shown that parallel analysis is an effective method for determining the number of factors. Despite being the most critical, top priority issue of factor analysis, determining the number of factors has been considered as one of the most challenging stages; this is particularly true for researchers inexperienced in factor analysis, although it is occasionally difficult for many experienced researchers, depending on the characteristics of the instrument (or scale), the research group and thus the collected data [25-29]. Essentially, the program works by creating a random dataset with the same numbers of observations and variables as the original data. A correlation matrix is computed from the randomly generated dataset and then eigenvalues of the correlation matrix are computed. When the eigenvalues from the random data are larger than the eigenvalues from the pca or factor analysis you known that the components or factors are mostly random noise.
# New data matrix including maximum normalized data.
# Matrix parameters and variables name assignment. Here the pipeline give an opportunity to users to adapt and/or adjust and as well to change analyzed variables names.rownames(bio_data_matrix) <- user_data$Var_Treat # Var_Treat row from loaded “user_data” matrix have been processed as the new matrix (matrix with row data) row names.rownames(bio.data.matrix) <- user_data$Var_Treat # Var_Treat row from loaded “user_data” matrix have been processed as the new matrix (matrix with normalized data) row names.colnames(bio_data_matrix) <- c("Diameter", "High", "Leave_Nub", "Leave_Leng") # new matrix column names (matrix with row data)colnames(bio.data.matrix) <- c("Diameter", "High", "Leave_Nub", "Leave_Leng") # new matrix column names (matrix with normalized data)# Next, above managed matrix have been written and saved in Results folder in working directory with txt extension. It is noteworthy to underline that results folder in working directory must be created before running the following script. write.table(bio_data_matrix, file = "Results/non_norm_table.txt", sep = ",", col.names = NA, qmethod = "double")write.table(bio.data.matrix, file = "Results/norm_table.txt", sep = ",", col.names = NA, qmethod = "double")# A whole descriptive statistical analysis by providing processed variables descriptive statistical table (Table 1).stat.desc(bio.data.matrix)write.table(stat.desc(bio.data.matrix), file = "Results/stat.desc.txt", sep = "\t")
|

![]() | Figure 1. Multivariate statistical analysis boxplot graphic in comparing row and normalized heterogenic data distribution for each considered parameter. |
|
![]() | Figure 2. Bean-plot graphic in comparing (1) row and (2) standardized data dispersion and/or distribution by merging processed variables data |
![]() | Figure 3. Assessment of processed data (row and standardized/normalized data) normality by density plot and as well quantile normalisation methods |
# Pvcluster clustering survey
# Normalized and /or standardized data result_ND <- pvclust(bio_data_matrix, method.dist="cor", method.hclust="average")# Row and/or unstandardized data

# Spearman Correlation with variable couples table.
#Heatmap Correlation
#Correlation (Pearson and/or Spearman) Heatmap GraphicFlexibility of our pipeline allows user to set and/or to use correlation method fitting better to their data. Here, we represented both Spearman and Pearson correlation heatmap for standardized data (ND). However, our pipeline provides correlation heatmap graphics for unstandardized data (RD). 
![]() | Figure 5. i.e. Pearson and/or Spearman correlation heatmap survey in processing standardized (ND) pair metric variable parameters |
|
![]() | Figure 6. Variance analysis via principal component analysis (PCA) of process metric parameters by (A) bar and (B) line scree-plot graphics |
|
![]() | Figure 7. Diagram of Shepard evaluating inter-individual distances in the multivariate environment created by PCA survey vs. observed inter-individual distances for standardized data |

![]() | Figure 8. Individual and variables circle correlation and/or inter-action graphics in assessing individual data variability. |
![]() | Figure 9. Retained adjusted Eigenvalue vs. unadjusted Eigenvalue as well as estimated bias representation |
|
The principal () function in the psych package can be used to extract and rotate principal components. Analyzed data can be a raw data matrix (i.e. bio_data_matrix, see additional file) or a covariance matrix. Pairwise deletion of missing data is used rotate can "none", "varimax" (see script above), "quatimax", "promax", "oblimin", "simplimax", or "cluster".Here, we performed an example by processing test of the hypothesis for n=1, because of PCA n factor survey results that discriminated n=1 as optimal factor for explaining metric data variability (Figure 7B).Test of the hypothesis that 1 component is sufficient exhibited the following results: root mean square of the residuals (RMSR) =0.06, with the empirical chi square = 4.34 with p< 0.11. The results of the present PCA factor analysis computing cumulative proportion with regard n=1 component in explaining metric data variability have been reported in Table 6. Also, principal component analysis provided communality (h2) and specific (u2) variance. Considering as a whole, proportion variance computed by PC1 =0.9.
|

|
![]() | Figure 10. Assessment of multiple linear regression response parameter (y) normality by density plot and as well quantile normalisation methods |