International Journal of Agriculture and Forestry

p-ISSN: 2165-882X    e-ISSN: 2165-8846

2014;  4(4): 310-316

doi:10.5923/j.ijaf.20140404.08

Data Mining Algorithms for Prediction of Soil Organic Matter and Clay Based on Vis-NIR Spectroscopy

Sandro Teixeira1, Alaine M. Guimarães1, Carlos A. Proença2, José Carlos F. da Rocha3, Eduardo Fávero Caires4

1Computer Applied to Agriculture Lab, State University of Ponta Grossa, Ponta Grossa, 84030-900, Brazil

2Information Technology Department, ABC Foundation - Research and Agricultural Development, Castro, 84166-981, Brazil

3Intelligent Systems Lab, State University of Ponta Grossa, Ponta Grossa, 84030-900, Brazil

4Soil Fertility Lab, State University of Ponta Grossa, Ponta Grossa, 84030-900, Brazil

Correspondence to: Alaine M. Guimarães, Computer Applied to Agriculture Lab, State University of Ponta Grossa, Ponta Grossa, 84030-900, Brazil.

Email:

Copyright © 2014 Scientific & Academic Publishing. All Rights Reserved.

Abstract

Organic matter (OM) amount and clay content in the soil are important constituents in the sustainability of agricultural systems. The methods used for OM and clay analyses in laboratories are laborious, time consuming and use require reagents that pollute the environment. The use of reflectance in the visible and near infrared (Vis-NIR) can be highly viable in soil analysis identifying the attributes contents in a cleaner and quicker way. There is still no general model specifying the wavelengths to be used for neither each variable being analyzed nor a well-defined methodology to be applied. The aim of this study was to apply all the classification algorithms available in the Weka software trying to find the best correlations between spectral data in the Vis and NIR spectrums, separately, and OM and clay content in the soil. As result, the clay prediction had a strong correlation with both Vis and NIR spectrum. OM prediction presented a determination coefficient greater than 0.7 but brought an error that cannot be overlooked. Lazy KStar algorithm showed to be more adequate to mine the data presenting the higher determination coefficients and the lower errors. The best results for both OM and clay were obtained when correlated with the Vis spectrum. This suggests that it is possible to predict OM and clay using only the Vis spectrum.

Keywords: Soil properties, Vis-NIR, Spectral reflectance, Classifier algorithms

Cite this paper: Sandro Teixeira, Alaine M. Guimarães, Carlos A. Proença, José Carlos F. da Rocha, Eduardo Fávero Caires, Data Mining Algorithms for Prediction of Soil Organic Matter and Clay Based on Vis-NIR Spectroscopy, International Journal of Agriculture and Forestry, Vol. 4 No. 4, 2014, pp. 310-316. doi: 10.5923/j.ijaf.20140404.08.

1. Introduction

Organic matter (OM) amount and clay content in the soil are important constituents in the sustainability of agricultural systems. Practices that favor the conservation of OM improve soil properties and help reduce the erosion risk. In so-called systems of sustainable management, beneficial microorganisms are incorporated into the topsoil, along with crop stubble and other organic waste, reducing the use of pesticides and fertilizers and leading to an increase in stability and soil conservation [16]. Other important constituent of soil is the level of clay; it has negative electrical charges responsible for cation exchange capacity (CEC), which is one of the requirements for recommending rates of fertilizer and lime. Clay also has a close relation with the water retention of soil.
The methods used for soil physical properties analyses in laboratories are laborious, time consuming and require the proper disposal of reagents; otherwise they will pollute the environment [25]. For the determination of OM in soil, the most widely used procedure is that proposed by Walkley-Black [27], which uses the dichromate (Cr2O72-) that causes environmental contamination by chromium.
A major challenge now is the development of analysis methods that require a minimum pre-treatment of samples, that are fast in achieving satisfactory answers and, mainly, that are non-destructive. An alternative that has shown good results is reflectance in visible (Vis) and near-infrared (NIR) spectroscopy (Vis-NIR). This methodology is based on the measurement of the absorption intensity of electromagnetic radiation in the visible and near-infrared regions.
Analysis using Vis-NIR is an amalgamation of spectroscopy, statistics and computing. The mechanical principle behind the technique is to illuminate a sample with light of a specific and known wavelength of the electromagnetic spectrum. The light absorption is measured by differences between both the amount of light emitted and reflected by the sample, being a ratio that can predict the physical-chemical composition of the sample [1]. According to the physical-chemical attributes of soil there is a spectral individualization of it where each one represents a spectral signature and this is the basis of the studies in this area. Information about the soil can be found at given wavelength or at a specific electromagnetic spectrum [17].
Obukhov and Orlov [15] showed the differences between groups of soils using spectral reflectance. However, their findings were not expanded upon until Shepherd and Walsh [19] presented the use of diffuse reflectance in the visible and near-infrared regions for the rapid assessment of certain fertility parameters in surface soil in East Africa. Fidencio et al. [7] used reflectance in the near infrared to predict organic carbon in Brazilian soil and they also carried out a study similar to that of Obukhov and Orlov [15]. Stenberg et al. [20] and Viscarra Rossel et al. [26] consider that the OM and clay content associated with the total nitrogen content, are promising factors regarding soil assessment. These authors state that there is a global action towards the development of more methodologies for soil analysis because there is a great demand for large quantities of high quality data to be used in agriculture. They also comment that in 2009 they proposed a project for the creation of a Global Soil Spectral Library. This project is intended to develop a global collaboration network in terms of soil spectroscopy, encouraging wider research and promoting its adoption within soil science.
According to Chang et al. [3] the level of clay content is the attribute that performs best in predictions using data in the Vis-NIR wavelength region of the spectrum. Additionally, Moron and Cozzolino [14] evaluated the content of sand, silt and clay in soils in Uruguay; R2 results obtained in the calibration were higher than 0.8 for the fractions of sand, silt and clay.
In order to analyze spectral data, statistical regression functions have been applied, increasing the understanding of the soil spectral behavior and obtaining calibration models for the prediction of soil properties [21]. Several statistical regression methods are being used to analyze soil variables using Vis-NIR, such as multiple linear regression (MLR), polynomial regression (PR), principal component regression (PCR) and stepwise multiple linear regression (RLMS) [3; 4; 10; 23].
Due to the features of the data obtained from Vis-NIR equipment spectral data mining using machine learning techniques could be considered as an important methodology to be used to find correlations between the different wavelengths and the soil property being analyzed. Data mining (DM), which corresponds to one of the steps of the Knowledge Discovery in Databases (KDD), is a technology that combines traditional statistical data analysis methods with sophisticated algorithms to process big and complex datasets [6]. A comparison between Bayes and lazy classifiers showed that lazy classifiers were more efficient than Bayes classifiers [24]. Also, artificial neural network (ANN) has been applied presenting good results [2; 19].
Although there are several papers asserting the viability of using spectroscopy to soil organic matter analysis, applying different techniques to use de , there is still no general model specifying the wavelengths to be used for neither each variable being analyzed nor a well-defined methodology to be used.
Besides several computational methods have been tested, other computational data mining algorithms, such as Bagging, M5 Rules and Lazy KStar [29], could be useful in establishing an efficient methodology. Software Weka [8] is a well done computational environment that offers a set of efficient algorithms that present good results in several domains and could be tested and evaluated for this research theme.
Other important point is related to the more adequate spectral region to be used analyzing different variables. There is no consensus if it is really need to use a NIR or a Vis-NIR spectrometer to analyses OM and clay. Considering that how more spectral regions the spectrometer covers more expensive it is, identifying a specific spectral range could contribute to reduce costs in acquiring the equipment.
The aim of this study was to apply all the classification algorithms available in the Weka software trying to find the best correlations between spectral data in the Vis and NIR spectrums, separately, and OM and clay content in the soil.

2. Material e Methods

2.1. Site Description and Soil

The study was undertaken at Pirai do Sul city, Parana State, Brazil (24º 22' S, 50º 04' W). The field scale dataset covers an area of 110 hectares and the predominant soils are loamy or clayey Oxisols. The climate in the region, according to Köppen classification is Cfb, with mild summer and frequent frosts in the winter. The altitude ranges from 600 to 1300 m and annual precipitation is about 1.480 mm.
The field sampling campaign was based on a georreferenced grid (Fig. 1) generated with one sample per hectare; soil samples were collected 0-0.20 m. In order to form each sample, 8 sub-samples were taken around the point and they were then standardized, totaling 111 soil samples, being one sample per hectare [17].
Figure 1. Georeferenced grid of the field sampling

2.2. Soil Analysis

The samples collected were analyzed by ABC Foundation Lab for determining the following parameters: (i) chemical analysis: pH (1:2.5 soil: 0.01 M CaCl2 suspension), total acidity (H + Al), exchangeable Al, Ca, Mg, and K, and organic matter (OM); (ii) texture analysis (g kg-1): clay, silt and sand; (iii) Vis-NIR analysis.
Soil samples were oven-dried for 12 h at 40°C and sieved 2 mm) prior to spectral scanning. The samples were analyzed for clay content by the pipette method [5], and for OM using the colorimetric method [18]. Soil pH, total acidity (H + Al), exchangeable Al, Ca, Mg, and K were analyzed according to the methods described by Raij et al. [18].
The spectral reflectance in the visible and near-infrared regions were obtained using a FOSS, model XDS, near-infrared spectrometer. The software used was ISIscan, version 3.2, to acquire the spectra and WinISI II, version 1.5 for the curves.
The spectra of the soil samples were obtained from readings in nanometers with an increment of 2 nm between scans, in the wavelength range 400-2500 nm. All the diffuse reflectance spectra were automatically converted to log (1/R), exemplified in Fig. 2 by means of four randomly selected samples of clay, where "R" represents reflectance.
Figure 2. Graphical representation of the spectral curve of four samples with different clay contents

2.3. Data Analysis

All the data analyses were carried out in Weka Software [8].
The original database consisted of 1064 variables, corresponding to the different wavelengths read by the Vis-NIR spectrometer. In order to analyze the data considering the different Vis and NIR spectrums combining them with clay and OM separately, the database was divided into four datasets, according to presented in Table 1. In data mining notation OM and clay are said goal attributes and the wavelengths variables are said predictors attributes.
Table 1. Split of the database used in the study
     
The first dataset contains OM and more 150 variables related to each wavelength of the visible spectrum (OM-Vis). Dataset 2 contains the OM and more 898 variables related to the wavelengths of the near infrared spectrum (OM-NIR). Likewise, datasets 3 (Clay-Vis) and 4 (Clay-NIR) presented in addition to the clay content the variables representing wavelength of the Vis and NIR spectrum, respectively. The high number of variables results in datasets with high dimensionality making hard the statistical and computational analysis to find correlations.
Aiming to reduce the datasets dimensionality a filter algorithm, named Attribute Selection Algorithm, was applied to the four datasets established. This algorithm is composed by the CfsSubsetEval (Correlation-based Feature Subset Selection) attributes evaluator [8]. As result, the wavelengths which presented the best correlation with OM and Clay content in the Vis and NIR spectrum were selected.
All the 39 classifier algorithms [22] available in the software WEKA were used to mining the datasets, having been used the default values of the algorithms parameters.
The evaluation of the data mining results, after applying all the classification algorithms, was made by analyzing the R2 and the root relative squared error (RRSE) [3]. The root relative squared error is defined as a relation to what it would have been if a simple predictor had been used. It was considered only the three best algorithms results for each dataset based on the higher R2 and the t test (P = 0.05).

3. Results and Discussion

3.1. Soil Analysis

The soil samples used had OM and clay contents in the ranges 15-43 g kg-1 and 123-534 g kg-1, respectively. The pH (1:2.5 soil: 0.01 M CaCl2 suspension) varied from 4.1 to 5.8. The effective cation exchange capacity (ECEC) ranged from 15.7 to 79.5 mmolc dm-3, and the cation exchange capacity pH 7.0 (CEC) ranged from 46.4 to 107.4 mmolc dm-3.

3.2. Wavelengths Filtered

After applying the selection filter algorithm the number of interest variables (wavelengths) was expressively reduced to no more than 7, as presented in Table 2.
Table 2. Result of the filter application in the used datasets considering the best correlation between the wavelength with OM and Clay
     
Only the wavelength (480) was filtered in both OM and clay datasets related to Vis spectrum. When NIR spectrum was considered, several wavelengths were selected for both the OM-NIR and the CLAY-NIR dataset.

3.3. Organic Matter

Based on the three best statistics results obtained for each dataset mined, the most suitable classification algorithms for the domain studied were determined. Table 3 shows results related to OM where Bagging, Decision Table and Lazy KStar algorithms presented the best correlation results between OM and wavelengths in the Vis spectrum (OM-Vis dataset); and the algorithms Random Subspace, Bagging and Decision returned the best correlation of OM with the NIR spectrum (OM-NIR dataset).
Table 3. Results returned by the algorithms correlating OM with the Vis and NIR spectrum. Lazy KStar algorithm presented the best results to both Vis and NIR spectrums
     
Lazy KStar algorithm presented the higher R2 (0.770) when correlating OM content with the Vis spectrum. In addition to achieving the highest coefficient, this algorithm obtained the smallest errors with the OM-Vis dataset: MAE (2.36), RMSE (3.25), RAE (43.16%) and RRSE (47.73%).
Analyzing the algorithms performance in the OM-NIR dataset no better results than those found in the OM-Vis dataset were obtained.
Figure 3 shows the correlations between measured clay (g kg-1) and predicted clay (g kg-1) using Vis spectrum, obtained for each of the three algorithms with better results.
Figure 3. Relationship between measured OM and predicted OM using Vis spectrum and the algorithms: (a) Lazy KStar; (b) Bagging; and (c) Decision Table. Relationship between measured OM and predicted OM using NIR spectrum and the algorithms: (d) Decision Table; (e) Bagging; and (f) Random Subspace. **P < 0.01

3.4. Clay

Predicting clay using the Vis spectrum (wavelength 480) gives a R2 of 0.899. Acceptable errors (RRSE 32.54% in the worst case) were obtained with this dataset. Despite good results were obtained for the correlation between clay with NIR spectrum, they were not better than obtained to clay when the Vis spectrum was considered. The best R2 to Clay-NIR dataset was 0.835. Anyway, estimating clay, with both Clay-NIR and Clay-Vis datasets, presented better results than estimating OM.
Table 4 shows the algorithms that returned the results when correlating clay with wavelengths in the Vis spectrum (Clay-Vis dataset); they are Bagging, M5 Rules and Lazy KStar algorithms. M5 Rules, Multilayer Perceptron (MLP) and Lazy KStar algorithms were the best correlating clay with the NIR spectrum (Clay-NIR dataset).
Table 4. Results returned by the algorithms correlating Clay with the Vis and NIR spectrum. Lazy KStar algorithm presented the best results to both Vis and NIR spectrums
     
Lazy KStar algorithm presented the better correlation result for both Clay-Vis and Clay-NIR datasets. The good results obtained with Lazy KStar algorithm are in agree with those found by Vijayarani and Muthulakshmi [24]. This algorithm was not used only with the OM-NIR dataset, just because it was not selected by the Feature Selection algorithm. Further research is necessary in this area before a firm conclusion can be reached about the reason to the Lazy KStar algorithm had been not selected in this case.
The best determination coefficient of clay (0.899) was higher than that found by Meyer [12] where R2 obtained was 0.769 for clay analysis. But results related to OM (R2 0.769) were not better than that obtained by He and Song [9] which used multivariate calibration and PLS models obtaining R2 equal to 0.921.
Figure 4 shows the correlations between measured clay (g kg-1) and predicted clay (g kg-1) using Vis spectrum, when applying the three algorithms with better results.
Figure 4. Relationship between measured clay and predicted clay using Vis spectrum and the algorithms: (a) Lazy KStar; (b) Bagging; and (c) M5Rules. Relationship between measured clay and predicted clay using NIR spectrum and the algorithms: (d) Lazy KStar; (e) MLP; and (f) M5Rules. **P < 0.01

4. Conclusions

Predicting clay content presented a strong correlation with the selected wavelengths. OM estimative had a determination correlation greater than 0.7 but brought an error that cannot be overlooked. This indicates that applying data mining techniques for the studied database was highly feasible to predict clay content and requires further investigation to improve the estimation of OM. Lazy KStar algorithm showed to be more adequate to mine the data presenting the higher determination coefficients and the lower errors when combined with the with the Attribute Selection filter algorithm.
The best results for both OM and clay were obtained when correlated with the Vis spectrum. This suggests that it is possible to predict OM and clay when using only the Vis spectrum.
The study showed that the use of spectroscopy is feasible, bringing advantages to be a quick and clean technique. To confirm the efficiency of the technique applied in this work, it would be important to conduct the same study in different regions, with a variety of soil types and climatic conditions.

ACKNOWLEDGEMENTS

We are grateful to ABC Foundation - Research and Agricultural Development, for providing the experimental area, the conventional analysis as well the Vis-NIR spectrometer, making possible to conduct this research.

References

[1]  Borges Neto, W. (2005) Parâmetros de qualidade de lubrificantes e óleo de oliva através de espectroscopia vibracional, calibração multivariada e seleção de variáveis. 2005. 130 f. Tese (Doutorado em Química) - UNICAMP - Universidade Estadual de Campinas. Campinas, São Paulo.
[2]  Brown, D.J.; Shepherd, K.D.; Walsh, M.G.; Dewayne Mays, M.; Reinsch, T.G. (2006) Global soil characterization with VNIR diffuse reflectance spectroscopy. Geoderma, 132, 273–290.
[3]  Chang, C.W.; Laird, D.A.; Mausbach, M.J.; Maurice, J.; Hurburgh, J.R. (2001) Near-Infrared reflectance spectroscopy – principal components regression analyses of soil properties. Soil Science Society of America Journal 65, 480–490.
[4]  Daniel, K.W.; Tripathi, N.K.; Honda, K.; Apisit, E. (2004) Analysis of VNIR (400–1100 nm) spectral signatures for estimation of soil organic matter in tropical soils of Thailand. International Journal of Remote Sensing 25, 643–652.
[5]  Embrapa. Centro Nacional de Pesquisa de Solos. (1997) Manual de métodos de análise de solo, 2.ed, Rio de Janeiro.
[6]  Fayyad, U.M.; Piatetski-Shapiro, G.; Smyth, P; Uthurusamy, R. (1996) Advances in Knowledge Discovery and Data Mining. Menlo Park: AAAI Press, p. 11-34.
[7]  Fidêncio, P. H.; Poppi, R. J.; De Andrade, J. C. (2002) Determination of organic matter in soils using radial basis function networks and near infrared spectroscopy. Analytica Chimica Acta, Amsterdam, v. 453, p. 125-134.
[8]  Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Ian H. Witten. (2009) The WEKA Data Mining Software: An Update; SIGKDD Explorations, Volume 11, Issue 1.
[9]  He,Y.; Song, H. (2006) Prediction of Soil Content Using Near-Infrared Spectroscopy, SPIE Newsroom- International Society for Optical Engineering, [S.l.], v.2, p. 8-10.
[10]  Islam, K.; Singh, B.; MCBratney, (2003) A. Simultaneous Estimation of Several Soil Properties by Ultra-violet, Visible, and Near-infrared Reflectance Spectroscopy. Australian Journal of Soil Research 41, 1193–1202.
[11]  McBratney, A.B.; Minasny, B.; Viscarra Rossel, R. (2006) Spectral soil analysis and inference systems: a powerful combination for solving the soil data crisis, Geoderma, Amstrerdam, v.136. p.272-278.
[12]  Meyer, J. H. (1999) Use of NIR in the South African sugar industry with reference to soil fertility management. South African Sugar Association Experiment Station, p. 1-13.
[13]  Moreira, F. M. S.; Siqueira, J. O. (2006) “Microbiologia e bioquímica do solo”. 2. ed. Lavras: Editora UFLA. p.729.
[14]  Moron, A.; Cozzolino, D. (2003) The potential of near-infrared reflectance spectroscopy to analyze soil chemical and physical characteristics. Journal of Agricultural Engineering, St. Joseph, v.140, p. 65-71.
[15]  Obukhov, A. I.; Orlov, D. S. (1964) Spectral reflectivity of the Major Soil Groups and possibility of using diffuse reflection in soil investigations. Soviet Soil Science, Washington, DC, v. 2, p. 174-184.
[16]  Poppi R. J; Sena M. (1999) Avaliação do uso de métodos quimiométricos em análise de solos. Departamento de Química Analítica - Instituto de Química - UNICAMP - CP 6154 - 13083-970 - Campinas – SP.
[17]  Proença, C. A. (2012) Redes Neurais Artificiais para predição dos teores de matéria orgânica e argila do solo na região dos Campos Gerais utilizando Espectroscopia de Reflectância Difusa. Dissertação de Mestrado, Ponta Grossa-PR, UEPG.
[18]  Raij B. V.; Andrade J.C; Cantarella H.; Quaggio J.A. (2001) Análise Química do Solo para Avaliação da Fertilidade de Solos Tropicais, pág. 177 a 180, Instituto Agronômico, Campinas, SP.
[19]  Shepherd, K. D.; Walsh, M. G. (2002) Development of reflectance spectral libraries for characterization of soil properties. Soil Science Society of America Journal, Madison, WI, v. 66, p. 988-998.
[20]  Stenberg, B.O.; Viscarra Rossel, R.A.; Mouazen, A.M.; Wetterlind, J. (2010) Visible and Near Infrared Spectroscopy in Soil Science. In: SPARKS, D.L. (Ed.). Advances in Agronomy, Burlington: Academic Press, v. 107, p. 163-215.
[21]  Terra, F. S. (2011) Espectroscopia de reflectância do visível ao infravermelho médio aplicada aos estudos qualitativos e quantitativos de solos. Tese (Doutorado) Escola Superior de Agricultura “Luiz de Queiroz”, Piracicaba.
[22]  Thornton, Chris; Hutter, Frank;, Hoos , Holger H.; Ley-ton-Brown, Kevin. (2013) Auto-WEKA: Combined Selection and Hyperparameter Optimization of Classification Algorithms. In: Proceedings KDD’ 2103 – Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining. Pages 847-855.
[23]  Udelhoven, T.; Emmerling, C.; Jarmer, T. (2003) Quantitative Analysis of Soil chemical Properties with Diffuse Reflectance Spectrometry and Partial-least Square Regression: a feasibility study. Plant Soil 251, 319–329.
[24]  Vijayarani, S., Muthulakshmi, M. (2013) Comparative Analysis of Bayes and Lazy Classification Algorithms. International Journal of Advanced Research in Computer and Communication Engineering. Vol. 2, Issue 8.
[25]  Viscarra Rossel, R. A., MCBratney, A. B. (1998) Laboratory Evaluation of a Proximal Sensing Technique for Simultaneous Measurement of Soil Clay and Water Content. Geoderma, 85, 19–39.
[26]  Viscarra Rossel, R.A.; Walvoort, D.J.J.; MCBratney, A.B.; Janik, L.J.; Skjemstad, J.O. (2006) Visible, Near infrared, Mid infrared or Combined Diffuse Reflectance Spectroscopy for Simultaneous Assessment of Various Soil Properties. Geoderma 131 (2006) 59–75.
[27]  Walinga, I. (1992) Spectrophotometric determination of organic carbon in soil. Communications in Soil Science and Plant Analysis, New York, v.23, p.1935-1944 Janik, L.J.; Skjemstad, J.O. (2006) Visible, Near infrared, Mid infrared or Combined Diffuse Reflectance Spectroscopy for Simultaneous Assessment of Various Soil Properties. Geoderma 131 (2006) 59–75.
[28]  Walinga, I. (1992) Spectrophotometric determination of organic carbon in soil. Communications in Soil Science and Plant Analysis, New York, v.23, p.1935-1944.