A protocol for automated timber species identification using metabolome profiling
Abstract
Using chemical fingerprints for timber species identification is a relatively new, but promising technique. However, little is known about the effect of pre-processing spectral data parameter settings on the timber species classification accuracy. Therefore, this study presents an extensive and automated analysis method using the random forest machine learning algorithm on a set of highly valuable timber species from the Meliaceae family. Metabolome profiles were collected using direct analysis in real-time (DART™) ionisation coupled with time-of-flight mass spectrometry (TOFMS) analysis of heartwood specimens for 175 individuals (representing 10 species). In order to analyse variability in classification accuracy, 110 sets of data pre-processing parameter combinations consisting of mass tolerance for binning and relative abundance cut-off thresholds were tested. Furthermore, for each set of parameters (designated “binning/threshold setting”), a random search for one hyperparameter of interest was performed, i.e. the number of variables (in this case ions) drawn randomly for each random forest analysis. The best classification accuracy (82.2%) was achieved with 47 variables and a binning and threshold combination of 40 mDa and 4%, respectively. Entandrophragma angolense is mostly confused with Entandrophragma candollei and Khaya anthotheca, and several Swietenia species are confused with each other due to the high similarity of their chemical fingerprints. Entandrophragma cylindricum, Entandrophragma utile, Khaya ivorensis, Lovoa trichilioides and Swietenia macrophylla are easy to discriminate and show less misclassifications. The choice of parameter settings, whether it is in the data pre-processing (binning and threshold) or classification algorithm (hyperparameters), results in variability in classification accuracy. Therefore, a preliminary parameter screening is proposed before constructing the final model when using the random forest algorithm for classification. Overall, DART-TOFMS in combination with random forest is a powerful tool for species identification.
For more information: https://link.springer.com/article/10.1007%2Fs00226-019-01111-1