A comparison of PLS-based and other dimension reduction methods for tumour classification using microarray data
Date of this Version
Biotechnological development in the area of genomics has lead to an explosion in the amount of large datasets containing thousands of simultaneously measured gene or protein expressions from biological samples. Accordingly, there has been much activity in the development or re-deployment of many multivariate and data-mining techniques that may be used to analyze such data. A particular area of interest is tumour classification based on microarray data and many studies have compared various methods for this purpose. However, many of these studies compare methods based on a single benchmark dataset, or worse still compare methods from very different families where issues such as implicit data standardization (e.g. dissimilarity metric) might be responsible for differences in performance rather than the fundamental nature of the method.
This study compares the classification accuracy of three different families of multivariate methods: 1. PLS-based methods; 2. Canonical ordination methods; and 3. Classification using scores obtained from indirect [unsupervised] projection methods. The success of the methods is gauged on their ability to correctly classify out-of-sample biological samples arising from three widely used benchmark datasets into tumour categories. Classification rates were also compared for varying numbers of components, both with the full complement of genes, and with only those identified as differentially expressed.
All three PLS-based methods yielded lower or at worst, equal misclassification rates as the other methods for all three benchmark datasets. Generalized and Ridge-penalized PLS classification outperformed all other methods when fewer gene components were used in the classification process. Performance of the Canonical ordination methods was highly variable across the three different datasets with these methods performing well on two of the datasets, and poorly on the remaining dataset. The indirect dimension reduction methods did not perform well with higher misclassification rates relative to at least one of the other families of techniques, however, these methods did improve with higher numbers of components, and when only differentially expressed genes were considered.
This document has been peer reviewed.