Machine learning in DNA microarray analysis

Published 1/22/2017 11:15:01 AM  |  Last update 4/6/2019 05:00:08 PM
Tags: data analysis, machine learning

Regarding differentially expressed genes, many protocols use a cutoff of a twofold difference as a criterion. However, this arbitrary cutoff value may be either too high or too low depending on the data variability. In addition, inherent data variability is not taken into account. A data point above or below the cutoff line could be there by chance or by error.

To ensure that a gene is truly differentially expressed requires multiple replicate experiments and statistical testing. However, not all microarray experiments are array-replicated. We, therefore, need to analyze data generated by different protocols. Statistical analysis, including meta-analysis, uses microarrays to study genes in isolation while the real power of microarrays is their ability to study the relationships between genes and to identify genes or samples that behave in a similar manner.

Meta-analysis and integrative data analysis only exploits the behavior of genes individually at the genotype level. In order to discover gene behaviors at the phenotype level by connecting genotype to phenotype, additional analyses are required. Cluster analysis, an unsupervised learning method, is a popular method that searches for patterns of gene expression associated with an experimental condition. Each experimental condition usually corresponds to a specific biological event, such as a drug treatment or a disease state, thereby allowing discovery of the drug targets, and drug and disease relationships.

Machine learning and data mining approaches have been developed for further analysis procedures. These approaches can be divided into unsupervised and supervised methods. Unsupervised methods involve the aggregation of samples, genes, or both into different clusters based on the distance between measured gene expression values. The goal of clustering is to group objects with similar properties, leading to clusters where the distance measure is small within clusters and large between clusters. Several clustering methods from classical pattern recognition, such as hierarchical clustering, K-Means clustering, fuzzy C-Means clustering, and self-organizing maps, have been applied to microarray data analysis. Using unsupervised methods, we can search the resulting clusters for candidate genes or treatments whose expression patterns could be associated with a given biological condition or gene expression signature. The advantage of this method is that it is unbiased and allows for identification of significant structure in a complex dataset without any prior information about the objects. In contrast, supervised methods integrate the knowledge of sample class information into the analysis with the goal of identifying expression patterns (i.e., gene expression signatures) which could be used to classify unknown samples according to their biological characteristics. A training dataset, consisting of gene expression values and sample class labels, is used to select a subset of expressed genes that have the most discriminative power between the classes. It is then used to build a predictive model, also called a classifier (e.g., k-nearest neighbors, neural network, support vector machines), which takes gene expression values of the pre-selected set of genes of an unknown sample as input and outputs the predicted class label of the sample.

REFERENCES

  • Koziol A.J. (2011) Comments on the rank product method for analyzing replicated experiments, Elsevier – FEBS letters, Vol. 584, pp. 941-944.
  • Thanh Le (2013) A Machine Learning approach for Gene Expression analysis and applications.
© 2024 blog.tinyray.com  by tinyray