Development of Computational Methods for Multi-omics Data Analysis
Loading...
Date
2022-07
Researcher
Madhumita
Supervisor
Paul, Sushmita
Journal Title
Journal ISSN
Volume Title
Publisher
Indian Institute of Technology Jodhpur
Abstract
Holistic understanding of human health and disease demands interpretation of molecular complexity
and variations at multiple levels such as the genome, epigenome, transcriptome, proteome, metabolome,
and others. High-throughput technologies have revolutionized medical research. The field of biology
has become increasingly dependent on data generated at the above mentioned levels, which together is
called �multi-omics� data. An abundance of multi-omics data has transformed the discipline of biology
and medicine and generated opportunities for integrated system-level approaches. Multi-omics studies are
based on data-driven scientific interrogations that help analyze a wide range of multi-dimensional datasets
at various levels that can be scaled to unravel cellular complexity.
With the advancement in high-throughput technologies, cancer research has also observed
a paradigm shift toward multi-omics approaches, expanding initiatives in comprehensive research
collaboration and progression of computational pipes and algorithms. An in-depth understanding of
cancer�s pathological phenotype can only be achieved by adequately integrating the myriad of related
biological information. Cancer genomics programs like The Cancer Genome Atlas have generated a
considerable amount of multi-omics data for multiple cancer types that can be utilized for the analysis.
Therefore, the thesis aims to develop computational algorithms that utilize more than one omic data at
a time to understand the etiology of cancer. The thesis�s two principal objectives are (1) To identify
miRNA-mRNA regulatory modules in cancer and (2) To identify cancer subtypes. The computational
objectives associated with these biological objectives are (1) Simultaneous clustering of miRNAs and
mRNAs (clustering of features) whose expression is measured across the same set of cancer patients and (2)
Clustering of cancer patients (clustering of samples) by integrating multiple levels of biological information
coming from different omic platforms.
Multi-omics ination and clustering have the potential to uncover subsequent systems-level
knowledge but raise biological and computational challenges. The major challenges associated with
multi-omics integration and clustering are (a) Selection of informative, appropriate, and meaningful omics
(aligned to the biological objective) for the integration task. (b) Capturing the regulatory interactions
between the multi-omics layers that can reflect the holistic nature of multi level data. (c) Careful handling
of data heterogeneity across the omics having different distributions. (d) The problem of high-dimension
low sample size, a situation where the variables significantly outnumber samples, leading to model
overfitting. (e) Avoiding the transmission of redundant and noisy information from the individual omics
while multi-omics integration. The computational approaches proposed in this thesis address the above
issues and integrate multi-omics data for miRNA-mRNA regulatory module identification and cancer
subtyping. The approaches designed in this study are based on four paradigms, specifically, Simultaneous
clustering, Feature weighting-assisted information fusion, Subspace-based multi-kernel information fusion,
and Latent space-based information fusion.
The simultaneous clustering algorithm groups co-expressed miRNAs and mRNAs together into a
module. An optimization function is designed to simultaneously maximize the relevance between a miRNA
and an mRNA and the functional similarity between a module�s mRNAs. The feature weighting-assisted
information fusion approach integrates transcriptomics data to identify cancer subtypes. The method assigns
a weight to every biomarker prior to data integration and sample clustering. The weights take care of the
inherent variance present in each transcriptomics data and are further utilized to calculate sample similarities.
The impact of weights on the sample similarity network is observed during sample stratification when
homogeneous groups of cancer patients are identified having distinished molecular characteristics. The
subspace-based multi-kernel integration approach utilizes two graph-based representations for each omic
data to capture the inherent data heterogeneity. It captures the best possible synergism between multiple representations using heuristics. Later, relevance-based integration of the synergistic graphs is performed
to combine the multi-omics information and sample clustering. The recursive multi-kernel integration
is performed to combine only the relevant and de-noised subspace. Here, relevant subspace refers to
that subspace of the matrix that purely encodes the cluster information, which in the case of synergy
matrix is its eigenspace corresponding to best eigenvalues. The latent space-based information fusion
approach performs early integration of multi-omics data for sample clustering. Dimensionality reduction
and simultaneous data integration are performed by learning neural networks in an unsupervised setting. It
helps to create an information bottleneck by capturing the non-linear relationships in the data and denoising
them simultaneously. The compressed integrated data representation holds the global cluster structure that
is explored for sample clustering.
Description
Keywords
Citation
Madhumita. (2023).Development of Computational Methods for Multi-omics Data Analysis (Doctor's thesis). Indian Institute of Technology Jodhpur, Jodhpur.