Development of Computational Methods for Multi-omics Data Analysis

Thumbnail Image
Paul, Sushmita
Journal Title
Journal ISSN
Volume Title
Indian Institute of Technology Jodhpur
Holistic understanding of human health and disease demands interpretation of molecular complexity and variations at multiple levels such as the genome, epigenome, transcriptome, proteome, metabolome, and others. High-throughput technologies have revolutionized medical research. The field of biology has become increasingly dependent on data generated at the above mentioned levels, which together is called �multi-omics� data. An abundance of multi-omics data has transformed the discipline of biology and medicine and generated opportunities for integrated system-level approaches. Multi-omics studies are based on data-driven scientific interrogations that help analyze a wide range of multi-dimensional datasets at various levels that can be scaled to unravel cellular complexity. With the advancement in high-throughput technologies, cancer research has also observed a paradigm shift toward multi-omics approaches, expanding initiatives in comprehensive research collaboration and progression of computational pipes and algorithms. An in-depth understanding of cancer�s pathological phenotype can only be achieved by adequately integrating the myriad of related biological information. Cancer genomics programs like The Cancer Genome Atlas have generated a considerable amount of multi-omics data for multiple cancer types that can be utilized for the analysis. Therefore, the thesis aims to develop computational algorithms that utilize more than one omic data at a time to understand the etiology of cancer. The thesis�s two principal objectives are (1) To identify miRNA-mRNA regulatory modules in cancer and (2) To identify cancer subtypes. The computational objectives associated with these biological objectives are (1) Simultaneous clustering of miRNAs and mRNAs (clustering of features) whose expression is measured across the same set of cancer patients and (2) Clustering of cancer patients (clustering of samples) by integrating multiple levels of biological information coming from different omic platforms. Multi-omics ination and clustering have the potential to uncover subsequent systems-level knowledge but raise biological and computational challenges. The major challenges associated with multi-omics integration and clustering are (a) Selection of informative, appropriate, and meaningful omics (aligned to the biological objective) for the integration task. (b) Capturing the regulatory interactions between the multi-omics layers that can reflect the holistic nature of multi level data. (c) Careful handling of data heterogeneity across the omics having different distributions. (d) The problem of high-dimension low sample size, a situation where the variables significantly outnumber samples, leading to model overfitting. (e) Avoiding the transmission of redundant and noisy information from the individual omics while multi-omics integration. The computational approaches proposed in this thesis address the above issues and integrate multi-omics data for miRNA-mRNA regulatory module identification and cancer subtyping. The approaches designed in this study are based on four paradigms, specifically, Simultaneous clustering, Feature weighting-assisted information fusion, Subspace-based multi-kernel information fusion, and Latent space-based information fusion. The simultaneous clustering algorithm groups co-expressed miRNAs and mRNAs together into a module. An optimization function is designed to simultaneously maximize the relevance between a miRNA and an mRNA and the functional similarity between a module�s mRNAs. The feature weighting-assisted information fusion approach integrates transcriptomics data to identify cancer subtypes. The method assigns a weight to every biomarker prior to data integration and sample clustering. The weights take care of the inherent variance present in each transcriptomics data and are further utilized to calculate sample similarities. The impact of weights on the sample similarity network is observed during sample stratification when homogeneous groups of cancer patients are identified having distinished molecular characteristics. The subspace-based multi-kernel integration approach utilizes two graph-based representations for each omic data to capture the inherent data heterogeneity. It captures the best possible synergism between multiple representations using heuristics. Later, relevance-based integration of the synergistic graphs is performed to combine the multi-omics information and sample clustering. The recursive multi-kernel integration is performed to combine only the relevant and de-noised subspace. Here, relevant subspace refers to that subspace of the matrix that purely encodes the cluster information, which in the case of synergy matrix is its eigenspace corresponding to best eigenvalues. The latent space-based information fusion approach performs early integration of multi-omics data for sample clustering. Dimensionality reduction and simultaneous data integration are performed by learning neural networks in an unsupervised setting. It helps to create an information bottleneck by capturing the non-linear relationships in the data and denoising them simultaneously. The compressed integrated data representation holds the global cluster structure that is explored for sample clustering.
Madhumita. (2023).Development of Computational Methods for Multi-omics Data Analysis (Doctor's thesis). Indian Institute of Technology Jodhpur, Jodhpur.