Strategies and Structures for Biological Datasets Integration and Knowledge Discovery

Teeple, Erin

Etd

Strategies and Structures for Biological Datasets Integration and Knowledge Discovery

Public Deposited

Correlations between variables in biological data reflect underlying processes, but data science problems in this domain include how to perform dimension reduction or integrate data in ways that do not lose information and how to use such data for discovery and prediction tasks. In part 1, this dissertation details the generation of a dataset for machine learning from US air quality and cause-specific mortality records. An innovative CCA-derived epidemiological analysis is then presented for novel quantification of exposure-outcome associations, achieving stronger and significant quantification of air quality and health outcome association through covariation vs more commonly used multiple linear regression. Conceptual understanding of covariation modeling then guides alignment of single-nucleus RNAseq datasets by CCA-based features to extract new insights into relationships between regional cell states and disease. In part 2, the problem of extracting information from a knowledge graph is considered for the task of predicting drug indication status for target-disease pairs using as input features aggregated association evidence scores from the Open Targets platform. In part 2, first, an innovative new approach for the task leveraging local network topology is shown to achieve improved prediction performance over previously published works. The second work in part 2 is another novel approach for the same classification task which achieves further improved performance by integration of external biological data resources via a feature engineering informed by collaborative filtering and network embedding concepts. In part 3, the problem of transforming raw biological data with correlated features into data structures for knowledge discovery is further explored with illustration of how such preparations can generate custom data structures which are suited for feature generation as shown in part 2. Innovations in this work include the development of new integration strategies for biological data (part 1, 2, 3), development of multiple novel indication status prediction methods for use in the Open Targets platform (part 2), and generation of new data-derived networks for knowledge discovery (part 3).

Creator