Novel Statistical Methods for Aggregating Correlated and Missing Data with Applications to Chronic Disease Research

Chen, Xiaohui

Etd

Novel Statistical Methods for Aggregating Correlated and Missing Data with Applications to Chronic Disease Research

Public Deposited

Information-aggregation methods are crucial for identifying risk factors for chronic diseases, analyzing treatment effects, and handling missing data problems. However, significant gaps persist in the literature, including the lack of signal-adaptive methods for summary statistics, inadequate study of the correlation-robustness properties of hypothesis-testing methods, and insufficient methods for large missing rates. In this dissertation, we aim to address these gaps and advance relevant statistical methodology. In the first part, we propose a new signal-adaptive analysis pipeline to address unknown signal patterns using the omnibus thresholding Fisher’s method (oTFisher). The oTFisher remains robustly powerful over various patterns of genetic effects. Its adaptive thresholding can be applied to estimate important single nucleotide polymorphisms (SNPs) contributing to the overall significance of the given SNP set. Efficient calculation algorithms are developed to control the type I error rate, which accounts for the linkage disequilibrium among SNPs. Extensive simulations show that the oTFisher has robustly high power and provides higher balanced accuracy in screening SNPs than the traditional Bonferroni and FDR procedures. We apply the oTFisher to study the genetic association of genes and haplotype blocks of the bone density-related traits using the GWAS summary data of the Genetic Factors for Osteoporosis Consortium. The oTFisher identifies more novel and literature-reported genetic factors than existing p-value combination methods. Next, we provide theoretical analyses examining the correlation-robustness properties of hypothesis-testing methods in analyzing correlated data. We focus specifically on two classical tests - the minimum P-value (minP) and the Simes tests. Our investigation delves into the tail probabilities of the minP and the Simes tests under the Gaussian mean model, considering an arbitrary correlation matrix. Our study reveals that both tests demonstrate asymptotic robustness to any non-perfect correlations. These findings hold significant practical implications, particularly when calculating extreme tail probabilities, as seen in scenarios requiring stringent type I error control in large-scale data analysis. Utilizing the approximation by the probability under independence could significantly expedite computation for analyzing large datasets. In the third part of this research, we study the missing data problems with high missing rates across different time points in pulmonary arterial hypertension. The COVID-19 pandemic introduced new challenges, such as high missing rates and unverifiable missing assumptions, that affect the measurement of drug effects. Multiple imputation methods are systemically compared to address the high missing rate issue based on remotely collected data (e.g., actigraphy data) under a simulation study. Four scenarios are considered in the simulation: missingness due to missing at random, adverse events, lack of efficacy, and a mixture case. We demonstrate that traditional parametric methods in the Bayesian framework have a high relative bias with a 40% missing rate. However, adding remotely available data related to the primary outcome and imputing the missingness by the best guess of reasons can lead to smaller relative biases.

Creator