Graduated Students

Seunghee Baek, PhD

Year Graduated: 2010
Advisor:
Scarlett L. Bellamy, ScD
Dissertation Title:
A Copula-Based Method For Analyzing Bivariate Binary Longitudinal Data
Abstract:

The work presented as part of this dissertation is primarily motivated by a randomized trial for HIV serodiscordant couples. Specifically, the Multisite HIV/STD Prevention Trial for African American Couples is a behavioral modification trial for African American, heterosexual, HIV discordant couples. In this trial, investigators developed and evaluated a couple-based behavioral intervention for reducing risky shared sexual behaviors and collected retrospective outcomes from both partners at baseline and at 3 follow-ups to evaluate the intervention efficacy. As the outcomes refer to the couples' shared sexual behavior, couples' responses are expected to be correlated, and modeling approaches should account for multiple sources of correlation: within-individual over time as well as within-couple both at the same measurement time and at different times. This dissertation details the novel application copulas to modeling dyadic, longitudinal binary data to estimate reliability and efficacy. Copulas have long been analytic tools for modeling multivariate outcomes in other settings. Particularly, we selected a mixture of max-infinitely divisible (max-id) copula because it has a number of attractive analytic features.


The dissertation is arranged as follows: Chapter 2 presents a copula-based approach in estimating the reliability of couple self-reported (baseline) outcomes, adjusting for key couple-level baseline covariates; Chapter 3 presents an extension of the max-id copula to model longitudinal (two measurement occasions), binary couples data; Chapter 4 further extends the copula-based model to accommodate more than two repeated measures in a different application examining two clinical depression measures. In this application, we are interested in estimating whether there are differential treatment effects on two different measures of depression, longitudinally.


The copula-based modeling approach presented in this dissertation provides a useful tool for investigating complex dependence structures among multivariate outcomes as well as examining covariate effects on the marginal distribution for each outcome. The application of existing statistical methodology to longitudinal, dyad-based trials is an important translational advancement. The methods presented here are easily applied to other studies that involve multivariate outcomes measured repeatedly.


Laurel Bastone, PhD

Year Graduated: 2007
Advisor:
Mary E. Putt, PhD, ScD
Dissertation Title:
Statistical Methods To Further The Understanding Of Complex And Continuous Genetic Traits
Abstract:

My research addresses two important issues in the analysis of complex and continuous genetic traits. The first topic is motivated by an association study of cardiovascular disease. We compare the mathematical basis of two methods for detecting higher-order genotype phenotype associations, multifactor-dimensionality reduction (MDR) and patterning and recursive partitioning (PRP), in order to better understand the utility of these methods in data analysis. We show that MDR is a special case of recursive partitioning in which patterns are used as predictors, tree growth is restricted to a, single split, and misclassification error is used as the measure of node impurity. Our finding indicates that the extensive theory developed for recursive partitioning should be considered when applying both MDR and PRP.


The second issue is motivated by a linkage study of gene expression phenotypes in extended pedigrees. The study design is novel since families are not selected on the basis of phenotype. As a result, there is potential for the presence of two types of families: segregating and non-segregating. Segregating families have at least one parent heterozygous at the quantitative trait locus. In contrast, non-segregating families have two homozygous parents and are uninformative for linkage. We use latent class methods to extend Haseman-Elston regression to account for heterogeneity and estimate which families exhibit evidence of segregation and linkage. We derive expressions for the parameters in the latent class model in terms of an assumed genetic model using both the squared difference and family mean corrected product as outcome variables. We use maximum likelihood methods to fit the model, assuming independence of sibling pairs within a family, conditional on class. We use Generalized Estimating Equations (GEE) to fit a model that accounts for correlations among sibling pairs, via the Expectation-Solution algorithm. We show the equivalence of the estimating equations based on the maximum likelihood and GEE approaches, assuming conditional independence. A permutation procedure is used to test for linkage in the segregating class. We illustrate the methods on the motivating set of data and perform a simulation study.


Shaun Bender, PhD

Year Graduated: 2016
Advisor:
Justine Shults, PhD
Dissertation Title:
Ignorability Conditions for Incomplete Data and the First-Order Markov Conditional Linear Expectation Approach for Analysis of Longitudinal Discrete Data with Overdispersion
Abstract:

Medical researchers strive to collect complete information, but most studies will have some degree of missing data. We first study the situations in which we can relax well accepted conditions under which inferences that ignore missing data are valid. We partition a set of data into outcome, conditioning, and latent variables, all of which potentially affect the probability of a missing response. We describe sufficient conditions under which a complete-case estimate of the conditional cumulative distribution function of the outcome given the conditioning variable is unbiased. We use simulations on a renal transplant data set to illustrate the implications of these results. After describing when missing data can be ignored, we provide a likelihood based statistical approach that accounts for missing data in longitudinal studies, by fitting correlation structures that are plausible for measurements that may be

unbalanced and unequally spaced in time. Our approach can be viewed as an extension of generalized linear models for longitudinal data that is in contrast to the generalized estimating equation (GEE) approach that is semi-parametric. Key assumptions of our method include first-order antedependence within subjects; independence between subjects; exponential family distributions for the first outcome on each subject and for the subsequent conditional distributions; and linearity of the expectations of the conditional distributions. Our approach is appropriate for data with over-dispersion, which occurs when the variance is inflated relative to the assumed distribution. We first consider a clinical trial to compare two treatments for seizures in patients. We implement the Poisson and Negative Binomial distributions for analysis of the seizure counts and perform a likelihood ratio test to choose between the two distributions. Next, we consider a study that evaluates the likelihood that a transplant center is flagged for poor performance. The outcome variable is a binomial type outcome that indicates the number of times the center was flagged in the previous time-period. For both studies, we perform simulations to assess the properties of our estimators and to compare our approach with GEE. We demonstrate that our method outperforms GEE, especially as the degree of over-dispersion increases. We also provide software in R so that the interested reader can implement our method in his or her own analysis.


Erica Billig, PhD

Year Graduated: 2017
Advisor:
Jason A. Roy, PhD
Michael Z. Levy, PhD
Dissertation Title:
Detecting and Controlling Insect Vectors in Urban Environments: Novel Bayesian Methods for Complex Spatial Data
Abstract:

Efforts to control the spread of vector-borne diseases are often focused on the vector itself. Here, we develop novel methods to strategically guide the search for vectors over urban landscapes. The methodology is motivated by Triatoma infestans, the vector of Chagas disease, a re-emerging vector in Arequipa, Peru. We first propose a novel stochastic epidemic model that incorporates both the counts of disease vectors at each observed house and the complex spatial dispersal dynamics. The goal of our analysis is to predict and identify houses that are infested with T. infestans for entomological inspection and insecticide treatment. A Bayesian method is used to augment the observed data, estimate the insect population growth and dispersal parameters, and determine posterior infestation probabilities of households. We investigate the properties of the model through simulation studies and implement the strategy in a region of Arequipa by inspecting houses with the highest posterior probabilities of infestation and report the results from the field study. After piloting this model in the field and assessing the strengths and weaknesses, we propose a much faster method that extends a Gaussian Field model to incorporate the urban landscape. Gaussian Field logistic models can be used to create risk maps of vector presence across large urban environments. However, these models do not typically account for the possibility that city streets function as permeable barriers for insect vectors. We extend Gaussian field models to account for this urban landscape. We demonstrate our method on simulated datasets and then apply it to data on T. infestans. We estimate that streets increase the effect of distance on the probability of vector presence at least 1.5 fold compared to the undivided environment within a city block. Lastly, we propose a Bayesian generalized multivariate conditional autoregressive approach to jointly model the distribution of vectors, T. infestans, with the proportion of vectors that carry the parasite of Chagas disease, Trypanosoma cruzi. We demonstrate the properties of the model using simulation studies, and apply the method to data from Arequipa, Peru.


Bing Cai, PhD

Year Graduated: 2010
Advisor:
Thomas R. Ten Have, PhD
Dylan S. Small, PhD (Wharton)
Dissertation Title:
Causal Inference With Two-Stage Logistic Regression -Accuracy, Precision, And Application
Abstract:

Two-stage predictor substitution (2SPS) and the two-stage residual inclusion (2SRI) are two approaches to instrumental variable (IV) analysis. While 2SPS and 2SRI with linear models are well-studied methods of causal inference, the properties of 2SPS and 2SRI for logistic binary outcomes have not been thoroughly studied. We study the bias and variance properties of 2SPS and 2SRI for a logistic outcome model so that we can apply these IV approaches to the causal inference of binary outcomes. We also propose and implement an extension of generalized structure mean model originally developed for a randomized trial. We first present closed form expressions of asymptotic bias for the causal odds ratio from both 2SPS and 2SRI approaches. Our closed form bias results show that the 2SPS logistic regression generates asymptotically biased estimates of this causal odds ratio when there is no unmeasured confounding and that this bias increases with increasing unmeasured confounding. The 2SRI logistic regression is asymptotically unbiased when there is no unmeasured confounding, but when there is unmeasured confounding, there is bias and it increases with increasing unmeasured confounding. In the second part, we propose the sandwich variance estimator of logistic regression of both 2SPS and 2SRI approaches and the variance estimator is adjusted for the fact that the estimates from the first stage regression is included as covariates in the second stage regression. The simulation results show that the adjusted estimates are consistent with the observed variance while the naive estimates without the adjustments are biased. This study also shows that the 2SRI method has a larger variance than the 2SPS method. Lastly, we compare the 2SPS and 2SRI logistic regression with the generalized structure mean model (GSMM). Our simulation results show that the GSMM is an unbiased estimator of complier-average causal effect (CACE) and has the least variance among the three approaches. We apply these three methods to the analysis of the GPRD database on antidiabetic effect of bezafibrate.


Xinglei Chai, PhD

Year Graduated: 2017
Advisor:
Jinbo Chen, PhD
Dissertation Title:
Semiparametric Approaches to Developing Models for Predicting Binary Outcomes Through Data and Information Integration
Abstract:

We developed statistical methods for evaluating the added value of biomarkers for predicting binary outcomes when biomarker data has limited availability. In the first project, we

considered a cost effective study design called “two-phase study”, where data on the outcome and established risk predictors was collected for all study subjects in Phase I while biomarkers were measured only for a judiciously selected subset in Phase II. Using a logistic regression model to describe the relationship between the binary outcome and risk predictors, we developed three approaches to estimating the risk distribution and summary measures of predictive accuracy. We showed that all three estimators were consistent and asymptotically normally distributed, and compared the efficiency and robustness of the three methods through extensive simulation studies and application to an ongoing biomarker study of Gestational Diabetes. We also developed a novel sampling strategy for selecting Phase II subjects towards improved efficiency for estimating measures of predictive accuracy. In the second project, we developed a statistical method for alleviating the challenge of lack of independent data to validate biomarkers for prediction, focusing on model calibration. When a wellcalibrated model with only standard predictors exists, we proposed to calibrate the new model to the existing model at the stage of model development. With data collected under a case-control study design, we developed a novel constrained maximum likelihood approach to fitting logistic regression models that brought this idea to fruition. We developed large sample theory for this method, and performed extensive simulation studies to assess the impact of constraints on the odds ratio parameter estimates. We applied our method to analyze a case-control study of breast cancer nested within the

Breast Cancer Detection and Demonstration Project to evaluate the added value of mammographic density for predicting the 5-year risk of breast cancer. In the third project, we extended the statistical method developed in the second project to accommodate the cross-sectional study design. By simulation studies and the analysis of Gestational Diabetes, we demonstrated that our method ensured that the model was well calibrated.


Lu Chen, PhD

Year Graduated: 2015
Advisor:
Jinbo Chen, PhD
Dissertation Title:
Cancer Absolute Risk Projection with Incomplete Predictor Variables
Abstract:

A popular approach to projecting cancer absolute risk is to integrate a relative hazard function of predictors with hazard rates obtained from different sources, where the relative hazard function is often approximated by an odds ratio function. To assess added values of candidate risk predictors, it is very common that data for standard risk predictors is fully available from a frequency-matched case-control study, but that of candidate predictors is available only for a subset of cases and controls. In the first project, we developed statistical measures for quantifying predictive accuracy of cancer absolute risk prediction models, accommodating incomplete predictor variables. We particularly focused on a measure that is useful for evaluating efficiency of model-based cancer screening, the proportion of cases that can be captured by screening only people with high projected risk. In the second project, using a logistic regression model to describe the relationship between cancer status and risk predictors, we developed a novel semiparametric maximum likelihood approach that accommodates incomplete predictor data under rare disease approximation for the estimation of odds ratio parameters and the distribution of candidate predictors. Through theoretical and simulation studies, we showed that our estimator is consistent with an asymptotically normal distribution and has improved statistical efficiency. In the third project, we applied the statistical methods developed in the first two to evaluate the added values of percent mammographic density and breast cancer risk SNPs in breast cancer absolute risk projection. Our results showed that the two sets of predictors had similar added values and can lead to more efficient model-based screening for breast cancer. In the fourth project, we applied the semiparametric maximum likelihood method to a family-supplemented study design that we proposed to address survival bias in case-control genetic association studies.


Jing Cheng, PhD

Year Graduated: 2006
Advisor:
Dylan S. Small, PhD (Wharton)
Dissertation Title:
Causal Inference For Randomized Trials With Noncompliance
Abstract:

Noncompliance is a common problem in randomized trials. When there is noncompliance, there is often interest in estimating the causal effect of actually receiving the treatment compared to receiving the control. Standard intent-to-treat analyses that compare randomization arms without regard to compliance do not estimate the causal effect of receiving treatment. My dissertation studies methods for making more accurate inferences about the causal effects of receiving treatment in randomized clinical trials when there is noncompliance. The first part of my dissertation focuses on causal methods for three-arm trials. We develop a method for constructing sharp bounds and confidence intervals for the average effects of receiving treatment within principal strata in three-arm trials with noncompliance. The first chapter in the second part of my dissertation develops a new efficient nonparametric estimator for the Complier Average Causal Effect (CACE) in two-arm trials with noncompliance. By using the empirical likelihood approach to construct a profile random sieve likelihood and taking into account the mixture structure in outcome distributions, our estimator is robust to parametric distribution assumptions and more efficient than the standard instrumental variables estimator. The second chapter in the second part of my dissertation develops a bootstrap version of a likelihood ratio test for the CACE in two-arm trials with noncompliance which have a multinomial outcome. We apply our method in the first part to the data from a randomized trial of treatments for alcohol dependence, and methods in the second part to the data from a randomized trial of an encouragement intervention to improve adherence to prescribed depression treatments among depressed elderly patients in primary care practices.


Shaokun (Shannon) Chuai, PhD

Year Graduated: 2011
Advisor:
Hongzhe Li, PhD
Dissertation Title:
Statistical Methods For Analysis of Structured Genomic Data
Abstract:

Partially motivated by analysis of high dimensional genomic data, high dimensional statistics, especially high dimensional regression analysis, have been an active research area in the last decades. Besides high dimensionality of the genomic data, another important feature is that the genomic data often have certain structure such as time course measurements and group or graphical structures. How to incorporate such structure information into analysis of numerical data raises interesting statistical challenges. This dissertation develops statistical methods for two problems motivated by genomic data analysis. The first problem is related to variable selection for high dimensional varying coefficients models, where we develop a regularization method for variable selection and estimation. We use basis function expansion to model the time-dependent regression coefficient functions and a combination of smoothness and group-level penalty to achieve both smooth function estimation and coefficient function selection. We apply the methods for analysis of microarray time course gene expression data in order to identify the transcription factors that regulate expression changes over times. Our results show that the varying coefficient model provides better power in identifying the relevant transcription factors that simple time-wise analysis. The second problem considers variable selection for graph-structured group variables, where we assume that the variables are grouped and also have a graphical structure. Examples include genes in a collection of pathways and single nucleotide polymorphisms in genes. We introduce a new penalty that is a combination of group Lasso and a graph-constrained smoothness penalty in order to perform both group-level variable selection and to impose some smoothness of the regression coefficients with respective to the graph structures. Simulation results have shown that the new method gives better variable selection and also prediction when such group and graphical structure information exists. We apply this method to an analysis of glioblastoma gene expression data and identify several KEGG pathways that are potentially related to survival time of glioblastoma.

Shoshana Daniel, PhD

Year Graduated: 2008
Advisor:
Marshall Joffe, MD, MPH, PhD
Paul R. Rosenbaum, PhD (Wharton)
Dissertation Title:
Matching In Observational Studies: Time Dependency And Balance
Abstract:

In observational studies, matching can be used to remove bias between treated and control subjects. Novel matching schemes and tools for analyzing matched data are developed in this thesis. Modified sign and signed rank tests statistics are created to allow for the possibility that the effect of a treatment is not seen until some time has passed, a late effect. These statistics are invertible and can be used to construct confidence intervals for censored, matched pairs. In simulation studies, these modified tests are shown to have very high power to detect a late effect in contrast to conventional techniques. These modified tests are used to ascertain whether the administering of chemotherapy by gynecological or medical oncologists creates the apparent late effect in their survival curves.


While the incidence rate of endometrial cancer is lower for black than for white women, black women have much higher mortality rates. In order to distinguish whether biological factors or quality of care is responsible for this discrepancy in survival, a tapered matched comparison is developed. In this procedure, a group of individuals is compared to two or more nonoverlapping group of controls created through the optimal assignment algorithm. In the endometrial data the black women are matched to whites at diagnosis and then each black is rematched to different whites based on both diagnosis and treatment characteristics.


While the tapered matching scheme implemented on the endometrial cancer data proved effective, it was not able to match well on all comorbidities of interest. Thus, a balance match algorithm is developed. This technique exchanges a member of the pool of controls that is unassigned with one that is assigned in order to maintain homogenous strata and provide better overall balance of covariates of interest. A simulation study shows that for a small penalty in the quality of the matches, balance across pairs is greatly enhanced. This flexible algorithm when implemented on endometrial data put in balance four key comorbidity covariates that were previously unbalanced for a minimal sacrifice in the quality of the matches.


Matthew Davis, PhD

Year Graduated: 2014
Advisor:
Warren B. Bilker, PhD
J. Richard Landis, PhD
Dissertation Title:
Estimation and Inference of the Three-Level Intraclass Correlation Coefficient
Abstract:

Since the early 1900s, the intraclass correlation coefficient (ICC) has been used to quantify the level of agreement among different assessments on the same object. By comparing the level of variability that exists within subjects to the overall error, a measure of the agreement among the different assessments can be calculated. Histori- cally, this has been performed using subject as the only random effect. However, there are many cases where other nested effects, such as site, should be controlled for when calculating the ICC to determine the chance corrected agreement adjusted for other nested factors. We will present a unified framework to estimate both the two-level and three-level ICC for both binomial and multinomial outcomes. In addition, the corresponding standard errors and confidence intervals for both ICC measurements will be displayed. Finally, an example of the effect that controlling for site can have on ICC measures will be presented for subjects within genotyping plates comparing genetically determined race to patient reported race.


In addition, when determining agreement on a multinomial response, the question of homogeneity of agreement of individual responses within the multinomial response is raised. One such scenario is the GO project at the University of Pennsylvania where subjects ages 8–21 were asked to rate a series of actors’ faces as happy, sad, angry, fearful or neutral. Methods exist to quantify overall agreement among the five responses, but only if the ICCs for each item-wise response are homogeneous. We will present a method to determine homogeneity of ICCs of the item-wise responses across a multinomial outcome and provide simulation results that demonstrate strong control of the type I error rate. This method will subsequently be extended to verify the assumptions of homogeneity of ICCs in the multinomial nested-level model to determine if the overall nested-level ICC is sufficient to describe the nested-level agreement.


Peter R. Dawson, PhD

Year Graduated: 2012
Advisor:
Phyllis A. Gimotty, PhD
Dissertation Title:
Beyond Traditional Biomarkers: Methods for identifying and evaluating non-traditional biomarkers
Abstract:


The identification of new biomarkers that can accurately classify patients by diagnosis or predict patient outcomes is of the utmost importance in biomedical science. Currently, both clinicians and biostatisticians focus on traditional biomarkers, i.e. ones that have a monotonic relationship with risk of disease or other clinical outcome. Standard methods based on the area under the curve (AUC) or logistic regression are adequate for the discovery of such traditional markers, but fail to identify potentially interesting biomarkers that violate the assumption of monotonicity. Methods developed to discover non-­‐traditional (non-­‐monotone) biomarkers are important since they can identify useful diagnostic, prognostic or predictive factors as well as enhance our understanding of the biology underlying a disease. In this dissertation, novel methods are developed and compared for identifying non-­‐traditional biomarkers including a chi-­‐squared test based on the area between empirical cumulative distribution functions. As, non-­‐traditional biomarkers are not accurately evaluated using diagnostic tests based on the traditional single optimal cutpoint, a dual cutpoint strategy is proposed for creating a binary diagnostic test and Youden's Index is extended to incorporate the second cutpoint. Asymptotic theory is then developed for the extension to Youden's Index to allow for its use in a hypothesis testing framework. Simulation results are presented that evaluate the operating characteristics of both proposed tests (chi-­‐squared and extension) as well as compare the new tests to standard methods for identifying both traditional and non-­‐ traditional biomarkers. These methods are applied to biomarkers from gene expression microarray data to identify and evaluate potential non-­‐traditional biomarkers that can discriminate between tumor and non-­‐tumor tissue.

Mark J Donovan, PhD

Year Graduated: 2006
Advisor:
Daniel F. Heitjan, PhD
Dissertation Title:
Predicting Event Times In Clinical Trials: Effects Of Masking And Blocking
Abstract:


Power is primarily determined by the number of events in event-based clinical trials. Therefore, the timing for interim or final analysis of data is often determined based on the accrual of events during the course of the study. Existing Bayesian methods may be used to predict the date of the landmark event, based on current enrollment, event, and loss to follow-up, if treatment arms are known.



This work extends methods where the treatment arms are masked. We present three models for predicting future event times when treatment arm is masked: (1) an exponential model for enrollment combined with a mixture exponential model for events and losses to follow-up; (2) extension of model 1 to a constrained latent variable (LV) model to address blocked randomization if block sizes are known; and (3) extensions of models 1 and 2 to address Weibull event times, as well as generalizations of the enrollment model for non-uniform enrollment and enrollment rates by site. We demonstrate use of these models using data from three studies, as well as simulation data.



We show the following: (1) the mixture model with diffuse priors can have better coverage probabilities for the prediction interval than single-population models if a treatment effect is present; (2) the LV model with blocking produced similar median estimates of the landmark event time, with narrower prediction limits than those for the unconstrained LV model; and (3) the Weibull model for events has similar or better coverage properties than the exponential model when a sufficient member of events are available. Also, a model that reflects enrollment patterns can substantially improve coverage properties for predicting the landmark event.


Angelo F. Elmi, PhD

Year Graduated: 2009
Advisor:
Sarah J. Ratcliffe, PhD
Wensheng Guo, PhD
Dissertation Title:
Curve Registration In Functional Data Analysis With Informatively Censored Event-Times
Abstract:

Curve Registration is a technique for aligning a set of curves whose time scale is observed subject to random error. In this dissertation, a general approach to Curve Registration for longitudinal and functional data, in the possible presence of informative dropout and time-varying treatments, is developed. A new method is developed for fitting the Semiparametric Nonlinear Mixed Effects Model(SNMM) where a B-spline basis expansion is used to estimate the common shape function. By using a smoothing spline to estimate the common shape function, the existing approaches to estimation and inference in this framework do not estimate the model parameters from a unified likelihood-based optimization criterion and instead use a backfitting approach that iterates between two mixed effects models. Such an iterative algorithm will not be guaranteed to converge, and because the variability in each of the two models is not properly accounted for, statistical inferences based on this approach may not be valid. Instead, a B-spline basis expansion is used in place of the smoothing spline which unifies estimation of all parameters within the same likelihood. Convergence is guaranteed, and likelihood-based statistical inferences and model selection will be valid. Computationally, the algorithm is simplified in comparison to the smoothing spline approach because the dimension of integration needed to compute the log-likelihood is typically small. Therefore, a more accurate numerical integration scheme based on Adaptive Gaussian Quadrature is implemented. The SNMM is extended to the shared parameter framework to enable joint modeling of the longitudinal trajectories and informatively censored event-times. Time-varying treatments are also accommodated through another extension to the branching curve problem. The methods developed in this dissertation are applied to a Women’s Health study involving women attempting a vaginal birth after cesarean(VBAC). The results of fitting the SNMM and its extensions are used to characterize the average progression of labor, to determine whether cases of uterine rupture tended to have longer delivery times, on average, than healthy controls, to model the effect of oxytocin, a labor inducing agent, on the average labor progressions.

Victoria Gamerman, PhD

Year Graduated: 2016
Advisor:
Phyllis Gimotty, PhD
Justine Shults, PhD
Dissertation Title:
Statistical Methods for Time-Conditional Survival Probability
Abstract:

We develop a new statistical framework for analyzing time-conditional survival probabilities beyond point estimates and their corresponding 95% confidence intervals. We define time-conditional survival probability as the probability of surviving at least an additional x years given survival beyond a years. This probability accounts for time elapsed from diagnosis and can be estimated by the ratio of a- and (a+x)-year estimated survival probabilities. We obtain the asymptotic distribution of estimates of log time-conditional survival and estimate their covariance matrix. We propose weighted least squares test statistics to address relevant clinical questions for evaluating the relationship between time-conditional survival probabilities and additional time survived. Simulation studies assess the power and the empirical probability of making a type I error. We obtain estimates from parametric methods to allow for the inclusion of continuous variables in a model for time-conditional survival probability. We create a statistical framework around time-conditional survival probability incorporating information from continuous prognostic factors. Melanoma survival data are used to illustrate the proposed nonparametric methodology by evaluating survival in patients with a reported occurrence of malignancy from the Surveillance, Epidemiology, and End Results (SEER) program who underwent clinical staging versus pathological staging. The Logistic-Weibull cure model is used to illustrate the parametric methods, which are applied to melanoma data from the SEER program. While the work presented here focuses on medical applications, it can be applied to time-to-event data in other disciplines. We also developed an approach for maximum likelihood analysis of longitudinal discrete data with over-dispersion of the marginal distributions relative to the Poisson distribution under an induced AR(1) correlation structure.


Long-Long Gao, PhD

Year Graduated: 2007
Advisor:
Marshall Joffe, MD, MPH, PhD
Dissertation Title:
Explanatory Analyses In Randomized Clinical Trials
Abstract:


Randomized clinical trials are commonly conducted in pharmaceutical companies and medical research institutes to evaluate a certain intervention effect. Standard intention-to-treat (ITT) analysis is routinely performed to test if there is a difference in outcome between the randomized groups. Besides ITT analysis, explanatory analyses are often performed. The purpose of explanatory analyses could be: (1) to examine effects of treatment received on the outcome, (2) to further address the reason for a difference between randomized groups, (3) to address other related research questions rather than the original study objective. This research was motivated by a dose-controlled randomized PK/PD clinical trial and a randomized clinical trial of Modification of Diet in Renal Disease (MDRD).



In this research, we focus on challenges which we encounter in practice, while we apply the Instrumental Variables (IV) method. More specifically we focus on investigating (1) the exclusion restriction assumption, (2) censored outcome measurements in G-estimation, (3) surrogacy concepts in explanatory analyses. The dissertation is organized by the following four parts. First, we review the compliance issue and introduce causal concepts and related notations. Second, we propose to apply the IV method to characterizing the relationship between pharmacokinetics (PK) and pharmacodynamics (PD) to deal with unmeasured confounding problem, and discuss the exclusion restriction assumption. Third, we focus on how to deal with censored data issue while we apply G-estimation method. We propose to use artificial censoring method, and extend it to more general settings. Fourth, we discuss the surrogacy concepts in explanatory analyses. We propose to use the direct and indirect effects to define the estimands. Many explanatory analyses fall into surrogacy concepts framework although they are not traditional surrogate biomarker problems. We bridge the surrogate endpoint concept to the definition of instrumental variables and the concept of direct and indirect effects. We propose to utilize baseline covariate information to characterize both direct and indirect effects simultaneously, and relate our approach to meta-analytical approach and principal stratification approach. We apply the direct and indirect effect concepts to our explanatory analyses of MDRD trial and dose-controlled PK/PD trial.


Sandra Griffith, PhD

Year Graduated: 2012
Advisor:
Daniel F. Heitjan, PhD
Dissertation Title:
Statistical methods for heaped data with application to smoking cessation research
Abstract:


Measures of daily cigarette consumption, like many self-reported numerical data, exhibit a form of measurement error termed heaping. This occurs when quantities are reported with varying levels of precision, often in the form of round numbers. As heaping can introduce substantial bias to estimates, conclusions drawn from data subject to heaping are suspect. Because more precise measurements are seldom available, methods to estimate the true underlying distribution from heaped data depend on unverifiable assumptions about the heaping mechanism. A doubly-coded dataset with both a conventional retrospective recall measurement (timeline follow back) and an instantaneous measurement not subject to heaping (ecological momentary assessment), motivates this dissertation and allows us to model the heaping mechanism.



We take three approaches to this problem. First, we develop a nonparametric method that involves the estimation of heaping probabilities directly, where possible, and calculating others by smoothing, interpolation and subtraction. Next, we use the motivating data as a calibration data set, allowing us to create a predictive model for imputation. We apply this model to multiply impute precise cigarette counts for data from a randomized, placebo-controlled trial of bupropion where only heaped cigarette counts are available. Finally, we build on findings from the first two approaches to develop a more flexible model, which forgoes the restrictive rounding framework of previous models. Rather than assuming subjects will round off when providing self-reported counts, we posit that numbers possess an intrinsic gravity that tends to attract subjects to characteristically round numerals. We outline procedures for parameterizing and estimating such a model and apply it to the motivating data. Our findings suggest that the self-reporting process is more complex than the mechanism assumed in conventional rounding-based models. While we apply these models exclusively to smoking cessation data, they have wide applicability to many types of self-reported count data.


Matthew W. Guerra, PhD

Year Graduated: 2011
Advisor:
Justine Shults, PhD
Dissertation Title:
Methods for Longituidinal Binary Data with Time-Dependent Covariates
Abstract:

We consider longitudinal studies with binary outcomes that are measured repeatedly on subjects over time. Our analysis goal is to fit a logistic model that relates the expected value of the outcomes with explanatory variables that are measured on each subject. However, additional care must be taken to adjust for the association between the repeated measurements on each subject. In this dissertation, we propose the new first-order Markov maximum likelihood (MARK1ML) method for covariates that may be fixed or time-varying. We also implement and make comparisons with two other approaches: generalized estimating equations, which may be more robust to misspecification of the true correlation structure, and alternating logistic regression, which models association via odds-ratios, that are subject to less-restrictive constraints than are correlations. We prove that the proposed estimation procedure will yield consistent and asymptotically normal estimates of the regression and correlation parameters, if the correlation on consecutive measurements on a subject is correctly specified. Simulations (conducted with a new and simple approach that we present in this dissertation) demonstrate that our approach can yield improved efficiency in estimation of the regression parameter; for equally spaced and complete data, the gains in efficiency were greatest for the parameter associated with a time by group interaction term and for higher values of the correlation. For unequally spaced data and with drop-out according to a missing at random mechanism, MARK1ML with correctly specified consecutive correlations yielded substantial improvements both in terms of bias and efficiency. We present analyzes from longitudinal studies in psychiatry and cancer prevention to demonstrate application of the methods we consider. We also offer an R function for easy implementation of our approach.

Mengye Guo, PhD

Year Graduated: 2009
Advisor:
Daniel F. Heitjan, PhD
Dissertation Title:
Bayesian Methods For Identifying Genetic Predictors In Smoking Cessation Studies
Abstract:


In clinical trials and observational studies, investigators may seek to identify genetic markers that are associated with treatment outcome, either directly or through interactions with treatments. Identification of such markers can help in elucidating the biological mechanisms and in identifying optimal treatments for individual subjects. A common approach is to screen all the markers using single-marker models and designate for further analysis those that are significantly associated with the outcome. When the number of potential markers is large, failure to adjust for multiplicity may lead to excess false positives. A range of recently developed frequentist approaches adjust properly for multiplicity but may be inefficient in the sense of failing to use available prior information. Moreover methods that select best models from all the markers simultaneously, rather than one marker at a time, would be preferable as sources of basic biological and clinical information.



In this dissertation, we propose the use of Bayesian procedures to detect important markers. Our first project uses the Bayes factor (BF) to screen all the single-marker models and identify a subset of markers of potential interest. Our second project develops an approach to maintain the efficiency of BF tests while controlling the overall type I error rate, using a method we call the Multiplicity-Calibrated Bayesian Hypothesis Testing (MCBHT) procedure. In our third project, we develop a comprehensive approach to select best models from all the markers simultaneously, using a Bayesian variable selection method that explicitly incorporates prior information in order to involve prior information and enforce parsimony. We use simulations to demonstrate the power and value of our methods. We apply the methods to the analysis of two smoking cessation datasets, thereby identifying a small set of plausibly predictive genetic markers.

Steffanie M. Halberstadt, PhD

Year Graduated: 2011
Advisor:
Mary Sammel, ScD
Dissertation Title:
Wrestling with Issues in Scale Development Using Joint Latent Variable Methods
Abstract:


In this dissertation I explore the use of joint latent variable methods for the development of summary scales used in clinical studies. The primary aims of scale development are to combine relevant pieces of information necessary to describe an underlying hypothetical construct (item selection) and to demonstrate that the scale measures the construct appropriately by comparing the scale to other measures of the same construct (validation). Joint latent variable methods are a natural choice for scale development because they model the relationship between multiple scale items and latent constructs simultaneously and provide a measure of the association between the latent construct and a “gold standard” validation measure. Combining the two stages of item selection and validation into a single model, joint latent variable models eliminate the bias that is inherent in modeling these processes separately. Motivating the methodology in this dissertation is an example from the Physical Activity and Lymphedema (PAL) clinical trial. What constitutes a “gold standard” clinical diagnostic measure is a controversial issue in lymphedema research. Many objective diagnostic measures are expensive, time-consuming, and fail to provide a comprehensive measure of all important attributes of lymphedema. Patient-reported symptoms have the potential to be a useful indicator of lymphedema because they an inexpensive and quick assessment of acute changes in swelling, skin tone, and function. Our objective was to simultaneously identify important patient-reported lymphedema symptoms and to compare these symptoms to a current clinical diagnostic measure. However, a unique feature of the PAL sample was the presence of significant symptom non-response. To account for this, I propose a multivariate zero-inflated proportional odds (MZIPO) model, a joint latent variable model that combines continuous and categorical latent variables to perform item selection and validation in the presence of zero-inflation in scale item distributions. This new model classifies subjects into a susceptible class, which includes those subjects who are prone to experiencing lymphedema symptoms, and an unsusceptible class, which includes subjects who are truly invulnerable to suffering from symptoms. For the susceptible class, the model provides estimates of correlation between individual symptoms and a latent measure of lymphedema severity as well as an association between the clinical diagnostic measure and lymphedema severity. In addition to determining the value of patient- reported symptoms in comparison to “gold standard” diagnostic measures, the MZIPO model is advocated for its ability to identify a significant unobserved subgroup of patients who may require careful monitoring for lymphedema exacerbations or flare-ups.


Elizabeth Handorf, PhD

Year Graduated: 2012
Advisor:
Daniel Heitjan, PhD
Nandita Mitra, PhD
Dissertation Title:
Statistical Methods for Cost-Effectiveness Analysis Using Observational Data
Abstract:


Observational studies are a useful resource for evaluating the cost and cost-effectiveness of medical treatments, but the results are subject to bias from measured and unmeasured confounding. Furthermore, skewed outcomes, censoring, and correlation between costs and effects complicate the estimation of the treatment effect. It is therefore important to use an appropriate model, and to assess the sensitivity of the results to the effects of unmeasured confounders.



We describe several methods for estimating the Net Monetary Benefit (NMB): linear regression, generalized linear models, parametric and semi-parametric survival methods, and non-parametric estimates with propensity score stratification. Using simulations, we compare the performance of the models for analysis of skewed and censored cost and survival data. We find that correctly specified non-linear parametric models provide the best estimates, and that linear regression is insufficient for censored data.



Further, we propose sensitivity analysis procedures for the treatment effect on cost and NMB. Using a Gamma GLM for cost and a Weibull model for survival, we derive closed-form relationships between the regression parameters based on observed data, and those which account for an unmeasured confounder. Our general formulas allow for any unmeasured confounder which can be characterized using a moment-generating function, and also allow for separate unmeasured confounders to influence cost and survival.



We apply our methods to SEER-Medicare data to compare treatments for bladder cancer and prostate cancer.


Jing He, PhD

Year Graduated: 2011
Advisor:
Hongzhe Li, PhD
Mingyao Li, PhD
Dissertation Title:
Statistical Methods In Mapping Complex Diseases
Abstract:

Genome-wide association studies have become a standard tool for disease gene discovery over the past few years. These studies have successfully identified genetic variants attributed to complex diseases, such as cardiovascular disease, diabetes and cancer. Various statistical methods have been developed with the goal of improving power to find disease causing variants. The major focus of this dissertation is to develop statistical methods related to gene mapping studies with its application in real datasets to identify genetic markers associated with complex human diseases.



In my first project, I developed a method to detect gene-gene interactions by incorporating linkage disequilibrium (LD) information provided by external datasets such as the International HapMap or the 1000 Genomes Projects. The next two projects in my dissertation are related to the analysis of secondary phenotypes in case-control genetic association studies. In these studies, a set of correlated secondary phenotypes that may share common genetic factors with disease status are often collected. However, due to unequal sampling probabilities between cases and controls, the standard regression approach for examination of these secondary phenotype can yield inflated type I error rates when the test SNPs are associated with the disease. To solve this issue, I propose a Gaussian copula approach to jointly model the disease status and the secondary phenotype. In my second project, I consider only one marker in the model and perform a test to access whether the marker is associated with the secondary phenotype in the Gaussian copula framework. In my third project, I extend the copula-based approach to include a large number of candidate SNPs in the model. I propose a variable selection approach to select markers which are associated with the secondary phenotype by applying a lasso penalty to the log-likelihood function.


Sidan He, MS

Year Graduated: 2016
Advisor:
Phyllis Gimotty, PhD
Dissertation Title:
Categorical predictors based on optimal cut points in regression modeling: Evaluation of Lymph node count and survival of melanoma patients.

Jiwei Hi, PhD

Year Graduated: 2014
Advisor:
Alisa J. Stephens, PhD
Marshall Joffe, MD, MPH, PhD
Dissertation Title:
Causal Modeling under Complex Dependency in Clustered and Longitudinal Observations
Abstract:

In assessing the efficacy of a time-varying treatment Marginal Structural Models (MSMs) and Structural Nested Mean Models (SNMMs) are useful in dealing with confounding by variables affected by earlier treatments. MSMs model the joint effect of treatments on the marginal mean of the potential outcome, whereas SNMMs model the joint effect of treatments on the mean of the potential outcome conditional on the treatment and covariate history. These models often consider independent subjects with noninformative time of observation.


The first two chapters extend the two classes of models to clustered observations with time-varying treatments in the presence of time-varying confounding. We formulate models with both cluster- and unit-level treatments and derive semiparametric estimators of parameters in such models. For unit-level treatments, we consider both the presence and absence of interference, namely the effect of treatment on outcomes in other units of the same cluster. For MSMs, we show that the use of unit-specific inverse probability weights and certain working correlation structures can improve the efficiency of estimators under specified conditions. The properties of the estimators are evaluated through simulations and compared with the conventional GEE regression method for clustered outcomes. To illustrate our methods, we use data from the treatment arm of a glaucoma clinical trial to compare the effectiveness of two commonly used ocular hypertension medications.


The third chapter extends SNMMs to situations with intermittent missing observations. In observational longitudinal studies, subjects often miss prescheduled visits intermittently. Previous literature has mainly focused on dealing with monotone censoring due to early dropout. Here we focus on intermittent missingness that can depend on the subjects' covariate and treatment history. We show that under certain assumptions the standard SNMMs can be used for situations where non-outcome covariates are missing intermittently. In situations where outcomes are also missing intermittently, we use a method that does not require artificially censoring the data, but requires a strict missing at random assumption. The estimators are shown to be consistent and achieve reasonable efficiency. We illustrate the method by estimating the effect of non-steroidal anti-inflammatory drugs (NSAIDs) on genitourinary pain using data from a study of chronic pelvic pain.


Peter Heping Hu, PhD

Year Graduated: 2004
Advisor:
Jesse Berlin, PhD
Dissertation Title:
Identification Of Differentially Expressed Genes And Prediction Of Clinical Outcome By Analyzing Gene Expression Profiles
Abstract:

Small sample size and high dimensionality of microarray data impose challenges on detecting differentially expressed genes and on summarizing genetic information. By estimating fold change for gene differentiation using a U -statistic, the asymptotic p -values are observed to have much lower power to detect gene differentiation than the exact p -values do in an example using a lung cancer microarray data, though they both are powerful using simulated data. Those genes selected using the exact p -values are not uniformly associated with higher posterior probabilities of differentiation, estimated using empirical Bayes analysis, than those genes that are not selected. These results suggest that the normality assumption of the U -statistic may be violated. In addition, genes are also selected based on positive FDR (p FDR) computed using the density of the p -values estimated under a k -component mixing distribution framework via the EM algorithm. This framework enables one to estimate m0 , the number of underlying true null hypotheses, which can be used in the FDR procedure to improve power. Assuming k = 3, the pFDR procedure is shown to be more powerful than the FDR only when the pre-chosen FDR is moderate (i.e. ≥0.18). The FDR procedure using the estimator for m 0 is more powerful than both the p FDR and the conventional FDR procedures at all levels of pre-chosen FDR. Finally, a generalized approach for summarizing gene expression profiles is developed using principal component analysis and the summarized genetic information, called the genetic score, is used for predicting clinical outcome. The distribution of the genetic score suggests strong heterogeneous genetic risks for survival. The dichotomized genetic score with threshold 1, can predict death events without misclassification. By narrowing down the number of genes that are differentially expressed between tumor stages I and III with more and more stringent pre-chosen FDR bounds, a minimal subset of 34 genes is identified that can predict survival with the largest five principal components. Comparisons of the performance of the genetic scores and some clinical factors associated with survival suggests that the genetic score contains unique information in predicting survival.


Yu Hu, PhD

Year Graduated: 2018
Advisor:
Mingyao Li, PhD
Dissertation Title:
“Statistical Methods for Alternative Splicing using RNA Sequencing”
Abstract:

The emergence of RNA-seq technology has made it possible to estimate isoformspecific gene expression and detect differential alternative splicing between conditions, thus

providing us an effective way to discover disease susceptibility genes. Analysis of alternative splicing, however, is challenging because various biases present in RNA-seq data complicates the analysis, and if not appropriately corrected, will affect gene expression estimation and downstream modeling. Motivated by these issues, my dissertation focused on statistical problems related to the analysis of alternative splicing in RNA-seq data. In Part I of my dissertation, I developed PennSeq, a method that aims to account for non-uniform read distribution in isoform expression estimation. PennSeq models non-uniformity using the empirical read distribution in RNA-seq data. It is the first time that non-uniformity is modeled at the isoform level. Compared to existing approaches, PennSeq allows bias correction at a much finer scale and achieved higher estimation accuracy. In Part II of my dissertation, I developed PennDiff, a method that aims to detect differential alternative splicing by RNA-seq. This approach avoids multiple testing for exons originated from the same isoform(s) and is able to detect differential alternative splicing at both exon and gene level, with more flexibility and higher sensitivity than existing methods. In Part III of my dissertation, I focused on problems arising from single-cell RNA-seq (scRNA-seq), a newly developed technology that allows the measurement of cellular heterogeneity of gene expression in single cells. Compared to bulk tissue RNA-seq, analysis of scRNA-seq data is more challenging due to high technical variability across cells and extremely low sequencing depth. To overcome these challenges, I developed SCATS, a method that aims to detect differential alternative splicing with scRNAseq data. SCATS employs an empirical Bayes approach to model technical noise by use of external RNA spike-ins and groups informative reads sharing the same isoform(s) to detect splicing change. SCATS showed superior performance in both simulation and real data analyses. In summary, methods developed in my dissertation provide biomedical researchers a set of powerful tools for transcriptomic data analysis and will aid novel scientific discovery.


Hojun Hwang, MS

Year Graduated: 2014
Advisor:
Justine Shults, PhD
Peter Reese, MD, MSCE
Dissertation Title:
Analysis of pediatric wait-times for deceased-donor kidneys

Thomas Jemielita, PhD

Year Graduated: 2017
Advisor:
E. Putt, PhD, ScD
Devan Mehrotra, PhD
Dissertation Title:
“Efficient Baseline Utilization in Crossover Clinical Trials through Linear Combinations of Baselines: Parametric, Nonparametric, and Model Selection Approaches”
Abstract:

In a crossover clinical trial, including period-specific baselines as covariates in a regression model is known to increase the precision of the estimated treatment effect. The potential efficiency gain depends, in part, on the true model, the distribution and covariance matrix of the vector of baselines and outcomes, and the model chosen for analysis. We examine improvements in power that can be achieved by incorporating optimal linear combination of baselines (LCB). For a known distribution, the optimal LCB minimizes the conditional variance corresponding to a treatment effect. The use of a single metric to capture the information in the baseline measurements is appealing for crossover designs. Because of their efficiency, crossover designs tend to have small sample sizes and thus the number of covariates in a model can significantly impact the degrees of freedom in the analysis. We start by examining optimal LCB models under a normality assumption for uniform and incomplete block designs. For uniform designs, such as the AB/BA design, estimation is entirely through withinsubject contrasts (and thus ordinary least squares [OLS]) and the optimal LCB minimizes the conditional variance corresponding to the treatment effect. However, since the optimal LCB is a function of the unknown covariance matrix, we propose an adaptive method that uses the LCB covariate corresponding to the most plausible covariance structure guided by the data. For incomplete block designs, data are commonly analyzed using a mixed effects model. Treatment effect estimates from this analysis are complex functions of both within-subject and between-subject treatment contrasts. To improve efficiency, we propose incorporating period-specific optimal LCBs which minimize the conditional variance of the period-specific outcomes. A simpler fixed effects analysis of covariance involving only within-subject contrasts is also described for small sample situations. In the latter, hypothesis tests based on the mixed effects analyses exhibit inflated type I error rates even when using a Kenward and Rogers approach to adjust the degrees of freedom. Lastly, we extend this work to the more general setting where the optimal LCB depends on the distribution of the response vector. In practice, the distribution is unknown and the optimal LCB is estimated under some loss function. To handle both normal and non-normal response data, OLS and a rank-based nonparametric regression model (R-estimation), are considered. A datadriven approach is then proposed which adaptively chooses the best fitting model among a set of models which work well under a range of conditions. Relative to commonly used methods, such as change from baseline analyses without use of covariates, our methods using functions of baselines as period-specific or period-invariant covariates consistently demonstrate improved power across a number of crossover designs, covariance structures, and response distributions.


Cheng Jia, PhD

Year Graduated: 2018
Advisor:
Mingyao Li, PhD
Dissertation Title:
“Statistical Methods for Whole Transcriptome Sequencing: From Bulk Tissue to Single Cells”
Abstract:

RNA-Sequencing (RNA-Seq) has enabled detailed unbiased profiling of whole transcriptomes with incredible throughput. Recent technological breakthroughs have pushed back the frontiers of RNA expression measurement to single-cell level (scRNA-Seq). With both bulk and single-cell RNA-Seq analyses, modeling of the noise structure embedded in the data is crucial for drawing correct inference. In this dissertation, I developed a series of statistical methods to account for the technical variations specific in RNA-Seq experiments in the context of isoform- or gene-level differential expression analyses. In the first part of my dissertation, I developed MetaDiff (https://github.com/jiach/MetaDiff), a random-effects meta-regression model, that allows the incorporation of uncertainty in isoform expression estimation in isoform differential expression analysis. This framework was further extended to detect splicing quantitative trait loci with RNA-Seq data. In the second part of my dissertation, I developed TASC (Toolkit for Analysis of Single-Cell data; https://github.com/scrna-seq/TASC), a hierarchical mixture model, to explicitly adjust for cellto-cell technical differences in scRNA-Seq analysis using an empirical Bayes approach. This framework can be adapted to perform differential gene expression analysis. In the third part of my dissertation, I developed, TASC-B, a method extended from TASC to model transcriptional bursting-induced zero-inflation. This model can identify and test for the difference in the level of transcriptional bursting. Compared to existing methods, these new tools that I developed have been shown to better control the false discovery rate in situations where technical noise cannot be ignored. They also display superior power in both our simulation studies and real world applications.


Edward H. Kennedy, PhD

Year Graduated: 2016
Advisor:
Dylan Small, PhD, Marshall Joffe, MD, MPH, PhD
Dissertation Title:
Doubly robust causal inference with complex parameters
Abstract:

Semiparametric doubly robust methods for causal inference help protect against bias due to model misspecification, while also reducing sensitivity to the curse of dimensionality (e.g., when high-dimensional covariate adjustment is necessary). However, doubly robust methods have not yet been developed in numerous important settings. In particular, standard semiparametric theory mostly only considers independent and identically distributed samples and smooth parameters that can be estimated at classical root-n rates. In this dissertation we extend this theory and develop novel methodology for three settings outside these bounds: (1) matched cohort studies, (2) nonparametric dose-response estimation, and (3) complex high-dimensional effects with continuous instrumental variables. In Chapter 1 we show that, for matched cohort studies, efficient and doubly robust estimators of effects on the treated are computationally equivalent to standard estimators that ignore the non-standard sampling. We also show that matched cohort studies are often more efficient than random sampling for estimating effects on the treated, and derive the optimal number of matches for given matching variables. We apply our methods in a study of the effect of hysterectomy on the risk of cardiovascular disease. In Chapter 2 we develop a novel approach for causal dose-response curve estimation that is doubly robust without requiring any parametric assumptions, and which naturally incorporates general off-the-shelf machine learning. We derive asymptotic properties for a kernel-based version of our approach and propose a data-driven method for bandwidth selection. The methods are used to study the effect of hospital nurse staffing on excess readmissions penalties. In Chapter 3 we develop novel estimators of the

local instrumental variable curve, which represents the treatment effect among compliers who would take treatment when the instrument passes some threshold. Our methods do not require parametric assumptions, allow for flexible data-adaptive estimation of effect modification, and are doubly robust. We derive asymptotic properties under weak conditions, and use the methods to study infant mortality effects of neonatal intensive care units with high versus low technical capacity, using travel time as an instrument.


Carin Kim, PhD

Year Graduated: 2008
Advisor:
Daniel F. Heitjan, PhD
Dissertation Title:
Statistical Design Of Tumor Xenograft Studies
Abstract:


The repeated measures tumor growth study is a common tool of translational cancer science. Standard procedures for designing such studies are not available. In this dissertation, we address three aspects of the design of tumor xenograft studies: sample size for comparative studies with log-linear growth, optimal placement of times of observation for spline growth models, and selection of the order of the spline model. All of our methods assume either compound symmetry (CS) or autoregressive (AR) error models.



For the sample size formula, we consider a typical two-arm study where the number of samples in each arm, and the number of equally-spaced follow-ups are the same. We assume a common intercept with a linear slope upon log transformation. The formulas involve only the difference in slopes (growth rates), the CV of the raw scale data, the number of follow-up times and the correlation parameter. We compare the sample size requirements when the standard deviation (SD) is known, and when it has to be estimated. The comparison suggests a simple modification to the known-SD formula that is valid and conservative in general. Furthermore, results suggest that misspecification of the error model makes little difference in the computed sample size.



We then consider the case of potentially nonlinear log growth curves when we suggest the use of spline regression models with correlated errors. Under an assumed spline regression model, we find a set of optimum points. With CS errors, the correlation has no effect on the values of the optimum points, whereas when we assume AR errors, the correlation has a modest effect on the optimum points. Regularly-spaced designs are not optimal but are fairly efficient. Finally, mis-specifying the linear-quadratic as a linear-cubic function results in substantial loss of efficiency. We illustrate the method with some designs for a two-arm tumor growth study.



Finally, we devise a method for selecting the best-fitting model from among a range of possible spline regression models and we demonstrate the application of our method to three sets of real tumor xenograft studies: an immunotherapy study where we assume that tumor growth is perturbed at the treatment administration time (i.e. the knot is placed at the time of treatment administration) and two bioluminescence imaging studies where the knot is to be estimated. Simulation results show that the method detects the correct functional form reasonably well.


Clara Kim, PhD

Year Graduated: 2005
Advisor:
Daniel F. Heitjan, PhD
Dissertation Title:
Bayesian Cost-Effectiveness Analysis
Abstract:


Estimation of the extra cost that is required to improve the efficacy of a treatment is an important problem of contemporary medical research. As a result, many clinical trials now collect cost information as well as effectiveness data. We propose parametric Bayesian approaches to compare the cost-effectiveness of a new treatment to an existing control when the outcomes are measured longitudinally and the data are subject to censoring. We first use a pattern-mixture model to describe the joint distribution of the cost and survival. The model assumes that cost, conditional on survival, follows a multivariate normal distribution. We simulate the posterior distribution via data augmentation and apply the method to data from a randomized clinical trial of a treatment for a cardiovascular disease. We use two sets of priors, noninformative priors and subjective priors. We compare findings from our approach with an analysis using Willan and Lin's frequentist nonparametric method.



Quality-of-life (QoL) is also important in evaluating the effect of a treatment on patients. Recently, the number of trials that compare patient QoL as a secondary endpoint has increased. To get a more complete comparison of the treatment and control, we expand the above method, so that the cost-effectiveness analysis adjusts for QoL measures. The pattern-mixture model assumes that cost and QoL, conditional on survival, follow a multivariate normal distribution. Simulation of the posterior distribution and prior density selections are similar to the unadjusted cost-effectiveness method. We again compare the results from our pattern-mixture models with Willan et al.'s frequentist nonparametric method.



Finally, we propose a frailty model that allows us to jointly model the survival, cost and QoL data with a small number of parameters and describe the within subject correlation in an intuitively appealing way. The model assumes that the survival, logic-transformed QoL and cost at each time point, given the frailty, follow exponential, normal, and gamma distributions, respectively. We use subjective priors and simulate the posterior distribution via importance sampling. We apply this method to the cardiovascular clinical trial data.


Hanjoo Kim, PhD

Year Graduated: 2009
Advisor:
Justine Shults, PhD
Dissertation Title:
Methods for Analysis of Multiple Longitudinal Outcomes and for Testing Multiple Outcomes
Abstract:


Clinical trials often involve multiple binary or continuous outcomes that are repeatedly measured, in order to assess the safety and efficacy of drug therapies. However, there are limitations to available statistical methods for multi-outcome longitudinal data. First, there is debate regarding the most appropriate approach for binary data, even when the binary outcomes are considered separately. Next, if the multiple outcomes are considered simultaneously, an approach based on generalized estimating equations (GEE) is available that models the association between multiple outcomes over time via a Kronecker product correlation structure; However, this method is only applicable for balanced data, with an equal number of measurements per outcome within subjects. Finally, multiple tests will usually be conducted in analysis of clinical trials that involve multiple longitudinal outcomes; However, this will result in inflation of the overall type I error rate, i.e. the probability of falsely rejecting any true null hypothesis.



This thesis proposes approaches that address the limitations described above and is based on the following two considerations: (i) Before any adjustment for multiple testing, appropriate statistical methods should be implemented for analysis of any binary outcomes and to properly address the multi-dimensional structure of the data; and (ii) A powerful and efficient multiple testing procedure should be applied to account for multiplicity, once the initial analysis has been conducted. We propose methods for longitudinal binary data and unbalanced multivariate longitudinal data in the framework of GEE. Further, we develop a theoretical framework for constructing a class of efficient and powerful multiple testing procedures for the control of overall type I error rate at the prespecified level of significance. The thesis consists of three parts: Part I describes our methods for single binary and multiple continuous or categorical longitudinal outcomes; Part II discusses the implementation of single and multiple longitudinal outcomes in our user-written SAS software; and lastly, Part III describes a unified approach for constructing multiple testing procedures in the closure framework.

Julie Kobie, PhD

Year Graduated: 2016
Advisor:
Hongzhe Li, PhD
Dissertation Title:
Sparse Simultaneous Signal Detection with Applications in Complex Disease GWAS”
Abstract:

Studying complex diseases, such as autoimmune diseases, can lead to the detection of pleiotropic loci with otherwise small effects. Through the detection of pleiotropic loci, the genetic architecture of these complex diseases can be better defined, allowing for subsequent improvements in their treatment and prevention efforts. Here, I investigate the genetic relatedness of complex diseases through the detection and quantification of simultaneous disease-associated genetic variants. I propose two max-type statistics, with and without an added level of dependency on the directions of the genetic effects, that globally test whether a pair of complex diseases shares at least one disease-associated genetic variant. The proposed global tests are based on the simultaneity of complex disease-associated genetic variants, allowing for the determination of exact p-values from a permutation distribution assuming independence. While an independence assumption is often imposed on genetic variants, I propose a perturbation procedure for evaluating the statistical significance of one of the proposed global tests, preserving the inherent dependency structure among genetic variants. I extend that global test beyond the detection of genetic relatedness at identical genetic variants, to the detection of genetic relatedness within dependency-defined windows

across the genome. With the proposed methods, I identify pairs of pediatric autoimmune diseases that exhibit evidence of genetic sharing, such as Crohn’s disease and ulcerative colitis.


Patricia Kooker, MS

Year Graduated: 2014
Advisor:
Kathleen Propert, ScD
Jessica Fishman, PhD
Dissertation Title:
Effects of communication in health care: Sharing information with cancer patients

Robert Krafty, PhD

Year Graduated: 2007
Advisor:
Wensheng Guo, PhD
Dissertation Title:
Penalized Analysis Of Correlated Functional Data
Abstract:


The aim of this dissertation is to create a unified and practical approach to the analysis of correlated functional data where realizations of a stochastic process over a space of smooth functions are observed at discreet observation points with noise. A philosophy for the penalized estimation of the second moment of the stochastic process is introduced by using the common smoothness of the underlying subject trajectories to induce a measure of regularity over the space of covariance functions. By showing that the amount of smoothing for covariance estimation is different than the amount of smoothing required for trajectory estimation, this dissertation offers some of the first in-depth discussions about practical smoothing parameter selection for second moment estimation. Further, a Kullback-Leibler criterion based on a metric over the space of covariance functions is offered for the selection of smoothing parameters.



This framework is used to develop explicit techniques for performing functional principal components analysis (FPCA), efficient functional regression, and functional classification analysis. The time-course expression of human fibroblast to growth serum is analyzed through a FPCA that takes the spectral decomposition of the covariance estimator while using the Kullback-Leibler criterion to jointly select the smoothing parameter and number of principal components. A procedure for fitting the varying-coefficient model when the within-subject correlation is unknown is developed by iterating between weighted penalized least squares estimation of functional coefficients conditional on the covariance and the covariance estimation procedure conditional on functional coefficients. This approach to functional linear regression is used to investigate the clinical role that VEGF plays in the interaction of chemotherapy and antiangiogenic therapy. Finally, a quadratic rule for the classification of populations of locally stationary time series that takes into account within-population spectral variability is developed from smooth estimates of the mean and covariance of the log-spectra and used for the prediction of epileptic seizures from intracranial EEG data.


Anagha Kumar, MS

Year Graduated: 2013
Advisor:
Andrea Troxel, ScD
Kathryn Schmitz, PhD, MPH
Dissertation Title:
LABAT (Lymphedema Assessment of the Breast, Arm and Torso) trial.

Benjamin Leiby, PhD

Year Graduated: 2006
Advisor:
Mary D. Sammel, ScD
Dissertation Title:
Bayesian Multivariate Growth Curve Latent Class Models
Abstract:


In many clinical studies, the disease of interest is multi-faceted and multiple outcomes are needed to adequately capture information about the characteristics of the disease or its severity. For example; in the study of Interstitial Cystitis (IC), three primary symptoms (pain, urgency to void, and voiding frequency) well as various composite symptom scales are typically evaluated. In analysis of such diseases, it is often difficult to determine what constitutes improvement due to the multivariate nature of the outcome. Furthermore, when the disease of interest has an unknown etiology and/or is primarily a symptom-defined syndrome, there is potential for the disease population to have distinct subgroups. This heterogeneity of the population under study makes the search for effective treatments challenging as treatments may not improve the symptoms of all types of patients. Identification of population subgroups is of interest as it may assist clinicians in providing appropriate treatment or in developing accurate prognoses.



We propose multivariate growth curve latent class models that group subjects based on multiple symptoms measured repeatedly over time. These groups or latent classes are defined by distinctive longitudinal profiles of a latent variable which is used to summarize the multivariate outcomes at each point in time. The mean growth curve for the latent variable in each class defines the features of the class. We explore multiple continuous symptoms at each time point to identify a class of "responders", who have a decline in the latent symptom summary outcome over time. We then extend this model to any combination of continuous, binary, ordinal or count outcomes. The models are developed within a Bayesian hierarchical framework. Simulation studies are used to validate the estimation procedures.



We apply our models to data from a randomized clinical trial evaluating the efficacy of Bacillus Calmette-Guerin in treating symptoms of IC where we are able to identify a class of subjects where treatment is effective. We also apply our continuous variable model to a cohort of Chronic Prostatitis/Chronic Pelvic Pain Syndrome patients where we identify a subset whose symptoms improved over a 2-year period.


Caiyan Li, PhD

Year Graduated: 2009
Advisor:
Hongzhe Li, PhD
Dissertation Title:
Statistical Methods For Analysis Of Graph-Constrained Genomic Data
Abstract:

Graphs and networks are common ways of depicting information. In biology, many different biological processes are represented by graphs, such as regulatory networks, metabolic pathways and protein-protein interaction networks. This kind of prior information accumulated over many years of biomedical research is a useful supplement to the standard numerical genomic data such as microarray gene expression data. How to incorporate information encoded by known biological pathways into the analysis of numerical data raises interesting statistical challenges. This dissertation develops several statistical methods for analysis of genomic data by incorporating the prior biological network information. We consider the high-dimensional regression problem when the covariates are measured on undirected graphs and develop methods for identifying genes and sub-networks that are related to the phenotypes. Specifically, we present the problem formulation, efficient computational algorithm of our procedure - GRA ph-C onstrained E stimator (GRACE) and develop theoretical properties of GRACE, including non-asymptotic error bounds and sign consistency for both fixed and diverging number of parameters. We also introduce an empirical Bayes method to take into account the biological network structure information using a discrete Markov Random Field model prior for identifying genes and subnetworks whose transcription activities are perturbed by or activated in response to experimental conditions. We apply both GRACE and the empirical Bayes method to a microarray gene expression study of human brain aging to identify genes or subnetworks that arc related or perturbed by the human brain aging. Extensions of the proposed methods to censored survival data are also presented.


Jiaqi Li, PhD

Year Graduated: 2016
Advisor:
Nandita Mitra, PhD
Dissertation Title:
Doubly Robust and Machine Learning Approaches for Economic Evaluation Using Observational Data
Abstract:

Policy makers are often interested in the economic evaluation of health care interventions in their decision making. However, proper cost effectiveness (CE) analysis is complicated by the need to account for unique features of cost data including informative censoring and distributional heterogeneity. In addition, medical costs are often collected from observational claims data which are susceptible to confounding.


We propose a doubly robust (DR) method based on propensity scores for estimating CE. This approach accounts for informative censoring and allows for the incorporation of cost history via inverse probability weighting and partitioning. We then investigate an ensemble machine learning approach to choose among popular cost models to estimate outcome parameters in the DR approach and to choose among various parametric and non-parametric propensity score models. We analytically demonstrate that this approach is unbiased. Our simulation studies confirm that the proposed DR approach performs well even under misspecification of either the PS model or the outcome model. We apply this approach to a cost-effectiveness analysis of two competing lung cancer surveillance procedures, CT versus chest X-ray, using SEER-Medicare data. Lastly, we explore Big Data tools and other machine learning algorithms that can be used for cost prediction.


Yimei Li, PhD

Year Graduated: 2010
Advisor:
Daniel F. Heitjan, PhD
Dissertation Title:
Statistical Modeling Of Data From Smoking Cessation Clinical Trials
Abstract:


In smoking cessation clinical trials, subjects commonly experience a series of lapse and recovery episodes of varying lengths. Any quit episode may become permanent, in the sense that the subject stops smoking for good, and any lapse may also become permanent, in the sense that the subject abandons the quit attempt entirely. Individual quit patterns may reflect the effects of treatment and measured and unmeasured covariates.



To describe this complex data structure, we propose a multivariate time-to-event model that (i) incorporates alternating recurrent events of two types, each with the possibility of "cure", (ii) allows for the modifying effects of treatment and covariates, and (iii) reflects within-subject correlation via frailties. Specifically, we introduce a novel cure-mixture frailty model in which the cure probability follows a binary regression and the time to event given not cured is determined by a proportional hazard model. We then extend it to data with recurring events of two alternating types, where we assume that each type of event has a gamma frailty, and we link the frailties by means of a Clayton copula. In my first project, I fit this model to data from a smoking cessation drug trial. In my second project, I developed a Bayesian method to predict individual long-term smoking behavior from observed short-term quit/relapse patterns. In my third project, I investigated the theoretical properties of the survival distribution, evidently not previously described, that arises from our cure-mixture frailty model.


Kaijun Liao, PhD

Year Graduated: 2012
Advisor:
Andrea B. Troxel, ScD
Dissertation Title:
Statistical methods for non-ignorable missing data with applications to quality-of-life data
Abstract:


In chronic disease studies, researchers increasingly use more and more survey studies, and design medical studies to better understand the relationships of patients, physicians, their health care system utilization, and their decision making process in disease prevention and management. Longitudinal data is widely used to capture disease progression or trends occurring over time. Each subject is observed as time progresses. A common problem is that repeated measurements are not fully observed due to missing responses or loss to follow up. However, in such medical studies, the sample sizes are limited due to restrictions on disease type, study area and medical information availability. Small sample sizes with large proportions of missing information are problematic for researchers trying to understand the experience of the total population. Data modeled without considering this missing information may cause biased results.



A first-order Markov dependence structure is a natural data structure to model the tendency of changes. First, we developed a Markov transition model in a full-likelihood based algorithm to provide robust estimation accounting for non-ignorable missingness, and applied it to data from the Penn Center of Excellence in Cancer Communication Research. Next, we extended the method to a pseudo-likelihood based approach by considering only pairs of adjacent observations to significantly ease the computational complexities of the full-likelihood based method. Finally, we build a two stage pseudo hidden Markov model to analyze the association between quality of life measurements and cancer treatments from a randomized phase III trial in brain cancer patients. By incorporating both selection models and shared parameter models with a hidden Markov model, this approach provides targeted identification of treatment effects. We outline procedures for parameterizing and estimating such models and apply it to the motivating data. Our model provides a simple framework for reducing the multi-dimensional integration in traditional non-ignorable missingness methods into one dimensional integration in the observed likelihood. In addition, the proposed models avoid the problem of specification of the correlation structure of repeated outcomes, instead emphasizing estimation in Markov chain parameters.

Chengcheng Liu, PhD

Year Graduated: 2011
Advisor:
Wensheng Guo, PhD
Sarah J. Ratcliffe, PhD
Dissertation Title:
Joint Modeling Of Non-Gaussian Longitudinal Outcomes And Time To Event Data
Abstract:

When informative dropouts exist for longitudinal studies, ignoring the informative dropout will result in biased results. Joint modeling of the outcome and dropout time can take into account some information from informative dropouts and correct some biases. In this dissertation, we introduce a random pattern mixture model to jointly model the longitudinal non-gaussian outcomes and the dropout time. The random pattern effects are defined as the latent effects linking the dropout process and the longitudinal outcome. Conditional on the random pattern effects, the longitudinal non-gaussian outcome and the dropout time are assumed independent. The EM algorithm is used for estimation. We also apply the random pattern concept to a joint modeling of the non-gaussian outcome and the survival time to analyze the effects of treatment on both longitudinal and survival responses simultaneously. In the first part of the dissertation, the random pattern mixture model is applied to a dataset of the Prevention of Suicide in Primary Care Elderly Collaborative Trial (PROSPECT) to estimate the intervention effect on the binary depression outcome compared with an independent generalized linear mixed model and a shared parameter model. We model the longitudinal binary outcome using a generalized linear mixed model with logit link function with random subject and pattern effects, and joint with the dropout model at the pattern level. In the second part of the dissertation, we apply the random pattern mixture model to a sample of end-stage renal disease (ESRD) study to estimate the baseline iron effect on the ordinal anemia outcome, with a proportional odds model for the longitudinal ordinal anemia outcome. In the third part of the dissertation, a joint modeling of the longitudinal outcome and the survival time is applied to a sample of ESRD study. The iron effects on both anemia and survival responses are estimated simultaneously. The binary anemia outcome is fitted with a generalized linear mixed model with subject and site level random effects. The survival time is fitted with a Cox proportional hazard model with random site level effect. Simulation studies were used to explore the robustness and sensitivity of the various models.


Tao Liu, PhD

Year Graduated: 2006
Advisor:
Daniel F. Heitjan, PhD
Dissertation Title:
Measuring Sensitivity To Nonignorable Censoring In Nonparametric And Semiparametric Survival Modeling
Abstract:


The analysis of survival data with potentially nonignorable censoring is a problem of increasing interest. The challenge mainly comes from two aspects: Making inference while accounting for nonignorability is not always trivial, especially in non parametric and semiparametric survival modeling, and generally such attempts are complicated by the competing-risks structure of censored survival data and its associated nonidentifiability issue.



A natural approach to the problem is through a sensitivity analysis, which evaluates how the inference changes as the unknown censoring mechanism departs form the ignorable model. In this work, we study local sensitivity of the Kaplan-Meier survival estimate and the Cox regression coefficient to potentially nonignorable censoring. We start by postulating a censoring mechanism using the Heitjan-Rubin coarse-data selection model. Then we extend the sensitivity analysis technique, the index of local sensitivity to nonignorability (ISNI ) (Troxel et al., 2004; Zhang and Heitjan, 2006), to the nonparametric case treating KM as a nonparametric MLE, and to the semiparametric case treating the Cox partial likelihood as a profile likelihood.



Because it avoids estimation of the nonignorable model, ISNI for the KM estimate and the Cox model are practically easy to use. Moreover simulation studies show that ISNI is a valid measure of local sensitivity. We demonstrate the procedure by applying ISNI-KM to the Stanford Heart Transplant data (Crowley and Hu, 1977), ISNI-Cox (discrete) to the leukemia data of Embury et al. (1977), and ISNI-Cox (continuous) to the advanced lung cancer data of Loprinzi et al. (1994).


Ziyue Liu, PhD

Year Graduated: 2010
Advisor:
Wensheng Guo, PhD
Dissertation Title:
Modeling Longitudinal Data By State Space Method
Abstract:

Longitudinal studies are common because they provide more information about the evolutions of the underlying system. With the advance of technology, high frequency data can be collected from the studied subjects and how to balance the flexibility and interpretability in extracting the information from the individual profiles is the key challenge. In literature, linear mixed effects models have clear interpretations but only limited flexibility. Functional mixed effects models, on the other hand, are extremely flexible but the results are hard to interpret. To simultaneously obtain flexibility and interpretability, this dissertation utilizes state space method as the basic unit of data analysis. We first develop a data driven spline smoothing method by extending the classical smoothing spline to allow the roughness to adapt to the underlying signal. We also propose an equivalent state space model to ease to computational demand. We then develop a new class of mixed effects model where state space method is used to specify both the population effects and individual effects. The resultant models can handle a wide range of individual profiles and have clear interpretation. We further extend the mixed effects state space models to study various types of relationships across multivariate outcomes. The proposed methods are motivated by and applied to two data sets: (1) adrenocorticotropic hormone and cortisol from a study of chronic fatigue syndrome and fibromyalgia syndrome; and (2) the electroencephalogram data from epilepsy patients.

Lola Luo, PhD

Year Graduated: 2013
Advisor:
Jason Roy, PhD
Dylan Small, PhD
Dissertation Title:
Some extensions of hidden Markov models for estimating disease state transition rates using observational healthcare databases
Abstract:


Chronic diseases are often described by stages of severity. Clinical decisions about what to do are influenced by the stage, whether a patient is progressing, and the rate of progression. For chronic kidney disease (CKD), relatively little is known about the transition rates between stages. Since CKD is a slow progressing disease, large samples with long follow-up times are needed to estimate these rare transitions. Data such as electronic health records (EHR) are ideal for this purpose because such data have the advantage of having both sufficient follow-up time and sample size to reliably and accurately estimate the transition rates. However, there are challenges in using EHR. For example, the large amount of information itself may pose difficulty in estimation; data collected tend to have higher levels of measurement error due to having less regulated and consistent data collection methods than data that were collected as part of a research study; acute events (e.g., acute kidney injury) can also affect a laboratory measurement in a way that does not accurately characterize the chronic disease stages; the observed data process might be informative; in addition, many researchers believe that the transition rates between CKD stages vary among individuals. Standard statistical methods for estimation of transition rates such as hidden Markov models (HMMs) do not address these challenges specifically and efficiently, hence, may not be feasible when applied to EHR data.



In this dissertation, new methods are developed to estimate these transition rates using HMMs to address the issues mentioned above. We propose a ``discretization'' method that transforms daily transition probabilities into intervals of 30 days, 90 days, or 180 days in order to make estimating transition rates feasible when large observation data are used. An extended HMM that uses two error matrices to correlate the observed outcomes with the true disease states is developed. This new model addresses the contamination in data due to measurement errors or acute events more efficiently than a traditional HMM. Lastly, we develop a modified mover-stayer HMM that views the distribution of the transition rates of a CKD patient as a mixture of two different distributions. The accuracy and robustness of these methods are tested via simulation studies. We also applied our models to a large EHR data set from Geisinger Clinic, where the transition probabilities and the relationship between true CKD states and the observed outcomes are estimated.


Robin Mogg, PhD

Year Graduated: 2009
Advisor:
Marshall Joffe, MD, MPH, PhD
Dissertation Title:
Assessing Causal Vaccine Effects In A Subset Selected Post-Randomization
Abstract:

In some clinical trials, the primary outcome of interest may only be measured in a subset of subjects where the subset is identified by a post-randomization event. For example, in prophylactic vaccination studies a primary objective may be to assess the effect of a vaccine on an outcome that is measured only in subjects who become infected with disease. Formulating causal effects in subgroups selected by post-randomization events can be challenging due to the potential of selection bias compounded by other limitations of clinical trials where certain subjects do not have outcomes measured. Principal stratification is a common approach that can be used to tackle selection bias in this context; however, causal treatment effects using principal stratification cannot be identified from the observed data with standard assumptions made in randomized trials. Currently published methods using principal stratification to identify the average causal effect of treatment in subsets selected after randomization do adjust for such unmeasured selection bias, but they are limited by various assumptions, including that the treatment or vaccine is not harmful (i.e., monotonicity) and that missing outcome data are missing completely at random (MCAR). In this dissertation, we describe a non-parametric approach to assess an average causal effect (ACE) of treatment in a subset selected post-randomization that resolves some limitations of current causal approaches. We first derive bounds of the ACE without assuming monotonicity and develop testing procedures for these bounds. We further propose estimation and testing procedures that utilize logistic regression models to reflect intermediate degrees of selective effects and describe applying these models to assess the ACE through a sensitivity analysis. Simulation is used to demonstrate the value of our methods. Finally, we develop a robust multiple imputation based approach to estimate and test the ACE using principal stratification in the presence of missing outcome data when MCAR is untenable and an ignorable missing data mechanism is plausible. We compare our approach with other recently published methods to handle ignorable missing data in this context via simulation. Throughout, we use two HIV vaccination trials to motivate our work and apply the new methods.


John Pluto, MS

Year Graduated: 2015
Advisor:
Russel T. Shinohara, PhD
Dissertation Title:
Development of dementia biomarkers from hippocampal and medial temporal lobe subregions via penalized regression

Li Qin, PhD

Year Graduated: 2004
Advisor:
Wensheng Guo, PhD
Dissertation Title:
Functional Models Using Smoothing Splines: A State Space Approach
Abstract:

With the development of modern technology, tremendous amount of data can be collected in biomedical experiments. These data can arise as curves or groups of time series, therefore it is natural to use a curve or a time series as the basic unit in the data analysis. In this dissertation, we propose two general classes of functional models for estimation and inference for such data. We first develop a new class of functional models for curve data that can incorporate the prior information about the curves. We then propose time-frequency functional linear models for time series data, in which the basic analysis unit is a time-varying spectrum. The biggest obstacle in fitting functional models is the heavy computational demand due to the curse of dimensionality. We develop O(N) computationally efficient estimation procedures for the proposed functional models by constructing equivalent state space models. The proposed methods are motivated by and applied to two data sets: (1) longitudinally-collected cortisol data from a fibromyalgia study and (2) the electroencephalogram (EEG) time series data from epilepsy patients.


Lior Rennert, PhD

Year Graduated: 2018
Advisor:
Sharon X. Xie, PhD
Dissertation Title:
“Statistical Methods for Truncated Survival Data”
Abstract:

Truncation is a well-known phenomenon that may be present in observational studies of time-to-event data. For example, autopsy-confirmed survival studies of neurodegenerative diseases are subject to selection bias due to the simultaneous presence of left and right truncation, also known as double truncation. While many methods exist to adjust for either left or right truncation, there are very few methods that adjust for double truncation. When time- to-event data is doubly truncated, the regression coefficient estimators from the standard Cox regression model will be biased. In this dissertation, we develop two novel methods to adjust for double truncation when fitting the Cox regression model. The first method uses a weighted estimating equation approach. This method assumes the survival and truncation times are independent. The second method relaxes this independence assumption

to an assumption of conditional independence between the survival and truncation times. As opposed to methods that ignore truncation, we show that both proposed methods result in consistent and asymptotically normal regression coefficient estimators and have little bias in small samples. We use these proposed methods to assess the effect of cognitive reserve on survival in individuals with autopsy-confirmed Alzheimer’s disease. We also conduct an extensive simulation study to compare survival distribution function estimators in the presence of double truncation and conduct a case study to compare the survival times of individuals with autopsy- confirmed Alzheimer’s disease and frontotemporal lobar degeneration. Furthermore, we introduce an R-package for the above methods to adjust for double truncation when fitting the Cox model and estimating the survival distribution function.


Pixu Shi, PhD

Year Graduated: 2016
Advisor:
Hongzhe Li, PhD
Dissertation Title:
Statistical Methods for Human Microbiome Studies
Abstract:

In human microbiome studies, sequencing reads data are often summarized as counts of bacterial taxa at various taxonomic levels. In this thesis, we investigate the relation between these counts and other variables. We first consider regression analysis with bacterial counts normalized into compositional data as covariates. In order to satisfy the subcompositional coherence of the results, linear models with a set of linear constraints on the regression

coefficients are introduced. A penalized estimation procedure for estimating the regression coefficients and for selecting variables under the linear constraints is developed. A method is also proposed to obtain de-biased estimates of the regression coefficients that are asymptotically unbiased and have a joint asymptotic multivariate normal distribution. This provides valid confidence intervals of the regression coefficients and can be used to obtain the p-values.

Simulation shows the validity of the confidence intervals and smaller variances of the de-biased estimates when the linear constraints are imposed. The proposed methods are applied to a gut microbiome data set and identify four bacterial genera that are associated with the body mass index after adjusting for the total fat and caloric intakes.


We then consider the problem of testing difference between two repeated measurements of microbiome data from the same subjects. Multiple measurements of microbiome from the same subject are often obtained to assess the difference in microbial composition across body sites or time points. Existing models for such count data are limited in modeling the covariance structure of the counts and in handling paired multinomial data. A new probability distribution is proposed for paired multinomial count data, which allows flexible covariance structure of the counts and can be used to model repeated measured multivariate counts. Based on this new distribution, a test

statistic is developed for testing the difference in compositions of paired multivariate count data.The proposed test can be applied to count data observed on a taxonomic tree in order to test difference in microbiome compositions and to identify subtrees with different subcompositions. Simulation shows that the proposed test has correct type 1 errors and increased power compared to some commonly used methods.


Andrew Smith, MS

Year Graduated: 2016
Advisor:
Wei-Ting Hwang, PhD
Dissertation Title:
Estimating the predictive value of continuous markers for censored survival data using a likelihood ratio approach.

Wenguang Sun, PhD

Year Graduated: 2008
Advisor:
Tony Cai, PhD (Wharton)
Dissertation Title:
A New Framework For Large-Scale Multiple Testing: Compound Decision Theory And Data-Driven Procedures
Abstract:

This dissertation studies the large-scale multiple testing problem from a compound decision theoretical view, and proposes a new class of powerful data-driven procedures that substantially outperform the traditional p -value based approaches. There are several important implications from my dissertation research: first, the individual p -value fails to serve as the fundamental building block for large-scale multiple testing; second, the validity of an FDR procedure should not be overemphasized at the expense of the important efficiency issue; and third, the traditional "do nothing" approach suggested for dependent multiple testing is inefficient, and the structural information among the hypotheses can be exploited to construct more powerful tests. Chapter 1 reviews important concepts and conventional framework for multiple testing and discusses several widely used testing procedures. The compound decision theory is formally introduced in Chapter 2. A major goal of Chapter 3 is to show that the p -value testing framework is generally inefficient in large-scale multiple testing and the precision of the tests can be greatly increased by pooling information from different samples. We develop a compound decision framework for multiple-testing and derive a z -value based oracle procedure that minimizes the false non-discovery rate subject to a constraint on the FDR. We then propose an adaptive procedure that asymptotically attains the performance of the oracle procedure. Chapter 4 considers the simultaneous testing of grouped hypotheses. Conventional strategies include pooled analysis and separate analysis. We derive an asymptotically optimal approach and show that both pooled and separate analyses can be uniformly improved. Our new approach provides important insights on how to optimally combine testing results obtained from multiple sources. Chapter 5 considers multiple testing under dependency. We show that the conventional "do nothing" approach can suffer from substantial efficiency loss in situations where the correlation structure is highly informative. We propose a data-driven procedure that is asymptotically valid and enjoys certain optimality properties. The new procedure is especially accurate in identifying structured weak signals, where traditional procedures tend to suffer from extremely low power.


Leah H. Suttner, PhD

Year Graduated: 2019
Advisor:
Sharon X. Xie, PhD
Dissertation Title:
Censored and Missing Data in Survival and Longitudinal Analysis.

Emin Tahirovic, PhD

Year Graduated: 2016
Advisor:
Andrea B. Troxel, ScD
Dissertation Title:
Sensitivity analysis for non-ignorable dropout of marginal treatment effect in longitudinal trials for G-computation based estimators
Abstract:

Estimators that adjust for non-ignorable dropout in a longitudinal clinical trial can be roughly classified in two ways: those that model the probability of dropout explicitly and those that adjust for dropout by specifying a model for the outcome. We discuss the intricacies related to the potential overlap of identification and specification in the latter class. We specify the assumption under which dropout is ignorable w.r.t. two estimators (linear increments (LI)

and extended SWEEP) from this class. Further, we present a sensitivity analysis approach w.r.t. this baseline assumption in a longitudinal trial that allows more intuitive and informed input from domain experts about a simple sensitivity parameter. We show that the unconditional treatment-specific mean is identified under the assumption of future independence given present. We apply our sensitivity analysis approach in a dataset coming from a multi-center cluster-randomized controlled trial comparing two alternative economic interventions for reducing LDL cholesterol among patients with high cardiovascular risk. In order to position our approach among existing sensitivity analysis tools, we illustrate how it can be viewed as an extension of Daniels and Hogan’s (2007) pattern-mixture approach to longer sequences of observations.


Kay See Tan, PhD

Year Graduated: 2014
Advisor:
Andrea B. Troxel, ScD
Benjamin French, PhD
Dissertation Title:
Regression Modeling of Longitudinal Outcomes with Outcome-Dependent Observation Times
Abstract:

Conventional longitudinal data analysis methods typically assume that outcomes are independent of the data-collection schedule. However, the independence assumption may be violated when an event triggers outcome assessment in between prescheduled follow-up visits. For example, patients initiating warfarin therapy who experience poor anticoagulation control may have extra physician visits to monitor the impact of necessary dose changes. Observation times may therefore be associated with outcome values, which may introduce bias when estimating the effect of covariates on outcomes using standard longitudinal regression methods. We consider a joint model approach with two components: a semi-parametric regression model for longitudinal outcomes and a recurrent event model for observation times. The semi-parametric model includes a parametric specification for covariate effects, but allows the effect of time to be unspecified. We formulate a framework of outcome-observation dependence mechanisms to describe conditional independence between the outcome and observation-time processes given observed covariates or shared latent variables.


We generalize existing methods for continuous outcomes by accommodating any combination of mechanisms through the use of observation-level weights and/or patient-level latent variables. We develop new methods for binary outcomes, while retaining the flexibility of a semi-parametric approach. We extend these methods to account for discontinuous risk intervals in which patients enter and leave the at-risk set multiple times during the study. Our methods are based on counting process approaches, rather than relying on possibly intractable likelihood-based or pseudo-likelihood-based approaches, and provide marginal, population-level inference. In simulations, we evaluate the statistical properties of our proposed methods. Comparisons are made to 'naïve' approaches that do not account for outcome-dependent observation times. We illustrate the utility of our proposed methods using data from a randomized trial of interventions designed to improve adherence to warfarin therapy and a randomized trial of malaria vaccines among children in Mali.


Arwin Thomasson, PhD

Year Graduated: 2012
Advisor:
Sarah Ratcliffe, PhD
Dissertation Title:
A joint longitudinal-survival model with possible cure: an analysis of patient outcomes on the liver transplant waiting list
Abstract:

Data from transplant patients has many unique characteristics that can cause problems with statistical modeling. The patient's underlying disease / health trajectory is known to affect both longitudinal biomarker values and the probability of both death and transplant. In liver transplant patients, biomarker values show a sharp exponential increase in the days preceding death or transplant. Patients who receive transplants show an immediate drop in biomarker values post-transplant, followed by an exponential decrease. Patients' survival probabilities also change post-transplant, with dependencies on pre-transplant biomarker values. To properly incorporate these clinical features, we developed a joint longitudinal-survival model that incorporates an exponential growth-decay longitudinal model and a cure survival model. This allows us to evaluate patient biomarker trajectories and survival times both pre- and post-transplant. The models are linked by patient-level shared random effects that appear in the biomarker trajectories and the frailties of the survival functions. Estimates are obtained via the EM algorithm, with random effects integrated out of the complete data likelihood function using adaptive quadrature techniques. Simulations show our model performs reasonably under a variety of conditions. We demonstrate our methods using liver transplant data from the United Network of Organ Sharing (UNOS). We use total serum bilirubin as our longitudinal outcome, with age at wait listing and gender as linear covariates. Gender is used as a covariate in the survival model both pre- and post-transplant.

Simon Vandekar, PhD

Year Graduated: 2018
Advisor:
Russell T. Shinohara, PhD
Dissertation Title:
“Association tests for neuroimaging studies of development and disease”
Abstract:

There are growing bodies of literature identifying neuroanatomical markers of psychiatric disorders in adolescence and dementia in adulthood. Because prevention is an

increasing target of intervention, establishing early biomarkers is critical to identify individuals who are at risk and may benefit from early therapy. Neuroimaging provides diverse

measurements of neurological features that may serve as biomarkers of disease. In this talk, we will discuss two new tools for performing association tests in neuroimaging studies that we use to explore development and disease at two periods of risk in the lifespan. We propose a robust procedure to perform “mass-univariate testing” and develop a novel framework for association testing that serves as an alternative to the classical mass-univariate approach. Our methods are solutions to assumption violations of classical procedures that have led to inflated type 1 error rates in neuroimaging studies.


Saran Vardhanabhuti, PhD

Year Graduated: 2011
Advisor:
Hongzhe Li, PhD
Dissertation Title:
Statistical Methods for Multi-Sample Analysis of RNA-Seq and DNA Copy Number Data
Abstract:

In this dissertation, I developed statistical and computational methods motivated by problems in genomics studies. In particular, the theme was to explore how to improve statistical inference in studies involving multiple samples. The first part of my dissertation is motivated by the analysis of RNA-Seq data which is being used to study gene and isoform expression generated from the Next Generation Sequencing (NGS) platform. For this part of the dissertation, we present a Bayesian hierarchical model for multi-sample RNA-Seq data analysis in order to simultaneously estimate isoform-specific expression and to identify differentially expressed isoforms. Our model has the advantage of borrowing information across all samples in estimating expression levels, which can improve the estimates drastically, particularly for low abundance isoforms. Furthermore, our model can easily incorporate sample-specific covariates, which facilitates the isoform-specific differential expression analysis. Simulation studies demonstrated that this Bayesian multi-sample approach can lead to more precise estimates of isoform-specific expression and higher power to detect differential expression by borrowing information across all samples compared to single sample analysis, especially for isoforms of low abundance. We further illustrated our methods using the RNA-Seq data of 10 Yoruban and 10 Caucasian individuals. For the second part of my dissertation, we studied copy number changes in germline and tumor DNA samples. Our approach focuses on the change point detection problem across multiple samples utilizing adapted multi-sample wavelet transformation. We present two approaches to assess the significance of change points: first is a flexible analytic threshold with the ability to control the Type I error rate at a pre-specified level to identify significant shared change points across multiple samples and second, a permutation based method to identify recurrent change points (driver mutations as opposed to random passenger mutations). Simulation and data analysis show that information pooled across samples can help boost detection power compared to single sample analysis particularly in regions with rare proportion of carriers. Examples from germline and tumor DNA copy number data were used to illustrate our approach.

Fei Wan, PhD

Year Graduated: 2016
Advisor:
Nandita Mitra, PhD
Dissertation Title:
Instrumental variable and propensity score methods for bias adjustment in non-linear models
Abstract:

Unmeasured confounding is a common concern when clinical and health services researchers attempt to estimate a treatment effect using observational data or randomized studies with non-perfect compliance. To address this concern, instrumental variable (IV) methods, such as two-stage predictor substitution (2SPS) and two-stage residual inclusion (2SRI), have been widely adopted. In many clinical studies of binary and survival outcomes, 2SRI has been accepted as the method of choice over 2SPS but a compelling theoretical rationale has not been postulated.


First, We directly compare the bias in the causal hazard ratio estimated by these two IV methods. Under the potential outcome and principal stratification framework, we derive closed form solutions for asymptotic bias in estimating the causal hazard ratio among compliers for both the 2SPS and 2SRI methods by assuming survival time follows the Weibull distribution with random censoring. When there is no unmeasured confounding and no always takers, our analytic results show that 2SRI is generally asymptotically unbiased but 2SPS is not. However, when there is substantial unmeasured confounding, 2SPS performs better than 2SRI with respect to bias under certain scenarios. We use extensive simulation studies to confirm the analytic results from our closed-form solutions. We apply these two methods to prostate cancer treatment data from SEER-Medicare and compare these 2SRI and 2SPS estimates to results from two published randomized trials.


Next, we propose a novel two-stage structural modeling framework to understanding the bias in estimating the conditional treatment effect for 2SPS and 2SRI when the outcome is binary, count or time to event. Under this framework, we demonstrate that the bias in 2SPS and 2SRI estimators can be reframed to mirror the problem of omitted variables in non-linear models. We demonstrate that only when the influence of the unmeasured covariates on the treatment is proportional to their effect on the outcome that 2SRI estimates are generally unbiased for logit and Cox models. We also propose a novel dissimilarity metric to quantify the difference in these effects and demonstrate that with increasing dissimilarity, the bias of 2SRI increases in magnitude. We investigate these methods using simulation studies and data from an observational study of perinatal care for premature infants.


Last, we extend Heller and Venkatraman's covariate adjusted conditional log rank test by using the propensity score method. We introduce the propensity score to balance the distribution of covariates among treatment groups and reduce the dimensionality of covariates to fit the conditional log rank test. We perform the simulation to assess the performance of this new method and covariates adjusted Cox model and score test.


Hong Wan, PhD

Year Graduated: 2013
Advisor:
Susan Ellenberg, PhD
Dissertation Title:
Issues in Group Sequential/Adaptive Designs
Abstract:


In recent years, there has been great interest in the use of adaptive features in clinical trials (i.e., changes in design or analyses guided by examination of the accumulated data at an interim point in the trial) that may make the studies more efficient (e.g., shorter duration, fewer patients). Many statistical methods have been developed to maintain the validity of study results when adaptive designs are used (e.g., control of the type I error rate). Group sequential designs, which allow early stopping for efficacy in light of compelling evidence of benefit or early stopping for futility when the likelihood of success is low at interim analyses, have been widely used for many years. In this dissertation, we study several aspects of statistical issues in group sequential/adaptive designs. Sample size re- estimation has drawn a great deal of interest due to its permitting revision of the target treatment difference based on the unblinded interim analysis results from an ongoing trial. A possible risk of unblinded sample size re-estimation is that the exact treatment effect being observed at interim analysis might be back-calculated from the modified sample sized, which might jeopardize the integrity of the trial. In the first project, we propose a pre-specified stepwise two-stage sample size adaptation to lessen the information on treatment effect that would be revealed. We minimize expected sample size among a class of these designs and compare efficiency with the fully optimized two-stage design, optimal two-stage group sequential design and designs based on promising conditional power. In the second project, we define the complete ordering of a group sequential sample space and show that a Wang-Tsiatis boundary family or an exponential spending function family can completely order the sample space. We also propose a simple method to transform a spending function to a completely ordered sample space when using the sequential p-value ordering. This method is also extended to beta-spending functions for p-values to reject the alternative hypothesis. In the third project, we propose a simple approach for controlling the family wise error rate in a group sequential design with multiple testing. We apply sequential p-values at the interim analysis from a group sequential design to the sequentially rejective graphical procedure, which is based on the closure principle. We also use simulations to study the operating characteristics of multiple testing in group sequential designs. We show that in terms of expected sample size, using a group sequential design in multiple hypothesis testing is more efficient than fixed sample size designs in many scenarios.


Chia-Hao Wang, PhD

Year Graduated: 2010
Advisor:
Thomas R. Ten Have, PhD
Dissertation Title:
Causal Effect Estimation Under Linear And Log-Linear Structural Nested Mean Models In The Presence Of Unmeasured Confounding
Abstract:

In randomized clinical trials where the effects of post-randomization factors are of interest, the standard regression analyses are biased due to unmeasured confounding. Causal methods such as the instrumental variables (IV; Angrist et al., 1996) and G-estimation procedures under structural nested mean models (SNMMs; Robins, 1994, 1997) allow one to make valid inference even if unmeasured confounding is present. Two commonly used IV approaches, namely the two-stage predictor substitution (2SPS) and two-stage residual inclusion (2SRI), are typically applied in analysis assuming the exclusion restriction to adjust for confounding. However, the exclusion restriction may be violated in clinical applications especially when the mechanism of treatment is assessed under mediation analyses. Accordingly, we focus on estimating the direct effect of the randomized treatment adjusting for a post-randomization mediator (mediation analysis). In the first chapter, we extend the two IV approaches to estimate the direct effect, and evaluate the corresponding theoretical properties under the linear SNMM. Under certain assumptions, we have shown that the 2SPS and 2SRI approaches are equivalent to the linear SNMM. In the second chapter, we further extend and investigate the validity of these IV methods for estimation under a log-linear SNMM. The results show that the IV estimators are biased under the log-linear SNMM in the presence of unmeasured confounding. Therefore, in the third chapter we consider the G-estimation approach as an alternative solution to remove bias under the log-linear SNMM. The G-estimation method was previously developed under either the exclusion restriction assumption or the sequential ignorability assumption. We present a general framework where these two assumptions are relaxed. In contrast to the IV log-linear regression methods, we have shown that the proposed G-estimators are unbiased in the presence of unmeasured confounding. Finally, we illustrate all methods in a lung cancer randomized trial for mediation analysis where the sequential ignorability assumption is violated. The results are discussed and compared to those from the standard regression approach.


Hao Wang, PhD

Year Graduated: 2009
Advisor:
Daniel F. Heitjan, PhD
Dissertation Title:
Statistical Methods For Heaped Data
Abstract:


Heaping is a common type of measurement error emerging when data are collected with various degrees of coarseness. We say that a dataset is "heaped" when it contains a mixture of exact and rounded-off values. Parameter estimates derived from heaped data can be misleading if the errors imparted by heaping are ignored. This dissertation describes methods to account for the measurement errors inherent in heaped data. Based on the concept that each observed outcome represents a true value potentially distorted by measurement error, which may include heaping, we formulate two latent variables; one representing the underlying true value, the second characterizing the misreporting behavior responsible for measurement errors. We propose probability models for the two latent processes and describe Bayesian methods to estimate the model parameters. The sensitivity of the inferences to the specification of the underlying distributions is assessed using Bayes factors. We apply graphical posterior predictive checks to evaluate the adequacy of the fitted model using multiple imputations of the latent true variable.



The development and evaluation of our modeling approach is divided into three projects. In the first project, we propose a strategy for modeling univariate heaped data and describe its application to data acquired from a smoking cessation trial to assess the efficacy of the antidepressant drug bupropion. In the second project, we extend the method to model longitudinal heaped data by incorporating separate random effects into the models for the underlying latent processes. We summarize the results obtained when the two latent processes were used to characterize an interview technique, known as time-line follow back, that is widely used in smoking cessation trials to assess daily cigarette consumption. In the third project, we propose a method to correct for digit preference in blood pressure (BP) measurements. Through an application of the method to systolic BP measurements from REGARDS, a large national population-based cohort study to evaluate risk factors for stroke, we assess the effect of digit preference on the misclassification of hypertension.


Le Wang, PhD

Year Graduated: 2018
Advisor:
Jinbo Chen, PhD
Dissertation Title:
Statistical Methods for Outcome-Dependent Sampling Designs
Abstract:

My dissertation work focuses on the development of novel outcome-dependent sampling designs and statistical methods of analysis. In a biomedical cohort study for assessing association between a binary outcome variable and a set of covariates, it is common that some covariates can only be measured on a subgroup of study subjects. An important design question is which subjects to select into the subgroup towards increased statistical efficiency. Existing designs can achieve improved efficiency for estimating odds ratio parameters for the completely observed covariates. Our goal is to improve efficiency for the incomplete covariates, which is of great importance in studies where the covariates of interest cannot be fully collected. In the first two projects, we proposed a novel sampling design in a common scenario where an external model is available relating the outcome and complete covariates. Our design oversampled cases and controls whose probabilities of having their own outcome were low as predicted by the external model and at the same time matched cases and controls on complete covariates. We developed a pseudo-likelihood method for estimating odds ratio parameters. Through simulation studies and a real cohort study, we showed that our design led to reduced asymptotic variances of the odds ratio parameter estimates for both incomplete and complete covariates. In the third project, we developed a family-supplemented inverse-probability-weighted empirical likelihood approach to correcting for a type of outcome-dependent selection bias in case-control genetic association studies, where genotype data were incomplete for reasons that were related to the genotype itself. Genetic association analysis would be biased if such non-ignorable missingness were naively ignored. Our method exploited genetic data from family members

to help infer missing genotype data. It jointly estimated odds ratio parameters for genetic association and missingness, where a logistic regression model was used to relate missingness with genotype and other covariates. In the estimating equation for genetic association parameters, we weighted the empirical likelihood score function based on subjects who had genotype data by the inversed probabilities that their genotype data were available. We studied large and finite sample performance of our method and applied it to a family-based case-control study of breast cancer.


Lu Wang, PhD

Year Graduated: 2019
Advisor:
Jinbo Chen, PhD
Dissertation Title:
Statistical Methods for Analyzing Electronic Health Record Data

Zhi Wei, PhD

Year Graduated: 2008
Advisor:
Hongzhe Li, PhD
Dissertation Title:
Statistical Methods For Network-Based Analysis Of Genomic Data
Abstract:

After many years of biomedical research, biologists have accumulated much knowledge about genes' collaborative activity. This knowledge is summarized in the form of biological pathways. Knowledge about biological pathways turns out to be useful in genome research and many computational methods have been proposed to utilize this information in the analysis of high-dimensional data. However, many of these methods utilize pathway information in post hoc ways and pathways are hardly used in the modeling step. This dissertation studies statistical methods for systematically modeling gene dependency encoded in biological pathways. The first part of this dissertation models pathway group structure. Specifically, Chapter 2 develops a pathway-based gradient descent boosting procedure for nonparametric pathway-based regression (NPR) analysis of genomic data. Such NPR models treat genes in the same pathway as a group and consider multiple pathways simultaneously, while allowing complex interactions among genes within a pathway. Our simulation studies and real-world applications indicate that the NPR models can indeed identify relevant genes and pathways. The second part of this dissertation models pathway graphic structure and develops several Markov random fields (MRF) to model the dependency of gene expression patterns in biological pathways. Specifically, Chapter 3 proposes a hidden MRF (hMRF) model for analysis of non-temporal genomic data in microarray time course (MTC) data, genes exhibit not only pathway graphic dependency but also temporal dependency. Chapter 4 extends the hMRF model further into a hidden spatial-temporal MRF model to simultaneously consider the graphic and temporal dependencies for analysis of MTC data. Alternatively, for short MTC data with a few time points, Chapter 5 treats observed gene expression data as multivariate vectors and assume genes share the same expression patterns over time. A Bayesian framework with the hMRF model as the prior is employed. Different multivariate empirical Bayesian models are developed to serve as the emission probabilities for longitudinal and cross-sectional designs. Simulation studies and real-world applications, by utilizing pathway graphic structure information, show that these MRF-based models are quite effective in identifying genes and modified subnetworks with higher sensitivity than common procedures and comparable false discovery rates.


Matthew White, PhD

Year Graduated: 2012
Advisor:
Sharon X. Xie, PhD
Dissertation Title:
Statistical Methods for Evaluating Diagnostic Biomarkers in the Presence of Measurement Error
Abstract:


In recent years, biomarkers have grown in importance in many clinical and epidemiological settings. Many biomarkers are obtained with measurement error due to imperfect lab conditions or temporal variability within subjects, and it is therefore critical to develop analytical methods to quantify and adjust for measurement error in the evaluation of diagnostic markers.



We first develop a parametric bias-correction approach to adjust estimates of sensitivity, specificity, and other diagnostic measures for measurement error by using an internal reliability sample. We derive asymptotic expressions for the bias in naive estimators. We prove that the bias-corrected estimators are consistent and asymptotically normally distributed and derive the asymptotic variance of the estimators using the delta method. We evaluate our method through extensive simulations and illustrate our method using a biomarker study in Alzheimer's disease (AD).



Next, we develop optimal design strategies for studying the effectiveness of an error-prone biomarker in differentiating diseased from non-diseased individuals and focus on the area under the receiver operating characteristic curve (AUC) as the primary measure of effectiveness. Using an internal reliability sample within the diseased and non-diseased groups, we develop optimal study design strategies that 1) minimize the variance of the estimated AUC subject to constraints on the total number of observations or total cost of the study or 2) achieve a pre-specified power. We develop optimal allocations of the number of subjects in each group, the size of the reliability sample in each group, and the number of replicate observations per subject in the reliability sample in each group under a variety of commonly seen study conditions.



Finally, we propose a parametric approach to compare two or more correlated AUCs when the biomarkers are subject to correlated measurement errors. We show that the proposed estimator is consistent and asymptotically normally distributed and derive its asymptotic variance using the delta method. We compare the performance of our method to naive methods that ignore the correlation in measurement errors through simulations and show that ignoring this correlation can lead to biased estimates of the AUC difference. We return to the AD biomarker study to demonstrate our method.

Michael Wierzbicki, PhD

Year Graduated: 2013
Advisor:
Wensheng Guo, PhD
Dissertation Title:
Sparse Semiparametric Nonlinear Models with Applications to Chromatographic Fingerprinting
Abstract:


Medicinal herbs are comprised of a multitude of compounds and the identification of their active composition is an important area of research. High Performance Liquid Chromatography, a popular technique to detect compounds in herbs, outputs a chromatogram, a curve characterized by spikes corresponding to detected compounds. As the particular set of compounds is unique to each herb and spike locations can be used to identify compounds, chromatograms provide a visual representation, or fingerprint, of herbs. Constructing statistical models for the estimation and comparison of chromatographic fingerprints is difficult due to the sparse, spiky nature of chromatograms. Moreover, across different experimental conditions, the location of spikes can be shifted, preventing the establishment of a standardized fingerprint, direct comparison of curves, and efficient compound identification.



Here we describe a sparse semiparametric nonlinear mixed effects modeling framework for the registration and estimation of functional data with sparse structures. Data-driven basis expansion is used to model group-averaged curves while parametric modeling of time warping functions aligns curves. Penalized estimation with the Adaptive Lasso penalty provides a unified criterion for curve registration, model selection, and estimation. Furthermore, the Adaptive Lasso estimators possess attractive sampling properties. The performance of the modeling framework and its role in medicinal herb research are demonstrated through its application to two chromatographic data sets.


Kaitlin Woo, MS

Year Graduated: 2013
Advisor:
Nandita Mitra, PhD
Justin Bekelman, MD
Dissertation Title:
Stage migration in prostate cancer

Qian Wu, PhD

Year Graduated: 2013
Advisor:
Hongzhe Li, PhD
Dissertation Title:
Statistical Methods for Analysis of Multi-sample Copy Number Variants and ChIP-seq Data
Abstract:


This dissertation addresses the statistical problems related to multiple-sample copy number variants (CNVs) analysis and analysis of differential histone binding between two or more biological conditions based on the Chromatin ImmunopreciPitation and sequencing (ChIP-seq) data. The first part of the dissertation develops methods for identifying the copy number variants that are associated with trait values. We develop a novel method, CNVtest, to directly identify the trait-associated CNVs without the need of identifying sample- specific CNVs. Asymptotic theory is developed to show that CNVtest controls the type I error asymptotically and identifies the true trait-associated CNVs with a high probability. The performance of this method is demonstrated through simulations and an application to identify the CNVs that are associated with population differentiation.



The second part of the dissertation is to develop methods for detecting the genes with differential histone binding regions between two or more experimental conditions based on the ChIP-seq data. We apply several nonparametric methods to identify the genes with differential binding regions. The methods can be applied to the ChIP-seq data of histone modification even without replicates. Our method is based on nonparametric hypothesis testing in order the capture spatial differences in protein-binding profiles. We demonstrate the method using a ChIP-seq data on a comparative epigenomic profiling of adipogenesis of murine adipose stromal cells. Our method detects many genes with differential binding for the histone modification mark H3K27ac in gene promoter regions between proliferating preadipocytes and mature adipocytes in murine 3T3-L1 cells. The test statistics also correlate with the gene expression changes well and are predictive to gene expression changes, indicating that the identified differential binding regions are indeed biologically meaningful. We further extend these tests to time-course ChIP-seq experiments by evaluating the maximum and mean of the adjacent pair-wise statistics. We compare and evaluate different nonparametric tests for differential binding analysis and observe that the kernel-smoothing methods perform better in controlling the type I errors, although the ranking of differentially bound genes are comparable using different test statistics.


Yuehui Wu, PhD

Year Graduated: 2004
Advisor:
Kathleen J. Propert, ScD
Dissertation Title:
Design For Intervention Studies With Categorical Outcomes
Abstract:

This thesis covers three related topics in the optimal design of experiments, with applications to clinical trials and rodent toxicology experiments. In general, optimal design theory provides guidelines for study design that maximize the efficiency of statistical estimates and hypothesis tests. This can in many settings provide substantial savings in required sample sizes and, thus, overall costs. The focus here is on optimal design for experiments with categorical outcomes, although many of the methods are directly extendable to other types of measures. The first topic addresses designs for dose-response experiments in which the outcome is an ordinal variable with many categories. Regression models based on the beta distribution are developed for these outcomes and D-optimal designs used to identify the optimal selection of doses. The methods are illustrated using data from a clinical trial. For the second topic, optimal designs are developed for animal experiments in regulated gene therapy. Four candidate dose-response models are explored that allow a dichotomous response to be a function of the doses of two simultaneous interventions, plus an interaction term. D-optimal designs are then identified for these models, and guidelines for the selection of doses provided. The final topic addresses two-stage sequential designs in which the goal is to select one of a number of candidate treatments in the first stage, for further testing in the second stage. The primary outcomes considered are Poisson outcomes, either independent, such as voiding frequency in urologic disorders, or paired, such as those seen in ophthalmology where one eye may be used as an internal control. First, under the assumption of normally-distributed outcomes, required sample sizes at each stage are identified, given assumptions about the true parameter values and error rates. These results then are extended to the Poisson outcome cases using variance-stabilizing transformations and the Poisson difference distribution.


Jichun Xie, PhD

Year Graduated: 2011
Advisor:
Hongzhe Li, PhD
T. Tony Cai, PhD (Wharton)
Dissertation Title:
Methods For High Dimensional Inferences With Applications In Genomics
Abstract:

In this dissertation, I have developed several high dimensional inferences and computational methods motivated by problems in genomics studies. It consists of two parts. The first part is motivated by analysis of data from genome-wide association studies (GWAS), where I have developed an optimal false discovery rate (FDR) con- trolling method for high dimensional dependent data. For short-ranged dependent data, I have shown that the marginal plug-in procedure has the optimal property in controlling the FDR and minimizing the false non-discovery rate (FNR). When applied to analysis of the neuroblastoma GWAS data, this procedure identified six more disease-associated variants compared to previous p-value based procedures such as the Benjamini and Hochberg procedure. I have further investigated the statistical issue of sparse signal recovery in the setting of GWAS and developed a rigorous procedure for sample size and power analysis in the framework of FDR and FNR for GWAS. In addition, I have characterized the almost complete discovery boundary in terms of signal strength and non-null proportion and developed a procedure to achieve the almost complete recovery of the signals. The second part of my dissertation was motivated by gene regulation network construction based on the genetical genomics data (eQTL). I have developed a sparse high dimensional multivariate regression model for studying the conditional independent relationships among a set of genes adjusting for possible genetic effects, as well as the genetic architecture that influences the gene expression. I have developed a covariate adjusted precision matrix estimation method (CAPME), which can be easily implemented by linear programming. Asymptotic convergence rates and sign consistency are established for the estimators of the regression coefficients and the precision matrix. Numerical performance of the estimator was investigated using both simulated and real data sets. Simulation results have shown that the CAPME resulted in great improvements in both estimation and graph structure selection. I have applied the CAPME to analysis of a yeast eQTL data in order to identify the gene regulatory network among a set of genes in the MAPK signaling pathway. Finally, I have also made the R software package CAPME based on my dissertation work.


Rengyi Emily Xu, PhD

Year Graduated: 2017
Advisor:
Pamela A. Shaw, PhD
Devan V. Mehrotra, PhD
Dissertation Title:
Methods for survival analysis in small samples
Abstract:

Studies with time-to-event endpoints and small sample sizes are commonly seen; however, most statistical methods are based on large sample considerations. We develop novel methods for analyzing crossover and parallel study designs with small sample sizes and time-to-event outcomes. For two-period, two-treatment (2x2) crossover designs, we propose a method in which censored values are treated as missing data and multiply imputed using pre-specified parametric failure time models. The failure times in each imputed dataset are then log-transformed and analyzed using ANCOVA. Results obtained from the imputed datasets are synthesized for point and confidence interval estimation of the treatment-ratio of geometric mean failure times using modelaveraging in conjunction with Rubin's combination rule. We use simulations to illustrate the favorable operating characteristics of our method relative to two other existing methods. We apply the proposed method to study the effect of an experimental drug relative to placebo in delaying a symptomatic cardiac-related event during a 10-minute treadmill walking test. For parallel designs for comparing survival times between two groups in the setting of proportional hazards, we propose a refined generalized log-rank (RGLR) statistic by eliminating an unnecessary approximation in the development of Mehrotra and Roth's GLR approach (2001). We show across a variety of simulated scenarios that the RGLR approach provides a smaller bias than the commonly used Cox model, parametric models and the GLR approach in small samples (up to 40 subjects per group), and has notably better efficiency relative to Cox and parametric models in terms of mean squared error. The RGLR approach also consistently delivers adequate confidence interval coverage and type I error control. We further show that while the performance of the parametric model can be significantly influenced by misspecification of the true underlying survival distribution, the RGLR approach provides a consistently low bias and high relative efficiency. We apply all competing methods to data from two clinical trials studying lung cancer and bladder cancer, respectively. Finally, we further extend the RGLR method to allow for stratification, where stratum-specific estimates are first obtained using RGLR and then combined across strata for overall estimation and inference using two different weighting schemes. We show through simulations the stratified RGLR approach delivers smaller bias and higher efficiency than the commonly used stratified Cox model analysis in small samples, notably so when the assumption of a constant hazard ratio across strata is violated. A dataset is used to illustrate the utility of the proposed new method.


Lingfeng Yang, PhD

Year Graduated: 2007
Advisor:
Mary D. Sammel, ScD
Thomas R. Ten Have, PhD
Dissertation Title:
Extensions Of Latent Class Trajectory Models
Abstract:


The latent variable model is a useful tool for longitudinal/multivariate data analysis. It not only deals with the trajectory of the entire response profile together, but also summarizes both continuous and categorical outcomes that may not be combined in a straightforward way. As a categorical latent variable model, the latent class model also classifies subjects according to their underlying heterogeneity and thus may facilitate further subgroup analysis if necessary.



This dissertation extends the existing latent class model methodology to two directions and applies them to a longitudinal psychiatric dataset. The first project, driven by the limited identifiability of the between level latent class in multilevel modeling, explores the possibility of a nested latent class model by reversing the conventional conditional order between the two levels. Although the straightforward interpretation of the between level class no longer holds in this identifiability-driven approach, we can still gain meaningful clinical implications from the association between the two levels of latent classes.



The second and third projects employ a shared parameter model to assess the impact of drop-out under the non-ignorable assumption. The second project models continuous longitudinal responses while the third binary. The non-ignorable drop-out of the longitudinal responses are confirmed in both projects, but the impacts of the drop-out differ in degree in terms of the significance change of model covariates. We propose that the difference is caused by loss of information due to changing the continuous outcome to the binary outcome.



All three projects employ a quasi-Newton algorithm to maximize the finite mixture likelihood function generated by the latent classes. Simulation studies have been performed to assess validity of the algorithm for all three estimations. The most dominant challenge in the latent class model is identifiability. In addition to the problems mentioned above, limited identifiability also imposes difficulties in various model diagnosis and assessment approaches.


Yang Yang, MS

Year Graduated: 2014
Advisor:
Mingyao Li, PhD
Muredach Reilly, MD
Dissertation Title:
A flexible framework for differential isoform expression analysis in RNA-seq

Gui-shuang Ying, PhD

Year Graduated: 2004
Advisor:
Daniel F. Heitjan, PhD
Dissertation Title:
Prediction Of Event Times In Randomized Clinical Trials
Abstract:


In clinical trials with planned interim analyses, it can be valuable for a variety of reasons to predict the times of landmark events in advance of their occurrence. Bagiella and Heitjan (2001) proposed a parametric prediction model for failure-time outcomes assuming exponential survival and Poisson enrollment. There is concern that their model has limited application because of the strong distributional assumptions, and that the predictions may be inaccurate if distributional assumptions are wrong.



To address this concern, we first propose a nonparametric approach to making point and interval prediction of landmark dates during the course of the trial. We obtain point predictions using the Kaplan-Meier estimator to extrapolate the survival probability into the future, selecting the time when the expected number of events is equal to the landmark number. To construct prediction intervals, we use the Bayesian bootstrap to generate the predictive distribution of landmark times; predictive intervals are quantiles of this distribution. Monte Carlo simulation results demonstrate the superiority of the nonparametric method when the assumptions underlying the parametric model are incorrect.



Secondly, we generalize the exponential survival model to the two-parameter Weibull model. The survival probability in the future is estimated from the available data and the prior guesses for the values of two Weibull parameters. For interval prediction, we approximate the posterior distribution using the sampling-importance-resampling technique, and generate the predictive distribution of landmark times. Monte Carlo simulation results show that the Weibull prediction model works very well for the Weibull and gamma distributions, but not so well for the lognormal distribution.



Finally, we extend the constant enrollment rate model to a non-homogenous Poisson process model. For the parametric prediction, we use a truncated exponential enrollment model. For non-parametric prediction, we generalize the enrollment model by using weighted sampling from previous enrollment time intervals. Monte Carlo simulation results illustrate the advantage of these generalizations over the constant rate model and their flexibility in predicting enrollment.



We demonstrate these methods using data from a trial in immunotherapy of chronic granulomatous disease.


Doyeong Yu, MS

Year Graduated: 2017
Advisor:
Wei-Ting Hwang, Ph.D

Jarcy Zee, PhD

Year Graduated: 2014
Advisor:
Sharon X. Xie, PhD
Dissertation Title:
Survival Analysis with Uncertain Endpoints using an Internal Validation Subsample
Abstract:

When a true survival endpoint cannot be assessed for some subjects, an alternative endpoint that measures the true endpoint with error may be collected, which often occurs when the true endpoint is too invasive or costly to obtain. We develop nonparametric and semiparametric estimated likelihood functions that incorporate both uncertain endpoints available for all participants and true endpoints available for only a subset of participants. We propose maximum estimated likelihood estimators of the discrete survival function of time to the true endpoint and of a hazard ratio representing the effect of a binary or continuous covariate assuming a proportional hazards model. We show that the proposed estimators are consistent and asymptotically normal and develop the analytical forms of the variance estimators. Through extensive simulations, we also show that the proposed estimators have little bias compared to the naïve estimator, which uses only uncertain endpoints, and are more efficient with moderate missingness compared to the complete-case estimator, which uses only available true endpoints. We illustrate the proposed method by estimating the risk of developing Alzheimer's disease using data from the Alzheimer's Disease Neuroimaging Initiative. Using our proposed semiparametric estimator, we develop optimal study design strategies to compare survival across treatment groups for a new trial with these data characteristics. We demonstrate how to calculate the optimal number of true events in the validation set with desired power using simulated data when assuming the baseline distribution of the true event, effect size, correlation between outcomes, and proportion of true outcomes that are missing can be estimated from pilot studies. We also propose a sample size formula that does not depend on baseline distribution of the true event and show that power calculated by the formula matches well with simulation based results. Using results from a Ginkgo Evaluation of Memory study, we calculate the number of true events in the validation set that would need to be observed for new studies comparing development of Alzheimer's disease among those with and without antihypertensive use, as well as the total number of subjects and number in the validation set to be recruited for these new trials.


Bret Zeldow, PhD

Year Graduated: 2017
Advisor:
Jason A. Roy, PhD
Dissertation Title:
Bayesian Nonparametric Methods for Causal Inference and Prediction
Abstract:

In this thesis we present novel approaches to regression and causal inference using popular Bayesian nonparametric methods. Bayesian Additive Regression Trees (BART) is a Bayesian machine learning algorithm in which the conditional distribution is modeled as a sum of regression trees. We extend BART into a semiparametric generalized linear model framework so that a portion of the covariates are modeled nonparametrically using BART and a subset of the covariates have parametric form. This presents an attractive option for research in which only a few covariates are of scientific interest but there are other covariates must be controlled for. Under certain causal assumptions, this model can be used as a structural mean model. We demonstrate this method by examining the effect of initiating certain antiretroviral medications has on mortality among HIV/HCV coinfected subjects. In later chapters, we propose a joint model for a continuous longitudinal outcome and baseline covariates using penalized splines and an enriched Dirichlet process (EDP) prior. This joint model decomposes into local linear mixed models for the outcome given the covariates and marginals for the covariates. The EDP prior that is placed on the regression parameters and the parameters on the covariates induces clustering among subjects determined by similarity in their regression parameters and nested within those clusters, sub-clusters based on similarity in the covariate space. When there are a large number of covariates, we find improved prediction over the same model with Dirichlet process (DP) priors. Since the model clusters based on regression parameters, this model also serves as

a functional clustering algorithm where one does not have to choose the number of clusters beforehand. We use the method to estimate incidence rates of diabetes when longitudinal laboratory

values from electronic health records are used to augment diagnostic codes for outcome identification. We later extend this work by using our EDP model in a causal inference setting using the parametric gformula. We demonstrate this using electronic health record data consisting of subjects initiating second generation antipsychotics.


Jiameng Zhang, PhD

Year Graduated: 2004
Advisor:
Daniel F. Heitjan, PhD
Dissertation Title:
Sensitivity Analysis Of Nonignorable Coarsening
Abstract:

Missing and censored data are common types of coarse data. An important consequence of the stochastic nature of the coarsening process is nonignorability. Failure to properly account for a nonignorable coarsening mechanism could vitiate inferences. One approach to this problem is to perform a sensitivity analysis to see how inferences change when the coarsening mechanism departs from ignorability. In this dissertation, I apply a simple sensitivity analysis tool, the index of sensitivity to nonignorability (ISNI, Troxel et al. 2004), to the evaluation of nonignorability of the coarsening process based on the general coarse data model (Heitjan and Rubin, 1991). Moreover, I extend ISNI for MLE to ISNI for Bayesian inference. I also propose a graphical method to check sensitivity, which can be used as a first step in judging the robustness of key inferences to nonignorable coarsening. Simulation studies show that this sensitivity analysis procedure is valid for practical use. I illustrate the procedure through application to two real data sets, one involving censoring by end of study in an randomize clinical trial and the other involving competing risks in an observational study.


Mingyuan Zhang, PhD

Year Graduated: 2009
Advisor:
Marshall Joffe, MD, MPH, PhD
Dylan Small, PhD
Dissertation Title:
Causal Inference In Discretely Observed Continuous Time Processes
Abstract:

In causal inference for longitudinal data, standard methods usually assume that the underlying processes are discrete time processes, and that the observational time points are the time points when the processes change values. The identification of these standard models often relies on the sequential randomization assumption, which assumes that the treatment assignment at each time point only depends on current covariates and the covariates and treatment that are observed in the past. However, in many real world data sets, it is more reasonable to assume that the underlying processes are continuous time processes, and that they are only observed at discrete time points. When this happens, the sequential randomization assumption may not be true even if it is still a reasonable abstraction of the treatment decision mechanism at the continuous time level. For example, in a multi-round survey study, the decision of treatment can be made by the subject and the subject's physician in continuous time, while the treatment level and covariates are only collected in discrete times by a third party survey organization. The mismatch in the treatment decision time and the observational time makes the sequential randomization assumption false in the observed data. In this dissertation, we show that the standard methods could produce severely biased estimates, and we would explore what further assumptions need to be made to warrant the use of standard methods. If these assumptions are false, we advocate the use of controlling-the-future method of Joffe and Robins (2009) when we are able to reconstruct the potential outcomes from the discretely observed data. We propose a full modeling approach and demonstrate it by an example of estimating the effect of vitamin A deficiency on children's respiratory infection, when we are not able to do so. We also provide a semi-parametric analysis of the controlling-the-future method, giving the semi-parametric efficient estimator.


Rongmei Zhang, PhD

Year Graduated: 2011
Advisor:
Thomas R. Ten Have, PhD
Dissertation Title:
Causal And Design Issues In Clinical Trials
Abstract:

The first part of my dissertation focuses on post-randomization modification of intent-to-treat effects. For example, in the field of behavioral science, investigations involve the estimation of the effects of behavioral interventions on final outcomes for individuals stratified by post-randomization moderators measured during the early stages of the intervention (e.g., landmark analyses in cancer research). Motivated by this, we address several questions on the use of standard and causal approaches to assessing the modification of intent-to-treat effects of a randomized intervention by a post-randomization factor. First, we show analytically the bias of the estimators of the corresponding interaction and meaningful main effects for the standard regression model under different combinations of assumptions. Such results show that the assumption of independence between two factors involved in an interaction, which has been assumed in the literature, is not necessary for unbiased estimation. Then, we present a structural nested distribution model estimated with G-estimation equations, which does not assume that the post-randomization variable is effectively randomized to individuals. We show how to obtain efficient estimators of the parameters of the structural distribution model. Finally, we confirm with simulations the performance of these optimal estimators and further assess our approach with data from a randomized cognitive therapy trial.



The second part of my dissertation is on optimal and adaptive designs for dose-finding experiments in clinical trials with multiple correlated responses. For instance, in phase I/II studies, efficacy and toxicity are often the primary endpoints which are observed simultaneously and need to be evaluated together. Accordingly, we focus on bivariate responses with one continuous and one categorical. We adopt the bivariate probit dose-response model and study locally optimal, two-stage optimal, and fully adaptive designs under different cost constraints. We assess the performance of the different designs through simulations and suggest that the two-stage designs are as efficient as and may be more efficient than the fully adaptive deigns under a moderate sample size in the initial stage. In addition, two-stage designs are easier to construct and implement, and thus can be a useful approach in practice.


Huaqing Zhao, PhD

Year Graduated: 2011
Advisor:
Wensheng Guo, PhD
Dissertation Title:
Analyzing Population Based Genetic Association Studies With Propensity Score Approach
Abstract:

In population based genetic association studies, confounding due to population stratification (PS) arises when differences in both allele and disease frequencies exist in a population of mixed racial/ethnic subpopulations. Propensity scores are often used to address confounding in observational studies. However, they have not been adapted to correct bias due to PS in genetic association studies. Currently, genomic control, structured association, principal components analysis (PCA), and multidimensional scaling (MDS) approaches have been proposed to address this bias using genetic markers. We propose a genomic propensity score (GPS) approach to correct for bias due to PS that considers both genetic and non-genetic factors such as patient characteristics. We further propose an extended genomic propensity score (eGPS) approach that allows one to estimate a genotype effect under various genetic models in candidate gene studies. Finally, we propose a new approach that combines principal components analysis and the propensity score (PCAPS) to correct for bias due to PS in genome-wide association studies (GWAS). Simulations show that our approach can adequately adjust for bias due to confounding and preserve coverage probability, type I error and power. We illustrate these approaches in a case-control GWAS of testicular germ cell tumors. We provide a novel and broadly applicable strategy for obtaining less biased estimates of genetic associations.


Jing Zhao, PhD

Year Graduated: 2004
Advisor:
Andrea Foulkes, PhD
Ed George, PhD (Wharton)
Dissertation Title:
Exploratory Bayesian Modeling Methods For Genetics Data
Abstract:

The last decade has been characterized by an explosion of biological sequence information. When the requirements for applying the traditional statistical methods have failed in most cases with many types of data arising in genomics, exploratory and revolutionary methods for characterizing the genetics data are sorely needed. However, this usually presents an analytic challenge due to the large number of potentially relevant biomarkers and the complex, uncharacterized relationships among them. This dissertation presents different exploratory methods for modeling variety of types of genetics data under Bayesian framework. The whole work is divided into two parts, and both parts make extensive usage of Bayesian model (variable) selection methods. First, we propose Markov modeling for characterizing the evolution of viral serial sequences. Secondly, we apply the common linear regression model for detecting the high-order SNP-phenotype associations. In both of the parts we then present the exploratory Bayesian model(variable) selection procedures that are based on different hierarchical Bayesian models. For large model spaces, i.e. a large number of multi-levelled biomarkers, we implement different Markov Chain Monte Carlo (MCMC) stochastic search algorithms for finding promising models. In the second part we not only provide a confirmative permutation test to evaluate our findings obtained from the exploratory Bayesian approach, but also conduct the simulation analyses to validate the method's ability to detect true, underling relationships. We illustrate our methods by applying our procedure to explore the extent to which HIV-1 genetic changes occur independently over time in the first part. In the second part, we provide an application to the phenotype and genotype data from patients at risk for cardiovascular disease assuming the known haplotype information. Finally, we further our exploratory approach by accounting for the unknown haplotype information.