PhD Lab Rotations

The overall goal of the rotations is to expose students to biomedical research, and in particular research related to statistical methodology early in their training. Students are expected to participate in 3-5 rotations in total. First year students will have 3 rotations: fall, spring, and summer. Students can expect to spend 20+ hours per week during the fall/spring semesters working on their lab rotation and are expected to work full time 40+ hours during the summer. In the second year it is expected that students will “settle” into a relationship with a potential dissertation advisor with the goal of focusing their research in an area of research related to that of the advisor.

Available Research/Lab Rotation Matches

Please contact the mentor for more information on any rotation you are interested in.

Mentor Title/Description Semesters Available
Matt Bryan, Ph.D. Methodological Challenges in Studying Risk Factors for Obesity in Children Spring 2016
Studying obesity in children presents numerous methodological challenges that have lead many studies to use biased analytical approaches. This lab project will focus on the comparison of methods for identifying risk factors for obesity-related outcomes measured in a longitudinal setting. The goal of the project is to highlight the statistical and scientific concerns with analyzing different outcome measures currently used in the pediatric growth literature. In particular, we will compare the use of raw weight trajectories to dichotomized obesity definitions and standardized weight scores. We aim to show that using raw weight trajectories analyzed using the newly developed longitudinal rate regression model is most appropriate. The results of this work will be formulated in a paper targeted for publication for an epidemiological or clinical audience in order to promote effective statistical analysis in this area. Student contributions to this work may include literature review, data analysis for example datasets, simulation studies, and statistical programming depending on the level of training and interest of the student.
Jinbo Chen, Ph.D. Statistical Methods for Risk Prediction and Evaluation Spring 2016
I am interested in developing statistical methods for predicting cancer risk. We are currently developing statistical models for predicting absolute risks of breast and ovarian cancer for women who carry BRCA1/2 mutations. Our models will accommodate variations in risk by age at Salpingo oophorectomy, mutation location, use of oral contraceptives, and other potential risk modifiers. We will use our model to inform the optimal age that a woman should undergo oophorectomy in terms of balancing the protection effect from oophorectomy and various loses. The work will require development of novel statistical methods for improved modeling of age-dependence and lagging of oophorectomy on reducing the risk of breast cancer, refined risk prediction models that accommodate the time-varying effect of oophorectomy, and causal modeling on the effect of oophorectomy on reducing the risk of breast cancer. We will be also studying the usefulness of risk-modifying genetic variants identified to date for optimizing age of oophorectomy use. I have submitted a R01 proposal as PI to support this line of work. I am also analyzing data from a consortium study that aims to quantify BRCA1/2 related breast and ovarian cancer risk in Southern Asia. The statistical work involves penetrance estimation using family data. I am very excited about this work, since the outcome from this project may have direct impact on clinical management of BRCA1/2 mutation carriers in Southern Asia.
Jinbo Chen, Ph.D. Statistical Methods in Genetic Epidemiology Spring 2016
My research in this direction currently focuses on two projects. The first one is the development of powerful statistical methods for analyzing multi-marker SNP genotype data with univariate or multivariate phenotypes. The phenotypes can be imaging phenotypes or multiple biomarkers collected on a pathway. The second one is the development of powerful statistical methods for exploiting precision covariates and know genetic susceptible variants in testing genetic associations. I have submitted a R01 proposal as PI to support this line of work.
Jinbo Chen, Ph.D. Statistical Methods for Outcome-dependent Sampling Designs Spring 2016
I have had long interest in efficient biomedical study design analyses. I have two ongoing projects, design and analysis of two-phase studies where the outcome variable cannot be precisely measured, and risk prediction under the case-control sampling design. This line of work is supported by my standing R01 award.
Yong Chen, Ph.D. Evaluating the impact of outcome misclassification and covariates measurement error on association studies using EHR data Spring and Summer 2016
Motivation: The deployment of electronic health records (EHR) systems offers the promise of aggregating information for clinical research and improved health care delivery. Since EHRs contain large populations with diverse disease conditions, they have the potential to act as platforms for generating sets of cases and controls for clinical and translational research. However, it is well know that EHRs are not designed for research purpose, so the disease outcome and important clinical covariate variables identified from EHR is subject to measurement error. Method: We propose to first empirically evaluate the impact of outcome misclassification and covariates measurement errors, and then to develop rigorous methods to deal with these problems, with a focus of modified hypothesis testing procedure and biased corrected estimation methods. More specifically, we will incorporate the partial information on misclassifications and measurement errors from related studies into the testing/estimation procedures.

Tentative timeline: 2-3 weeks (literature review); 2-3 weeks (method development); 3 weeks (simulation studies); 6 weeks (data analysis and manuscript writing).

Yong Chen, Ph.D. Statistical modeling of the fundamental diagram Spring and Summer 2016
Background: Transportation systems suffer from lack of efficiency. These inefficiencies have a tremendous impact: 2.8 billions of gallons of wasted fuel and 5.5 billions of wasted hours yearly in 2011; nearly one third of CO2 emissions in US; and deaths because of accidents (130,000 in 2013, fourth leading cause of death in US). New technologies allow big data collection from infrastructure and user sensors. Even if these data were analyzed in a number of studies, it is still missing a comprehensive approach based on a synergistic combination of statistical inference, modeling, numerical and analytic tools. This project aims to use model-guide statistical analysis of traffic data to provide innovative tools to support transport agencies decision-making. Proposed method: It has been well acknowledged that there are substantially more variations in flux when the density is higher than a critical density, compared to variations in the flux at density lower than the critical density. To account for such extra variation, we propose to model the density-flux relation as a mixture of distributions by allowing the critical density and maximum flux to be dependent on measured external conditions (e.g., weather condition, day of the week, indicator of a special event, number of lanes, etc). Specifically, we will propose a semiparametric model for density-flux relation at a particular sensor. The proposed model has several important advantages. First, the density-flux relation is modeled by flexible data-driven approach using B-spline smoothing. Such a strategy avoids subjective assumptions on the fundamental diagram, which avoids potential bias due to model misspecifications. Secondly, by allowing heterogeneity in critical density and maximum flux, we can account for the extra variation in flux when the density is higher than the critical density. Thirdly, in the proposed model, the model parameters quantify the relations between the predictors and the critical density (or the maximum flux) which are of primary interest. These parameters play important roles in decision making for traffic management.

Tentative timeline: 2-3 weeks (literature review); 2-3 weeks (method development); 3 weeks (simulation studies); 6 weeks (data analysis and manuscript writing).

Yong Chen, Ph.D. Statistical Inference of Copas Selection Model for Publication Bias Correction with Partial Identification Spring and Summer 2016
Background and method: Publication bias occurs when the published research results are systematically unrepresentative of the population of studies that have been conducted, and is a potential threat to meaningful meta-analysis. The Copas selection model provides a flexible framework for correcting estimates and offers considerable insight into the publication bias. However, maximizing the observed likelihood under the Copas selection model is challenging because the observed data contain very little information on the latent variable. We propose a novel statistical inference procedure to account for the partial identification of the Copas model. In particular, we will provide an information bound for test of publication bias under small or moderate sample size settings.

Tentative timeline: 2 weeks (literature review); 6 weeks (method development); 4 weeks (simulation studies); 4 weeks (data analysis and manuscript writing).

Phyllis Gimotty, Ph.D. Develop biomarkers and prognostic/predictive models for use in clinical settings Spring 2016
My research program focuses on methods to develop biomarkers and prognostic/predictive models for use in clinical settings. My research has two major themes: 1) the development of statistical methods for risk assessment (including diagnostic, prognostic and predictive models) and biomarker evaluation addressing statistical issues such as the comparison and validation of prognostic models; and 2) development of statistical methods to evaluating new biomarkers (from the histopathological to the molecular genetics) for detection/diagnosis, diagnosis, prognosis, and treatment. Current applications for which we are investigating statistical issues include 1) assessing the impact of recent, new therapies for melanoma patients on survival and time-conditional, 2) evaluating nontraditional biomarkers where the likelihood of the event is not a linear function of the biomarker, 3) investigating methodological approaches related to time-conditional survival where survival analyses condition on the time survived after diagnosis and treatment , and 4) investigating statistical issues related to the analysis of outcomes for patients with rare diseases. Possible laboratory rotation activities may include comparison of alternative methodological approaches, statistical literature reviews, statistical simulation and/or bootstrap studies, and tasks associated with manuscripts and proposal development. Methods investigated will be applied to research projects of Abramson Cancer Center investigators and research projects of the Penn-CHOP Blood Center for Patient Care and Discovery.
Wei-Ting Hwang, Ph.D. Statistical issues for Superfund research: Asbestos exposure, remediation and adverse health effects Spring 2016
Available research opportunities are projects as part of University of Pennsylvania Superfund Research and Training (Penn SRP) Center. Penn SRP is a National Institute of Environmental Health Sciences (NIEHS)-funded program that addresses issues related exposure, remediation, and adverse health effects of asbestos hazardous waste, a concern of a nearby community in Ambler, PA (~40 miles away from Philadelphia), one of the largest asbestos Superfund sites in the country. The Center currently has two inter-related environmental science projects, one community-based epidemiological/sociological project, and three biomedical science projects that involves with investigators across both Schools of Medicine (PSOM) and Arts and Science (SAS). The environmental projects focus on the remediation and mobility of asbestos particles in soil and water using greenhouse experiment and fluid dynamics model. The community- based study will conduct both cohort and case-control studies to identify occupational, environmental, and social and life style factors that associated with asbestos.
Maureen Maguire, Ph.D. Approaches to Constructing a Primary Outcome Measure for a Clinical Trial When the Measurement Is Not Ideal Spring 2016
Designation of the primary outcome measure is a key component in the design of randomized clinical trials. In the DRy Eye Assessment and Management (DREAM) Study, an assessment of patient symptoms using the Ocular Surface Disease Index (OSDI) was chosen as the key measure. The psychometric properties of the questionnaire have been evaluated and there is considerable variation in scores on test-retest evaluations and on testing at two points in time in "stable" patients as well as varying changes in scores required for the patient to perceive change, depending on the baseline value. However, there are no other instruments with better properties. The lab rotation focus on the impact of 1) categorizing the change scores into improved or not relative to mean change; and 2) using the mean of repeated testing for the baseline value or follow-up value (average of 6- and 12-month scores). Different assumptions for the distribution of change scores conditional on the baseline value and for the emergence of a treatment effect over time will be investigated to identify the optimal approach to assessing a treatment effect on the OSDI score. The results should be applicable to other clinical trials when the measurement properties of the outcome measure are not ideal.
Hongzhe Li, Ph.D. Analysis of Very Large but Sparse Count data Spring 2016
In microbiome and metagenomic studies, the data are often summarized a very big table with sparse counts. The count may represent counts of k-mers, species or over marker genes. Such data are extremely sparse. We want to develop new statistical methods for analyzing such sparse count data, including methods from the text mining literature and the latent Dirichlet Allocation models. We are also interested in developing methods to link such large sparse counts to clinical outcomes and to understand the dependency structures of such data. The methods will be applied to data from Penn-CHOP microbiome program.
Hongzhe Li, Ph.D. Integrative Analysis of GWAS and eQTL data Spring 2016
Suppose we have both genome-wide association (genetic variants) and gene expression data are the same or different sets of samples. We aim to use genetic variations as a systematic perturbation to the biological systems in order to identify the genes and pathways that are associated with disease onset and progression. Methods such as IV regression and mediation analysis can be applied to such data, however, we have to deal with the high dimensionality of the genomic data. The project involves developing sparse simultaneous signal identification methods to identify the possible causal genes. The methods will be applied to analysis Penn heart failure genomic studies.
Sarah Ratcliffe, Ph.D. Functional Data Analysis of Infant RFM Data Spring 2016
Description: For infants born needing resuscitation, respiratory function monitoring (RFM) has been shown to be a feasible tool for measuring the characteristics of assisted ventilations in the delivery room. However, analyzing the multichannel streams of data, and making them useful in a clinical setting, poses numerous statistical challenges. These include: 1) how to align/register streams from multiple infants, 2) irregular timing of streams due to successful ventilations or death, and 3) need to determine predictive characteristics of a successful resuscitation in real time. We are interested in developing new statistical methods to analyze these data, and apply them to infant RFM recordings. Specific laboratory rotation activities may include statistical literature reviews, simulation studies, comparison of methods on a sample of data, and activities related to grant proposal development.
Jason Roy, Ph.D. Bayesian methods for causal inference Spring 2016
This lab will focus on statistical methods for estimating causal effects from observational studies, with an emphasis on Bayesian methods. Students will be introduced to core concepts, such as potential outcomes, various definitions of causal effects, confounding, directed acyclic graphs, and typical causal assumptions. Experience in implementing some standard methods will be gained using either real or simulated data sets.
Taki Shinohara, Ph.D. Frontiers in Biomedical Imaging Statistics Spring, Summer 2016
The Penn Statistical Imaging and Visualization Endeavor (PennSIVE) is a group of statisticians studying etiology and clinical practice through medical imaging. Our primary goals include developing robust and generalizable statistical methods for the analysis of multimodal biomedical imaging data, integrating complex medical imaging data and other biomarkers to study health, and building clinical tools for the assessment of disease diagnosis, progression, and prognosis through cross-sectional and longitudinal imaging studies. Our group specializes in neurological and psychiatric diseases. This laboratory rotation will begin with a literature review of the clinical and methodological aspects of a project agreed upon by the principal investigator and the trainee. The trainee will then lead the development of a novel statistical method, or the implementation of an existing method to answer new scientific questions.
Justine Shults, Ph.D. Simulating longer vectors of longitudinal binary data via the multinomial sampling method in R. Spring 2016
Longitudinal discrete data are more challenging to simulate than continuous measurements, due to the lack of an analogue to the multivariable normal distribution for discrete random variables. The multinomial sampling method is a general approach for simulating binary data, but has only been implemented for vectors of length four and more after first making restrictive assumptions such as first-order Antedependence. I have developed a general algorithm that is easy to implement and have applied it in Stata. In this lab rotation we would do a literature review and implement this approach in R (preferably).
Sharon Xie, Ph.D. Evaluating and comparing rates of decline in dementia patients Spring 2016
Longitudinal studies are crucial in evaluating rates of decline in Alzheimer's disease and aging-related disorder patients. We aim to explore and compare several statistical approaches in identifying and comparing rates of decline in dementia patients.
Rui Feng, Ph.D. A Variable Selection Model with an Application to a Sleep Study Summer 2016
About 30% of the US adults sleep less than necessary (7-8 hours) per night. Sleep deprivation (SD) is a known risk factor for obesity, diabetes, cardiovascular diseases, and depression. Cumulative SD also increases sleep propensity, destabilizes the wake state, impairs psychomotor and cognitive performance. Specific neurocognitive domains including cognitive speed, attention, memory, and executive functions are all affected by SD. Interestingly, we observed considerable and replicable individual differences in the magnitude of sleepiness and cognitive performance vulnerability to SD: some healthy adults had substantial cognitive deficits but others had little change. However, it is not known why some had the large vulnerability to SD. To answer this question, we have collected multimodal neuroimaging data (fMRI) and comprehensive behavioral testing data (100 variables) in a sample of 54 healthy subjects after normal sleep, following one night of total SD, and after two nights of recovery sleep. Additional neuroimaging and behavioral testing data are also acquired from 17 control subjects without SD. The overall aims of the project are to (1) understand the neural mechanisms by which SD affects behavioral performance; (2) investigate how the effects of SD vary across individuals of differential vulnerability. In the 1st rotation project, we will develop and apply a variable selection method to identify the brain regions correlated with vigilance attention (VA) at baseline after sleep deprivation. Dr. Feng will be available in summer, 2016 to guide the project.

Previous Research/Lab Rotation Matches

Mentor Title/Description Student Semester
Jinbo Chen, Ph.D. Risk estimation with correlated data from a mixture population with known mixture proportion Lingjiao Zhang Fall 2015
Statistical Methods for Risk Prediction and Evaluation I am interested in developing statistical methods for predicting cancer risk. We are currently developing statistical models for predicting absolute risks of breast and ovarian cancer for women who carry BRCA1/2 mutations. Our models will accommodate variations in risk by age at Salpingo oophorectomy, mutation location, use of oral contraceptives, and other potential risk modifiers. We will use our model to inform the optimal age that a woman should undergo oophorectomy in terms of balancing the protection effect from oophorectomy and various loses. The work will require development of novel statistical methods for improved modeling of age-dependence and lagging of oophorectomy on reducing the risk of breast cancer, refined risk prediction models that accommodate the time-varying effect of oophorectomy, and causal modeling on the effect of oophorectomy on reducing the risk of breast cancer. We will be also studying the usefulness of risk-modifying genetic variants identified to date for optimizing age of oophorectomy use. I have submitted a R01 proposal as PI to support this line of work. I am also analyzing data from a consortium study that aims to quantify BRCA1/2 related breast and ovarian cancer risk in Southern Asia. The statistical work involves penetrance estimation using family data. I am very excited about this work, since the outcome from this project may have direct impact on clinical management of BRCA1/2 mutation carriers in Southern Asia.
Yong Chen, Ph.D. Test of association with misclassified binary outcomes using EHR Data Rui Duan Fall 2015
Over the past few decades, a dramatic increase in the incidence of obesity has become a worldwide health issue, contributing significantly as a risk factor of many diseases. Many individuals participate in web-based weight loss programs where their weights, physical activities and diets are self-reported. Such web-based program generated data poses new challenges to statistical modeling and inference, including subject-specific self-reporting times and outcome-dependent missingness. These challenges are known as biased sampling problem in statistical literature, and can lead to substantial bias in inference. This rotation will focus on a novel framework to deal with some of the existing challenges, and also describe the new challenges in the context of “big data”.
Rebecca Hubbard, Ph.D. Statistical Approaches to Estimating Overdiagnosis in Breast Cancer Screening Carrie Caswell Fall 2015
Screening for breast cancer has been proven to reduce morbidity and mortality by detecting disease at an earlier and more treatable stage. However, ongoing evaluation of breast cancer screening using observational data is needed in order to characterize harms and benefits as screening and treatment patterns evolve. Evaluating cancer screening using observational data gives rise to a number of methodological challenges including confounding, selection bias, missing data, and informative censoring. This project aims to develop, evaluate, and apply alternative statistical methods for evaluating the performance of breast cancer screening modalities. We will use data on breast cancer screening to identify common patterns of missingness and screening drop-out as well as biases associated with these patterns. Statistical simulation studies will then be used to characterize the performance of alternative estimators for harms, including missed cancers and unnecessary diagnostic evaluation, and benefits, including screen-detected cancers. The best performing statistical methods will be applied to a study of screening mammography harms and benefits.
Pam Shaw, Ph.D Analysis of Failure Time Data with Outcome Measurement Error Eric Oh Fall 2015
Evaluation of the impact of measurement error in outcomes on the bias of regression coefficients, specifically for failure time data. Cox and Weibull regression models are considered to determine if either of the two are more vulnerable to bias from error in the time outcomes. Exploration of various statistical fixes to the problem.
Taki Shinohara, Ph.D PREVAIL: Predicting Recovery through Estimation and Visualization of Active and Incident Lesions Jordan Dworkin Fall 2015
The Penn Statistical Imaging and Visualization Endeavor (PennSIVE) is a group of statisticians studying etiology and clinical practice through medical imaging. Our primary goals include developing robust and generalizable statistical methods for the analysis of multimodal biomedical imaging data, integrating complex medical imaging data and other biomarkers to study health, and building clinical tools for the assessment of disease diagnosis, progression, and prognosis through cross-sectional and longitudinal imaging studies. Our group specializes in neurological and psychiatric diseases. This laboratory rotation will begin with a literature review of the clinical and methodological aspects of a project agreed upon by the principal investigator and the trainee. The trainee will then lead the development of a novel statistical method, or the implementation of an existing method to answer new scientific questions.
Haochang Shou, Ph.D Depicting Activity Profiles via Multilevel Functional Principle Component Analysis: Association and Prediction Jiarui Lu Fall 2015
As part of the Penn Statistical Imaging and Visualization Endeavor (PennSIVE), my research interests focus on multilevel functional data analysis. The lab rotation will start with literature review of fundamentals and recent advances in functional data analysis with application in neuroimaging and physical activity data. The student will then have the opportunities to practice analytical tools for imaging and activity data using R to identify disease-specific activity patterns and region of interest in the context of mental disorders or chronic renal disease.
Andrea Troxel, Sc.D An evaluation of treatment effect in opt-in versus opt-out consent frameworks under a mixture of patient motivation levels Alessasndra Valcarcel Fall 2015
I have several ideas related to clinical trials methods motivated by research in behavioral economics. The first relates to the concept of evidence-based evolutionary testing (EBET), an extension of the standard randomized clinical trial (RCT) in which multiple versions of an intervention are tested sequentially against a control group. Issues that arise include defining the relevant hypotheses (which versions to test against control and each other), determination of the optimal randomization ratio, and definition of appropriate Type I error rates. The second idea relates to statistical issues in opt-in versus opt-out consent; these include generalizability of trial results, changes in accrual rates, differences in effect size, and the combined impact these might have on power. The third idea concerns characterizing longitudinal medication adherence patterns using daily pill-taking behavior, and the possibility of defining adherence “personalities” and classifying individuals using these categories.
Wei (Peter) Yang, Ph.D. ECG Measures Predict Cardiovascular and Non-cardiovascular Death in Patients with CKD Wenli Sun Fall 2015
Time-dependent confounding is an issue to deal with when estimating the effect of time-varying treatment on outcome. It occurs when time-varying covariates confound the treatment - outcome association while they are affected by prior treatment at the same time. Standard techniques like regression cannot handle time-dependent confounding appropriately. Marginal Structural Model (MSM) is a model for the joint effects of time-varying treatment and can handle time-dependent confounding appropriately. The project I propose is to identify the effects of high blood pressure on the risk of renal disease in patients with chronic kidney disease using the data from the Chronic Renal Insufficiency Cohort (CRIC) study. CRIC is a multi-center prospective cohort study started in 2003. It enrolled about 4000 participants from 2003-2008. The participants are then follow-up once a year. The objective of the study is to identify risk factors for renal disease progression and cardiovascular diseases. The outcomes that are collected include renal function, e.g., estimated glomerular filtration rate (eGFR), measured annually, and clinical endpoints including end-stage renal disease (ESRD) and death. One challenge of using ESRD as the single outcome is the competing risk due to death. Another challenge is the incidence rate of ESRD is low so that the study may not have enough power to detect significant associations. We propose to use all the information related to renal function measures including the repeated eGFR measures, ESRD and death. To do so, the participants can be classified into a few different stages based on their eGFR values according to clinical guidelines. For example, we can classify participants into early chronic kidney disease (CKD), intermediate CKD and advanced CKD stages in addition to the two clinical end stages: ESRD and death. The goal of this project is to identify the effect of blood pressure on the transitions among these different stages during the follow-up. Some preliminary results have shown that the blood pressure may have different effects on the transition probabilities depending on the current stage a participant is at. So a natural follow-up question is to identify a blood pressure control target, which is potentially a function of the current stage and other patient characteristics, to minimize patients’ risk of either ESRD or death or both. We will also consider other statistical methods including structural nested models to tackle the problem. CRIC has very rich data and unique statistical challenges. In addition to the typical information collected from the case report forms (CRFs), CRIC also has very rich biomarker data assayed from blood and urine samples, functional data (e.g., ECG data) and genetic data. There are a lot of interesting statistical problems the student can work on. Here are a few additional examples:

1. Disease risk prediction for competing survival outcomes 2. Joint modeling of longitudinal and survival outcomes 3. Models for longitudinal functional data , e.g, ECG