# Thesis work in statistics supervised by Mette Langaas

I'm a statistician with special interest in *analysing data from biostatistics and genomics* but I'm also involved in other application areas (like insurance, and multivariate data from sensors). Teaching and learning in STEM (science, technology, engineering and mathematics) is also very close to my heart.

The statistical methods I mainly use (and develop) are within *statistical inference and learning*, in particular, methods for regression and classification (mainly generalized linear and mixed models (GLM+LMM) and regularization), and methods for calculating valid p-values and correcting for multiple testing. I'm also interest in understanding black-box models, using methods within explainable AI.

I am an active user of the statistical language R, and all of my project thesis suggestions will have a strong R component. However, using Python is of cause possible (but then I will not be of much help with detailed coding).

Do you want to read more about me and my research? See: Contact information

For the 2019-2020 study year I'm located at the University of Oslo, Department of Mathematics (sabbatical) but can be reached by email and will drop by Trondheim from time-to-time, so it is possible to meet face-to-face. I am back in my office in Sentralbygg 2, room 1236 from August 2020.

Below you find specific suggestions for thesis projects, and below that I list courses that go well with my projects. If you have other ideas for a thesis projects within my fields of research (as listed above) you may contact me so we can discuss thesis supervision.

**For all projects:** we meet approximately weekly, and discuss reading material, progress, R-coding, and data analyses. Writing should be done in latex (Overleaf) and in English, and R-code should preferably be maintained with RStudio in combination with version control using Github or Bitbucket.

## Bachelor thesis project suggestions for the 2020-2021 study year

These suggestions are for Bachelor level (year 3), aka course MA2002.

#### Explainable AI

- What is a black-box model, and what does it mean that an explainable AI (XAI) method is model-agnostic?
- We look at one black-box model, possibly xgboost or a feed-forward neural network (or just pretend and use regression).
- We choose 1-2 popular methods for XAI. These could include partial dependency plots, LIME, some version of Shapley values.
- We find a data set, fit our black-box model and test out our XAI method - using R (and available packages in R). This involves making nice plots, possibly using ggplot in R.
- Finally, we sum up what we have learned and what not.

Good to have before or while writing the thesis: TMA4268 Statistical learning.

Would be nice if the text and R-code used could be made as a (possibly private) github-repository through RStudio (always good to know version control). Hello-world Github

Possible literature to start with: Interpretable Machine Learning A Guide for Making Black Box Models Explainable.

This project is useful if you plan to later write a Master's thesis within statistical learning or deep learing.

#### Relationship inference in forensic sciences

In this project we will work to understand

- what is the genetic kinship of two or more DNA samples (learning about genetic concepts)
- how can this be calculated using genetic data and statistical methods (learning about statistical methods)
- and how is this done in R with the familias package (performing simple analyses in practice)

Classical examples where relationship inference is used include parternity cases, family reunions and complex identification cases (maybe from a crime scene).

The project will be based on selected chapters of Relationship inference with Familias and R, by Thore Egeland, Daniel Kling and Petter Mostad, and use of the familias R package available from CRAN. Webpage for book and resources: http://familias.name.

The required background is only the courses ST1101 and ST1201, and understanding the concept of a likelihood is important knowledge.

This project is useful if you plan to later write a Master's thesis within biostatistics or statistical genomics.

## Thesis project suggestions for the 2020-2021 study year

These suggestions are for thesis on the master level (year 4/5 of study: TMA4500, TMA4900, MA3911 and master thesis for MLREAL in statistics).

The project A is the most theoretical, followed by D and B, while C and E are the most applied.

#### Project A: Calculating p-values with the score test in binary regression with application to genome-wide association studies

*Background:* We start by considering a generalized linear model, with two types of covariates: those present under the null hypothesis and those additionally potentially present under the alternative hypothesis. We then derive the score test statistic for testing hypotheses about single or multiple parameters (related to one, some or all covariates additionally present under the alternative hypothesis). The score test statistics will under the null hypothesis have a first order asymptotic chi-square approximation, but this requires large and balanced sample sizes (for binary data) and that the test statistics is not very large (far out tail). In the latter cases more advanced asymptotic methods are needed, and that is the topic of this project.

*Method:* Improved score tests and second order methods to fix the asymptotics of chi-square approximation to calculating p-values in the far out tails of (exact) distribution of the score test statistics. Bartlett and saddlepoint approximations. The methods will be tested out on simulated data motivated from genome-wide associations studies.

*Useful courses:* TMA4294 Statistical inference, TMA4315 Generalized linear models (can be taken in the autumn of 2020)

*Remark:* This project has main focus on statistical theory, and implementation and use of the theory in R.

#### Project B: Statistical analysis of interaction effects between environmental and genetic factors

*Background:* Decline in daily physical activity (PA) is thought to be a key contributor to the global cardiovascular disease (CVD) epidemic. However, the impact of sedentariness on CVD may in part be determined by a person’s genetic constitution. Hence, PA may modify the genetic effects that give rise to increased risk of CVD. To identify CVD risk variants whose effects are modified by PA, we will perform interaction analyses between genetic markers (in the form of single-nucleotide polymorphisms (SNPs)) previously associated with CVD and PA in 70.000 individuals from the HUNT study. It has been hypothesized that SNPs showing the strongest main effect associations with the outcome variable from genome-wide association studies (GWAS) may be the least sensitive to environmental and lifestyle influences, and may therefore not make the best candidates for interactions (ref). Our hypothesis is therefore that the SNPs previously found to have modest association with CVD are influenced by PA, and that PA contributes to significant reduction in the genetic risk of CVD. Our secondary hypothesis is that sex-specific SNP×PA interactions exist in CVD.

*Methods:* We will study different ways to code the interaction effects in a joint statistical model for all data. Statistical hypothesis testing will be performed using score tests, and also correcting for multiple testing. To evaluate the interaction coding of the statistical models we will also analyse the data using decision trees and stratified analyses on PA-levels and sex.

*Useful courses:* TMA4295 Statistical Inference, TMA4315 Generalized linear models (can be taken in the autumn of 2020), TMA4275 Lifetime analysis (spring 2020)

*Co-supervisor:* Senior researcher Anja Bye, Department of Circulation and Medical Imaging
Faculty of Medicine and Health Sciences

#### Project C: Risk prediction with statistical learning

*Background:* Although cardiovascular disease (CVD) mortality rates are currently declining in most European countries (ref), there is an increasing non-fatal CVD incidence, especially among females and younger individuals. However, CVD can be prevented, delayed or even well controlled through a number of lifestyle changes and pharmaceuticals when it is diagnosed at early stages of atherosclerosis process. Hence, there is a need for more precise and early identification of high-risk individuals to reduce the burden of CVD, allowing more effective intervention and thus more disease-free years (ref).
Risk prediction models for CVD used in clinical practice are quite simple and their performance remains a matter of concern. Recently, methods in machine learning has been introduced and applied in various healthcare applications, including CVD risk prediction (ref). Compared to traditional risk prediction scores, machine learning techniques exploit the majority of the available data, building much more complex models including more variables than only the typical CVD risk factors.

*Aim:* Explore the potential of using classification methods from statistical learning for CVD risk assessment of healthy adults.

*Methods:* Data from 50.000 individuals in HUNT will be used. Different statistical and machine learning classifiers (classification trees, random forest, boosting, and possibly deep learning) will be trained and evaluated against 10-year CVD incidence, in comparison with the NORRISK2 risk prediction score. The aim is to achieve the best overall performance. But, there is a need to explain the prediction method, possibly by using available methods from explainable AI.

*More about NORRISK2:* Video by Wasim Zahid and NORRISK2 calculator.

*Useful courses:* TMA4268 Statistical learning, TMA4315 Generalized linear models (can be taken in the autumn of 2020), TMA4275 Lifetime analysis (spring 2020)

*Co-supervisor:* Senior researcher Anja Bye, Department of Circulation and Medical Imaging
Faculty of Medicine and Health Sciences

#### Project D: Expression quantitative trait loci analysis for inflammable bowel disease data

*Background:* The genome of a human organism is encoded in in the form of DNA. The genome of all humans is largely the same, with most of the individual variation in the form of single nucleotide polymorphisms (SNP). A SNP is a specific position in the genome where changes in a single base pair (nucleotide) is seen. A gene is a sequence of nucleotides in the DNA that encodes the recipe for a gene product, usually a protein. Gene expression refers to the process - consisting of several steps - where the information in the gene is used to build the functional gene product. The gene expression level is in molecular biology often measured to evaluate the biological activity of a study object. The aim in expression quantitative trait loci analysis is to detect positions in the genome (SNPs) that may contribute to explain the expression level of a gene, thereby identifying hereditary causes of differences in activity between objects Given data on one SNP and the gene expression of one specific gene, regression methods may be used to detect eQTLs by using the measured gene expression as response and the SNP as covariate. If the SNP significantly explains the gene expression level of the gene the gene-SNP pair is referred to as an eQTL (expression quantitative trait locus). Identification of such eQTLs are of interest in order to understand diseases in humans. The focus of this project is to detect eQTLs in inflammable bowel disease (IBD), a term to describe multiple chronic bowel diseases, where the two most common are Crohn’s disease and ulcerative colitis.

Different laboratory methods exists to measure gene expression, and in this project we will look at data from next generation sequencing for counting the abundance of RNA molecules, called RNA-seq. Statistical analysis of data from RNA-seq have lead to suggesting that a negative binomial distribution may be used to model the variability of the data, as implemented in the DESeq2 method, but other competing assumptions and methods also exists. When relating data from RNA-seq from one gene with data from one SNP in a regression model, the interesting hypothesis to test is if the regression coefficient for the SNP is different from zero. If this is the case a eQTL is detected.

The eQTL analysis may aim to test every possible pair of SNP and gene expression, or one specific gene against a set of SNPs (we will focus on the latter). In both cases this leads to a performing many statistical hypothesis tests. The false discovery rate is a multiple testing concept generalizing the type I error (rejecting a true null hypothesis) when many hypotheses are performed, and is defined as the expected proportion of type I errors among the rejected statistical hypothesis. In eQTL analysis it is common to control the FDR at some chosen level (often 5%), often using the method of Benjamini and Hochberg.

*Methods:* The project will start by getting to understand the terms SNP, RNA-seq and eQTL, and then move on to gain a statistical understanding of the DESeq2 and competing methods for analysing RNA-seq data using the SNP status as the covariate of interest. Other covariates (disease status, age, sex, clinical variables) may also enter the regression model. Extensions with respect to handling multiple measurements from each IBD patient may be studied, and methods for evaluating asymptotically calculated p-values. The project will involve analysing data in R and building an R Shiny app to be used by medical researchers.

*Literature to start with:* Modern statistics for modern biology by Susan Holmes and Wolfgang Huber (Introduction and chapters 6 and 8), and the Master thesis of Kristine Lund Mathisen: Identifying Expression Quantitative Trait Loci in Patients with Inflammatory Bowel Disease

*Useful courses:* TMA4315 Generalized linear models (can be taken in the autumn of 2020), TMA4275 Lifetime analysis (spring 2020), MOL8008 Bioinformatics Methods for next Generation Sequencing Analysis

*Co-supervisor:* Researcher Atle van Beelen Granlund, Department of Clinical and Molecular Medicine.

#### Project E: Prissetting av skadeforskring (hos If)

*Bakgrunn:* Prisen på forsikringer modelleres ofte med generaliserte lineære modeller (GLM), og dette har blitt en bransjestandard. Man modellerer skadeutbetalingene, og så blir prisen lik skadekostnaden pluss et tillegg for andre kostnader og fortjeneste. Standard prosedyre er å først modelleres skadefrekvens med en Poissonmodell, og deretter skadestørrelse med en Gammamodell. Produktet av disse to modellene gir oss en modell for skadekostnader.

I If har vi en litt annerledes framgangsmåte som ikke er helt etter ‘boka’, men som virker bra i praksis. Vi har mye informasjon om objektet som forsikres og om kunden. Det jobbes kontinuerlig med å forbedre datagrunnlaget.

Fokus i denne oppgaven vil være på utvidelser og konkurrerende metoder til GLM-modellene som brukes i dag, og her vil det være mange spennende ting å ta tak i.

*Metode:* Først kan man sette seg inn i metoden som If bruker i dag, og gjøre rede for fordeler og ulemper/svakheter, og forstå hvorfor denne metoden fungerer. Deretter kan man gå videre og teste ut alternative metoder, eller finne forbedring av dagens metode. Eksempler kan være regularisering av GLM, eller maskinlæringsteknikker.

*Anbefalte (for)kunnskaper:* TMA4268 Statistisk Læring, TMA4300 Beregningskrevende statistikk, TMA4315 Generaliserte lineære modeller.

Det er ønskelig/mulig å sitte hos If i Oslo (på Vækerø) våren 2021. Det det vil være mulighet for sommerjobb i forkant av oppgaven (sommeren 2020).

*Veileder på If:* Anne Randi Syversveen. Hun kan kontaktes på anne [dot] randi [dot] syversveen [at] if [dot] no, telefon 99774986, hvis du vil høre mer om prosjektet og praktisk gjennomføring.

## Supporting courses for the thesis projects

**Spring semester:**

**Statistics:**

- KLMED8008 Analysis of repeated measurements. From MH faculty, only taught during a short period and gives 5 STP.
- MA8701 General statistical methods Statistical learning on phd level, next spring 2021.

**Medicine/data/mathematics:**

- TDT4300 Datawarehouse and data mining. Remark that this requires that you first have TDT4120 Algorithms and data structures, and then TDT4100 Objektoriented programmering, and then TDT4145 Datamodelling and database systems

**Autumn semester:**

**Statistics:**

**Medicine/data/mathematics:**

- TDT4173 Machine learning and case-based reasoning - some overlap with TMA4268 in topics, but not in STP.
- TDT4287 Algorithms for bioinformatics (strong emphasis on programming with python, and you need to be skilled in algorithms and data structures).
- TDT4117 Information retrieval (see recommended background)

Also read about the possibility to do an additional profile (tilleggsprofil) within Health by choosing appropriate courses: https://innsida.ntnu.no/wiki/-/wiki/Norsk/Tilleggsprofiler+IKT-studier+IME

## Former master students

(IndMat): MTFYMA (Master of Technology, Physics and Mathematics) with Industrical Mathematics from year 3.

(LUR): MLREAL

(MathS): MSMNFMA (International Master in Mathematical Sciences) with focus on Statistics.

- June 2019: Kristine Lund Mathisen (IndMat), Identifying Expression Quantitative Trait Loci in Patients with Inflammatory Bowel Disease. Co-supervised by Atle van Beelen Granlund, MH, NTNU.
- June 2019: Amirhossein Kazami (IndMat), A Semi-Supervised Approach to the Application of Sensor-based Change-Point Detection for Failure Prediction in Industrial Instruments. Co-supervised by Martin Høy at DNV GL.
- June 2019: Elisabeth Hetlelid (IndMat), Modelling biomarker development during pregnancy for women with rheumatic diseases using linear mixed models and bootstrapping. Co-supervised by Mona Høysæter Fenstad, MH, NTNU.
- June 2019: Dag Johnsrud Kristiansen (IndMat), Detecting Neuronal Activity with Lasso Penalized Logistic Regression Co-supervised by Benjamin Dunn, IMF, NTNU.
- June 2018: Kristin B. Bakka (IndMat), Changepoint model selection in Gaussian data by maximization of approximate Bayes Factors with the Pruned Exact Linear Time algorithm.
- June 2018: Pål Vegard Johnsen (IndMat), Stochastic modelling and analysis of neural tumour evolution. Co-supervisor: Thea Bjørnland, IMF, NTNU.
- March 2018: Kristian Aga (IndMat), Modelling Neuronal Activity with Jittered Generalised Linear Models. Co-supervised by Benjamin Dunn, IMF, NTNU.
- June 2017: Haris Fawad (IndMat), Modelling Neuronal Activity using Lasso Regularized Logistic Regression. Co-supervised by Benjamin Dunn, IMF, NTNU.
- June 2017: Marie Klevjar (LUR), Statistical modelling and analysis of energy expenditure among severely overweight. Co-supervisor: Ingrid Løvold Mostad, MH, NTNU and St.Olavs Hospital
- June 2017: Martina Hall (IndMat), Statistical Methods for early Prediction of Cerebral Palsy based on Data from Computer-based Video Analysis. Joint supervision with Turid Follestad, co-supervisor: Lars Adde, MH, NTNU.
- June 2017: Julia B. Debik (IndMat), Using Ensemble Methods to Improve thePerformance of Prediction Statistical Analysis of a Myocardial Infarction Data Set from the HUNT Study. Co-supervisor: Anja Bye, MH, NTNU.
- June 2016: Fredrik Lohne Aanes (IndMat). Testing equality of the success probabilities in two independent binomial distributions. Co-supervisor: Øyvind Bakke, IMF, NTNU.
- June 2016: Lene Maria Sundbakk (IndMat), Statistical methods to detect genotype-phenotype association using genetic similarity matrices.
- June 2016: Marthe Larsen (IndMat), Statistical analysis of factors influencing early neurological deterioration after acute ischemic stroke using sparse logistic regression. Co-supervisor: Turid Follestad, MH, NTNU.
- June 2014: Thea Bjørnland (IndMat), Statistical Methods for Genetic Association Studies under the Extreme Phenotype Sampling Design.
- March 2014: Karen Sofie Sollie Holt (IndMat), Statistical Modelling and Inference for Long Gene Expression Time Series.
- June 2013: Marit Runde (IndMat), Statistical methods for detecting genotype-phenotype association in the presence of environmental covariates.
- July 2012: Christian Magnus Page (MathS), Estimating Time-Continuous Gene Expression Profiles Using the Linear Mixed Effects Framework.
- June 2012: Kari Krizak Halle (IndMat), Statistical Methods for Multiple Testing in Genome-Wide Association Studies.
- June 2011: Eirin Tangen Østgård (IndMat), Statistical analysis of biological data using linear mixed effects data.
- June 2011: Tonje Gulbrandsen Lien (IndMat), Statistical modelling of qPCR data.
- June 2010: Anita Sklander Fineid (IndMat), Genetic Association in Case-Control Studies: Methods for Calculating Statistical Significance. Main supervisor: Øyvind Bakke (IMF, NTNU).
- June 2010: Ida Henriette Caspersen (Bio), Leukocyte count and expression of diabetes markers in response to dietary carbohydrate restriction. Institute of Biology, NTNU. Main supervisor: Berit Johansen.
- July 2008: Torbjørn Lilleheier (IndMat), Analysis of common cause failures in complex safety instrumented systems. Joint supervisor: Marvin Raustand, NTNU.
- June 2008: Marita Risberg (IndMat), Combining the MAX test with methods for Family-wise error rate.
- June 2008: Erik Edsberg (IndMat), A statistical simulation-based framework for sample size considerations in case-control SNP association studies.
- June 2007: Solfrid Håbrekke (IndMat), Metodar og data til estimering av parametrar i risikomodellen for transport av farleg gods. Joint supervisor: Jørn Vatn, IVT, IPK, NTNU.
- June 2007: Ingvild Bore (IndMat), Statistisk analyse av celleprøver innen kreftdiagnose: Multinomisk logistisk regresjon - modelltilpasning og prediksjon. Joint supervisor: Stian Lydersen, MH, NTNU.
- June 2006: Øystein Widar Bråthen (IndMat), Multilevel Analysis Applied to Fetal Growth Data with Missing Values. Joint supervisor: Stian Lydersen, MH, NTNU.
- February 2005: Marit Sellie Eriksen (IndMat), Computing expression summary measures for Affymetrix microarray data.
- June 2003: Egil Ferkingstad (IndMat), Estimating the proportion of true null hypotheses, with application to DNA microarray data. Joint supervisor: Bo Lindqvist, IMF, NTNU.
- June 2003: Hilde-Gunn Bruu (IndMat), Statistical designs for cDNA microarray experiments. Joint supervisor: John Tyssedal, IMF, NTNU.