# Thesis work in statistics supervised by Mette Langaas

**(UPDATE: I will not be able to take on new students in 2024 - due to sickness. BUT the project marked EXTERNAL with Pål Vegard Johnsen at SINTEF Digtal will be available in 2024)
**

I'm a statistician with special interest in *analysing data from biostatistics, medicine and genomics* but I'm also involved in other application areas (like insurance, and multivariate data from sensors). Teaching and learning in STEM (science, technology, engineering and mathematics) is also very close to my heart.

The statistical methods I mainly use (and develop) are within *statistical inference and learning*, in particular, methods for regression and classification (mainly generalized linear and mixed models (GLM+LMM), ensemble methods like boosting, and methods including different versions of regularization), and methods for calculating valid p-values and correcting for multiple testing. I'm also interest in understanding black-box models, using methods within explainable AI.

I am an active user of the statistical language R, and all of my project thesis suggestions will have a strong R component. However, using Python is of cause possible (but then I will not be of much help with detailed coding).

Do you want to read more about me and my research? See: Contact information

Below you find specific suggestions for thesis projects, and below that I list courses that go well with my projects. If you have other ideas for a thesis projects within my fields of research (as listed above) you may contact me so we can discuss thesis supervision.

**For all projects:** we meet approximately weekly, and discuss reading material, progress, R-coding, and data analyses. Writing should be done in latex (Overleaf) and in English (may be Norwegian for bachelor theses), and R-code should preferably be maintained with RStudio in combination with version control using Github or similar.

## Bachelor thesis project suggestions

**Beklager - men jeg ser syk og kan ikke veilede i 2024!**

En bacheloroppgave går gjerne ut på at studenten setter seg inn i en metode/tema som kan være relevant i samfunnet i dag, eller noe som er en liten utvidelse av hva studenten har lært så langt. Eller, så kan det være interessant å prøve ut alt man har lært i praksis - ved å analysere et datasett.

Jeg har bare veiledet to bacheloroppgaver hittil - titler ser du nederst på denne siden.

Jeg er for tiden (våren 2024) spesielt interessert i å undersøke hvor godt store språkmodeller (LLM) som ChatGPT4 kan analysere data, og jeg ønsker å lage et system for å evaluere kvaliteten av slike analyser. Siden LLMen kan være trent på data som inneholder public datasett bør systemet bygge på simulerte data (de har ikke LLMen sett før og da har man en fasit fordi man selv har satt opp modellen dere data er simulert fra).

Send meg en epost så avtaler vi et møte - så ser vi om det er overlapp mellom hva du har lyst til å jobbe med og hva jeg har kompetanse på å veilede. Mine interesser ser du fra informasjonen på toppen av siden, fra forslagene til oppgaver og tittel på oppgaver jeg har veiledet tidligere.

## Thesis project suggestions (Only last project marked EXTERNAL available for 2024/2025- then contact Pål Vegard (see below)

These suggestions are for thesis on the master level (year 4/5 of study: TMA4500, TMA4900, MA3911 and master thesis for MLREAL in statistics).

#### Project A: Two-sided p-values for non-symmetric distribution of test statistics

*Background:* Two-sided statistical tests and p-values are well defined only when the test statistic in question has a symmetric distribution. What are the possible solutions for non-symmetric distributions?

*Method:* We will start by studying the possible ways to calculate two-sided p-values in the discrete situation for the Fisher exact test. We will use this as motivation for considering solutions for both discrete and continuous test statistics with non-symmetric distributions under the null hypothesis.

*Useful courses:* TMA4294 Statistical inference, TMA4315 Generalized linear models (can be taken in the autumn of 2021)

*Literature:* Useful to start with understanding enumeration type methods (one example of such a method is Fishers exact test), as explained in Chapter 2 and 3 of the master thesis of Fredrik L. Aanes, with popular science explanations in the book "The lady tasting tea" described here https://en.wikipedia.org/wiki/Lady_tasting_tea, and the bachelor thesis of Mathias Dåsvand.

*Alternative supervisor:* This work is a collaboration with Øyvind Bakke, and he may be chosen as the main supervisor for this project.

*Remark:* This project has main focus on statistical theory, and implementation and use of the theory in R. The reason we are interested in this problem is connected to work in progress with using higher order asymptotics for the score test in logistic regression to perform hypothesis tests for genetic markers - and the need for two-sided p-values.

#### Project B: Prediction of patients at high-risk for imminent suicidal behaviour using supervised machine learning approaches

*Background:* Suicide is a major health problem with multiple causes that is poorly understood. The predictive ability of known risk factors for suicidal behaviour is low. Can risk algorithms for suicidal behaviour be constructed using machine learning methods from a large number of predictors?

*Methods: *The machine learning methods to investigate are tre ensembles, in particular random forest and xgboost. The methods are to be used on a data set from two prospective and register based cohort studies.

*Useful courses:* TMA4315 Generalized linear models, TMA4268 Statistical learning, and also TMA4295 Statistical Inference, TMA4275 Lifetime analysis

*Co-supervisor:* Terje Torgersen, Institutt for psykisk helse og St.Olav og Linde Melby (ph.d. student) St. Olavs hospital, Klinikk for psykisk
helsevern

*Earlier work:* See thesis by Marthe B. Ludvigsen below, Gorm Finne Engelsen is also working with this type of data in the 2023/2024 study year.

#### Project C: CorFemina - risk of cardiovascular disease in women

*Background:* Existing risk prediction models for cardiovascular disease are imprecise, especially in women. This may lead to that more women develop these diseases without being aware that they are at risk. New risk prediction algorithms for women need to be developed! This is part of the CorFemina project

*Methods (not all to be used - need to select):* Logistic regression and lasso logistic regression, tree ensembles, xgboost, XAI-methods.

*Useful courses:* TMA4315 Generalized linear models (can be at the same time as the thesis work), TMA4275 Lifetime analysis, TDT4173 Machine learning and case based reasoning

*Co-supervisor:* Senior researcher Anja Bye, Department of Circulation and Medical Imaging, Faculty of Medicine and Health, NTNU, and ph.d. student Virginia de Martin Topranin.

*Remark:* To choose this project the student should have an interest in working with medical data and in understanding genetic concepts. A large part of the project will be with data analysis (in R) (some with within a virtual environment HUNT cloud, some using the NICE-1 safe NTNU storage). Knowledge working on the command line is useful: http://swcarpentry.github.io/shell-novice/

*Earlier work:* Atle Wiig-Fisketjøn wrote his master thesis in 2021 on similar data, and the approach and methods he used are also relevant for this project: https://ntnuopen.ntnu.no/ntnu-xmlui/handle/11250/2784245. Emma Botten is writing about risk prediction with the lasso in the 2023/2024 study year.

#### Project D: Statistical inference for the precision-recall (PR) curve and the area under the PR-curve

*Background:* Classical statistical analyses (methods for classification and regression, hypothesis tests) play an important role in medicine, and these methods are well understood from a formal statistical point of view. Recently, methods from statistical learning and machine learning (often referred to as artificial intelligence method) have gain popularity in the field of risk prediction, with the ability to model non-linear effects, perform automatic model selection and automatically including interactions between potential risk factors in the estimated risk prediction algorithm. However, several challenges exist.

Risk prediction methods are often evaluated and compare using the reciever operator curve (ROC) and the area under the ROC curve. These two concepts are rather well understood and valid methods for statistical inference are available. One way to perform hypotheses about the ROCAUC of one or several risk prediction methods, is based on the permutation test and also rank based test are important.

However, for low prevalence diseases it is know that adding patients without disease and with low test predictions of disease will improve the ROC, but without any improvement in the sensitivity or the positive predictive value. Therefore, in machine learning, the Precision-Recall (PR) curve and the area under this curve (PRAUC) is now commonly used.
Statistical inference for PR-curves and the PRAUC is not well understood and there is a need to formalize the use of these assessment methods based on statistical theory. This will be the topic of this master thesis project.

*Methods: *Work will be based on first understanding the theory for the ROCAUC (including the rank based and permutation based tests), and then working from several angles to arrive at valid methods for inference for the PRAUC.

*Useful courses:* TMA4315 Generalized linear models (can be at the same time as the thesis work), TMA4268 Statistical learning (required), and also TMA4295 Statistical Inference, if possible also MA8701 (next time spring 2025).

#### Project EXTERNAL

*Background:* The focus is on the construction of forecasting models on process industry data.
In particular, as a continuation of research done in the Analytics & AI group at SINTEF Digital, we want to improve on our
recently developed "chunk-based ensemble model". We use real sensor data from a wastewater treatment plant in Norway
to test the ability of the forecasting model. We want to use the forecasting model in the context of control systems (such
as model predictive controls).

*Method:* The idea is that models are trained on different disjoint intervals of the dataset to learn and focus on different relationships in the data.
At prediction time the forecasts is a linear relationship between predictions from previously
trained model. How to weight the models is of particular interest, and at this point the weights are assigned
based on how the models perform on recent data (formulated as a quadratic programming optimization problem).
We now want to test a clustering method such that similar dynamics in the data is placed in the same cluster,
and train a model for each cluster. The thesis will focus on how these clusters should be made, and how
the models trained on each cluster should be combined at prediction time.

*Recommended literature/courses*: TMA4285 Time series, TMA4268 Statistical learning,
TTK4105 Control Systems, TTK4135 Optimization and Control

*Supervisor:* The project will be supervised by Pål Vegard Johnsen at SINTEF Digital, and Thea Bjørnland will be the formal supervisor at IMF. Contact Pål if you want to know more!

## Supporting courses for the thesis projects

**Spring semester:**

**Statistics:**

- TMA4268 Statistical learning (required).
- TMA4300 Computational statistics (nice to have)
- TMA4275 Levetidsanalyse (will move to autumn, last time spring is 2024, then next time autumn 2025)
- KLMED8008 Analysis of repeated measurements. From MH faculty, only taught during a short period and gives 5 STP. (Have not checked if the course will run in 2024)
- MA8701 Advanced statistical methods in inference and learning (next time spring 2025)

**Medicine/data/mathematics:**

- TDT4300 Datawarehouse and data mining. Remark that this requires that you first have TDT4120 Algorithms and data structures, and then TDT4100 Objektoriented programmering, and then TDT4145 Datamodelling and database systems

**Autumn semester:**

**Statistics:**

- TMA4315 Generalized linear models (required, but may be taken in parallell with the thesis)
- From the 2025 autumn semester the Life time analysis course will move from spring to autumn.

**Medicine/data/mathematics:**

- TDT4173 Machine learning and case-based reasoning - some overlap with TMA4268 in topics, but not in STP.
- TDT4117 Information retrieval (see recommended background)

## Former master students

(IndMat): MTFYMA (Master of Technology, Physics and Mathematics) with Industrical Mathematics from year 3.

(LUR): MLREAL

(MathS): MSMNFMA (International Master in Mathematical Sciences) with focus on Statistics.

- February 2024: Emma Botten (Indmat): Identifying Risk Factors for Cardivascular Disease in Women Using Penalized Regression. Co-supervised by Anja Bye, MH, NTNU.
- Juni 2023: Marte Bøe Ludvigsen (Indmat): Suicide Crisis Syndrome in a Norwegian Acute Psychiatric Unit: Exploring Risk Factors using Statistical Learning and Inference. Co-supervised by Linde Melby and Terje Torgersen.
- June 2022: Sebastian Øiungen Ankill (Indmat): Statistical method for analysis of gene expression count data applied to a Crohn´s disease dataset. Co-supervised by Atle van Beelen Granlund, MH, NTNU.
- June 2022: Lene Tillerli Omdal (IndMat): Statistical Analysis of the Association between MicroRNAs in Breast Milk, Perinatal Probiotic Supplement and the Development of Atopic Dermatitis. Main supervisor Turid Follestad, Co-supervisor: Melanie Rae Simpson and Mette Langaas.
- June 2021: Atle Wiig-Fisketjøn (IndMat), Risk Prediction of Cardiovascular Disease with Statistical Learning Methods. Co-supervised by Anja Bye, MH, NTNU.
- June 2021: Lisa Erfjord (IndMat), Statistical Analysis of Interaction Effects Between Environmental and Genetic Factors, Can physical activity reduce the effects of genetic predispositions to cardiovascular disease? Data set from the HUNT Study. Co-supervised by Anja Bye, MH, NTNU.
- June 2019: Kristine Lund Mathisen (IndMat), Identifying Expression Quantitative Trait Loci in Patients with Inflammatory Bowel Disease. Co-supervised by Atle van Beelen Granlund, MH, NTNU.
- June 2019: Amirhossein Kazami (IndMat), A Semi-Supervised Approach to the Application of Sensor-based Change-Point Detection for Failure Prediction in Industrial Instruments. Co-supervised by Martin Høy at DNV GL.
- June 2019: Elisabeth Hetlelid (IndMat), Modelling biomarker development during pregnancy for women with rheumatic diseases using linear mixed models and bootstrapping. Co-supervised by Mona Høysæter Fenstad, MH, NTNU.
- June 2019: Dag Johnsrud Kristiansen (IndMat), Detecting Neuronal Activity with Lasso Penalized Logistic Regression Co-supervised by Benjamin Dunn, IMF, NTNU.
- June 2018: Kristin B. Bakka (IndMat), Changepoint model selection in Gaussian data by maximization of approximate Bayes Factors with the Pruned Exact Linear Time algorithm.
- June 2018: Pål Vegard Johnsen (IndMat), Stochastic modelling and analysis of neural tumour evolution. Co-supervisor: Thea Bjørnland, IMF, NTNU.
- March 2018: Kristian Aga (IndMat), Modelling Neuronal Activity with Jittered Generalised Linear Models. Co-supervised by Benjamin Dunn, IMF, NTNU.
- June 2017: Haris Fawad (IndMat), Modelling Neuronal Activity using Lasso Regularized Logistic Regression. Co-supervised by Benjamin Dunn, IMF, NTNU.
- June 2017: Marie Klevjar (LUR), Statistical modelling and analysis of energy expenditure among severely overweight. Co-supervisor: Ingrid Løvold Mostad, MH, NTNU and St.Olavs Hospital
- June 2017: Martina Hall (IndMat), Statistical Methods for early Prediction of Cerebral Palsy based on Data from Computer-based Video Analysis. Joint supervision with Turid Follestad, co-supervisor: Lars Adde, MH, NTNU.
- June 2017: Julia B. Debik (IndMat), Using Ensemble Methods to Improve thePerformance of Prediction Statistical Analysis of a Myocardial Infarction Data Set from the HUNT Study. Co-supervisor: Anja Bye, MH, NTNU.
- June 2016: Fredrik Lohne Aanes (IndMat). Testing equality of the success probabilities in two independent binomial distributions. Co-supervisor: Øyvind Bakke, IMF, NTNU.
- June 2016: Lene Maria Sundbakk (IndMat), Statistical methods to detect genotype-phenotype association using genetic similarity matrices.
- June 2016: Marthe Larsen (IndMat), Statistical analysis of factors influencing early neurological deterioration after acute ischemic stroke using sparse logistic regression. Co-supervisor: Turid Follestad, MH, NTNU.
- June 2014: Thea Bjørnland (IndMat), Statistical Methods for Genetic Association Studies under the Extreme Phenotype Sampling Design.
- March 2014: Karen Sofie Sollie Holt (IndMat), Statistical Modelling and Inference for Long Gene Expression Time Series.
- June 2013: Marit Runde (IndMat), Statistical methods for detecting genotype-phenotype association in the presence of environmental covariates.
- July 2012: Christian Magnus Page (MathS), Estimating Time-Continuous Gene Expression Profiles Using the Linear Mixed Effects Framework. Co-supervisor Torunn Bruland, MH, NTNU.
- June 2012: Kari Krizak Halle (IndMat), Statistical Methods for Multiple Testing in Genome-Wide Association Studies.
- June 2011: Eirin Tangen Østgård (IndMat), Statistical analysis of biological data using linear mixed effects data.
- June 2011: Tonje Gulbrandsen Lien (IndMat), Statistical modelling of qPCR data.
- June 2010: Anita Sklander Fineid (IndMat), Genetic Association in Case-Control Studies: Methods for Calculating Statistical Significance. Main supervisor: Øyvind Bakke (IMF, NTNU).
- June 2010: Ida Henriette Caspersen (Bio), Leukocyte count and expression of diabetes markers in response to dietary carbohydrate restriction. Institute of Biology, NTNU. Main supervisor: Berit Johansen.
- July 2008: Torbjørn Lilleheier (IndMat), Analysis of common cause failures in complex safety instrumented systems. Joint supervisor: Marvin Raustand, NTNU.
- June 2008: Marita Risberg (IndMat), Combining the MAX test with methods for Family-wise error rate.
- June 2008: Erik Edsberg (IndMat), A statistical simulation-based framework for sample size considerations in case-control SNP association studies.
- June 2007: Solfrid Håbrekke (IndMat), Metodar og data til estimering av parametrar i risikomodellen for transport av farleg gods. Joint supervisor: Jørn Vatn, IVT, IPK, NTNU.
- June 2007: Ingvild Bore (IndMat), Statistisk analyse av celleprøver innen kreftdiagnose: Multinomisk logistisk regresjon - modelltilpasning og prediksjon. Joint supervisor: Stian Lydersen, MH, NTNU.
- June 2006: Øystein Widar Bråthen (IndMat), Multilevel Analysis Applied to Fetal Growth Data with Missing Values. Joint supervisor: Stian Lydersen, MH, NTNU.
- February 2005: Marit Sellie Eriksen (IndMat), Computing expression summary measures for Affymetrix microarray data.
- June 2003: Hilde-Gunn Bruu (IndMat), Statistical designs for cDNA microarray experiments. Joint supervisor: John Tyssedal, IMF, NTNU.
- June 2003: Egil Ferkingstad (IndMat), Estimating the proportion of true null hypotheses, with application to DNA microarray data. Joint supervisor: Bo Lindqvist, IMF, NTNU.

## Former bachelor students

- May 2021: Rasmus Hilmer Henninen (BMAT), Explainable Artificial Intelligence. Explaining predictions with partial dependence and accumulated local effects plots.
- May 2021: Haakon Muggerud (BMAT), Feed-forward neural networks and how to explain their predictions.