Thesis work in statistics supervised by Mette Langaas
I'm a statistician with special interest in analysing data from biostatistics, medicine and genomics but I'm also involved in other application areas (like insurance, and multivariate data from sensors). Teaching and learning in STEM (science, technology, engineering and mathematics) is also very close to my heart.
The statistical methods I mainly use (and develop) are within statistical inference and learning, in particular, methods for regression and classification (mainly generalized linear and mixed models (GLM+LMM) and regularization), and methods for calculating valid p-values and correcting for multiple testing. I'm also interest in understanding black-box models, using methods within explainable AI.
I am an active user of the statistical language R, and all of my project thesis suggestions will have a strong R component. However, using Python is of cause possible (but then I will not be of much help with detailed coding).
Do you want to read more about me and my research? See: Contact information
Below you find specific suggestions for thesis projects, and below that I list courses that go well with my projects. If you have other ideas for a thesis projects within my fields of research (as listed above) you may contact me so we can discuss thesis supervision.
For all projects: we meet approximately weekly, and discuss reading material, progress, R-coding, and data analyses. Writing should be done in latex (Overleaf) and in English, and R-code should preferably be maintained with RStudio in combination with version control using Github or Bitbucket.
Bachelor thesis project suggestions
En bacheloroppgave går gjerne ut på at studenten setter seg inn i en metode/tema som kan være relevant i samfunnet i dag, eller noe som er en liten utvidelse av hva studenten har lært så langt. Eller, så kan det være interessant å prøve ut alt man har lært i praksis - ved å analysere et datasett.
Jeg har bare veiledet to bacheloroppgaver hittil - titler ser du nederst på denne siden.
Send meg en epost så avtaler vi et møte - så ser vi om det er overlapp mellom hva du har lyst til å jobbe med og hva jeg har kompetanse på å veilede. Mine interesser ser du fra informasjonen på toppen av siden, fra forslagene til oppgaver og tittel på oppgaver jeg har veiledet tidligere.
Jeg har ikke mulighet for å veilede bacheloropgave våren 2023 (fullbooket).
Thesis project suggestions for the 2024-2025 study year
These suggestions are for thesis on the master level (year 4/5 of study: TMA4500, TMA4900, MA3911 and master thesis for MLREAL in statistics).
Project A: Two-sided p-values for non-symmetric distribution of test statistics
Background: Two-sided statistical tests and p-values are well defined only when the test statistic in question has a symmetric distribution. What are the possible solutions for non-symmetric distributions?
Method: We will start by studying the possible ways to calculate two-sided p-values in the discrete situation for the Fisher exact test. We will use this as motivation for considering solutions for both discrete and continuous test statistics with non-symmetric distributions under the null hypothesis.
Useful courses: TMA4294 Statistical inference, TMA4315 Generalized linear models (can be taken in the autumn of 2021)
Literature: Useful to start with understanding enumeration type methods (one example of such a method is Fishers exact test), as explained in Chapter 2 and 3 of the master thesis of Fredrik L. Aanes, with popular science explanations in the book "The lady tasting tea" described here https://en.wikipedia.org/wiki/Lady_tasting_tea.
Alternative supervisor: This work is a collaboration with Øyvind Bakke, and he may be chosen as the main supervisor for this project.
Remark: This project has main focus on statistical theory, and implementation and use of the theory in R. The reason we are interested in this problem is connected to work in progress with using higher order asymptotics for the score test in logistic regression to perform hypothesis tests for genetic markers - and the need for two-sided p-values.
Project B: Prediction of patients at high-risk for imminent suicidal behaviour using supervised machine learning approaches
Background: Suicide is a major health problem with multiple causes that is poorly understood. The predictive ability of known risk factors for suicidal behaviour is low. Can risk algorithms for suicidal behaviour be constructed using machine learning methods from a large number of predictors?
Methods: The machine learning methods to investigate are tre ensembles, in particular random forest and xgboost. The methods are to be used on a data set from two prospective and register based cohort studies.
Useful courses: TMA4315 Generalized linear models, TMA4268 Statistical learning, and also TMA4295 Statistical Inference, TMA4275 Lifetime analysis
Co-supervisor: Terje Torgersen, Institutt for psykisk helse og St.Olav og Linde Melby (ph.d. student) St. Olavs hospital, Klinikk for psykisk helsevern
Earlier work: See thesis by Marthe B. Ludvigsen below.
Project C: CorFemina - risk of cardiovascular disease in women
Background: Existing risk prediction models for cardiovascular disease are imprecise, especially in women. This may lead to that more women develop these diseases without being aware that they are at risk. New risk prediction algorithms for women need to be developed!
Methods (not all to be used - need to select): Logistic regression and lasso logistic regression, tree ensembles, xgboost, XAI-methods. Evaluation with area under precision-recall curves.
Useful courses: TMA4315 Generalized linear models (can be taken in the autumn of 2023), TMA4275 Lifetime analysis, TDT4173 Machine learning and case based reasoning
Co-supervisor: Senior researcher Anja Bye, Department of Circulation and Medical Imaging, Faculty of Medicine and Health, NTNU.
Remark: To choose this project the student should have an interest in working with medical data and in understanding genetic concepts. A large part of the project will be with data analysis (in R), within a virtual environment HUNT cloud. Knowledge working on the command line is useful: http://swcarpentry.github.io/shell-novice/
Earlier work: Atle Wiig-Fisketjøn wrote his master thesis in 2021 on similar data, and the approach and methods he used are also relevant for this project: https://ntnuopen.ntnu.no/ntnu-xmlui/handle/11250/2784245.
Supporting courses for the thesis projects
- KLMED8008 Analysis of repeated measurements. From MH faculty, only taught during a short period and gives 5 STP.
- TDT4173 Machine learning and case-based reasoning - some overlap with TMA4268 in topics, but not in STP.
- TDT4287 Algorithms for bioinformatics (strong emphasis on programming with python, and you need to be skilled in algorithms and data structures).
- TDT4117 Information retrieval (see recommended background)
Former master students
(IndMat): MTFYMA (Master of Technology, Physics and Mathematics) with Industrical Mathematics from year 3.
(MathS): MSMNFMA (International Master in Mathematical Sciences) with focus on Statistics.
- Juni 2023: Marte Bøe Ludvigsen (Indmat): Suicide Crisis Syndrome in a Norwegian Acute Psychiatric Unit: Exploring Risk Factors using Statistical Learning and Inference. Co-supervised by Linde Melby and Terje Torgersen.
- June 2022: Sebastian Øiungen Ankill (Indmat): Statistical method for analysis of gene expression count data applied to a Crohn´s disease dataset. Co-supervised by Atle van Beelen Granlund, MH, NTNU.
- June 2022: Lene Tillerli Omdal (IndMat): Statistical Analysis of the Association between MicroRNAs in Breast Milk, Perinatal Probiotic Supplement and the Development of Atopic Dermatitis. Main supervisor Turid Follestad, Co-supervisor: Melanie Rae Simpson and Mette Langaas.
- June 2021: Atle Wiig-Fisketjøn (IndMat), Risk Prediction of Cardiovascular Disease with Statistical Learning Methods. Co-supervised by Anja Bye, MH, NTNU.
- June 2021: Lisa Erfjord (IndMat), Statistical Analysis of Interaction Effects Between Environmental and Genetic Factors, Can physical activity reduce the effects of genetic predispositions to cardiovascular disease? Data set from the HUNT Study. Co-supervised by Anja Bye, MH, NTNU.
- June 2019: Kristine Lund Mathisen (IndMat), Identifying Expression Quantitative Trait Loci in Patients with Inflammatory Bowel Disease. Co-supervised by Atle van Beelen Granlund, MH, NTNU.
- June 2019: Amirhossein Kazami (IndMat), A Semi-Supervised Approach to the Application of Sensor-based Change-Point Detection for Failure Prediction in Industrial Instruments. Co-supervised by Martin Høy at DNV GL.
- June 2019: Elisabeth Hetlelid (IndMat), Modelling biomarker development during pregnancy for women with rheumatic diseases using linear mixed models and bootstrapping. Co-supervised by Mona Høysæter Fenstad, MH, NTNU.
- June 2019: Dag Johnsrud Kristiansen (IndMat), Detecting Neuronal Activity with Lasso Penalized Logistic Regression Co-supervised by Benjamin Dunn, IMF, NTNU.
- June 2018: Kristin B. Bakka (IndMat), Changepoint model selection in Gaussian data by maximization of approximate Bayes Factors with the Pruned Exact Linear Time algorithm.
- June 2018: Pål Vegard Johnsen (IndMat), Stochastic modelling and analysis of neural tumour evolution. Co-supervisor: Thea Bjørnland, IMF, NTNU.
- March 2018: Kristian Aga (IndMat), Modelling Neuronal Activity with Jittered Generalised Linear Models. Co-supervised by Benjamin Dunn, IMF, NTNU.
- June 2017: Haris Fawad (IndMat), Modelling Neuronal Activity using Lasso Regularized Logistic Regression. Co-supervised by Benjamin Dunn, IMF, NTNU.
- June 2017: Marie Klevjar (LUR), Statistical modelling and analysis of energy expenditure among severely overweight. Co-supervisor: Ingrid Løvold Mostad, MH, NTNU and St.Olavs Hospital
- June 2017: Martina Hall (IndMat), Statistical Methods for early Prediction of Cerebral Palsy based on Data from Computer-based Video Analysis. Joint supervision with Turid Follestad, co-supervisor: Lars Adde, MH, NTNU.
- June 2017: Julia B. Debik (IndMat), Using Ensemble Methods to Improve thePerformance of Prediction Statistical Analysis of a Myocardial Infarction Data Set from the HUNT Study. Co-supervisor: Anja Bye, MH, NTNU.
- June 2016: Fredrik Lohne Aanes (IndMat). Testing equality of the success probabilities in two independent binomial distributions. Co-supervisor: Øyvind Bakke, IMF, NTNU.
- June 2016: Lene Maria Sundbakk (IndMat), Statistical methods to detect genotype-phenotype association using genetic similarity matrices.
- June 2016: Marthe Larsen (IndMat), Statistical analysis of factors influencing early neurological deterioration after acute ischemic stroke using sparse logistic regression. Co-supervisor: Turid Follestad, MH, NTNU.
- June 2014: Thea Bjørnland (IndMat), Statistical Methods for Genetic Association Studies under the Extreme Phenotype Sampling Design.
- March 2014: Karen Sofie Sollie Holt (IndMat), Statistical Modelling and Inference for Long Gene Expression Time Series.
- June 2013: Marit Runde (IndMat), Statistical methods for detecting genotype-phenotype association in the presence of environmental covariates.
- July 2012: Christian Magnus Page (MathS), Estimating Time-Continuous Gene Expression Profiles Using the Linear Mixed Effects Framework. Co-supervisor Torunn Bruland, MH, NTNU.
- June 2012: Kari Krizak Halle (IndMat), Statistical Methods for Multiple Testing in Genome-Wide Association Studies.
- June 2011: Eirin Tangen Østgård (IndMat), Statistical analysis of biological data using linear mixed effects data.
- June 2011: Tonje Gulbrandsen Lien (IndMat), Statistical modelling of qPCR data.
- June 2010: Anita Sklander Fineid (IndMat), Genetic Association in Case-Control Studies: Methods for Calculating Statistical Significance. Main supervisor: Øyvind Bakke (IMF, NTNU).
- June 2010: Ida Henriette Caspersen (Bio), Leukocyte count and expression of diabetes markers in response to dietary carbohydrate restriction. Institute of Biology, NTNU. Main supervisor: Berit Johansen.
- July 2008: Torbjørn Lilleheier (IndMat), Analysis of common cause failures in complex safety instrumented systems. Joint supervisor: Marvin Raustand, NTNU.
- June 2008: Marita Risberg (IndMat), Combining the MAX test with methods for Family-wise error rate.
- June 2008: Erik Edsberg (IndMat), A statistical simulation-based framework for sample size considerations in case-control SNP association studies.
- June 2007: Solfrid Håbrekke (IndMat), Metodar og data til estimering av parametrar i risikomodellen for transport av farleg gods. Joint supervisor: Jørn Vatn, IVT, IPK, NTNU.
- June 2007: Ingvild Bore (IndMat), Statistisk analyse av celleprøver innen kreftdiagnose: Multinomisk logistisk regresjon - modelltilpasning og prediksjon. Joint supervisor: Stian Lydersen, MH, NTNU.
- June 2006: Øystein Widar Bråthen (IndMat), Multilevel Analysis Applied to Fetal Growth Data with Missing Values. Joint supervisor: Stian Lydersen, MH, NTNU.
- February 2005: Marit Sellie Eriksen (IndMat), Computing expression summary measures for Affymetrix microarray data.
- June 2003: Hilde-Gunn Bruu (IndMat), Statistical designs for cDNA microarray experiments. Joint supervisor: John Tyssedal, IMF, NTNU.
- June 2003: Egil Ferkingstad (IndMat), Estimating the proportion of true null hypotheses, with application to DNA microarray data. Joint supervisor: Bo Lindqvist, IMF, NTNU.
Former bachelor students
- May 2021: Rasmus Hilmer Henninen (BMAT), Explainable Artificial Intelligence. Explaining predictions with partial dependence and accumulated local effects plots.
- May 2021: Haakon Muggerud (BMAT), Feed-forward neural networks and how to explain their predictions.