Master and bachelor projects offered by Jarle Tufto

Approximate frequentist prediction intervals for GLMs

For generalized linear models (GLMs), current methods for constructing prediction intervals resort to simulation based methods, typically relying on approximate/asymptotic normality of \(\hat{\mathbf{\beta}}\) or relying on parametric bootstrapping and then calculating the prediction interval from quantiles of the simulated response conditional on simulated realisations of the linear predictor.

In project 2, problem 1c in TMA4315 Generalized linear models 2021, for linear models (LMs), we saw that the limit(s) of a prediction interval are equal to the critical value(s) in a kind of hypothesis test that a particular observation is an outlier.

This suggests an alternative method for constructing (approximate) prediction intervals also for GLMs, via the critical values of a similar kind of outlier-test based on the asymptotic chi-square distribution of the likelihood ratio of the null hypothesis that an observation is not an outlier versus the alternative that it is. This idea seem to have gained only some theoretical attention, see for example Hinkley 1979 and Bjørnstad 1990. This idea may similarly apply to other statistical models such as ARIMA models in time series analysis.

An aim of this project would be to examine the coverage of prediction intervals constructed based on this idea with prediction intervals constructed based on already existing methods. A related topic is computational methods for finding profile-likelihood-based confidence intervals for predicted probabilities in for example logistic regression as I discuss here.

Stokastisk volatilitets modeller

Stokastisk volatilitets (SV) modeller er et alternativ til GARCH modeller (del av kurset i tidsrekkemodeller). Modellen er en form for state space modell hvor log til variansen (av støyleddet i prosessen vi observer) modelleres som en egen autonom stokastisk prosess. I Martino et. al. 2010 følger log til variansen en AR(1) prosess. Mer generelt kan vi tenke oss at log til variansen følger en egen ARMA prosess. Denne state-prosessen er ikke direkte observert og må derfor integreres ut og en måte å gjøre dette på vil kan være via Laplace approksimasjon beregnet v.h.a. TMB (se nedenfor).

Kvadratet av en GARCH prosess er i seg selv en ARMA prosess så sånn sett skulle vi kanskje tro vi GARCH er like fleksibelt som en SV-modell men det er imidlertid visse begrensinger på parametrene i ARMA tolkningen av GARCH modellen.

Oppgaven er implementere en SV modell og undersøke hvordan dette fungerer i praksis (både med simulerte og reelle data, hvor god er Laplace approksimasjon av marginalt likelihood etc.) og å kanskje sammenligne med GARCH modeller.

Sampling from improper posteriors

This has traditionally been viewed as meaningless but some authors argue that this can make sense, see Taraldsen, Tufto, Lindqvist (2021). One task is then to develop methods for computing such posteriors. One approach is to align density estimates based on separate restricted MCMC runs as described in Tufto et. al. 2012 Appendix S4 but other computational approaches can certainly be constructed. That is the task of this project.

A traditional prior for a multiple regression model \(\mathbf{Y} = \mathbf{X}\boldsymbol{\beta} + \mathbf{e}\) where \(\mathbf{e}\sim N(0,\sigma^2 \mathbf{I})\) is to let \(\boldsymbol\beta \sim N(\boldsymbol{\beta}_0,\boldsymbol{\Sigma}_0)\). If we, in an attempt to be non-informative, let the prior variances on \(\boldsymbol{\beta}\) go to infinity and we use a improper scale prior \(\pi(\sigma^2)=1/\sigma^2\) on \(\sigma^2\), credible intervals based on the resulting posterior coincide with frequentist confidence and prediction intervals. The limiting form of this prior has several undesirable properties, however, in that it leads to overfitting and no shrinkage of the regression coefficients.

An alternative approach to be investigated in this project is to specify the prior of the multiple regression model in terms of the model reparameterized as a multivariate normal distribution, that is, not only the response but also the covariates, that is, \(\mathbf{z}=(z_1,z_2,\dots,z_{p+1})=(y,x_1,x_2,\dots,x_p) \sim N(\boldsymbol{\mu}, \operatorname{diag}(\boldsymbol{\sigma})\mathbf{R}\operatorname{diag}(\boldsymbol{\sigma}) )\) where \(\mathbf{R}\) is a matrix containing the \(p(p+1)/2\) correlations, and \(\boldsymbol{\sigma}=(\sigma_1,\sigma_2,\dots,\sigma_{p+1})\) are the marginal standard deviations. A possible prior is then to assign independent, improper scale priors (Berger, 1985, p. 86 on each marginal standard deviation \(\sigma_i\), perhaps a jointly uniform informative prior on the elements of \(\mathbf{R}\) within the space of positive definite correlation matrices, and a uniform improper prior over \(\mathbb{R}^{p+1}\) on the location vector \(\boldsymbol{\mu}\).

Preliminary MCMC results for \(p=1\) (one covariate) indicate that the posterior become proper with only \(n=2\) observations unlike the above traditional prior which degenerate to a perfectly fitting model. More generally, like Lasso- and Ridge-based techniques, this prior shrinks the regression coefficients (in the reparameterized model, these are derived parameters). Unlike Lasso- and Ridge-based methods, however, this prior incorporates prior negative correlations between the regression coefficients which seems reasonable – if one regression coefficient is large, other coefficients are a priori more likely to be small.

Interestingly, for \(p=0\) (no covariates) and \(n=2\) observations, the predictive density of a new observation \(y_3\) fulfills some consistency criteria in that \[ P(y_3 \le y_{(1)}|y_1,y_2) = P(y_{(1)} \lt y_3 \le y_{(2)}|y_1,y_2) = P(y_{(2)}\lt y_3|y_1,y_2)=1/3 \] where \(y_{(i)}\), \(i=1,\dots, n\), are order statistics of the observed values. Does the predictive density more generally have similar consistency properties?

Unintuitively, classical methods of variable selection in multiple regression tells us to exclude covariates with small effects, thus throwing away information. One aim of the above prior is to avoid covariate selection in this sense. Unlike AIC based model selection and Lasso, covariates are never excluded completely from the model even for \(p>n\).

Template model builder

Template model builder (TMB) is a R-package for fitting complex non-linear latent variable models (for example state-space and mixed models) by either maximising the Laplace approximation of the marginal likelihood (the probability of the observed data with latent random effects integrated out) or by using this approximation of the likelihood in further MCMC based methods (as in R-package tmbstan). Template model builder computes the Laplace approximation very efficiently using automatic differentiation and automatic sparsity detection based on a C++ function computing the joint likelihood of the data and random effects provided by the user. In the more recent package RTMB, this function is instead provided by the user as an R-function. For many models the Laplace approximation work wells, even for non-Guassian likelihoods since most likelihoods become approximately Gaussian unless the sample size is small. A possible project would be to use TMB to fit some form of latent variable model and then study how well the Laplace approximation works. One way to to this would be to compare posterior distribution computed via tmbstan with random effects integrated out with and without the Laplace approximation.

Approximate Bayesian Computation

In various applied settings, the mechanism generating the data is well understood but leads to to a likelihood which cannot easily be computed. MCMC may also be too slow. Simulating data from the model may be easy however. For an example, see this paper. One way to do inference in such cases is to use ABC. In its simplest form, this involves simulating samples from the prior distribution of the parameters, simulating data given these parameter values and accepting those samples for which certain carefully chosen summary statistics are within sufficiently small tolerance limits of the the values computed for the observed data. This can be shown to produce samples from the posterior distribution of the parameters, not given all the data, but given the summary statistics. While this typically leads to loss of some of the information in the data, this avoid the need to compute the exact likelihood. Improvements of the algorithm such as ABC-MCMC leads to more efficient sampling from the posterior.

Evolutionary responses to fluctuating selection

Different species may respond to environmental change through genetic evolution as envisioned by Darwin, plasticity, diversifying bet-hedging, as well as more recently, phenotypic evolution through epigenetic or maternal effects (see e.g. Tufto 2015 and Chevin et. al. 2015 and references therein). Understanding what conditions favour these adaptations is important in terms of predicting how biological populations will respond to ongoing anthropogenic global warming. Possible projects could be either theoretical stochastic modelling of the evolutionary responses arising for specific patterns of temporal and possibly spatial environmental variation, or could involve doing statistical inference of the parameters of such models, for example, using state space modelling. You should have some interest in theoretical evolutionary biology or you liked the course in time series modelling.

Romlig eksplisitte fangst-gjenfangstmodeller

Tradisjonelle metoder for estimering av populasjonsstetthet i biologi baserer seg på at en visst antall \(n\) men en ukjent andel \(p\) individ i en populasjon merkes. Andelen merkede individ i et gjenfangst-sample kan så brukes til å estimere andelen \(p\) og hele populasjonsstørrelsen \(N\). Ved gjentatte gjenfangstsample kan også overlevelsesparametere estimeres. Slike metoder bygger på antakelsen om at alle individ blander seg fullstendig med hverandre innenfor et ofte vilkårlig definert studieområde.

Uavhengig av dette har fangst-gjenfangstdata vært brukt for å estimere hvor mye individ forflytter seg fra en generasjon til neste (en viktig parameter i forskjellige teoretiske romlige evolusjonære og økologiske modeller).

Ved å modellere bevegelsene til ulike individ i et sample eksplisitt kan en ved hjelp av MCMC-metoder simultant estimere levetids- og spredningsparametere, populasjonstetthet, samt parametere som karakteriserer attraksjonsegenskapene til hver felle. Et mulig prosjekt vil være å videreutvikle slike estimeringsmetoder.

Se denne artikkelen for mer detaljer.

Probabilistiske flervalgsprøver

Flervalgsprøver hvor deltaker krysser av ett av et gitt antall alternativ på hvert spørsmål gir lite presis informasjon om kunnskapsnivået til deltaker, spesielt dersom deltaker har lite kunnskap i emnet pr?ven dreier seg om. En alternativ prøveform vil være å la deltaker oppgi sin grad av tro på ulike svaralternativer (subjektive sannsynligheter). La sannsynlighetene deltaker oppgir p? de korrekte svaralternativene på spørsmål 1,2,…,n være \(p_1,p_2,\dots,p_n\). Det er da optimalt for deltaker å oppgi sine subjektive sannsynligheter hvis det i poengsum gis \(\sum_i \log p_i\) (se f.eks., Bernardo, 1997).

Et mulig prosjekt kan være å gjennomføre en slik probabilistisk flervalgsprøve (f.eks. blant medstudenter og ansatte) og undersøke hvor godt et slikt poengsystem fungerer i forhold til tradisjonelle flervalgsprøver (både i teori og praksis). Siden hver deltaker selv kan beregne sin (subjektivt) forventede poengsum får en ved denne prøveformen også informasjon om hvorvidt hver deltaker har for liten eller for stor tro på egne evner.

Estimering av transitive og intransitive dominanshierarkier

Bradley Terry-modellen kan brukes for å modellere preferanse i valg mellom to alternativer, dominanshierarkier i biologi, eller sannsynlighet for seier eller tap i spill med to deltakere av gangen. Det er ønskelig å modellere styrkeparametere til ulike individ/spillere som forklart av egenskaper ved disse individene (kovariater). Dette kan føre til hierarkier som er intransitive (se her for eksempler). Samtidig er det naturlig å tenke seg at styrkeparametere til ulike individ/spillere eller residual variasjon som ikke er forklart av kovariater kommer fra en felles fordeling. Dette vil skape avhengighet i dataene. En mulig prosjektoppgave kan være å estimere parametere i en slik hierarkiske modell ved hjelp av Gibbs-sampling e.l. En passende oppgave kan også være å implementere dette i TMB (se oppgave ovenfor)

2024-03-20, Jarle Tufto