Projects and master projects offered by Jarle Tufto

I'm currently on sabbatical leave at CEFE, CNRS in Montpellier (autumn of 2018 and spring 2019) but will be available to supervise project and masterstudents in the autumn of 2019 and spring 2020. In general, if any of the projects below interest you, you should contact me via email or Skype and we will discuss further details.

Sampling from improper posteriors

This has traditionally been viewed as meaningless but some authors argue that this can make sense, see Taraldsen, Tufto, Lindqvist (submitted). One task is then to develop methods for computing such posteriors. One approach is to align density estimates based on separate restricted MCMC runs as described in Tufto et. al. 2012 Appendix S4 but other computational approaches can certainly be constructed. That is the task of this project.

A traditional prior for a multiple regression model \(\mathbf{Y} = \mathbf{X}\boldsymbol{\beta} + \mathbf{e}\) where \(\mathbf{e}\sim N(0,\sigma^2 \mathbf{I})\) is to let \(\boldsymbol\beta \sim N(\boldsymbol{\beta}_0,\boldsymbol{\Sigma}_0)\). If we, in an attempt to be non-informative, let the prior variances on \(\boldsymbol{\beta}\) go to infinity and we use a improper scale prior \(\pi(\sigma^2)=1/\sigma^2\) on \(\sigma^2\), credible intervals based on the resulting posterior coincide with frequentist confidence and prediction intervals. The limiting form of this prior has several undesirable properties, however, in that it leads to overfitting and no shrinkage of the regression coefficients.

An alternative approach to be investigated in this project is to specify the prior of the multiple regression model in terms of the model reparameterized as a multivariate normal distribution, that is, not only the response but also the covariates, that is, \(\mathbf{z}=(z_1,z_2,\dots,z_{p+1})=(y,x_1,x_2,\dots,x_p) \sim N(\boldsymbol{\mu}, \operatorname{diag}(\boldsymbol{\sigma})\mathbf{R}\operatorname{diag}(\boldsymbol{\sigma}) )\) where \(\mathbf{R}\) is a matrix containing the \(p(p+1)/2\) correlations, and \(\boldsymbol{\sigma}=(\sigma_1,\sigma_2,\dots,\sigma_{p+1})\) are the marginal standard deviations. A possible prior is then to assign independent, improper scale priors (Berger, 1985, p. 86 on each marginal standard deviation \(\sigma_i\), perhaps a jointly uniform informative prior on the elements of \(\mathbf{R}\) within the space of positive definite correlation matrices, and a uniform improper prior over \(\mathbb{R}^{p+1}\) on the location vector \(\boldsymbol{\mu}\).

Preliminary MCMC results for \(p=1\) (one covariate) indicate that the posterior become proper with only \(n=2\) observations unlike the above traditional prior which degenerate to a perfectly fitting model. More generally, like Lasso- and Ridge-based techniques, this prior shrinks the regression coefficients (in the reparameterized model, these are derived parameters). Unlike Lasso- and Ridge-based methods, however, this prior incorporates prior negative correlations between the regression coefficients which seems reasonable – if one regression coefficient is large, other coefficients are a priori more likely to be small.

Interestingly, for \(p=0\) (no covariates) and \(n=2\) observations, the predictive density of a new observation \(y_3\) fulfills some consistency criteria in that \[ P(y_3 \le y_{(1)}|y_1,y_2) = P(y_{(1)} \lt y_3 \le y_{(2)}|y_1,y_2) = P(y_{(2)}\lt y_3|y_1,y_2)=1/3 \] where \(y_{(i)}\), \(i=1,\dots, n\), are order statistics of the observed values. Does the predictive density more generally have similar consistency properties?

Unintuitively, classical methods of variable selection in multiple regression tells us to exclude covariates with small effects, thus throwing away information. One aim of the above prior is to avoid covariate selection in this sense. Unlike AIC based model selection and Lasso, covariates are never excluded completely from the model even for \(p>n\).

Template model builder

Template model builder (TMB) is a R-package for fitting complex non-linear latent variable models (for example state-space and mixed models) by either maximising the Laplace approximation of the marginal likelihood (the probability of the observed data with latent random effects integrated out) or by using this approximation of the likelihood in further MCMC based methods (as in R-package tmbstan). Template model builder computes the Laplace approximation very efficiently using automatic differentiation and automatic sparsity detection based on a C++ template function computing the joint likelihood of the data and random effects provided by the user. For many models the Laplace approximation work wells, even for non-Guassian likelihoods since most likelihoods become approximately Gaussian unless the sample size is small. A possible project would be to use TMB to fit some form of latent variable model and then study how well the Laplace approximation works. One way to to this would be to compare posterior distribution computed via tmbstan with random effects integrated out with and without the Laplace approximation.

Approximate Bayesian Computation

In various applied settings, the mechanism generating the data is well understood but leads to to a likelihood which cannot easily be computed. MCMC may also be too slow. Simulating data from the model may be easy however. For an example, see this paper. One way to do inference in such cases is to use ABC. In its simplest form, this involves simulating samples from the prior distribution of the parameters, simulating data given these parameter values and accepting those samples for which certain carefully chosen summary statistics are within sufficiently small tolerance limits of the the values computed for the observed data. This can be shown to produce samples from the posterior distribution of the parameters, not given all the data, but given the summary statistics. While this typically leads to loss of some of the information in the data, this avoid the need to compute the exact likelihood. Improvements of the algorithm such as ABC-MCMC leads to more efficient sampling from the posterior.

Evolutionary responses to fluctuating selection

Different species may respond to environmental change through genetic evolution as envisioned by Darwin, plasticity, diversifying bet-hedging, as well as more recently, phenotypic evolution through epigenetic or maternal effects (see e.g. Tufto 2015 and Chevin et. al. 2015 and references therein). Understanding what conditions favour these adaptations is important in terms of predicting how biological populations will respond to ongoing anthropogenic global warming. Possible projects could be either theoretical stochastic modelling of the evolutionary responses arising for specific patterns of temporal and possibly spatial environmental variation, or could involve doing statistical inference of the parameters of such models, for example, using state space modelling. You should have some interest in theoretical evolutionary biology or you liked the course in time series modelling.

Romlig eksplisitte fangst-gjenfangstmodeller

Tradisjonelle metoder for estimering av populasjonsstetthet i biologi baserer seg på at en visst antall \(n\) men en ukjent andel \(p\) individ i en populasjon merkes. Andelen merkede individ i et gjenfangst-sample kan så brukes til å estimere andelen \(p\) og hele populasjonsstørrelsen \(N\). Ved gjentatte gjenfangstsample kan også overlevelsesparametere estimeres. Slike metoder bygger på antakelsen om at alle individ blander seg fullstendig med hverandre innenfor et ofte vilkårlig definert studieområde.

Uavhengig av dette har fangst-gjenfangstdata vært brukt for å estimere hvor mye individ forflytter seg fra en generasjon til neste (en viktig parameter i forskjellige teoretiske romlige evolusjonære og økologiske modeller).

Ved å modellere bevegelsene til ulike individ i et sample eksplisitt kan en ved hjelp av MCMC-metoder simultant estimere levetids- og spredningsparametere, populasjonstetthet, samt parametere som karakteriserer attraksjonsegenskapene til hver felle. Et mulig prosjekt vil være å videreutvikle slike estimeringsmetoder.

Probabilistiske flervalgsprøver

Flervalgsprøver hvor deltaker krysser av ett av et gitt antall alternativ på hvert spørsmål gir lite presis informasjon om kunnskapsnivået til deltaker, spesielt dersom deltaker har lite kunnskap i emnet pr?ven dreier seg om. En alternativ prøveform vil være å la deltaker oppgi sin grad av tro på ulike svaralternativer (subjektive sannsynligheter). La sannsynlighetene deltaker oppgir p? de korrekte svaralternativene på spørsmål 1,2,…,n være \(p_1,p_2,\dots,p_n\). Det er da optimalt for deltaker å oppgi sine subjektive sannsynligheter hvis det i poengsum gis \(\sum_i \log p_i\) (se f.eks., Bernardo, 1997).

Et mulig prosjekt kan være å gjennomføre en slik probabilistisk flervalgsprøve (f.eks. blant medstudenter og ansatte) og undersøke hvor godt et slikt poengsystem fungerer i forhold til tradisjonelle flervalgsprøver (både i teori og praksis). Siden hver deltaker selv kan beregne sin (subjektivt) forventede poengsum får en ved denne prøveformen også informasjon om hvorvidt hver deltaker har for liten eller for stor tro på egne evner.

Estimering av transitive og intransitive dominanshierarkier

Bradley Terry-modellen kan brukes for å modellere preferanse i valg mellom to alternativer, dominanshierarkier i biologi, eller sannsynlighet for seier eller tap i spill med to deltakere av gangen. Det er ønskelig å modellere styrkeparametere til ulike individ/spillere som forklart av egenskaper ved disse individene (kovariater). Dette kan føre til hierarkier som er intransitive (se her for eksempler). Samtidig er det naturlig å tenke seg at styrkeparametere til ulike individ/spillere eller residual variasjon som ikke er forklart av kovariater kommer fra en felles fordeling. Dette vil skape avhengighet i dataene. En mulig prosjektoppgave kan være å estimere parametere i en slik hierarkiske modell ved hjelp av Gibbs-sampling e.l. En passende oppgave kan også være å implementere dette i TMB (se oppgave ovenfor)

2019-02-06, Jarle Tufto