Master and bachelor projects offered by Jarle Tufto

GARMA versus Latent Gaussian models

Combining ideas of classical ARMA models and Generalized linear models, Benjamin et. al. 2003 introduced the generalized autoregressive moving average model (GARMA). This model is useful for modelling time series of counts, e.g. number of insurance claims or number of daily cases of a disease in a population. In the GARMA model the link-function (e.g. the log) of the expectation \(\mu_t\) of observations in the exponential family (e.g. the Poisson) conditional on the past is modelled as \[ g(\mu_t)=\eta_t=\beta + \sum_{j=1}^p\phi_j(g(y_{t-j}^*)-\beta) + \sum_{j=1}^q \theta_j (g(y_{t-j}^*)-\eta_{t-j}), \tag{1} \] that is, as linear combination of past observations and past "prediction errors". Note that \(y_{t-j}\) must be replaced by \(y_{t-j}^*=\operatorname{max}(y_{t-j},c)\) when the log-link is used where the threshold \(c\) is an additional parameter that must be chosen arbitrarily or perhaps estimated.

An obvious alternative model is to instead assume e.g. that \[ y_t \sim \operatorname{Poisson}(\mu_t) \tag{2a} \] for \(t=1,2,\dots,n\) and independent conditional on \(\{\mu_t\}\) and that \[\{\ln\mu_t\}\sim \text{Gaussian ARMA}(p,q). \tag{2b} \] It is well known that adding white noise to e.g. an AR(1) model leads to a an ARMA(1,1) model. This suggests that a GARMA model would need an extraneous MA term to fit data if the data is truly generated by (2) and that (2) will perhaps more parsimoniously fit real data if the conditional independence assumption (2a) holds.

Model (2) is a Latent Gaussian model that can be implemented in TMB and fitted by maximising the Laplace approximation of the marginal likelihood (see below). A project could involve implementing this model and see how well it works on real as well as simulated data compared to the above GARMA model.

Approximate frequentist prediction intervals for GLMs

For generalized linear models (GLMs), current methods for constructing prediction intervals resort to simulation based methods, typically relying on approximate/asymptotic normality of \(\hat{\mathbf{\beta}}\) or relying on parametric bootstrapping and then calculating the prediction interval from quantiles of the simulated response conditional on simulated realisations of the linear predictor.

In project 2, problem 1c in TMA4315 Generalized linear models 2021, for linear models (LMs), we saw that the limit(s) of a prediction interval are equal to the critical value(s) in a kind of hypothesis test that a particular observation is an outlier.

This suggests an alternative method for constructing (approximate) prediction intervals also for GLMs, via the critical values of a similar kind of outlier-test based on the asymptotic chi-square distribution of the likelihood ratio of the null hypothesis that an observation is not an outlier versus the alternative that it is. This idea seem to have gained only some theoretical attention, see for example Hinkley 1979 and Bjørnstad 1990. This idea may similarly apply to other statistical models such as ARIMA models in time series analysis.

An aim of this project would be to examine the coverage of prediction intervals constructed based on this idea with prediction intervals constructed based on already existing methods. A related topic is computational methods for finding profile-likelihood-based confidence intervals for predicted probabilities in for example logistic regression as I discuss here.

Stokastiske volatilitets modeller

Stokastisk volatilitets (SV) modeller er et alternativ til GARCH modeller (del av kurset i tidsrekkemodeller). Modellen er en form for state space modell hvor log til variansen (av støyleddet i prosessen vi observer) modelleres som en egen autonom stokastisk prosess. I Martino et. al. 2010 følger log til variansen en AR(1) prosess. Mer generelt kan vi tenke oss at log til variansen følger en egen ARMA prosess. Denne state-prosessen er ikke direkte observert og må derfor integreres ut og en måte å gjøre dette på vil kan være via Laplace approksimasjon beregnet v.h.a. TMB (se nedenfor).

Kvadratet av en GARCH prosess er i seg selv en ARMA prosess så sånn sett skulle vi kanskje tro vi GARCH er like fleksibelt som en SV-modell men det er imidlertid visse begrensinger på parametrene i ARMA tolkningen av GARCH modellen.

Oppgaven er implementere en SV modell og undersøke hvordan dette fungerer i praksis (både med simulerte og reelle data, hvor god er Laplace approksimasjon av marginalt likelihood etc.) og å kanskje sammenligne med GARCH modeller.

Tidsirreversible stokastiske volatilitetsmodeller

Modellene over (at log volatilitet følger f.eks. en egen Gaussisk ARMA prosess) innebærer en implisitt antakelse om at volatiliteten følger en tidsreversibel prossess, altså at f.eks. \(v_1,v_2,\dots, v_n\) har samme fordeling som \(v_n,v_{n-1},\dots, v_1\). Men antakelsen om tidsreversibilitet er trolig en dårlig antakelse. Det vi i stedet vanligvis observerer er at volatiliteten går opp som følge av "sjokk" i økonomien hvorpå volatiliteten avtar langsomt.

En aktuell modell vil kunne være at forekomstene av sjokk følger en Poissonprosess, at størrelsen på sjokkene kommer fra en viss parametrisk fordeling, og at virkningen av sjokkene på volatiliteten avtar eksponentielt med en viss rate. MCMC (kanskje en form for Gibbs sampling) vil trolig være en aktuell metode for å tilpasse en slik type modell til observerte data. An alternative model would be to assume that the log of the volatility follows an AR(1) process but with non-Gaussian innovations, e.g. Gamma distributed innovations with a small shape parameter which also make the process

En alternative modell som kan utforskes kan være å anta at log volatiliteten følger en AR(1) prosess men med ikke-Gaussisk støy, f.eks. iid Gamma fordelt støy med liten form-parameter. Det kan da vises at den kausale stasjonære løsningen er en tidsirreversibel prosess i motsetning til hvis prossesen er Gaussisk. Slike ikke-Gaussiske latente variable kan implementeres i TMB ved hjelp av "the transformation trick", see https://kaskr.github.io/adcomp/Introduction.html, men det er uklart hvor godt Laplace-approksimasjon av det marginale likelihoodet vil fungere i forhold til MCMC for en slik modellvariant.

Sampling from improper posteriors

This has traditionally been viewed as meaningless but some authors argue that this can make sense, see Taraldsen, Tufto, Lindqvist (2021). One task is then to develop methods for computing such posteriors. One approach is to align density estimates based on separate restricted MCMC runs as described in Tufto et. al. 2012 Appendix S4 but other computational approaches can certainly be constructed. That is the task of this project.

A traditional prior for a multiple regression model \(\mathbf{Y} = \mathbf{X}\boldsymbol{\beta} + \mathbf{e}\) where \(\mathbf{e}\sim N(0,\sigma^2 \mathbf{I})\) is to let \(\boldsymbol\beta \sim N(\boldsymbol{\beta}_0,\boldsymbol{\Sigma}_0)\). If we, in an attempt to be non-informative, let the prior variances on \(\boldsymbol{\beta}\) go to infinity and we use a improper scale prior \(\pi(\sigma^2)=1/\sigma^2\) on \(\sigma^2\), credible intervals based on the resulting posterior coincide with frequentist confidence and prediction intervals. The limiting form of this prior has several undesirable properties, however, in that it leads to overfitting and no shrinkage of the regression coefficients.

An alternative approach to be investigated in this project is to specify the prior of the multiple regression model in terms of the model reparameterized as a multivariate normal distribution, that is, not only the response but also the covariates, that is, \(\mathbf{z}=(z_1,z_2,\dots,z_{p+1})=(y,x_1,x_2,\dots,x_p) \sim N(\boldsymbol{\mu}, \operatorname{diag}(\boldsymbol{\sigma})\mathbf{R}\operatorname{diag}(\boldsymbol{\sigma}) )\) where \(\mathbf{R}\) is a matrix containing the \(p(p+1)/2\) correlations, and \(\boldsymbol{\sigma}=(\sigma_1,\sigma_2,\dots,\sigma_{p+1})\) are the marginal standard deviations. A possible prior is then to assign independent, improper scale priors (Berger, 1985, p. 86 on each marginal standard deviation \(\sigma_i\), perhaps a jointly uniform informative prior on the elements of \(\mathbf{R}\) within the space of positive definite correlation matrices, and a uniform improper prior over \(\mathbb{R}^{p+1}\) on the location vector \(\boldsymbol{\mu}\).

Preliminary MCMC results for \(p=1\) (one covariate) indicate that the posterior become proper with only \(n=2\) observations unlike the above traditional prior which degenerate to a perfectly fitting model. More generally, like Lasso- and Ridge-based techniques, this prior shrinks the regression coefficients (in the reparameterized model, these are derived parameters). Unlike Lasso- and Ridge-based methods, however, this prior incorporates prior negative correlations between the regression coefficients which seems reasonable – if one regression coefficient is large, other coefficients are a priori more likely to be small.

Interestingly, for \(p=0\) (no covariates) and \(n=2\) observations, the predictive density of a new observation \(y_3\) fulfills some consistency criteria in that \[ P(y_3 \le y_{(1)}|y_1,y_2) = P(y_{(1)} \lt y_3 \le y_{(2)}|y_1,y_2) = P(y_{(2)}\lt y_3|y_1,y_2)=1/3 \] where \(y_{(i)}\), \(i=1,\dots, n\), are order statistics of the observed values. Does the predictive density more generally have similar consistency properties?

Unintuitively, classical methods of variable selection in multiple regression tells us to exclude covariates with small effects, thus throwing away information. One aim of the above prior is to avoid covariate selection in this sense. Unlike AIC based model selection and Lasso, covariates are never excluded completely from the model even for \(p>n\).

Template model builder

Template model builder (TMB) is a R-package for fitting complex non-linear latent variable models (for example state-space and mixed models) by either maximising the Laplace approximation of the marginal likelihood (the probability of the observed data with latent random effects integrated out) or by using this approximation of the likelihood in further MCMC based methods (as in R-package tmbstan). Template model builder computes the Laplace approximation very efficiently using automatic differentiation and automatic sparsity detection based on a C++ function computing the joint likelihood of the data and random effects provided by the user. In the more recent package RTMB, this function is instead provided by the user as an R-function. For many models the Laplace approximation work wells, even for non-Guassian likelihoods since most likelihoods become approximately Gaussian unless the sample size is small. A possible project would be to use TMB to fit some form of latent variable model and then study how well the Laplace approximation works. One way to to this would be to compare posterior distribution computed via tmbstan with random effects integrated out with and without the Laplace approximation.

Approximate Bayesian Computation

In various applied settings, the mechanism generating the data is well understood but leads to to a likelihood which cannot easily be computed. MCMC may also be too slow. Simulating data from the model may be easy however. For an example, see this paper. One way to do inference in such cases is to use ABC. In its simplest form, this involves simulating samples from the prior distribution of the parameters, simulating data given these parameter values and accepting those samples for which certain carefully chosen summary statistics are within sufficiently small tolerance limits of the the values computed for the observed data. This can be shown to produce samples from the posterior distribution of the parameters, not given all the data, but given the summary statistics. While this typically leads to loss of some of the information in the data, this avoid the need to compute the exact likelihood. Improvements of the algorithm such as ABC-MCMC leads to more efficient sampling from the posterior.

Evolutionary responses to fluctuating selection

Different species may respond to environmental change through genetic evolution as envisioned by Darwin, plasticity, diversifying bet-hedging, as well as more recently, phenotypic evolution through epigenetic or maternal effects (see e.g. Tufto 2015 and Chevin et. al. 2015 and references therein). Understanding what conditions favour these adaptations is important in terms of predicting how biological populations will respond to ongoing anthropogenic global warming. Possible projects could be either theoretical stochastic modelling of the evolutionary responses arising for specific patterns of temporal and possibly spatial environmental variation, or could involve doing statistical inference of the parameters of such models, for example, using state space modelling. You should have some interest in theoretical evolutionary biology or you liked the course in time series modelling.

Romlig eksplisitte fangst-gjenfangstmodeller

Tradisjonelle metoder for estimering av populasjonsstetthet i biologi baserer seg på at en visst antall \(n\) men en ukjent andel \(p\) individ i en populasjon merkes. Andelen merkede individ i et gjenfangst-sample kan så brukes til å estimere andelen \(p\) og hele populasjonsstørrelsen \(N\). Ved gjentatte gjenfangstsample kan også overlevelsesparametere estimeres. Slike metoder bygger på antakelsen om at alle individ blander seg fullstendig med hverandre innenfor et ofte vilkårlig definert studieområde.

Uavhengig av dette har fangst-gjenfangstdata vært brukt for å estimere hvor mye individ forflytter seg fra en generasjon til neste (en viktig parameter i forskjellige teoretiske romlige evolusjonære og økologiske modeller).

Ved å modellere bevegelsene til ulike individ i et sample eksplisitt kan en ved hjelp av MCMC-metoder simultant estimere levetids- og spredningsparametere, populasjonstetthet, samt parametere som karakteriserer attraksjonsegenskapene til hver felle. Et mulig prosjekt vil være å videreutvikle slike estimeringsmetoder.

Se denne artikkelen for mer detaljer.

Probabilistiske flervalgsprøver

Flervalgsprøver hvor deltaker krysser av ett av et gitt antall alternativ på hvert spørsmål gir lite presis informasjon om kunnskapsnivået til deltaker, spesielt dersom deltaker har lite kunnskap i emnet pr?ven dreier seg om. En alternativ prøveform vil være å la deltaker oppgi sin grad av tro på ulike svaralternativer (subjektive sannsynligheter). La sannsynlighetene deltaker oppgir p? de korrekte svaralternativene på spørsmål 1,2,…,n være \(p_1,p_2,\dots,p_n\). Det er da optimalt for deltaker å oppgi sine subjektive sannsynligheter hvis det i poengsum gis \(\sum_i \log p_i\) (se f.eks., Bernardo, 1997).

Et mulig prosjekt kan være å gjennomføre en slik probabilistisk flervalgsprøve (f.eks. blant medstudenter og ansatte) og undersøke hvor godt et slikt poengsystem fungerer i forhold til tradisjonelle flervalgsprøver (både i teori og praksis). Siden hver deltaker selv kan beregne sin (subjektivt) forventede poengsum får en ved denne prøveformen også informasjon om hvorvidt hver deltaker har for liten eller for stor tro på egne evner.

Estimering av transitive og intransitive dominanshierarkier

Bradley Terry-modellen kan brukes for å modellere preferanse i valg mellom to alternativer, dominanshierarkier i biologi, eller sannsynlighet for seier eller tap i spill med to deltakere (eller lag) av gangen. Det er ønskelig å modellere styrkeparametere til ulike individ/spillere som forklart av egenskaper ved disse individene (kovariater). Ved å inkludere en viss type interaksjonsledd mellom disse kovariatene kan hierarkier som er intransitive modelleres (se Tufto, Ringsby & Solberg (1998) for eksempler).

Samtidig er det naturlig å tenke seg at styrkeparametere til ulike individ/spillere ikke er fullstendig forklart av observerte kovariater. Ikke-forklarte egenskaper ved hver spiller bør da modelleres som blandede effekter (random effects) slik at vi får en form for GLMM. Denne modelvarianten er håndtert av R-pakken BradleyTerry2.

En interessant utvidelse av en slik modell vil være å la disse blandede effektene for hver enkelt spiller være vektorer med samspillsledd mellom komponentene som over. Intransitive hierarkier kan da modelleres uten observerte kovariater. Chen & Joachim, 2016 og siterende artikler ser på en slike modeller. En oppgave kan også være å forsøke implementere dette i TMB (se oppgave ovenfor) eller ved hjelp av MCMC. En utfordring med oppgaven vil være å gjøre en slik modell identifiserbar.

2025-01-27, Jarle Tufto