Project 1

You may work in groups of up to two students. The final report should be be written in R markdown or LaTeX using knitr (preferably) and uploaded as a single pdf file to this dropbox folder by October 2. The first page must include your email adresse(s) and your candidate number(s) (not your studentnumber).

All questions should be posted via the piazza forum.

1

a) Derive the log likelihood, the score function, and the observed and expected Fisher information (expressed both as sums and in matrix notation if appropriate) for a GLM where the response variable is Poisson distributed for the canonical choice of link function.

b) Write a function myglm in R that fits this GLM using the Fisher scoring algorithm. The function should take three input arguments:

formula: a model formula

data: a data frame containing the data

start: an optional vector containing starting values defaulting to \(\boldsymbol{0}\).

The function should return a list containg the following elements

coefficents: A matrix containing the maximum likelihood estimates of each parameter in the first column and estimates of their standard errors based on the final Fisher information matrix. Assign sensible names to each row and column (colnames( ), rownames( )).

deviance: The deviance of the model, that is twice the difference in log likelihood between the fitted and the saturated model.

vcov: The estimated variance matrix of the coefficients.

Note that the design matrix \(\mathbf{X}\) needed in the computations can easily by constructed using X ← model.matrix(formula, data). A quadratic effect of t can be specified by a term I(t^2) in the model formula (see ?formula).

c) Test that your function works by comparing the results on simulated data to those obtain using the glm and vcov functions in R. You'll need the rpois function to simulate from different Poisson distributions.

2

Load the data.frame data into your R workspace by issuing the command

load(url("https://www.math.ntnu.no/emner/TMA4315/2020h/hoge-veluwe.Rdata"))

This data frame contains data on reproductive success of 135 females in the Dutch Hoge Veluwe great tit population during the summer of 2005. Different females initiate breeding (lays the first egg) on different dates (the variable t in the dataset (number of days after April 1)) and this can potentially influence the number of fledglings leaving the nest (the variable y) because the timing of breeding need to match the timing of important food resources (caterpillars).

We are interested in fitting a model assuming that the number of fledglings produced by each female follows a Poisson distribution and the expected number of fledglings \(\lambda_i\) produced by each female depends on \(t_i\) via a Gaussian function \[ \lambda_0 \exp(-\frac{(t_i-\theta)^2}{2\omega^2}), \] where \(\omega\) and \(\lambda_0\) are positive parameters. Our aim is to estimate the unknown parameters \(\lambda_0\), \(\theta\) and \(\omega\).

a) How would you interpret the parameters \(\lambda_0\), \(\theta\) and \(\omega\)?

b) Explain why this, after reparameterization, is a GLM and find the relations between the GLM parameters contained in \(\boldsymbol\beta\) and \((\lambda_0,\theta,\omega)\). Hint.

c) Fit the model using the function in problem 1b.

d) Is there evidence of a quadratic effect of \(t\)?

e) Using the observed deviance, test the goodness-of-fit of the fitted model. How well does the model fit the data? Examine the distribution of the \(y\) more closely and discuss how the assumptions of the model may be violated.

f) Relying on the fitted model, explain why you can now easily find the maximum likelihood estimates of \(\theta\) and \(\omega\). Compute these estimates and, using the delta method, their standard errors based on asymptotic theory.

g) As a result of global warming and warmer spring temperatures, optimal breeding dates may have advanced towards earlier dates in many populations with mean breeding dates lagging behind if the evolutionary response to changing environmental conditions are too slow. Is the mean value of t in this population significantly different from the estimated optimal date based on the fitted model? What additional assumptions or approximations do you need to make to answer this question? Update: I see that the above can be misunderstood: Ideally you should take into account the uncertainty in the estimates of both the optimal breeding time and in the mean breeding time.

3

Estimate \(\operatorname{Var}\hat{\boldsymbol\beta}\) instead by parametric bootstrapping (separately refitting the model to 1000 different realisations of \(y_1,y_2,\dots,y_n\) simulated from the model estimated in 2c). Compare the results to those obtained via the expected Fisher information.

2020-10-01, Jarle Tufto