Project 1

You may work in groups of up to two students. The final report should be be written in R markdown or LaTeX using knitr (preferably) and uploaded as a single pdf file to this dropbox folder by October 1. The first page must include your email adresse(s) and your candidate number(s) (not your studentnumber). Edit Sept. 30: The candidate numbers should appear on studweb sometime today according to the exam administration.

The report should be a self-contained document written in english or norwegian written as a proper mathematical text and including R-code and output.

You can get assistance with the project at Banachrommet, each friday at 11-12 or ask questions in the Discourse forum.

1

a) Derive the log likelihood, the score function, and the expected Fisher information (expressed both as sums and in matrix notation if appropriate) for a GLM where the response variable is Bernoulli distributed for the probit choice of link function.

b) Write a function myglm in R that fits this GLM using the Fisher scoring algorithm. The function should take three input arguments:

formula: a model formula

data: a data frame containing the data

start: an optional vector containing starting values defaulting to \(\boldsymbol{0}\).

The function should return a list containing the following elements

coefficents: A matrix containing the maximum likelihood estimates of each parameter in the first column and estimates of their standard errors based on the final Fisher information matrix. Assign sensible names to each row and column (colnames( ), rownames( )).

deviance: The deviance of the model, that is twice the difference in log likelihood between the fitted and the saturated model.

vcov: The estimated variance matrix of the coefficients.

Note that the design matrix \(\mathbf{X}\) needed in the computations can easily by constructed using

X <- model.matrix(formula, data)

Also note that the standard normal cdf \(\Phi\) and pdf \(\phi\) are available in R as the functions pnorm and dnorm.

c) Test that your function works by comparing the results on simulated data to those obtain using the glm and vcov functions in R. You'll need the rbinom function to simulate from different Bernoulli distributions.

2

In this problem we will use the dataset juul.girl containing variables from an investigation performed by Anders Juul at the danish Rigshospital on a group of healthy school children. Load the data using the following R code:

library(ISwR) # Install the package if needed
data(juul)
juul$menarche <- juul$menarche - 1
juul.girl <- subset(juul,age>8 & age<20 & complete.cases(menarche))

We will use the variable menarche as the response variable. This variable indicates for each girl whether or not she has had her first period coded as 0 for "no" and 1 for "yes" at the time of the investigation. The variable age is the age of the each girl at the time of the investigation in years.

a) Fit the probit model either using the code from problem 1 or the built-in function glm in R. Use a Wald-test to test if there is an effect of age on the probability that each girl has had her first period.

b) Explain how the fitted model corresponds to a latent variable model where each girl's age at her first period are latent normally distributed random variables \(T\) with mean \(\mu\) and standard deviation \(\sigma\). What is the relationship between \(\mu\) and \(\sigma\) and the regression parameters \(\beta_0\) and \(\beta_1\). Compute the MLEs of \(\mu\) and \(\sigma\) and derive and compute their standard errors via the delta method.

c) Fit a model instead using the logit link function. This implicitly amounts to making another assumption about the distribution of the latent ages \(T\). Which distribution? Compute estimates of the mean and standard deviation of \(T\) based on this model. Known properties of the distribution in question are listed in wikipedia as well as `Tabeller og formler i statistikk` and you do not need to rederive these.

d) Since the latent ages discussed in point b are necessarily positive random variables, it may be more reasonable assume that these ages instead follow a log-normal distribution. Fit this alternative model to the data and compute estimates of the mean and standard deviation of \(T\) based on this model.

e) Finally, show that a GLM including the log(age) as explanatory variable and a cloglog as the link function corresponds to assuming that \(T\) has Weibull distribution. Use this to estimate the Weibull shape and scale parameter and the mean and standard deviation of \(T\).

f) Discuss possible reasons for preferring each of the models in point a, c, d and e.