Project 2 (due October 29)

Problem 1 (40 %)

In this problem will model the relation between brain and body size for 62 species of mammals (within species averages). Load the data using the following command

  mammals <- read.table(
  "https://www.math.ntnu.no/~jarlet/statmod/mammals.dat",
  header=T)

a) Find a good linear model (LM) modelling the conditional distribution of brain size (in grams) for a given body mass (in kg). You're allowed to apply any transformation to both brain and body mass. You'll probably want to create scatter plots of different candidate transformations of brain and body size.

b) Extend your model (\(H_0\)) from point a) to include an additional parameter such that the expected human brain size differ from the expected brain size for a mammal with the same body mass as humans. How much larger is brain size in humans compared to the expectations based on the other mammals in the data set? Carry out an exact one-sided test of the hypothesis that this difference is positive (\(H_1\)).

c) Assuming that human brain size is distributed in the same way as in other mammals, derive an exact one-sided \((1-\alpha)\)-prediction interval \((-\infty,U)\) for human brain size based on all mammals in the data set except humans and let \(A\) denote the event that this interval does not include the observed human brain size in the data. Let \(B\) denote the event that we reject the above \(H_0\) in favour of \(H_1\) at a significance level of \(\alpha\) using the test from point b. Are the two events \(A\) and \(B\) equivalent? Why or why not? Hint: For the extended model in point b, consider the profile log-likelihood \(l_p(\beta_0,\beta_1)=\sup_{\beta_2} l(\beta_0,\beta_1,\beta_2)\) to find the MLE of \((\beta_0,\beta_1)\) and the MLE of \(\beta_2\) given \(\hat\beta_0,\hat\beta_1\).

d) As an alternative to the extended LM in point b, fit a similar GLM assuming instead that mammalian brain size for a given body size follow a Gamma distribution.

e) Based on the models in b and d, carry out Wald and likelihood ratio tests (or tests that are equivalent to these tests) of the null hypothesis that the 3/4-scaling law for metabolic rate vs. body mass (see for example West et. al. 1997) also applies to the relation between brain and body size in mammals (except perhaps for the human species). Briefly explain why and for which of the models the Wald and likelihood ratio tests differ.

f) Compute the AIC for the two model alternatives in b and d to assess which is the "best" model. Note that you need to be careful to make the log likelihoods for the two model alternatives comparable. Why? Also derive an expression for the skew of the log a gamma distributed random variable \(Y\). Based on the estimated gamma GLM in point d, compute an estimate of the skew of log mammalian brain size for a given body size. How does this compare to the sample skew of the residuals from the LM fitted in point a? Hint: To derive an expression for the theoretical skew, first find the moment generating function of of \(\ln Y\), \(M_{\ln Y}(t)=Ee^{t\ln Y}=EY^t\), and then use the corresponding cumulant generating function to derive the third central moment. You'll then encounter the polygamma functions.

Problem 2 (60%)

In this problem we will apply ordinal multinomial regression to data from the chess tournament Norway Chess 2021. This is a link to the tournament regulations.

We will need to turn the tournament result data into a form suitable for ordinal regression. To do this we will use this google docs spreadsheet. You will need to complete filling in the data in this speadsheet before you can start analysing the data in R. Wins to white, draws and wins to black are encoded as \(y=1\), \(y=2\) and \(y=3\) respectively. There are two types of matches: classic and armageddon with armageddon played only if the classic match in a given round ends in a draw.

The aim of the modelling exercise is to find a good ordinal multinomial regression model for the outcome of each game (this can be judged by the AIC (or AICc) of different candidate models). The aim of this exercise is thus quite open ended and you are free to model this whichever way you think is reasonable. You should report a precise description of the assumption of the model(s) in suitable mathematical notation, how you arrived at your best model, statistical tests of model extensions and restrictions that you find are of interest and mathematical description and verbal interpretations of these tests.

Your model should probably involve parameters characterising the relative strength of each of the six players that participated in the tournament, perhaps when playing as white and black respectively. Or you may perhaps assume that a single parameter characterises the strength of each player and that we instead have a single parameter for the effect of playing white versus black. Page 102 in the lecture notes (a revised version of part of the lecture on October 7) has some discussion of this. The Bradley-Terry model is a closely related model (but only binomial so not directly relevant here).

Some players may also be particularly strong in matches of type armageddon so this can perhaps also be built into the model. Alternatively, the probabilities of white wins, draws and black wins may be altered for armageddon games, because the players adapt to certain scoring rules in the tournament regulation so you may want to study those rules carefully. You may of course also consider and add other covariates that you think influence the game outcomes.

To fit ordinal regression models in R, use the function vglm in the VGAM-package, with the family=cumulative(parallel = TRUE, link="logitlink") or similar supplied as an argument. You may of course consider other link function. You should perhaps also think about other choices for the parallel argument. R functions you may find useful include AIC (or maybe AICc), fitted, summary, relevel, factor( ,ordered=TRUE),read.csv, cbind, pnorm, plogis, pchisq, model.matrix, substr, colnames, rmultinom, table, vcov, coef.

Additional information

As in project 1 you may work in groups of up to two students. The final report should be be written in R markdown or LaTeX using knitr (preferably) and uploaded as a single pdf file to this dropbox folder by October 29. The first page must include your email adresse(s) and your candidate number(s) (not your studentnumber).

The report should be a self-contained document written in english or norwegian written as a proper mathematical text and including R-code and output.

You can get assistance with the project at Banachrommet, each friday at 11-12 or ask questions in the Discourse forum.

2022-08-01, Per Kristian Hove