Project 2

You may work in groups of up to two students. The final report should be be written in R markdown or LaTeX using knitr (preferably) and uploaded as a single pdf file to this dropbox folder by October 30. The first page must include your email adresse(s) and your candidate number(s) (not your studentnumber).

All questions should be posted via the piazza forum.

In this project we will use multinomial regression to analyse a dataset of marital status among individuals of europeans ethnicity in New Zealand. Load the VGAM package and the dataset by doing

 library(VGAM)
 attach(marital.nz)

We will use two of the variables in this dataset; mstatus which has four levels (Divorced/Separated, Married/Partnered, Single, Widowed) and we will try to build a model predicting the probabilities of these four categories based age (16-88).

a) First fit a multinomial regression model with a linear effect of age by doing

 mod1 <- vglm(mstatus ~ age, multinomial)
 summary(mod1)

Briefly summarise the precise assumptions of this model in a combination of english and suitable mathematical notation.

Next explain how the parameter estimates relating to age can be interpreted as certain odds ratios.

Does this interpretation apply also to how \(\pi_{ir}/(1-\pi_{ir})\) for a given category \(r\) change with age? ~~If so,~~ For the above model with a linear effect of age only, are the probabilities of belonging to the different categories necessarily monotonic functions of age? Explain why or why not.

Test if the linear effect of age is statistically significant with an appropriate test (either do the calculations manually or use anova).

b) Next, use the model fitted in a) to compute predicted probabilities for ages \(16,17,\dots,88\). This can be done be using the generic predict function which calls predictvglm when applied to model objects of the vglm class. You will need to set the additional arguments newdata and type to appropriate values (see the help page of predictvglm). Plot the predicted probabilities against age (e.g. using matplot or some ggplot2 equivalent).

Do the estimated probabilities and how they depend on age look reasonable?

c) The improve the model, we will consider polynomial regression by including powers of age up to different orders in the linear predictors. This can be done by including e.g. poly(age, 3) on the right hand side of the model formula, see the help page of poly.

Make plots similar to those in b) for each polynomial order and compute the AIC for each model alternative.

What does model selection based on AIC aim to minimise?

Does the "best" model in terms of the AIC criteria look reasonable judged by the predicted relationships with age?