## Project 2

You may work in groups of up to two students. The final report should be be written in R markdown or LaTeX using knitr (preferably) and uploaded as **a single pdf file** to this dropbox folder by October 30. The first page must include **your email adresse(s) and your candidate number(s)** (not your studentnumber).

**All questions should be posted via the piazza forum.**

In this project we will use multinomial regression to analyse a dataset of marital status among individuals of europeans ethnicity in New Zealand. Load the `VGAM`

package and the dataset by doing

library(VGAM) attach(marital.nz)

We will use two of the variables in this dataset; `mstatus`

which has four levels (Divorced/Separated, Married/Partnered, Single, Widowed) and we will try to build a model predicting the probabilities of these four categories based `age`

(16-88).

**a)** First fit a multinomial regression model with a linear effect of age by doing

mod1 <- vglm(mstatus ~ age, multinomial) summary(mod1)

Briefly summarise the precise assumptions of this model in a combination of english and suitable mathematical notation.

Next explain how the parameter estimates relating to age can be interpreted as certain odds ratios.

Does this interpretation apply also to how \(\pi_{ir}/(1-\pi_{ir})\) for a given category \(r\) change with age? ** If so,** For the above model with a linear effect of age only, are the probabilities of belonging to the different categories necessarily monotonic functions of age? Explain why or why not.

Test if the linear effect of age is statistically significant with an appropriate test (either do the calculations manually or use `anova`

).

**b)** Next, use the model fitted in a) to compute predicted probabilities for ages \(16,17,\dots,88\). This can be done be using the generic `predict`

function which calls `predictvglm`

when applied to model objects of the `vglm`

class. You will need to set the additional arguments `newdata`

and `type`

to appropriate values (see the help page of `predictvglm`

). Plot the predicted probabilities against age (e.g. using `matplot`

or some `ggplot2`

equivalent).

Do the estimated probabilities and how they depend on age look reasonable?

**c)** The improve the model, we will consider polynomial regression by including powers of `age`

up to different orders in the linear predictors. This can be done by including e.g. `poly(age, 3)`

on the right hand side of the model formula, see the help page of `poly`

.

Make plots similar to those in b) for each polynomial order and compute the AIC for each model alternative.

What does model selection based on AIC aim to minimise?

Does the "best" model in terms of the AIC criteria look reasonable judged by the predicted relationships with age?