# ST2304 spring 2018

## Messages

May 28: Preliminary solution to today's exam.

Mai 27: If you have questions before tomorrows exam, use the google groups discussion forum.

May 24: Løsningsforslag for eksamen, august 2017 er lagt ut. Merk at noen svar i løsningsforslaget inneholder noe stoff utover hva det rent spesifikt er spurt om i oppgavesettet. Det vil være nyttig å studere dette stoffet.

**Spørredag 25. mai**: Vi vil være tilgjengelig det meste av dagen fra 9:00 til 15:00 i auditorium S7 for å svare på spørsmål f.eks. om tidligere eksamensoppgaver etc.

Send eventuellt e-post til jarle.tufto@gmail.com, bob.ohara@ntnu.no (tilgjengelig fredag formiddag) or christoffer.h.hilde@ntnu.no ved behov.

May 9: The help session on May 11 from 12:15-16:00 will take place in S1 instead of K5.

May 4: Remember the lecture today at 14:15 in S5. There is also a session (where you can help with exercise 8) in K5 between 12:15 and 14 as usual.

April 25: On May 2, I will go through problem 2c from the August 2017 exam and problem 2a and b from August 2016. In addition, I will summarise the two types of generalized linear models (Poisson regression and logistic regression aka Binomial regression), hypothesis testing for glms and methods for testing and correcting for overdispersion.

April 25: In addition to today's lecture, there will be an extra lecture where I go through previous exam questions, probably in S5 on wednesday, May 2 at 14:15 (same time and place). The exercise sessions on fridays 14:15-16:00 continue on Friday, April 27, ~~May 4~~, ~~(and possibly)~~and May 11. On Friday May 25 (before the exam on May 28) I and Christophe will be available to answer questions between 9:00 and 15:00.

## Syllabus

There are 14 lectures, topics and slides (pdf and R Markdown) listed below - with reference to the Dalgaard textbook and links to additional learning material.

Textbook: Peter Dalgaard, Introductory statistics with R (this is an ebook from Springer, that NTNU has access to - but you need to be logged in from NTNU to download the book)

As **a supplement to the slides from the lectures**, we also recommend chapter 3 (excluding section 3.5) in Introduction to Statistical Learning (James et.al. 2013) which covers much of the same material as in lectures 2-9, that is, linear simple and multiple regression, categorical covariates, interactions as well as details on how such models are fitted with R, all through examples. This book is used in a course in statistical learning, and also comes with video lectures by the authors (youtube playlist for Chapter 3). The book is available at as ebook from springer.

A nice and gentle, example oriented introduction to logistic regression (binomial response) and Poisson regression (lectures 11-13) is Chapters 2 and 3 in Faraway (2006), Extending the the Linear Model with R.

### Lecture 1: Basic knowledge about R

**Slides:** Lecture1.pdf and Lecture1.rmd

**Topics:** Different data types (vectors, factors, data.frames, lists), vectorised operations, getting data into R. read.csv, read.table etc (Dalgaard, ch. 1, ch. 2 (except 2.3). Built in functions for dealing with different probability distributions (i.e. dnorm, pnorm, qnorm, rnorm etc…) (Dalgaard ch. 3), summary statistics and graphics (Dalgaard ch. 4).

**Additional resources:** Rbeginner.html and Rbeginner.Rmd.
Parts of Rintermediate.html and Rintermediate.Rmd may also be useful (perhaps ignore how to plots things using ggplot2 package although some of you may want to learn this on your own at some stage)

### Lectures 2 and 3: Statistical inference

**Slides:**

**Topics:** Maximum likelihood.

**Additional resources:**

- Maximum likelihood ST0103 (pdf) (in norwegian)

### Lectures 4 to 6: Simple linear regression with normal response

**Slides:**

**Topics:** Simple linear regression (lecture 3 to 5, Dalgaard ch. 6). What are the assumptions, what is the principle behind estimation of the parameters? (the mathematical derivation of the maximum likelihood estimators is not part of the curriculum). This is also covered in ST0103.

### Lecture 6 to 7: Multiple regression

### Lecture 8: Categorical covariates

**Slides:** Lecture8.pdf and Lecture8.rmd

**Topics:** Essentially a form of linear regression models since categorical covariates (called factors in R) can be represented through numerical 0/1 dummy variables (Dalgaard ch. 7 except 7.1.1, 7.1.4, 7.2 and 7.4 and ch. 12.3).

Important points: How is the model specified as a model formula in R. How is the model written in mathematical notation? What is the interpretation of the parameters (e.g. when running summary on the fitted model object in R). Specifically, you should know why we can’t estimate the intercept simultaneously with the effect of _all_ levels of a factor and that we instead typically impose the constraint that the effect of the first “reference” level of a factor is zero.

### Lecture 9: Interactions

**Slides:** Lecture9.pdf and Lecture9.rmd

**Topics:** Interactions between two numerical covariates; one numerical and categorical, two categorical (Dalgaard ch. 12.5).

### Lecture 10: Model selection

**Slides:** Lecture10.pdf and Lecture10.rmd

**Topics:** F-distribution (handout 1), Approximate/asymptotic chi-square distribution of 2 times difference of maximum log-likelihoods under H_0 and H_1 (handout 5, section 2.2), AIC (handout 1, section 5), BIC

**Additional material:** handout-1.pdf and handout-5.pdf

### Lectures 11-13: Generalised linear models

**Slides:**

**Topics:** Poisson response (log link) (Dalgaard ch. 15), Binomial response (logit, probit and cloglog link functions) (Dalgaard ch. 13), Why do we need link functions? Theoretical reasons for using different link functions, Again you should be able to write down the model in mathematical notation and interpret parameter estimates.
Based on the summary output for a fitted model, you should be able to compute model predictions for models with the above link functions (so you need to know about the inverse of different link functions). Deviance, how to use change in deviance in tests between nested models. Overdispersion (what is it, how do we test if there is overdispersion, how do we correct for overdispersion).

Most of the this is also covered in Handout 4 in addition to chapters in Dalgaard and slides from the lectures.

**Additional material:** handout-4.pdf and ch. 2 and 3 in Faraway.

### Lecture 14, 15, 16 : Summing up

Jarle Tufto

## Material not covered this year:

- The delta method (propagation of uncertainty),
- the multinomial distribution and contingency tables,
- numerical methods for maximising likelihood functions of non-standard models.
- How to obtain approximate standard errors of maximum likelihood estimates based on asymptotic theory (from the Hessian matrix - second partial derivates of the log likelihood function at the MLEs) has been covered to some extent but only to a small extent in lecture 2.

This means that questions related to these topics will not be part of this years exam.

## Exercises and solutions:

## Datasets used in Lectures/Exercises

Each of the following lines of codes reads the different data sets from the web and returns it as a data.frame. Alternatively, click on the links and save the file to a local folder on your computer.

```
read.csv("https://www.math.ntnu.no/emner/ST2304/2018v/BirdEggs.csv")
read.csv("https://www.math.ntnu.no/emner/ST2304/2018v/31396_Bumpus_English_Sparrow_Data.csv")
read.csv("https://www.math.ntnu.no/emner/ST2304/2018v/Birdbrains.csv")
read.csv("https://www.math.ntnu.no/emner/ST2304/2018v/HastingsData.csv")
read.csv("https://www.math.ntnu.no/emner/ST2304/2018v/Himmicanes.csv")
read.csv("https://www.math.ntnu.no/emner/ST2304/2018v/LifeExpectancy.csv") # Healtcare data
read.csv("https://www.math.ntnu.no/emner/ST2304/2018v/Dpileatus.csv") # Pileated Woodpecker Data
```

## Previous years exams:

Parts of previous years exams will be relevant preparation also for this years exam (questions that are not relevant this years are specified below).

Trial exam 2011: bokmål, solution (not 1a-c, last question of 2c, 3a-e)

June 2011: english, bokmål, nynorsk, solution (not 1a-c, 3a-c)

June 2012: english, bokmål, nynorsk, solution (not 1a-d, 2e-f)

August 2012: bokmål,solution (not 1a-d, 3a-c)

June 2013:
english,
bokmål,
nynorsk,
solution
(~~all questions relevant~~ not 1a-d, 2c)

August 2013: bokmål,solution (not 1a-c, 2a-d)

May 2014: bokmål, solution (not 1-a, 3c)

May 2015: english, bokmål, nynorsk, solution (not 1a-c, second question of 2c and 2e-g)

June 2016: english, bokmål, nynorsk, solution (not 1a-b, not second sentence of 2b, 2d, 4a)

August 2016: bokmål, nynorsk, solution (ikke 1a-c, 3b-c)

May 2017:
bokmål
solution
(~~All questions relevant~~ not 1a-c)

August 2017:
bokmål
nynorsk
solution*
(~~All questions relevant~~ not 1a-c)

June 2018: bokmål nynorsk english

* Feel free to change in google docs if you see any errors.

## Permitted aids on the exam

Support material code C: One yellow A4-paper with your own handwritten notes (available at 7th floor of sentralbygg II), approved calculator, Tabeller og formler i statistikk (Tapir forlag), Matematisk formelsamling (K. Rottmann). In addition the exam itself may include help page for specific R function that you may need.