====== TMA4315 Generalized linear models, autumn 2020 ====== ===== Messages ====== June 23: There will be a **oral continuation exam** sometime in August, [[jarle.tufto@ntnu.no|let me know]] if you plan to sign up. January 15: The grades are out. The solution is also available through the link at the bottom of the page. ^Grade ^ Count ^ |A|8 | |B|12| |C|6 | |D|4 | |E|0 | |F|1 | November 1627: The exam will be a graded digital home exam (via Inspera). All written aids, computer aids and use of written sources on the internet are permitted. The exam questions will (us usual) aim to test understanding rather than memorisation of the course material. I will be available via zoom during the exam in case any of the questions need clarification (you will have to wait in a "zoom waiting room" if you want assistance). Any form of collaboration is not allowed including giving assistance to other students, see [[https://innsida.ntnu.no/wiki/-/wiki/English/cheating+on+exams|these regulations for further details]]. The exam problems given to different students may differ but I will not disclose further details on this. In accordance with regulation for the IE-faculty, after the grades have been announced, you may be required to attend a meeting to verify that what you have handed in is your own work. There will be no control interview after the exam. November 16: Todays and tomorrows lecture will be digital only (no physical lecture). We will provide guidance for project 3 probably both via physical attendance at the Banachroom and [[https://ntnu.zoom.us/j/96224909229?pwd=a1dNOUdTQ0xtSlV4WXhVYmI2TGw2UT09|via zoom (use this link)]] (in addition to the piazza forum). November 2: In addition to Michail form 11-12, I'll be available to answer question about project 3 the next Wednesdays from 10-11. Next lectures are on november 16 and 17. October 27: The link to Wood (2015) was broken and has been fixed. October 26: As of now, we hope that the exam can take place as planned on December 15 but it remains to be seen if the NTNU administration will be able to provide enough room such that this can happen within the regulations of the health authorities. Plan B is a graded 4 hour home exam (no collaboration allowed). August 31: When you attend lectures, follow [[https://innsida.ntnu.no/wiki/-/wiki/Norsk/ntnu+check-in+questions+and+answers|these instructions]]. August 24: Fill in preferred timepoints for the weekly exercise/project via the link provided below. August 11: In accordance with [[https://innsida.ntnu.no/wiki/-/wiki/English/Infection+control+-+lecturers+responsibilities|NTNU Covide-19 regulations]], please fill in the form at [[https://tinyurl.com/glmntnu]] _every_ time you attend a lecture or exercise. ===== Practical information ===== Lectures: Tuesdays 10:15-12:00 in R9 and Mondays 16:15-18:00\(^*\) in KJL1 (* only until 17:00 in week 39-41). Guidance with exercises: Wednesdays 11-12 in Sentralbygg I, room 265 until week 37.Time and place to be decided - please fill in possible timepoints [[https://docs.google.com/spreadsheets/d/1JpZeAF3LqTm41hLgaMsMtN1qFkzp1jnNGafys9SkH6M/edit?usp=sharing|here]]. In addition we will use [[https://piazza.com/ntnu.no/fall2020/tma4315/home|this piazza forum]] (I think you need to sign up with your @stud.ntnu.no email adresse). If you can't attend lectures or exercises physically, you this [[https://ntnu.zoom.us/j/97635406419?pwd=TWE0MVp0bStTMzROT0lGSGVPMVNldz09|link]]. You need to authenticate yourself using "Sign in with SSO" and entering "ntnu" as "Your company name" and then your ntnu username and password. Lecturer: [[https://piazza.com/ntnu.no/fall2020/tma4315/home|Jarle Tufto]] Teaching assistant: [[https://www.ntnu.edu/employees/michail.spitieris|Michail Spitieris]]. Reference group: [[|NN]], [[|NN]], [[|NN]] (send me an [[jarle.tufto@ntnu.no|email]] if you want to be in the reference group). ===== Obligatory exercises ===== There will be three obligatory exercises (counts 30% of final grade). [[tma4315:2020h:project-1|Project 1]] [[Project 2]] [[Project 3]] You'll get help with the projects in the [[https://wiki.math.ntnu.no/drift/stud/omdatalab|Banachroom on the third floor]] (each Wednesday 11-12 from week 38). If you don't have access to Nullrommet, please fill in your information in this [[https://forms.gle/3iryqAvyMYM55HF96|form]]. Due to the pandemic, the number of students than can be present in this room is limited to 25. We will make necessary arrangements if we reach this limit. ===== Tentative curriculum ===== Fahrmeir et. al. (2013) [[https://link.springer.com/book/10.1007%2F978-3-642-34333-9|(freely available on springer link)]], ch. 2.1-2.4, B.4, 5.1-5.4, 5.8.2, 6, 7.1-7.3, 7.5, 7.7. We will also use some material from Wood (2015) [[https://www.maths.ed.ac.uk/~swood34/core-statistics.pdf|Wood (2015)]], see below. This covers ordinary linear and multiple regression (mostly repetition from [[https://wiki.math.ntnu.no/tma4267|Linear statistical models]]), binary regression, Poisson and gamma regression, the exponential family and generalised linear models in general, categorical regression (includes contingency tables and log-linear models, multinomial and ordinal regression), linear mixed effects models, generalized linear mixed effects models. Also see [[https://www.ntnu.edu/studies/courses/TMA4315#tab=omEmnet|the official ntnu course info]]. ===== Lectures ===== [[https://www.math.ntnu.no/emner/TMA4315/2020h/r-code-lectures.html|R code from the lectures (markdown)]] [[https://www.math.ntnu.no/emner/TMA4315/2019h/code-from-lectures.R|(2019 version)]] [[https://drive.google.com/file/d/1iK3t23rUBzcBUB_lD30YUnuosPVvjSg5/view?usp=sharing|Lecture notes (handwritten in notability)]] August 18: Introduction to glms (ch. 2.1-2.3), the exponential family (ch. 5.4.1). [[https://ntnu.cloud.panopto.eu/Panopto/Pages/Viewer.aspx?id=397a7d4b-21d3-4fe5-9b06-ac1b00ada0a1|Video (first half has bad audio)]]. August 24: More on the exponential family (ch. 5.4.1). Review of theory of linear models (ch. 3). [[https://ntnu.cloud.panopto.eu/Panopto/Pages/Viewer.aspx?id=fece30a3-a4b0-4953-9d7b-ac210103e698|Video (only second half)]] August 25: Geometric views of the linear model. Sampling distributions associated with the linear model (the chi-square-, t- and F-distribution). [[https://ntnu.cloud.panopto.eu/Panopto/Pages/Viewer.aspx?id=26c6b5a6-496c-44be-bcf1-ac2200a372c2|Video]] August 31: Testing and fitting linear hypotheses (via quadratic form for \(C\hat\beta-d\) - Box 3.13 in Fahrmeir) or via F-test based on sums of squares for each model alternative (the restricted model fitted via Lagrange method (pp. 172-173) or using the solution to problem 2 below). Design matrices for interactions between numeric and categorical covariates. [[https://ntnu.cloud.panopto.eu/Panopto/Pages/Viewer.aspx?id=2dd6f397-d8fb-474b-a042-ac28010b5469 |Video (unfortunately, screen sharing in zoom on my ipad would not work after the break)]] September 1: Binary regression (ch. 5). Logit, probit and cloglog models with examples. Binary regression continued. Score function of binary regression model.[[https://ntnu.cloud.panopto.eu/Panopto/Pages/Viewer.aspx?id=fefc247e-93b8-431a-98b7-ac2900a5daa1|Video]]. September 7: Some general properties of the expected log likelihood (sec. 4.1 in [[http://www.maths.bris.ac.uk/~sw15190/core-statistics.pdf|Wood (2015)]]). Expected and observed Fisher information and iterative computation of MLEs for binary regression (Fisher scoring algorithm). [[https://ntnu.cloud.panopto.eu/Panopto/Pages/Viewer.aspx?id=03bb8f75-2eed-4639-ae89-ac2f0112f40e|Video]]. September 8: A minimal example of divergence of the Fisher scoring algorithm. Binary regression continued. Asymptotic properties of MLEs. Likelihood ratio, Wald, and score tests. Deviance and testing goodness-of-fit. [[https://ntnu.cloud.panopto.eu/Panopto/Pages/Viewer.aspx?id=d927be96-b7f2-4ed4-ad66-ac3000b0ce00|Video]]. September 14: More on the deviance and the saturated model. Deviance residuals. Estimating the overdispersion parameter. We'll also go through a sketch of a proof for the asymptotic distribution of LR test statistic (section 4.4 in Wood). [[https://ntnu.cloud.panopto.eu/Panopto/Pages/Viewer.aspx?id=5b287ec9-82b7-496d-8614-ac360102cd40|Video]]. September 15: Example (lung cancer rates) illustrating model selection via AIC, model parsimony, Wald and likelihood ratio testing. [[https://ntnu.cloud.panopto.eu/Panopto/Pages/Viewer.aspx?id=8ea00539-421e-4f5f-9a32-ac37009abd7b|Video]] September 21: No lecture. September 22: Theory behind AIC (Wood sec. 4.6). Poisson regression. Fisher scoring vs. Newton-Raphson for poisson-regression with non-canonical identity link (see R code for further illustrations). See e.g. [[https://ete-online.biomedcentral.com/articles/10.1186/1742-7622-10-14|this paper]] for motivation. [[https://ntnu.cloud.panopto.eu/Panopto/Pages/Viewer.aspx?id=7edb8f07-276a-45eb-b3bd-ac3e00aad3cf|Video (some technical problems this time)]]. September 28 (1 hour only): Gamma and lognormal regression. Glms in general and IRLS (ch. 5.4, 5.8.2). [[https://ntnu.cloud.panopto.eu/Panopto/Pages/Viewer.aspx?id=6263b08b-8d42-401b-b3e7-ac440108a343|Video]]. September 29: Quasi-likelihood models (ch. 5.5). The F-test between nested alternatives (see [[https://link.springer.com/book/10.1007%2F978-0-387-21706-2|Venables and Ripley 2002, eq. 7.10)]]). QAIC (see [[https://link.springer.com/book/10.1007%2Fb97636|Burnham & Anderson 2002 p. 70]]). Linear separation. [[https://ntnu.cloud.panopto.eu/Panopto/Pages/Viewer.aspx?id=68212e3d-8294-4f58-ab3b-ac4500b0d785|Video]]. October 5 (1 hour only): Offset variables, profile likelihood confidence intervals. [[https://ntnu.cloud.panopto.eu/Panopto/Pages/Viewer.aspx?id=b9888071-59ad-4d87-b27e-ac4b0101ab4c|Video]]. October 6: Categorical regression models (ch. 6). Categorical regression models continued. Multinomial models as discrete choice latent utility models. [[https://ntnu.cloud.panopto.eu/Panopto/Pages/Viewer.aspx?id=6ec57589-3f76-4192-b3ce-ac4c00a81966|Video]]. October 12: Ordinal regression models. General treatment of multinomial GLMs. [[https://ntnu.cloud.panopto.eu/Panopto/Pages/Viewer.aspx?id=d1284013-ec24-4f9f-93cb-ac5201069dbe|Video]]. October 13: Introduction to mixed models (ch. 7.1 and 7.2) (I must have forgotten to press record so no video so see lecture notes). October 19: Mixed model continued. ML and restricted likelihood based on \(\mathbf{A}\mathbf{y}\) (REML) estimation (7.3). Bayesian/marginal likelihood interpretation of the restricted likelihood ([[https://www.math.ntnu.no/emner/TMA4315/2020h/Harville-1974.pdf|Harville 1974]]). [[https://ntnu.cloud.panopto.eu/Panopto/Pages/Viewer.aspx?id=16968791-c398-498f-af04-ac59010e815d|Video]]. October 20: Mixed models continued. The connection between the profile and restricted likelihood (7.3.2) (again, see Harville 1974). Hypothesis testing (7.3.4). [[https://ntnu.cloud.panopto.eu/Panopto/Pages/Viewer.aspx?id=a85182b4-c3d8-4f19-a8a7-ac5a00a78766|Video]]. October 26: Mixed models continued: Conditional distributions/BLUPs of random effects (7.3.1, 7.3.3). [[https://ntnu.cloud.panopto.eu/Panopto/Pages/Viewer.aspx?id=48e5568c-6987-4a25-b190-ac600119aa5a|Video]]. October 27: Generalized linear mixed models (GLMMs). Differences between individual/cluster- and population-level effect sizes (i.e. marginal vs. conditional models) including approximation in [[https://www.math.ntnu.no/emner/TMA4315/2019h/agresti.pdf|Agresti, 2002, eq. 12.8]]. [[https://ntnu.cloud.panopto.eu/Panopto/Pages/Viewer.aspx?id=56a9e319-3f59-4bf1-bfa3-ac6100b57877|Video]]. November 2: Methods of inference for GLMMs: Marginal likelihood based on numerical integration (Gauss-Hermite quadrature ([[https://www.ntnu.no/studier/emner/TMA4215#tab=omEmnet|TMA4215]])). Laplace approximation ([[https://www.jstatsoft.org/article/view/v070i05|Kristensen et.al. 2016]]) of the marginal likelihood and its computation via [[https://en.wikipedia.org/wiki/Automatic_differentiation|automatic]] (see e.g. Wood ch. 5.5.3) as opposed to [[https://en.wikipedia.org/wiki/Numerical_differentiation|numeric]] and symbolic differentiation. Laplace approximation of the restricted likelihood (REML) for GLMMs (available in R-package glmmtmb). Penalized quasi joint likelihood.. [[https://ntnu.cloud.panopto.eu/Panopto/Pages/Viewer.aspx?id=ce826c8c-dd2d-4f47-8bd0-ac6701181e2c|Video]]. November 3, 9 and 10: Work with project 3 November 16: Summary of course. I'll go through some previous exams: [[https://www.math.ntnu.no/emner/TMA4315/Exam/tma4315-2017-en.pdf|2017, problem 1]]. November 17: [[https://www.math.ntnu.no/emner/TMA4315/Exam/tma4315-2006-no.pdf|2006, oppgave 1 (norwegian version)]] ([[https://translate.google.com/translate?hl=&sl=no&tl=en&u=https%3A%2F%2Fwww.math.ntnu.no%2Femner%2FTMA4315%2FExam%2Ftma4315-2006-no.pdf|english translation]]) (ordinal regression)++ [[https://folk.ntnu.no/jarlet/statmod/exams/2016v/eksamen-english.pdf|ST2304 2016, problem 3 (offset variables)]] [[https://ntnu.cloud.panopto.eu/Panopto/Pages/Viewer.aspx?id=5833d829-16f1-403d-a64a-ac7600c367b0|Video]]. ===== Recommended exercises ===== 1. Determine whether or not the gamma, the inverse Gaussian, the negative binomial and the uniform distribution on \((0,\theta)\) belong to the exponential family (Fahrmeier, ch. 5.4.1) and find their mean and variance via the general formulas \(EY=b'(\theta)\) and \(\operatorname{Var}Y=\frac\phi w b''(\theta)\) 1b. Show that that the [[https://en.wikipedia.org/wiki/Central_moment|third central moment]] of a random variable \(X\) equals the third [[https://en.wikipedia.org/wiki/Cumulant|cumulant]]. 2. For the linear model \(Y=X\beta + \epsilon\), one method for estimating the parameters under a linear hypothesis \(C\beta = d\) (\(r\) linear constraints on the \(p\) model parameters) is given in Fahrmeier, p. 172-173. Show that an alternative approach is to rewrite the model into an alternative form \(Y' = X_0\beta_0+\epsilon\) where \(X_0\) has dimension \(n \times (p-r)\) and \(\beta_0\) dimension \((p-r) \times 1\) and \(Y'\) is equal to the original \(Y\) minus some offset vector. Hint: https://en.wikipedia.org/wiki/Block_matrix#Block_matrix_multiplication. 2b. When testing nested linear model alternatives, show that the \(F\)-test is equivalent (leads to the same conclusions) as a test using the likelihood ratio as the test statistic. 3. Assume that the \(n\times p\) matrix \(X\) has rank \(p\) and let \(H=X(X^T X)^{-1}X^T\). Show that \(H\) and \(I-H\) are both idempotent and symmetric matrices. Use this to show that the eigenvalues of both matrices are either 1 or 0. 4. Suppose that \(Y\) is Bernoulli distributed with parameter \(q\) and that \(X|Y=i \sim N(\mu_i,\sigma^2)\) for \(i=0,1\). Show that the logit of the conditional probability \(P(Y|X=x)\) can be written as \(\beta_0 + \beta_1 x\). Find \(\beta_0,\beta_1\) expressed in terms of \(q,\mu_1,\mu_2,\sigma\). 5. In R, fit the probit model to the juul data (the help page ''?juul'' describes the data) using library(ISwR) # Install the package if needed data(juul) juul$menarche <- factor(juul$menarche, labels=c("No","Yes")) juul.girl <- subset(juul,age>8 & age<20 & complete.cases(menarche)) attach(juul.girl) plot(age, menarche) glm(menarche ~ age, binomial(link="probit")) detach(juul) and compute estimates of mean \(\mu\) and standard deviation \(\sigma\) of the underlying normal distribution of the latent time of menarche from estimates of \(\beta_0\) and \(\beta_1\) of the glm. Also find approximate estimates of standard errors of \(\hat \mu\) and \(\hat \sigma\) using the delta method. 5c. If using the logit- instead of the probit-link in problem 5, what implicit assumption are we making about the underlying distribution of time of menarche? Compute the mean and standard deviation of the latent times for this alternative model. Which estimates do you trust the most? 5d. Show that the cloglog choice of link function for problem 5 and log transforming age would correspond to the assumption that the underlying latent age of menarche have a Weibull distribution. 5b. For linear models, the regression can be forced through the origin by omitting the first column of the design matrix \(\mathbf{X}\). Is there any way to force a binomial regression using the logit link through the origin such that \(p_i=0\) for \(x_i=0\)? 6. Derive the likelihood, score function and the expected and observed Fisher information for the binary regression model \(y_i\sim \mbox{bin}(1,\pi_i)\) for the probit choice of link function. Express the observed Fisher information in terms of the standard normal cdf \(\Phi()\) and pdf \(\phi()\). Note also that \(\frac d{dz}\phi(z)=-z\phi(z)\). Verify that your answer for \(H(\beta)\) is correct by taking the expectation of the final expression. 7. A model is [[https://en.wikipedia.org/wiki/Identifiability|identifiable]] if there is a one-to-one mapping between the parameter vector \(\beta\) and the distribution of the data. If \(X\) has rank \(p-1\), why is a glm not identifiable? In such cases, what shape will the log likelihood surface \(l(\beta)\) take? (Consider the no-intercept model \(\eta_i = \beta_1 x_{i1} + \beta_2 x_{i2}\) as an example). How does this agree with and why is the observed Fisher information only be positive semi-definite? 8. For a glm with the canonical choice of link function, write the log likelihood on exponential family form and derive the observed Fisher information \(H(\beta)\) . Assuming that \(X\) has full rank, is \(H(\beta)\) positive definite? Based on this, what can we say about the overall shape of the log likelihood function? How many optima are possible at the interior of the parameter space? 9. For the linear model \(y = X\beta + \epsilon\), \(\epsilon \sim N(0,\sigma^2 I_n)\) where \(X\) has rank \(p