Final reading list MA8701V2023

ESL= Hastie, Friedman, Tibshirani: Elements of statistical learning, 2nd edition, 12th printing (2017): https://hastie.su.domains/ElemStatLearn/download.html

Part 1: Core concepts (this part checked on 2023.01.28 and found to be final reading list)

ESL ch 2.4, 7.1-7.6, 7.10-7.12
Handbook of missing data methodology: Chapter 12.1.2, 12.2, 12.3.3. Available for download (60 pages pr day) from Oria at NTNU (Choose EBSCOweb, then PDF full text and then just download chapter 12).
Van Buuren: "Flexible Imputation of Missing Data", 2nd edition https://stefvanbuuren.name/fimd/. Ch 1.1, 1.2, 1.3, 1.4, 2.2.4, 2.3.2 (similar to Handbook 12.2), 3.2.1, 3.2.2 (Algo 3.1), 3.4 (Algo 3.3), 4.5.1, 4.5.2.

Part 2: Shrinkage and regularization in LM and GLM (this part checked on 2023.02.20 and found to be final reading list)

ESL 3.2.2, 3.2.3, 3.4.1-3.4.2, 4.4.1-4.4.4.
Hastie, Tibshirani, Wainwright (HTW): "Statistical Learning with Sparsity: The Lasso and Generalizations". https://hastie.su.domains/StatLearnSparsity/. Chapters 2.1-2.5, 3.1-3.2, 3.7, Chapter 4 (4.1-4.3, 4.6) only on an overview level (practical level), 6.0, 6.1 (overview), 6.2, 6.5.
Taylor and Tibshirani (2015): Statistical learning and selective inference, PNAS, vol 112, no 25, pages 7629-7634. https://www.pnas.org/content/112/25/7629 (a light version of HTW 6.3.2, skip sequential stopping rule and PCR)
Single/multi-sampling splitting part of Dezeure, Bühlmann, Meinshausen (2015). "High-Dimensional Inference: Confidence Intervals, p-Values and R-Software hdi". Statistical Science, 2015, Vol. 30, No. 4, 533–558 DOI: 10.1214/15-STS527 (2.1.1 and 2.2 for linear regression and use of the method in practice).

Part 3: Ensembles (this part checked on 2023.03.13 and found to be final reading list)

ESL 8.7 (Bagging), 9.2 (trees), 15 (random forest), 10.1-10.6, 10.9-10.12 (boosting), 16.1 (ensembles).
Chen, T., & Guestrin, C. (2016). XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 785–794). New York, NY, USA: ACM. https://doi.org/10.1145/2939672.2939785. The mathematical notation is not in focus
Super Learner: Section 2.2 of phd-thesis of Erin Le Dell (2015): Scalable Ensemble Learning and Computationally Efficient Variance Estimation. PhD Thesis, University of California, Berkeley
Hyperparameter tuning with Bayesian Optimization. Frazier (2018): "A tutorial on Bayesian optimization", https://arxiv.org/abs/1807.02811: Sections 1,2,3,4.1, 5: only the section "Noisy evaluations", 6,7.
Inference: class notes and notes from L17 on data rich/poor for classification and regression.

Part 4: Explainable AI/Interpretable Machine Learning (this is the final version)

Interpretable Machine Learning A Guide for Making Black Box Models Explainable. Christoph Molnar. Covering the general introduction to the topic, partial dependence plots, ALE plots, LIME, Shapley values and counterfactuals.

First edition: Chapters 2, 5 (not 5.8), 6.1 (this was on the reading list in 2021, but now the book is in its second edition). Second edition: Chapters 3, 6, 8 (not 8,3, 8.4,8.6,8.7, 9 (not 9.4, 9.6.3).

Central theoretical concepts

One learning goal is … "Understand and explain the central theoretical aspects in statistical inference and learning." So, what are those central theoretical apects?

Part 1: Core concepts

[L1]: Statistical decision theoretic framework: minimize EPE(f). For quadratic loss the optimal choice for \(f(X)\) is the conditional expectation \(E(Y \mid X)\). If the joint density for \((X,Y)\) is multivariate normal, this leads to a \(f(X)=E(Y\mid X)=\)linear model in \(X\).
[L2]: Continue with statistical decision theoretic framework. For 0-1-loss the optimal choice for \(\hat{G}(X)\) is class with the maximal probability \(P(G=g_\mid X=x)\). EPE (now also referred to as Err) may now be seen unconditional of the training set and conditional of the training set. We have seen that the training error (often called apparent error) is not a good estimate for the EPE.
[L3]: The average optimism is given as \(\frac{2}{N}\sum_{i=1}^N \text{Cov}(y_i, \hat{y}_i)\), and this can be used to generalize the effective degrees of freedom concept. Here the in sample error is the starting point and results are used for model selection (not assessment).
[L4]: Cross-validation may estimate Err, and can be used for both model selection and model assessment. 5 or 10-fold preferred. For bootstrapping the 0.632 and 0.632+ rule may be used, and also estimate Err.
[L5]: MCAR/MAR/MNAR. Complete case analysis is ok for MCAR. MAR: multiple imputation! Rubin's rules for pooling
[L6]: Bayesian explanation for Rubin's rules. Bayesian linear regression and PMM. Fully conditional specification.

Part 2: Shrinkage and regularization

[L7]: Gauss-Markov theorem. Derive the ridge linear regression estimator, understand both budget and penalty version. Compare variance of LS to variance of ridge. Relevance of SVD for understanding ridge. Ridge for ortogonal design matrix.
[L8]: Understand the budget and penalty version for the lasso. Derive the lasso explicit formula for one covariate and othogonal covariates. Explain how cyclic coordinate descent is used to find lasso estimates. Properties of the lasso estimate. When to use lasso vs when to use ridge?
[L9]: Compare the l_q regression to the bridge regression. What is elastic net and when to use it? What is the group lasso and when to use it? What is the sparse group lasso and when to use it?
[L10]: GLM set-up and logistic regression in particular. Penalized logistic regression. Extensions to the cyclic coordinate descent. Elastic net logistic regression.
[L11]: Debiased lasso. Bayesian lasso. Bootstrapping
[L12]: Multisample splitting, forward selection to motivate polyhedral result, reproducibility crises.

Part 3: Ensembles

[L13]: Wisdom of the crowds, bootstrap aggregation, methods suitable for bagging, trees, random forest.
[L14]: Adaboost.M1, additive model, forward stagewise modeling, exponential loss, gradient boosting, tree depth, regularization.
Video: Boosting vs L1 regularization, 2nd order GTB, XXBoost, hyperparameter tuning for XGBoost.
[L15]: Stacked ensembles, why need for CV, level one data, base learners and metalearner. Oracle property.
[L16]: Hyperparameter tuning. Grid search. Iterative search. Surrogate model. Gaussian process and multivariate normal distribution, Bayesian optimization with acquisition function expected improvement. Algorithm for Bayesian optimization. Know that a DOE can provide data for a surrogate model in response surface models.
[L17]: Data rich situation (separate test set) with classification: CI with Clopper Pearson and Blaker, McNemar to compare two methods, DeLong CI and test for ROC-AUC. With regression unclear how CI and test for Err on test set can be performed. Data poor situation (CV) with classification: unclear how to do inference for the misclassification rate, ROC-AUC CI by LeDell. With regression only showed correlations within and between folds, and no solution for CI and tests for Err.

Part 4: Explainable AI

[L18]: Interpretable parametric models (correlation between covariates hinders interpretability), global vs local methods, specific vs agnostic methods. Global specific methods: Shapley regression, Gini importance. Global agnostic methods: ICE to PDP-plots, enhanced to ALE-plots.
[L19]: Who are interested in the explanations? (end user, system builder, regulatory bodies, end consumers). LIME. Role of ficticious data, and local approximation. Gower distance. Counterfactuals. Which features to alter to obtain a different decision? Changes: few, small and likeli. Three methods, involving both multiobjective optimization and conditional probability models (similar to the FCS models).
[L20]: Shapley values: understand and calculate by hand for 3-4 players. Efficiency, symmetry, dummy and linearity properties. Approximation methods for Shapley regression. Shaply for prediction: contribution function to be estimated via conditional distributions. SHAP. Theoretical result for linear regression of independent covariates. Two challenges: computational complexity for sum over possible models and estimating the contribution function. Acknowledge what are challenge, but not go into details on solutions.