### Erich Schubert: Machine Learning Lecture Recordings

I have uploaded

*most*of my Machine Learning lecture to YouTube. The slides are in English, but the audio is in German. Some very basic contents (e.g., a demo of standard k-means clustering) were left out from this advanced class, and instead only a link to recordings from an earlier class were given. In this class, I wanted to focus on the improved (accelerated) algorithms instead. These are not included here (yet). I believe there are some contents covered in this class you will find nowhere else (yet). The first unit is pretty long (I did not split it further yet). The later units are shorter recordings. ML F1: Principles in Machine Learning- Principles in Machine Learning
- Principles in Machine Learning /2
- Occam s Razor Principle of Parsimony
- Simple Models
- Computational Learning Theory
- Probably Approximately Correct Learning (PAC Learning) /1
- Probably Approximately Correct Learning (PAC Learning) /2
- PAC Learnable Examples
- VC Dimension (Vapnik-Chervonenkis-Dimension )
- VC Dimension Example
- Error Bounds and the VC Dimension
- No Free Lunch
- No Free Lunch Theorem
- No Free Lunch Theorem Explanation
- Bias-Variance Tradeoff
- Bias-Variance Tradeoff
- Bias vs. Variance
- Bias-Variance Decomposition
- Bias-Variance Illustration
- Different Kinds of Bias
- Data Often Has Bias
- AI Can Be Sexist and Racist
- Relationships

- Correlation does not Imply Causation
- Correlation does not Imply Causation /2
- Correlation does not Imply Causation /3
- Correlation with Statistics Classes
- Multiple Testing Problem
- Bonferroni s Principle Multiple Testing Problem
- Multiple Testing Problem

- Overfitting
- Underfitting and Overfitting
- Overfitting Decision Tree
- Overfitting Due to Noise
- Overfitting Due to Insufficient Examples

- Curse of Dimensionality
- Combinatorial Explosion
- Concentration of Distances
- Data is in the Margins
- Illustration: Shrinking Hyperspheres
- Illustration: Shrinking Hyperspheres /2
- Effect on Search in High Dimensionality
- Summary

- Intrinsic Dimensionality
- Estimating Intrinsic Dimensionality
- Angle-Based Intrinsic Dimensionality Intuition
- Angle-Based Intrinsic Dimensionality (ABID) /2
- Consequences & Solutions

- Distance Functions
- Distances, Metrics and Similarities
- Distances, Metrics and Similarities /2
- Distance Functions
- Distance Functions /2
- Similarity Functions
- Distances for Binary Data
- Jaccard Coefficient for Sets
- Example Distances for Categorical Data
- Mahalanobis Distance
- Scaling & Normalization
- To Scale, or not to Scale?
- To Scale, or not to Scale? /2

- Classification
- Prediction Problems
- Classification: A Multi-Stage Process
- Classification Problem
- Example
- Process of Constructing a Model
- Process of Applying the Model

- Evaluation and Selection of Classifiers
- Quick Recap: Classification
- Classifier Evaluation: Confusion Matrix
- Classifier Evaluation: Accuracy and Error-Rate
- Precision, Recall, and F-measure
- Classifier Evaluation: Multi-Class Confusion Matrix
- Training Accuracy vs. Accuarcy on New Data
- The Need for Validation
- Holdout Validation
- Cross-Validation
- Bootstrap Validation
- Considerations for Selecting a Model

- Bayesian Classification
- Bayes Classification: Motivation
- Bayes Theorem: Review
- Optimal Bayes Classifier
- Na ve Bayes Classifier
- Probability Models for a Single Attribute
- Multivariate Gaussian Bayes Classification
- Na ve Bayes Classifier: Example
- Na ve Bayes Classifier: Computational Aspects
- Na ve Bayes Classifier: Comments & Discussion

- Nearest-Neighbor Classification
- Nearest Neighbor Classifier Motivation
- Nearest Neighbor Classifier: Foundations
- Nearest Neighbor Classifier: Example
- Nearest Neighbor Classification: Example
- Nearest Neighbor Decision Rules

- Nearest-Neighbor as Density estimation
- Nearest Neighbor Classification and Density Estimation
- Predicting with Kernel Density Estimation with k=1,3,5,15
- Error Probability of Nearest Neighbors
- Nearest Neighbor Regression
- Nearest-Neighbor Classification: Comments & Discussion

- Decision Tree Learning
- Example (Variant of a Dataset in )
- Decision Tree Example
- Decision Trees as Rule-based Systems
- Basic Notions
- Constructing a Decision Tree /1
- Visual Interpretation of Decision Trees on R
- Constructing a Decision Tree /2
- Decision Tree Classification: Example

- Decision Tree Splitting
- Split for Categorical Attributes
- Split for Numeric Attributes
- Best Split Example
- Quality Measures for Splits
- Measure of Impurity: Gini Index
- Gini-Index: Example
- Information Gain
- Information Gain: Example
- Information Gain: Gain-Ratio
- Gain-Ratio: Beispiel
- Classification Error
- Gini, Entropy and Classification Error
- Comparing Split Selection Measures
- Splits for Numerical Attributes

- Ensembles and Meta-Learning
- Ensembles and Meta-Learning
- Error-Rate of Ensembles
- Random Forests
- Boosting
- Random Forest Classification: Example
- Gradient Boosting Classification: Example

- Support Vector Machine Motivation
- Support Vector Machines
- Support Vector Machines /2
- Finding the Best Separating Hyperplane
- Maximum Margin Hyperplane

- Maximum Margin Hyperplane
- A Na ve Attempt
- Support Vectors Separable Data
- Computing the Maximum Margin Hyperplane (MMH)
- Computing the Maximum Margin Hyperplane (MMH) /2
- Boundary of the Maximum Margin Hyperplane (MMH)
- Deriving the Primal SVM Optimization Problem

- Training Support Vector Machines
- Optimization Problem
- Karush-Kuhn-Tucker KKT Conditions
- Switching to the Dual Problem
- Classification with the Dual SVM
- Optimizing the i
- Optimizing SVMs
- Sequential Minimal Optimization
- Further Improvements

- Non-linear SVM and the Kernel Trick
- Nonlinear SVM
- Nonlinear SVM /2
- Kernel Functions
- Soft Margin SVM Classifier
- Soft Margin SVM Classifier /2
- Soft Margin SVM Classifier /3
- Soft Margin SVM Classifier /4

- SVM Extensions and Conclusions
- Separation of more than 2 Classes
- Support Vector Regression
- Support Vector Regression Optimization Problem
- Support Vector Regression Dual
- Support Vector Data Description (SVDD)
- SVDD Dual Problem
- Support Vector Clustering
- SVMs: Comments & Discussion

- Threshold Logic Units
- Threshold Logic Units (TLUs)
- Threshold Logic Units Example
- Geometric Interpretation of TLUs
- Exclusive-Or (XOR) Problem
- Exclusive-Or (XOR) Problem /2
- Exclusive-Or (XOR) Problem /3
- Universality of TLUs
- Mark I Perceptron

- General Artificial Neural Networks
- Simplifying Threshold Logic Units
- Weight Matrices
- From TLUs to Multilayer Perceptrons
- Some Activation Functions
- Some Activation Functions /2
- Some Activation Functions /3
- Some Activation Functions /4

- Learning Neural Networks with Backpropagation
- Basic Gradient Descent
- Stochastic Gradient Descent
- Learning Single-Layer Perceptrons
- Backpropagation
- Training with Backpropagation

- Deep Neural Networks
- Universal Approximation Theorem
- Deep vs. Wide Neural Networks
- High vs. Low Dimensionality
- (Early) Problems of Deep Learning
- Autoencoders
- Layer-wise Pre-Training of Deep Neural Networks
- Dropout Regularization
- Batch Normalization
- Choosing Activation Functions

- Recurrent Neural Networks
- Recurrent Neural Networks (RNNs) on Sequences
- Recurrent Neural Networks (RNN)
- Long-Short Term Memory (LSTM)
- Further Developments

- Cluster Analysis Introduction
- What is Clustering?
- What is Clustering? /2
- Applications of Clustering
- Basic Steps for Clustering

- Hierarchical Agglomerative Clustering
- Distance of Clusters
- AGNES Agglomerative Nesting
- AGNES Agglomerative Nesting /2
- Extracting Clusters from a Dendrogram
- Benefits and Limitations of HAC

- Accelerating Hierarchical Clustering
- Complexity of Hierarchical Clustering
- Anderberg s Caching
- AGNES vs. Anderberg , NNChain , SLINK
- Example: Hierarchical Clustering with Anderberg

- K-means Clustering
- The Sum of Squares Objective
- The Standard Algorithm (Lloyd s Algorithm)
- Non-determinism & Non-optimality
- Initialization
- Initialization /2
- Complexity of k-Means Clustering

- Accelerating k-Means Clustering
- k-Means++: Weighted Random Initialization
- Making k-means Faster
- Bounding the Distances Elkan and Hamerly
- Hamerly s k-means
- Example: k-Means Clustering with Hamerly s Algorithm
- Speedup with Hamerly, Elkan, and Exponion

- Limitations of k-Means Clustering
- Benefits and Drawbacks of k-Means
- Choosing the Optimum k for k-Means
- Limitations of k-Means

- Extensions of k-Means Clustering
- k-Means and Distances
- k-Means Minimizes Sum of Squares, not Euclidean Distance!
- k-Means Variations for Other Distances
- Spherical k-Means for Text Clustering
- Pre-processing and Post-processing

- Partitioning Around Medoids (k-Medoids)
- k-medoids Clustering
- Partitioning Around Medoids
- Algorithm: Partitioning Around Medoids
- Algorithm: Partitioning Around Medoids /2
- Change in TD
- Finding the Best Swap Faster
- k-Medoids, k-Means style
- Example for the Inferiority of k-Means Style k-Medoids

- Gaussian Mixture Modeling Introduction
- Expectation-Maximization in Clustering
- Fitting Multiple Gaussian Distributions
- Gaussian Mixture Modeling as E-M-Optimization
- Algorithm: EM Clustering
- Numerical Issues in GMM

- BIRCH and BETULA
- BIRCH Clustering
- BIRCH Clustering Features
- BIRCH Distances
- BIRCH CF-Tree
- BETULA Cluster Features
- BETULA Distance Computations
- Accelerating k-Means with BIRCH and BETULA
- Accelerating GMM with BETULA

- Density-Based Clustering Fundamentals
- Density-based Clustering: Foundations
- Density-based Clustering: Foundations /2
- Density-based Clustering: Foundations /3
- Density-reachability and Density-connectivity
- Density-reachability

- DBSCAN
- Clustering Approach
- Abstract DBSCAN Algorithm
- DBSCAN Algorithm
- DBSCAN Algorithm /2
- DBSCAN Algorithm /3
- DBSCAN in Context

- DBSCAN Parameterization
- Choosing DBSCAN parameters
- Choosing DBSCAN parameters /2
- Choosing DBSCAN parameters /3

- DBSCAN Extensions
- Generalized Density-based Clustering
- Grid-based Accelerated DBSCAN
- Anytime Density-Based Clustering (AnyDBC)
- Hierarchical DBSCAN* (HDBSCAN*)
- Improved DBSCAN Variations

- OPTICS Clustering
- Density-based Hierarchical Clustering
- Density-based Hierarchical Clustering /2
- OPTICS Clustering
- Cluster Order
- OPTICS Algorithm

- Cluster Extraction from OPTICS Plots
- OPTICS Reachability Plots
- Extracting Clusters from OPTICS Reachability Plots
- Role of the Parameters and minPts

- Understanding the OPTICS Cluster Order
- Properties of the OPTICS Cluster Order
- Cluster Order as Serialized Spanning Tree
- OPTICS as Density Spanning Trees
- Cluster Order to Dendrograms

- Spectral Clustering
- Minimum Cuts
- Graph Laplacian
- From Clustering Graphs to Clustering Data
- Spectral Clustering
- Spectral Clustering is Related to DBSCAN

- Biclustering and Subspace Clustering
- Biclustering & Subspace Clustering
- Bicluster Patterns
- Density-based Subspace Clustering
- Subspace Clustering with Apriori-Style Search
- Correlation Clustering
- 4C: Computing Correlation Connected Clusters
- Hough Transform
- CASH: Robust Clustering in Arbitrarily Oriented Subspaces