Merge branch 'main' of https://github.com/MoenMi/notes

MoenMi · MoenMi · commit 7a7bc00a39c2 · 2025-06-02T13:51:11.000-05:00
diff --git a/_toc.yml b/_toc.yml
@@ -78,6 +78,7 @@ parts:
           - file: classes/cs483/2-probability-univariate-models
           - file: classes/cs483/3-probability-multivariate-models
           - file: classes/cs483/4-statistics
+          - file: classes/cs483/9-linear-discriminant-analysis
           - file: classes/cs483/13-neural-networks-tabular
           - file: classes/cs483/16-exemplar-based-methods
       - file: classes/cs491/overview
diff --git a/classes/cs483/1-intro.md b/classes/cs483/1-intro.md
@@ -72,6 +72,36 @@ This can be minimized to compute the **maximum likelihood estimate (MLE)**.
 
 ### 1.2.2 - Regression
 
+If we want to predict a real-valued quantity $y \in \mathbb{R}$ instead of a class label $y \in \{ 1, \dots, C \}$, this is known as **regression**.
+
+Regression is very similar to classification, but we need to use a different loss function. The most common choice is to use quadratic loss:
+
+$$ \ell_2(y, \hat{y}) = (y - \hat{y})^2 $$
+
+The empirical risk when using quadratic loss is equal to the **mean squared error (MSE)**:
+
+$$ \text{MSE}(\boldsymbol{\theta}) = \frac{1}{N} \sum^N_{n=1} (y_n - f(\boldsymbol{x}_n; \boldsymbol{\theta}))^2 $$
+
+In regression problems, we typically assume that the output distribution is normal.
+
+#### Linear Regression
+
+A **simple linear regression (SLR)** model model takes the following form:
+
+$$ f(x; \boldsymbol{\theta}) = \beta_0 + \beta_1 x $$
+
+We can adjust $\beta_0$ and $\beta_1$ to find the values that minimize the squared errors.
+
+If we have multiple input features, we can use a **multiple linear regression (MLR)** model:
+
+$$ f(x; \boldsymbol{\theta}) = \beta_0 + \beta_1 x_1 + \dots + \beta_n x_n $$
+
+#### Polynomial Regression
+
+
+
+#### Deep Neural Networks
+
 
 
 ### 1.2.3 - Overfitting and generalization
@@ -158,11 +188,19 @@ $$ \text{TFIDF}_{ij} = \log(\text{TF}_{ij} + 1) \times \text{IDF}_i $$
 
 #### Word embeddings
 
-**Word embeddings** map each sparse one-hot vector, $\boldsymbol{x}_{nt} \in \{0, 1\}^V$, to a lower-dimensional dense vector, $\boldsymbol{e}_{nt} \in \mathbb{R}^K$ using $\boldsymbol{e}_{nt} = \textbf{E} \boldsymbol{x}_{nt}$, where $\textbf{E} \in \mathbb{R}^{K \times V}$ is learned such that 
+**Word embeddings** map each sparse one-hot vector, $\boldsymbol{x}_{nt} \in \{0, 1\}^V$, to a lower-dimensional dense vector, $\boldsymbol{e}_{nt} \in \mathbb{R}^K$ using $\boldsymbol{e}_{nt} = \textbf{E} \boldsymbol{x}_{nt}$, where $\textbf{E} \in \mathbb{R}^{K \times V}$ is learned such that semantically similar words are placed close by. Once we have an embedding matrix, we can represent a variable-length text document as a **bag of word embeddings**. We can then convert this to a fixed length vector by summing the embeddings
 
-#### Dealing with novel words
+$$ \bar{\boldsymbol{e}}_n = \sum^T_{t=1} \boldsymbol{e}_{nt} = \textbf{E} \tilde{\boldsymbol{x}}_n $$
+
+where $\tilde{\boldsymbol{x}}_n$ is the bag of words representation. We can use this inside of a logistic regression classifier. The overall model has the form
+
+$$ p(y = c | \boldsymbol{x}_n, \boldsymbol{\theta}) = \text{softmax}_c(\textbf{WE} \tilde{\boldsymbol{x}}_n) $$
 
+We often use a **pre-trained word embedding** matrix $\textbf{E}$, in which case the model is linear in $\textbf{W}$, which simplifies parameter estimation.
+
+#### Dealing with novel words
 
+If the model encounters a novel word at test time, it is known as **out of vocabulary (OOV)**. A standard heuristic to solve this problem is to replace all novel words with the special symbol **UNK**. This loses information, since we may be able to deduce info from suffixes/root words. To address this, we can break the words down into their substructure, and then take **subword units** or **wordpieces**. These are often created using a method called **byte-pair encoding**, which is a form of data compression that creates new symbols to represent common substrings.
 
 ### 1.5.5 - Handling missing data
 
diff --git a/classes/cs483/9-linear-discriminant-analysis.md b/classes/cs483/9-linear-discriminant-analysis.md
@@ -0,0 +1,71 @@
+# 9 - Linear Discriminant Analysis
+
+## 9.1 - Introduction
+
+In this chapter, we consider models of the following form:
+
+$$ p(y = c | \boldsymbol{x}, \boldsymbol{\theta}) = \frac{p(\boldsymbol{x} | y = c, \boldsymbol{\theta})p(y = c | \boldsymbol{\theta})}{\sum_{c'} p(\boldsymbol{x} | y = c', \boldsymbol{\theta}) p(y = c' | \boldsymbol{\theta})} $$
+
+The term $p(y = c | \boldsymbol{\theta})$ is the prior over class labels, and the term $p(\boldsymbol{x} | y = c, \boldsymbol{\theta})$ is called the **class conditional density** for class $c$.
+
+## 9.2 - Gaussian discriminant analysis
+
+### 9.2.1 - Quadratic decision boundaries
+
+
+
+### 9.2.2 - Linear decision boundaries
+
+
+
+### 9.2.3 - The connection between LDA and logistic regression
+
+
+
+### 9.2.4 - Model fitting
+
+
+
+### 9.2.5 - Nearest centroid classifier
+
+
+
+### 9.2.6 - Fisher’s linear discriminant analysis *
+
+
+
+## 9.3 - Naive Bayes classifiers
+
+### 9.3.1 - Example models
+
+
+
+### 9.3.2 - Model fitting
+
+
+
+### 9.3.3 - Bayesian naive Bayes
+
+
+
+### 9.3.4 - The connection between naive Bayes and logistic regression
+
+
+
+## 9.4 - Generative vs discriminative classifiers
+
+### 9.4.1 - Advantages of discriminative classifiers
+
+
+
+### 9.4.2 - Advantages of generative classifiers
+
+
+
+### 9.4.3 - Handling missing features
+
+
+
+### 9.5 Exercises -
+
+