You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
**Word embeddings** map each sparse one-hot vector, $\boldsymbol{x}_{nt} \in \{0, 1\}^V$, to a lower-dimensional dense vector, $\boldsymbol{e}_{nt} \in \mathbb{R}^K$ using $\boldsymbol{e}_{nt} = \textbf{E} \boldsymbol{x}_{nt}$, where $\textbf{E} \in \mathbb{R}^{K \times V}$ is learned such that
191
+
**Word embeddings** map each sparse one-hot vector, $\boldsymbol{x}_{nt} \in \{0, 1\}^V$, to a lower-dimensional dense vector, $\boldsymbol{e}_{nt} \in \mathbb{R}^K$ using $\boldsymbol{e}_{nt} = \textbf{E} \boldsymbol{x}_{nt}$, where $\textbf{E} \in \mathbb{R}^{K \times V}$ is learned such that semantically similar words are placed close by. Once we have an embedding matrix, we can represent a variable-length text document as a **bag of word embeddings**. We can then convert this to a fixed length vector by summing the embeddings
where $\tilde{\boldsymbol{x}}_n$ is the bag of words representation. We can use this inside of a logistic regression classifier. The overall model has the form
We often use a **pre-trained word embedding** matrix $\textbf{E}$, in which case the model is linear in $\textbf{W}$, which simplifies parameter estimation.
200
+
201
+
#### Dealing with novel words
165
202
203
+
If the model encounters a novel word at test time, it is known as **out of vocabulary (OOV)**. A standard heuristic to solve this problem is to replace all novel words with the special symbol **UNK**. This loses information, since we may be able to deduce info from suffixes/root words. To address this, we can break the words down into their substructure, and then take **subword units** or **wordpieces**. These are often created using a method called **byte-pair encoding**, which is a form of data compression that creates new symbols to represent common substrings.
In this chapter, we consider models of the following form:
6
+
7
+
$$ p(y = c | \boldsymbol{x}, \boldsymbol{\theta}) = \frac{p(\boldsymbol{x} | y = c, \boldsymbol{\theta})p(y = c | \boldsymbol{\theta})}{\sum_{c'} p(\boldsymbol{x} | y = c', \boldsymbol{\theta}) p(y = c' | \boldsymbol{\theta})} $$
8
+
9
+
The term $p(y = c | \boldsymbol{\theta})$ is the prior over class labels, and the term $p(\boldsymbol{x} | y = c, \boldsymbol{\theta})$ is called the **class conditional density** for class $c$.
10
+
11
+
## 9.2 - Gaussian discriminant analysis
12
+
13
+
### 9.2.1 - Quadratic decision boundaries
14
+
15
+
16
+
17
+
### 9.2.2 - Linear decision boundaries
18
+
19
+
20
+
21
+
### 9.2.3 - The connection between LDA and logistic regression
22
+
23
+
24
+
25
+
### 9.2.4 - Model fitting
26
+
27
+
28
+
29
+
### 9.2.5 - Nearest centroid classifier
30
+
31
+
32
+
33
+
### 9.2.6 - Fisher’s linear discriminant analysis *
34
+
35
+
36
+
37
+
## 9.3 - Naive Bayes classifiers
38
+
39
+
### 9.3.1 - Example models
40
+
41
+
42
+
43
+
### 9.3.2 - Model fitting
44
+
45
+
46
+
47
+
### 9.3.3 - Bayesian naive Bayes
48
+
49
+
50
+
51
+
### 9.3.4 - The connection between naive Bayes and logistic regression
52
+
53
+
54
+
55
+
## 9.4 - Generative vs discriminative classifiers
56
+
57
+
### 9.4.1 - Advantages of discriminative classifiers
0 commit comments