# Word Embeddings

## Word Representation

• 1-hot representation: any product of them is $0$
• Featurized representation: word embedding

### Visualizing word embeddings t-SNE algorithm: $300 \mathrm D \to 2 \mathrm D$

learn the concepts that fell like they should be more related

## Using word embeddings

### Named entity recognition example it will be much smaller in training sets and so this allows you to carry out transfer learning

## Transfer learning and word embeddings

• Learn word embeddings from large text corputs. ($1 - 100\mathrm B$ words)

• Transfer embedding to new task with smaller training set.

(say, 100k words)

• Optional: Continue to finetune word embeddings with new data

## Properties of Word Embeddings

### Analogies

$\text{Man} \to \text{Woman } as \text{ King} \to ?$

$e_{\text{man}} - e_{\text{woman}} \approx \begin{bmatrix} -2 \\ 0 \\ 0 \\ 0 \end{bmatrix} \approx e_{\text{king}} - e_{\text{queen}}$

$e_? \approx e_\text{king} - e_\text{man} + e_\text{woman} \approx e_{\text{queen}}$

find a word $w$ to satisfiy $\argmax_w \text{sim}(e_w, e_\text{king} - e_\text{man} + e_\text{woman})$

• Cosine similarity

$\text{sim}(u, v) = \frac{u^{T}v}{||u||_2 ||v||_2}$

## Embedding Matrix # Learning Word Embeddings: Word2vec & GloVe

## Learning Word Embeddings

• Neural language model

mask a word and build a network to predict the word, and get the parameters • Other context/target pairs

Context: Last 4 words / 4 words on left & right / Last 1 word / Neraby 1 word(skig gram)

$\text{a glass of orange } \underline{?} \text{ to go along with}$

## Word2Vec

### Skip-grams

come up with a few context to target errors to create our supervised learning problem

• Model

$\text{Vocab size} = 10000$

$\text{Context } c \text{ "orange"(6527)} = \text{Target } t \text{ "juice"(4834)}$

$O_c \to E \to e_c( = E \times O_c) \to o(\text{softmax}) \to \hat y$

$\text{softmax}: P(t | c) = \frac{e^{\theta_t^T e_c}}{\sum_{j = 1}^{10000} e^{\theta_j^T e_c}}$

$e_t$ is a parameter associated with output $t$

$\text{Loss}: \mathcal L(\hat y, y) = - \sum_{i = 1}^{10000} y_i \log \hat y_i$

• Problems with softmax classification

computation cost is too high

• Solutions with softmax classification

hierarchical softmax classifier ## Negative Sampling

context word target?
orange juice 1
orange king 0
orange book 0
orange the 0
orange of 0

## Defining a new learning problem & Model

• pick a context word and a target word to get a positive example;

• pick k random words in dictionary and the target word to get k negative examples.

$k = \begin{cases} 5 \sim 20 & (\text{small dataset})\\ 2 \sim 5 & (\text{larget dataset}) \end{cases}$

• train 10000 binary classification problem ( $k+1$ example ) instead of multiple classification(computation cost is much lower)

### Selecting negative examples

$P(w_i) = \frac{f(w_i)^{3/ 4}}{\sum_{j = 1}^{10000} f(w_j)^{3/4}}$

$f(w_i)$ represents the frequency of $w_i$ .

## GloVe Word Vectors

GloVe(global vectors for word representation)

$X_{ct} = X_{ij} = \text{times } i \text{ appears in context } j$

$X_{ij} = X_{ji}$ represent how $i, j$ close to each others

$\min \sum_{i = 1}^{n} \sum_{j = 1}^n f(X_{ij})(\theta_i^T e_j + b_i + b_j' - \log X_{ij})^2$

$f(X_{ij})$ is a weighting term:

$f(X_{ij}) = \begin{cases} 0 & \text{if } X_{ij} = 0\\ \text{high} & \text{(stopwords) this, is, of, a, }\cdots\\ \text{low} & \text{(rare words) durian, }\cdots \end{cases}$

(regarding $0 \log 0 = 0$ )

$\theta_i$ and $e_j$ are symmetric so you can calculate
$\displaystyle e_w^{\text{final}} = \frac{e_w + \theta_w}{2}$ .

# Applications Using Word Embeddings

## Sentiment Classification

Average the word embeddings of the sentence and use a softmax to predict But it makes some mistakes, e.x. "Completely lacking in good taste, good service, and good ambience."

### RNN for sentiment classification

Use the many-to-one RNN (input the word embeddings) can solve this problem.

## Debiasing word embeddings

Word embeddings can reflect gender, ethnicity, age, sexual, orientation, and other biases of the text used to train the model.

### Addressing bias in word embeddings

• Indentify bias direction

average
$\begin{cases} e_{\text{he}} - e_{\text{she}}\\ e_{\text{male}} - e_{\text{female}}\\ \dots \end{cases}$

bias direction( $1\text{ D}$ )

non-bias direction( $n-1\text{ D}$ )

SVU(singluar vale decomposition, like PCA) can solve it

• Neutralize: For every word that is not definitional, project to get rid of bias

(need to figure out which words should be neutralize, use SVM first to classify)

• Equalize pairs.

grandmother - grandfater have the same similarity and distance(gender neural)

you can handpick them(they are not so much)

# Homework - Emojify

## Building the Emojifier-V2 # UNQ_C5 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)

def Emojify_V2(input_shape, word_to_vec_map, word_to_index):
"""
Function creating the Emojify-v2 model's graph.

Arguments:
input_shape -- shape of the input, usually (max_len,)
word_to_vec_map -- dictionary mapping every word in a vocabulary into its 50-dimensional vector representation
word_to_index -- dictionary mapping from words to their indices in the vocabulary (400,001 words)

Returns:
model -- a model instance in Keras
"""

### START CODE HERE ###
# Define sentence_indices as the input of the graph.
# It should be of shape input_shape and dtype 'int32' (as it contains indices, which are integers).
sentence_indices = Input(input_shape, dtype = 'int32')

# Create the embedding layer pretrained with GloVe Vectors (≈1 line)
embedding_layer = pretrained_embedding_layer(word_to_vec_map, word_to_index)

# Propagate sentence_indices through your embedding layer
# (See additional hints in the instructions).
embeddings = embedding_layer(sentence_indices)

# Propagate the embeddings through an LSTM layer with 128-dimensional hidden state
# The returned output should be a batch of sequences.
X = LSTM(128, return_sequences = True)(embeddings)
# Add dropout with a probability of 0.5
X = Dropout(0.5)(X)
# Propagate X trough another LSTM layer with 128-dimensional hidden state
# The returned output should be a single hidden state, not a batch of sequences.
X = LSTM(128, return_sequences = False)(X)
# Add dropout with a probability of 0.5
X = Dropout(0.5)(X)
# Propagate X through a Dense layer with 5 units
X = Dense(5)(X)
X = Activation('softmax')(X)

# Create Model instance which converts sentence_indices into X.
model = Model(inputs = sentence_indices, outputs = X)

### END CODE HERE ###

return model

model = Emojify_V2((maxLen,), word_to_vec_map, word_to_index)
model.summary()

Model: "functional_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
input_2 (InputLayer)         [(None, 10)]              0
_________________________________________________________________
embedding_3 (Embedding)      (None, 10, 50)            20000050
_________________________________________________________________
lstm_2 (LSTM)                (None, 10, 128)           91648
_________________________________________________________________
dropout_2 (Dropout)          (None, 10, 128)           0
_________________________________________________________________
lstm_3 (LSTM)                (None, 128)               131584
_________________________________________________________________
dropout_3 (Dropout)          (None, 128)               0
_________________________________________________________________
dense_1 (Dense)              (None, 5)                 645
_________________________________________________________________
activation_1 (Activation)    (None, 5)                 0
=================================================================
Total params: 20,223,927
Trainable params: 223,877
Non-trainable params: 20,000,050
_________________________________________________________________


Compile it

model.compile(loss = 'categorical_crossentropy', optimizer = 'adam', metrics = ['accuracy'])


Train it

X_train_indices = sentences_to_indices(X_train, word_to_index, maxLen)
Y_train_oh = convert_to_one_hot(Y_train, C = 5)
model.fit(X_train_indices, Y_train_oh, epochs = 50, batch_size = 32, shuffle=True)

posted @ 2021-08-15 00:26  zjp_shadow  阅读(80)  评论(0编辑  收藏  举报