NLP

Context-free language

Context-free language: Given \(G\) as a CFG, the language of \(G\), denoted as \(\mathcal L(G)\), is the set of strings derivable by \(G\).

Every regular language is context-free.

Example of non-CFL language: \(L=\{a^nb^nc^n|n>0\}\).

Chomsky normal form (CNF): each production is either the form \(A\to BC\) or \(A\to a\) (binary)

CKY parsing: map a sentence to its parse tree by dynamic programming.

image-20250918122651570

Probabilistic context-free grammar (PCFG): probabilities are assigned to each rule.


Latent semantic analysis (LSA).

Term-document matrix: \(W_{i,j}\) indicates how many times word \(i\) appears in matrix \(j\).

Two problems:

  1. related words don’t always appear together.
  2. sparsity.

Instead of directly using \(W\), we want to infer a "latent" vector such that \(W_{i,j}\approx\max(lv(w_i)lv(d_j)^T,0)\)

Singular value decomposition: \(W=U\Sigma V^T\).

TF-IDF normalization: define term frequency as \(\dfrac{\text{the number of times word }i\text{ appears in document }j}{\text{the number of words in document }j}\), and inverse document frequency as \(\log(\dfrac{\text{the number of documents }+1}{\text{the number of documents containing word }i+1})+1\), let \(count'(i,j)=tf\cdot idf\). (A word \(i\) has more importance in document \(j\) if it appears very frequent in document \(j\) but not so frequent in other documents)


E-M algorithm

Gaussian mixture model: select a mixture component with probability \(\pi\), and sample from that component's Gaussian, i.e., \(\mathbb{P}(z=c)=\pi_c\), \(\mathbb{P}(x|z=c)=\mathcal N(x;\mu_c,\sigma_c)\), where \(\pi,\mu,\sigma\) are learnable parameters.

We want to maximize the log-likelihood of data:

\[\log p(x)=\log\sum\limits_{z}p(x,z) \]

For any distribution \(q\) on \(z\), by Jensen inequality, we have

\[p(x;\theta)=\sum\limits_{z}q(z)\cdot\dfrac{p(x,z;\theta)}{q(z)}\ge\sum\limits_{z}q(z)\log(\dfrac{p(x,z;\theta)}{q(z)})=\sum\limits_{z}q(z)\log p(x,z;\theta)+\text{entropy}(q) \]

equality holds when \(q(z)=p(z|x;\theta)\).

EM algorithm:

  • E-step: compute \(q(z)=p(z|x;\theta)\).
  • M-step: update \(\theta\) for \(\max\sum\limits_{z}q(z)\log p(x,z;\theta)\).

HMM for part of speech tagging

image-20250926182201520

Given \(n\) observations \(o_1,o_2,\cdots,o_n\), the probability that hidden labels \(q_1,q_2,\cdots,q_n\) correspond to \(o_1,o_2,\cdots,o_n\) is

\[\mathbb{P}(Q,O)=\mathbb{P}(q_1)\mathbb{P}(q_2|q_1)\mathbb{P}(q_3|q_2)\cdots\mathbb{P}(q_n|q_{n-1})\mathbb{P}(o_1|q_1)\mathbb{P}(o_2|q_2)\cdots\mathbb{P}(o_n|q_n) \]

Denote all hidden-hidden transition probabilities by \(A\), and all hidden-emission probabilities by \(B\).

The most possible tag sequence \(q_1,q_2,\cdots,q_n\) can be found by dynamic programming (Viterbi algorithm)

How to get \(\pi,A,B\)?

Supervised learning:

  • \(\pi_i=\dfrac{\#(q_1=i)}{\#\text{sequences}}\)
  • \(A_{i,j}=\dfrac{\#(q_{t-1}=i,q_t=j)}{\#(q_{t-1}=i,q_t=*)}\)
  • \(B_{i,w}=\dfrac{\#(q_t=i,o_t=w)}{\#(q_t=i)}\)

Unsupervised learning: optimize \(\log\mathbb{P}(O|\theta)=\log\sum\limits_{Q}\mathbb{P}(O,Q|\theta)\). Apply E-M algorithm:

  • E-step: \(Q(Z)=\mathbb{P}(Q|O,\theta)\), can be directly derived from the parameters of \(A,B\) of \(\theta\).
  • M-step: update \(\theta\) for \(\max\sum\limits_{Z}Q(Z)\log \mathbb{P}(O,Z|\theta)\), apply Lagrange multiplier.

\(N\)-gram LM

Language model: A language model assigns a probability of any sequence of words.

Unigram: each word is independent of all others. \(P_{unigram}(w_1,...,w_T)=P(w_1)P(w_2)...P(w_T)\)

\(N\)-gram: the probability of each word depends on the previous \(n-1\) words. For example, \(P_{tri}(w_t|w_{t-2}w_{t-1})=\dfrac{\text{count}(w_{t-2}w_{t-1}w_t)}{\text{count}(w_{t-2}w_{t-1}*)}\)

Two special tokens:

  • <eos>: end of sentence.
  • <unk>: out of vocabulary token.

Add \(k\) smoothing: \(P(w_t|w_{t-N}\cdots w_{i-1})=\dfrac{\text{count}(w_{t-N}\cdots w_{t-1}w_t)+k}{\text{count}(w_{t-N}\cdots w_{t-1}*)+k|V|}\).

Interpolation: \(P_{tri}(w_t|w_{t-2}w_{t-1})=\lambda_1P_{tri}(w_t|w_{t-2}w_{t-1})+\lambda_2P_{bi}(w_t|w_{t-1})+\lambda_3P_{uni}(w_t)\), where \(\lambda_1+\lambda_2+\lambda_3=1\) are tunable parameters.

Backoff: \(P_{triBO}(w_t|w_{t-1}w_{t-1})=\begin{cases}P_{tri}^*(w_t|w_{t-2}w_{t-1})&(\text{count}(w_{t-2}w_{t-1}w_t)>0)\\\alpha(w_{t-2}w_{t-1})P_{bi}(w_t|w_{t-1})&(\text{otherwise})\end{cases}\)

Perplexity: \(PPL(W)=2^{-l}\), where \(l=\dfrac{\log_2(P(W))}{\text{token_len}(W)}\)


Word2Vec

Skip-gram: learn representations that predict the context given a word. \(p_{\theta}(\text{out}|\text{input})=\dfrac{\exp(u_{\text{out}}\cdot w_{\text{input}})}{\sum_{v\in V}\exp(u_v\cdot w_{\text{input}})}\), loss \(L_t=-\log p_{\theta}(x_{t-s}|x_t)-\cdots-\log p_{\theta}(x_{t+s}|x_t)\). Apply gradient descent: \(w_x\leftarrow w_x+\eta(u_y-E_{p_\theta(v|x)}[u_v])\). Negative sampling: for each true pair \((x,y)\), sample \(k\) negative samples \(y'\), and sample \(k\) negative samples \(y'\), define the loss function as \(\log\sigma(u_y\cdot w_x)+\sum\limits_{i}\mathbb{E}_{y'\sim P_n}[\log\sigma(-u_{y'}\cdot w_x)]\), where \(\sigma\) is the sigmoid function, and \(P_n\) is a unigram model.

CBOW (Continuous Bag-of-Words): learn representations that predict a word given context. \(p_\theta(x_t|x_{t-s}\cdots x_{t+s})=\dfrac{\exp(u_{x_t}\cdot \frac{1}{2S}\sum_jw_{t+j})}{Z}\), loss \(L_t=-\log p_\theta(x_t|x_{t-2},...,x_{t+2})\).


Neural network

Dropout: A regularization technique for neural networks that randomly drops out a unit at training time with a specified probability \(p\). (much less mentioned in very recent LLM reports)

Parallel computing (CUDA)

Class-based LM: Cluster the token into \(\sqrt{V}\) classes. Decompose the prediction of a token into first its class and then the token in the class.


Recurrent neural network

Recurrent neural network: encode the whole history \(w_0,w_1,\cdots,w_{t-1}\).

\[h_t=\sigma(W_{ih}x_t+W_{hh}h_{t-1}+b_h) \]

\[y_t=\text{softmax}(W_{ho}h_t+b_o) \]

\[L(w)=-\sum\limits_{i}\log\mathbb{P}(w_i|w_{0\cdots i-1}) \]

The parameters \(W_{ih}\) and \(W_{hh}\) are shared across timestamps.

Back-propagation through time: do back-propagation in the reverse topological order.

  • Gradient exploding: gradient clipping, \(\text{clip}(\nabla L)=\min(1,\dfrac{\gamma}{\lVert L\rVert^2})\nabla L\), where \(\gamma\) is a hyper-parameter, usually set to be \(1\) or \(0.5\).

  • Gradient vanishing: Long-short term memory.

    image-20251015223354026

Gate recurrent unit (GRU):

  • Update gate: \(z_t=\sigma(W_z\cdot [x_t,h_{t-1}]+b_z)\).
  • Reset gate: \(r_t=\sigma(W_r\cdot [x_t,h_{t-1}]+b_r)\).
  • Candidate hidden state: \(\hat h_t=\tanh(W\cdot [r_t*h_{t-1},x_t]+b)\)
  • \(h_t=z_t\cdot \hat h_t+(1-z_t)\cdot h_{t-1}\)

The \((1-z_t)\cdot h_{t-1}\) term alleviates gradient vanishing.

Parallel training for RNN: parallel across sentences.

Sampling with RNNLM: At each time step \(t\), sample \(w_t\) from \(\mathbb{P}(w_t|\cdots)\), and feed it to the next timestamp (autoregressive LM).

Residue network: Add a direct link between hidden layers, \(h_{l+1}=h_l+F(h_l)\).

Bi-directional RNN: \(h_t\) has context from both the left and the right (CANNOT be used for autoregressive language modelling because it is forbidden to utilize information from the future)


seq2seq

Encoder-decoder model: use bi-RNN encoder for the input sequence, and use a uni-RNN decoder
for the output. Average the encoder's hidden vectors for the input of the decoder RNN.

image-20251209185453583

Attention: in machine translation, we may want to pay attention to different parts of the input in different timesteps.

In timestamp \(t\), for each encoder state \(h_i^{enc}\), compute an alignment score \(\hat{a}_i=(h_i^{enc})^TW_ah_{t-1}^{dec}\), where \(W_a\) is shared across timestamps. Reweight the encoder states by \(a\) and pass \(\sum a_ih_i^{enc}\) to the decoder.

image-20251209190457279

Decoding with beam search: Maintain a number of beams. On each time-step: We expand the current beams, sort them, and only keep the beams with \(k\)-largest log-probability.


VAE-LM

Train a encoder \(q_{\theta}\) that generates \(z\) from \(x\), and a decoder \(p_{\theta}\) that reconstructs \(x\) from \(z\).

Generation: sample \(z\) from a prior distribution \(p(z)\), then generate \(x\) from \(p_{\theta}(x|z)\).

What is the training objective for VAE? Consider the maximum likelihood, we have

\[\begin{aligned} &\ln p(x)\\ =&\ln\int_zp(x,z)\\ =&\ln\int_zp(x,z)\dfrac{q(z|x)}{q(z|x)}\\ \ge&\mathbb{E}_{z\sim q(z|x)}[\ln(\dfrac{p(x,z)}{q(z|x)})]\\ =&\mathbb{E}_{z\sim q(z|x)}[\ln(\dfrac{p(x|z)p(z)}{q(x,z)})]\\ =&\mathbb{E}_{z\sim q(z|x)}\ln(p(x|z))-\text{KL}(q(z|x)\parallel p(z)) \end{aligned} \]

This is called ELBO (Evidence lower bound). When we are maximizing ELBO, we are also learning a good
posterior \(q\).

Reparameterization trick: the term \(\mathbb{E}_{z\sim q(z|x)}\ln(p(x|z))\) is not convenient for BP, but in practice, \(p(z),p_{\theta}(x|z)\) and \(q_{\theta}(z|x)\) are all Gaussian, so we can sample \(\epsilon\sim\mathcal N(0,1)\) and let \(z=\mu+\sigma\epsilon\), then do BP.


Subword tokenization

Byte pair encoding (BPE):

  • Start with a unigram vocabulary of all characters in the data.
  • In each iteration, find the most frequent pair, merge it, and add to the vocabulary.
  • Stop when vocabulary is of pre-determined size.

Transformer

Self attention:

  • Train three matrices \(W_q,W_k\) and \(W_v\), for each word embedding \(x_i\), compute the query \(q_i=W_qx_i,k_i=W_kx_i,v_i=W_vv_i\), and denote the \(d\times n\) matrix formed by \(q_i,k_i,v_i\) by \(Q,K,V\).
  • For every pair compute \(q_i\cdot k_j\) and apply softmax, i.e., compute matrix \(A=\text{softmax}(\dfrac{QK^T}{\sqrt{d}})\).
  • Get the outputs by weighting the value vectors, i.e., \(z_i=\sum\limits_{j}A_{i,j}v_j\).

Can be computed in parallel.

Layer normalization: \(\text{Layernorm}(h)=\alpha\cdot\dfrac{h-\text{mean}(h)}{\text{std}(h)}+\beta\), where \(\text{std}(h)\) is the standard derivation of all dimensions of \(h\) and \(\alpha,\beta\) are learnable parameters.

Positional embedding: an embedding added to the word embedding which contains positional information. \(p_{2i}=\sin(\dfrac{i}{10000^{2i/d}}),p_{2i+1}=\cos(\dfrac{i}{10000^{2i/d}})\).


BERT

Masked language modeling (MLM): randomly mask several tokens in each sentence and ask the transformer to predict the masked token on the top layer via standard cross-entropy loss.

  • Also unmasks several of the masked tokens and replaces some of the masks with random tokens.

Next sentence prediction (NSP): ask the model to predict whether sentence \(2\) is the next sentence of sentence \(1\).

ELECTRA: Instead of masking the input, we corrupt it by replacing some tokens with plausible
alternatives sampled from a small generator network.

Longformer: limit the attention a fixed local span of size \(w\).

Transformer for autoregressive LM: add a mask for unseen tokens. (applied before softmax so that the attention distribution is still normalized)

image-20251209203924872

Transformer encoder-decoder: Each decoder layer is a self-attention followed by a cross-attention. The query vector for a transformer decoder’s cross-attention head is from the output of the previous decoder layer, and the key and value vectors are from the encoders’ outputs.


Maximum Likelihood Estimation

Language GAN: \(\min\limits_{G}\max\limits_{D}V(D,G)\), where \(V(D,G)=\mathbb{E}_{x\sim p_{data}(x)}[\log D(x)]+\mathbb{E}_{z\sim p_z(z)}[\log(1-D(G(z)))]\).

However, if \(G\) is a RNNLM, we CANNOT directly apply GAN for language generation because \(G(z)\) is a one-hot vector which is not differentiable.

Solution 1: Gumbel-softmax reparameterization: return probability distribution \(y_i=\dfrac{\exp((\log\pi_i+g_i)/\tau)}{\sum\exp((\log\pi_j+g_j)/\tau)}\), where \(\pi\) is the probability distribution \(G\) returns and \(\tau\) is the temperature.

Solution 2: Policy gradient.

image-20251209230018452

Three popular sampling algorithms (quality-diversity trade off):

  • Top \(K\): \(\hat{p_i}=\dfrac{p_i\cdot 1\{i\le K\}}{Z}\).
  • Nucleus: \(\hat{p_i}=\dfrac{p_i\cdot 1\{\sum\limits_{j=1}^{i-1}p_j<P\}}{Z}\).
  • Tempered: \(\hat{p_i}=\dfrac{\exp(\log(p_i)/T)}{Z}\).

Properties:

  • Order preservation: \(p_i\ge p_j\Rightarrow\hat{p_i}\ge\hat{p_j}\).
  • Entropy reduction: \(\mathcal H(\hat{p})\le\mathcal H(p)\).
  • Slope preservation: \(\dfrac{\log p_i-\log p_j}{\log p_i-\log p_k}=\dfrac{\log \hat{p_i}-\log \hat{p_j}}{\log \hat{p_i}-\log \hat{p_k}}\).

Tackling bad behaviors:

  • Biased decoding: discourage repeating tokens.
  • Unlikelihood training for repetition
  • The MMI criterion
  • Negative training

GPT3

In context learning: ask the LLM to perform the same as demonstrated in the prompt.

Chain of thoughts:

  • CoT with self-consistency: sample multiple reasoning path from the LLM
    with temperature sampling and then take a majority voting over the answers.
  • Tree of thoughts: Maintain and expand a thought-tree. For each existing step, we prompt the LLM to propose multiple next steps, and also to judge which path is more promising

Instruction tuning: After pretraining, finetune the language model on a good amount of "instruction following" data. We hope the model can generalize to unseen task type.


RLHF

Prompting: Pros: training-free; Cons: No guarantee that the model will precisely follow, and requires
careful prompt design.

Best-of-N: Pros: Do not need to train the policy model, simple and powerful; Cons: not efficient and you might need a large \(N\).

RLHF: collect samples from the model, and ask labelers to rank them. These ranks are used to train the reward model.

Reward model: given input \(x\) and respond \(y\), return a reward value \(r\). Training objective:

\[\mathcal L_R(r_{\phi},D)=-\mathbb{E}_{(x,y_w,y_l)\in D}[\log\sigma(r_{\phi}(x,y_w)-r_{\phi}(r,y_l))] \]

where \(\phi\) is the parameter of the reward model and \(\sigma\) is the sigmoid function.

During the RL phase, the language model tries to maximize the reward function, but should not be too far away from the reference model \(\pi_{ref}\). (RLHF-PPO)

\[\begin{aligned} &\max\limits_{\pi_{\theta}}\mathbb{E}_{x\sim D,y\sim\pi_{\theta}(y|x)}[r_{\phi}(x,y)]-\beta\text{KL}(\pi_{\theta}(\cdot|x)\parallel\pi_{ref}(\cdot|x))\\ =&\max\limits_{\pi_{\theta}}\mathbb{E}_{x\sim D,y\sim\pi_{\theta}(y|x)}[r_{\phi}(x,y)-\beta(\log\pi_{\theta}(y|x)-\log\pi_{ref}(y|x))] \end{aligned} \]

DPO: do not need a reward model or value model.

\[\mathcal{L}_{\text{DPO}}(\theta) = -\mathbb{E}_{(x, y_w, y_l) \in D} \left[ \log \sigma \left( \beta \log \frac{\pi_\theta(y_w | x)}{\pi_{\text{ref}}(y_w | x)} - \beta \log \frac{\pi_\theta(y_l | x)}{\pi_{\text{ref}}(y_l | x)} \right) \right] \]

with no restrictions, it has a closed-form solution:

\[\pi^*(y | x) = \frac{1}{Z(x)} \pi_{\text{ref}}(y | x) \exp\left( \frac{1}{\beta} r(x, y) \right) \]


PET

Parameter-efficient tuning: only tune a small set of parameters (5%, 1% or lower) for a given task.

  • Adapter: Add linear transformations \(d_{model}\to m\to d_{model}\), where \(m\) is small, and only tune the adapter parameters during finetuning.
  • Bitfit: train only the bias-term and task-specific classification layer.
  • LoRA: For each weight matrix, apply a low rank reparameterization to a weight matrix ܹ\(W_0\), which is kept frozen. \(W_0+BA\), \(W_0\in\mathbb{R}^{d\times k},B\in\mathbb{R}^{d\times r},A\in\mathbb{R}^{r\times k}\). (zero latency)
posted @ 2025-12-16 14:22  tzc_wk  阅读(7)  评论(0)    收藏  举报