BGD, SGD and MBGD

1 BGD (Batch Gradient Descent)

BGD, computes gradient using entire training dataset in each iteration. Makes one update per epoch.

$$\theta_{j+1}=\theta_{j}-\eta*\frac{1}{m}\sum_{i}^{}\nabla J(\theta_{j})$$

pros:

statle convergence with consistent gradient direction
can achieve optimal convergence for convex surfaces

cons:

computationally expensive
memory intensive
can't be updated online

use when dataset is small.

2 SGD (Stotistic Gradient Descent)

Computes gradient using one random training example at a time
Makes many updates per epoch (one per example)

$$\theta_{j+1}=\theta_{j}-\eta*\nabla J(\theta_{j})$$

pros:

computationally efficient
can escape local minima
online learning capacity

cons:

high variance
may never converge
requires careful learning rate scheduling

3 MBGD (Mini Batch Gradient Descent)

compute gradient use small batches (typically 32-512 samples)
balances between BGD stability and SGD efficiency

$$\theta_{j+1}=\theta_{j}-\eta*\nabla J(\theta_{j},B)$$

B is mini batch.

posted @ 2026-01-18 12:43 ylxn 阅读(0) 评论(0) 收藏举报

刷新页面返回顶部