BGD, SGD and MBGD
1 BGD (Batch Gradient Descent)
BGD, computes gradient using entire training dataset in each iteration. Makes one update per epoch.
$$\theta_{j+1}=\theta_{j}-\eta*\frac{1}{m}\sum_{i}^{}\nabla J(\theta_{j})$$
pros:
- statle convergence with consistent gradient direction
- can achieve optimal convergence for convex surfaces
cons:
- computationally expensive
- memory intensive
- can't be updated online
use when dataset is small.
2 SGD (Stotistic Gradient Descent)
-
Computes gradient using one random training example at a time
-
Makes many updates per epoch (one per example)
$$\theta_{j+1}=\theta_{j}-\eta*\nabla J(\theta_{j})$$
pros:
- computationally efficient
- can escape local minima
- online learning capacity
cons:
- high variance
- may never converge
- requires careful learning rate scheduling
3 MBGD (Mini Batch Gradient Descent)
- compute gradient use small batches (typically 32-512 samples)
- balances between BGD stability and SGD efficiency
$$\theta_{j+1}=\theta_{j}-\eta*\nabla J(\theta_{j},B)$$
B is mini batch.
谢谢!

浙公网安备 33010602011771号