BGD, SGD and MBGD

1 BGD (Batch Gradient Descent)

BGD, computes gradient using entire training dataset in each iteration. Makes one update per epoch. 

$$\theta_{j+1}=\theta_{j}-\eta*\frac{1}{m}\sum_{i}^{}\nabla J(\theta_{j})$$

pros:

  • statle convergence with consistent gradient direction
  • can achieve optimal convergence for convex surfaces 

cons:

  • computationally expensive
  • memory intensive
  • can't be updated online 

use when dataset is small.

 

2 SGD (Stotistic Gradient Descent)

  • Computes gradient using one random training example at a time

  • Makes many updates per epoch (one per example)

$$\theta_{j+1}=\theta_{j}-\eta*\nabla J(\theta_{j})$$

pros:

  • computationally efficient
  • can escape local minima
  • online learning capacity

cons:

  • high variance
  • may never converge
  • requires careful learning rate scheduling

 

3 MBGD (Mini Batch Gradient Descent)

  • compute gradient use small batches (typically 32-512 samples)
  • balances between BGD stability and SGD efficiency

$$\theta_{j+1}=\theta_{j}-\eta*\nabla J(\theta_{j},B)$$

B is mini batch.

 

posted @ 2026-01-18 12:43  ylxn  阅读(0)  评论(0)    收藏  举报