activation functions summary and comparision

written in the foreword

Any nonlinear function that has good derivative properties has the potential to become an activation function. So here, we will just compare some classic activation functions.

summary

name formula digraph attribution advantage disadvantage usage addition
sigmoid

 

 

  • output is in [0-1] 
  • gradient in [0-0.25]
 
  • Smooth and easy to differentiate, avoid jumpy value
  • Gradient Vanishing 
  • high computation, low speed
 
  • binary classification
 
tanh

 

 

 
  • output is in [-1, 1]
  • gradient in [0-1]
  • mean(output)=0, easy to compute
  • smooth 
  • Gradient Vanishing 
  • high computation, low speed
 
  • RNN 
 
 softmax

 

 

  • gradient in [0-1]
  • convert output as property distribution, the sum of all classes is 1
  • smooth
 
  • Gradient Vanishing 
  • high computation, low speed
 
  •  multiple classification

 ReLU  

 

 

 

  • gradient in [0-1]
  •  Preventing Gradient Vanishing
  • Fast Convergence
  • Dead ReLU Problem 
  • LeNet-5
  • AlexNet
  • VGG 
 
ReLU6

 

 

  •  output limited in [0-6]
  • high efficiency 
  •  limited expression of output
  •  mobile device 
 
SwiGLU

 

 

 
  • performance improvement, better than swish, GLU, etc. including the vision field.
  •  dynamic mating mechanism 
  • unknown 
  •  PaLM(google)
  • llama2(meta)
 

 

comparison 

  sigmoid tanh softmax ReLU ReLU6 SwiGLU
sigmoid -

the gradient in [0,1], output in [-1,1], mean(output)=0 used in RNN

gradient in [0,0.25], output in [0, 1], mainly used in binary classification

same: smooth and gradient vanishing, high computation

gradient in [0-1], output in [0, -), used in binary classification

gradient in [0,0.25], output in [0, 1], mainly used in binary classification

same: smooth and gradient vanishing, high computation

the gradient in [0, a], output in [0, -) widely used in the vision field, high efficiency, preventing gradient vanishing

gradient in [0,0.25], output in [0, 1], mainly used in binary classification, low-efficiency

gradient in [0,1], output in [0, 6], used in mobile devices, high-efficiency 

the gradient in [0,0.25], output in [0, 1], mainly used in binary classification, low-efficiency

output in [0, -), high-performance

gradient in [0,0.25], output in [0, 1], mainly used in binary classification

tanh   -

output in [0, -), used in binary classification

output in [-1,1] , mean(output)=0, used in RNN

same: gradient in [0, 1]

the gradient in [0, a], output in [0, -] widely used in the vision field, high efficiency, preventing gradient vanishing

gradient in [0, 1], output in [-1,1] , mean(output)=0, used in RNN

output in [0, 6], used in mobile devices, high efficiency 

output in [-1,1] , mean(output)=0, used in RNN

same: gradient in [0, 1]

output in [0, -], high performance

gradient in [0, 1], output in [-1,1], mean(output)=0, used in RNN

softmax     -

the gradient in [0, a],  widely used in the vision field, high efficiency, preventing gradient vanishing

gradient in [0-1],  used in binary classification, gradient vanishing

same: output in [0, -)

output in [0, 6], used in mobile devices, high efficiency 

output in [0, 1], used in binary classification, gradient vanishing

same: gradient in [0, 1]

high performance

gradient in [0, 1],  gradient vanishing 

same: output in [0, -)

 

ReLU       -

gradient in [0,1], output in [0, 6], used in mobile device

gradient in [0, a], output in [0, -) widely used in the vision field

same: high efficiency

the high performance

gradient in [0, a],  high efficiency, preventing gradient vanishing 

same: output in [0, -)

ReLU6         -

output in [0, -], high performance

gradient in [0,1], output in [0, 6], used in mobile devices, high efficiency  

SwiGLU           -

 

posted @ 2023-11-28 03:33  Daze_Lu  阅读(42)  评论(0)    收藏  举报