# Warmup小记

## 为什么用warmup

• 有助于减缓模型在初始阶段对mini-batch的提前过拟合现象，保持分布的平稳
• 有助于保持模型深层的稳定性

## learning rate schedule

warmup和learning schedule是类似的，只是学习率变化不同。如图

### learning rate schedule

tensorflow 中有几种不同的learning rate schedule，以上图的3种为例，更多schedule可以直达官网

# CosineDecay
cosine_learning_rate_schedule = tf.keras.optimizers.schedules.CosineDecay(0.001,4000)
plt.plot(cosine_learning_rate_schedule(tf.range(40000, dtype=tf.float32)),label="cosine")

# ExponentialDecay
exp_learning_rate_schedule = tf.keras.optimizers.schedules.ExponentialDecay(
0.001, 4000, 0.9, staircase=False, name=None
)
plt.plot(exp_learning_rate_schedule(tf.range(40000, dtype=tf.float32)),label="exp")

# PiecewiseConstantDecay
boundaries = [10000, 20000,30000]
values = [0.001, 0.0008, 0.0004,0.0001]
piecewise_learning_rate_schedule = tf.keras.optimizers.schedules.PiecewiseConstantDecay(
boundaries, values)
plt.plot([piecewise_learning_rate_schedule(step) for step in tf.range(40000, dtype=tf.float32)],label="piecewise")

# 自定义 Schedule
my_learning_rate_schedule = MySchedule(0.001)
plt.plot([my_learning_rate_schedule(step) for step in tf.range(40000, dtype=tf.float32)],label="warmup")

plt.title("Learning rate schedule")
plt.ylabel("Learning Rate")
plt.xlabel("Train Step")
plt.legend()

# 自定义 Schedule
class MySchedule(tf.keras.optimizers.schedules.LearningRateSchedule):
def __init__(self, initial_learning_rate, warmup_steps=4000):
super(MySchedule, self).__init__()
self.initial_learning_rate = initial_learning_rate
self.warmup_steps = warmup_steps

def __call__(self, step):
if step > self.warmup_steps:
return self.initial_learning_rate * self.warmup_steps * step ** -1
else:
return self.initial_learning_rate * step * (self.warmup_steps ** -1)


### warmup in transformer

Noam Optimizer

$\alpha \frac{1}{\sqrt{d_{model}}}min(\frac{1}{\sqrt{t}},\frac{t}{w^{3/2}})$

$lrate = d^{-0.5}_{model}*min(step\_ num^{-0.5},step\_ num*warmup\_ steps^{-1.5})$

class CustomSchedule(tf.keras.optimizers.schedules.LearningRateSchedule):
def __init__(self, d_model, warmup_steps=4000):
super(CustomSchedule, self).__init__()

self.d_model = d_model
self.d_model = tf.cast(self.d_model, tf.float32)

self.warmup_steps = warmup_steps

def __call__(self, step):
arg1 = tf.math.rsqrt(step)
arg2 = step * (self.warmup_steps ** -1.5)

return tf.math.rsqrt(self.d_model) * tf.math.minimum(arg1, arg2)

learning_rate = CustomSchedule(d_model)
epsilon=1e-9)

temp_learning_rate_schedule = CustomSchedule(128)

plt.plot(temp_learning_rate_schedule(tf.range(40000, dtype=tf.float32)))
plt.ylabel("Learning Rate")
plt.xlabel("Train Step")


### 关于warmup参数

references

【1】神经网络中 warmup 策略为什么有效；有什么理论解释么？ - 香侬科技的回答 - 知乎 https://www.zhihu.com/question/338066667/answer/771252708

【2】tf官方文档 tf.keras.optimizers.schedules. https://www.tensorflow.org/versions/r2.6/api_docs/python/tf/keras/optimizers/schedules

【3】理解语言的 Transformer 模型. https://www.tensorflow.org/tutorials/text/transformer#优化器（optimizer）

【4】聊一聊学习率预热linear warmup. https://cloud.tencent.com/developer/article/1929850

posted @ 2022-04-10 16:13  鱼与鱼  阅读(181)  评论(0编辑  收藏  举报