• 博客园logo
  • 会员
  • 众包
  • 新闻
  • 博问
  • 闪存
  • 赞助商
  • HarmonyOS
  • Chat2DB
    • 搜索
      所有博客
    • 搜索
      当前博客
  • 写随笔 我的博客 短消息 简洁模式
    用户头像
    我的博客 我的园子 账号设置 会员中心 简洁模式 ... 退出登录
    注册 登录
Vpegasus
E-mail: pegasus.wenjia@foxmail.com
博客园    首页    新随笔    联系   管理    订阅  订阅
强化学习(五):时间差分学习

Temporal-Difference Learning

TD在强化学习中处于中心位置,它结合了DP与MC两种思想。如MC, TD可以直接从原始经验中学习,且不需要对环境有整体的认知。也如DP一样,它不需要等到最终结果才开始学习,它Bootstrap,即它的每步估计会部分地基于之前的估计。

最简单的TD形式:

\[V(S_t) \leftarrow V(S_t) + \alpha [R_{t+1} + \gamma V(S_{t+1} ) - V(S_t)] \]

这个可被称为TD(0)或一步TD(one-step TD)。

# Tabular TD(0) for estimating v_pi
Input: the policy pi to be evaluated
Algorithm parameter: step size alpha in (0,1]
Initialize V(s), for all s in S_plus, arbitrarily except that V(terminal) = 0

Loop for each episode
    Initialize S
    for step in episode:
        A = action given by pi for S
        Take action A, observe R, S'
        V(S) = V(S) + alpha *[ R+gamma V(S') - V(S)]
        S = S'
        if S == terminal:
            break

TD error:

\[\delta_t \dot = R_{t+1} + \gamma V(S_{t+1}) - V(S_t) \]

在每一时刻,TD error是因为估计所产生的误差。

Advantage of TD Prediction Methods

Sarsa: On-policy TD Control

\[Q(S_t,A_t) \leftarrow Q(S_t,A_t) + \alpha[R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) - Q(S_t,A_t)] \]

Sarsa (State, Action, Reward, State, Action) 表达是这个五元组元素之间的关系。TD error 可表示

\[\delta_t = R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) - Q(S_t,A_t) \]

# Sarsa (on-policy TD control) for estimating Q = q
Algorithm parameters: step size  alpha in (0,1], small epsilon > 0
Initialize Q(s,a), for all s in S_plus, a in A(s), arbitrarily except that Q(terminal,.) = 0

Loop for each episode:
     Initialize S
     Choose A from S using policy derived from Q (e.g., epsilon-greedy)
     Loop for each step of episode:
          Take action A, observe R, S'
          Choose A' from S' using policy derived from Q (e.g., epsilon-greedy)
          Q(S,A) = Q(S,A) + alpha[R + gamma Q(S',A') - Q(S,A)]
          S = S',A=A'
          if S = terminal:
              break

Q-learning: Off-policy TD Control

\[Q(S_t,A_t) \leftarrow Q(S_t,A_t) + \alpha[R_{t+1} + \gamma \max_a Q(S_{t+1}, a) - Q(S_t,A_t)] \]

# Sarsa (on-policy TD control) for estimating Q = q
Algorithm parameters: step size  alpha in (0,1], small epsilon > 0
Initialize Q(s,a), for all s in S_plus, a in A(s), arbitrarily except that Q(terminal,.) = 0

Loop for each episode:
     Initialize S
     
     Loop for each step of episode:
        
          Choose A from S using policy derived from Q (e.g., epsilon-greedy)
          Take action A, observe R, S'
          Q(S,A) = Q(S,A) + alpha[R + gamma max_a Q(S',a) - Q(S,A)]
          S = S'
          if S = terminal:
              break

Q-learning 直接逼近q*, 最优的action-value 函数独立于行为策略。

Expected Sarsa

\[Q(S_t,A_t) \leftarrow Q(S_t,A_t) + \alpha[R_{t+1} + \gamma E[ Q(S_{t+1}, A_{t+1})|S_{t+1}] - Q(S_t,A_t)]\\ \leftarrow Q(S_t,A_t) +\alpha [R_{t+1} + \gamma \sum_{a}\pi(a|S_{t+1})Q(S_{t+1},a) - Q(S_t,A_t)] \]

Double Q-learning

\[Q(S_t,A_t) \leftarrow Q(S_t,A_t) + \alpha[R_{t+1} + \gamma Q_2(S_{t+1}, \arg\max_a Q_1(S_{t+1},a)) - Q(S_t,A_t)] \]

# Double Q-learning, for estimating Q1 = Q2 = q*

Algorithm parameters: step size alpha in (0,1],small epsilon >0
Initialize Q1(s,a) and Q2(s,a), for all s in S_plus, a in A(s), such that Q(terminal,.) = 0

Loop for each episode:
     Initialize S
     Loop for each step of episode:
          Choose A from S using the policy epsilon-greedy in Q1+Q2
          Take action A, observe R, S'
          with 0.5 probability:
               Q1(S,A) = Q1(S,A) + alpha(R + gamma Q2(S',arg max_a Q1(S',a)) - Q1(S,A))
    	 else:
               Q2(S,A) = Q2(S,A) + alpha(R + gamma Q1(S',arg max_a Q2(S',a)) - Q2(S,A))
     	 S = S'
         if S = terminal:
                break
posted on 2018-08-12 23:58  Vpegasus  阅读(1156)  评论(0)    收藏  举报
刷新页面返回顶部
博客园  ©  2004-2025
浙公网安备 33010602011771号 浙ICP备2021040463号-3