增强学习的基本概念(1)

增强学习有两个基本问题:

(1)预测问题: 给定一个特定的政策,评估有多少奖励

  Estimate the value function of an unknown MDP

(2)控制问题:找到一个最佳的策略

Optimize the value function of an unknown MDP

 

 

On-policy learning

"Learn on the job"

The policy we're following is the policy we learn about

Off-policy learning

"Look over someone's shoulder"

Evaluate target policy π(a|s) to compute Vπ(s) or Qπ(s) while following behavior policy μ(a|s)

Why off-policy is important:

(1) Learn from observing humans or other agents

(2)Reuse experience generated from old policies π1,π2,π3,...

(3)Learn about optimal policy while following exploratory policy

(4)Learn about multiple policies while following one policy

Monte-Carlo is not fit to Off-Policy 

posted @ 2017-09-13 15:33  swagger2016  阅读(134)  评论(0)    收藏  举报