增强学习的基本概念(1)
增强学习有两个基本问题:
(1)预测问题: 给定一个特定的政策,评估有多少奖励
Estimate the value function of an unknown MDP
(2)控制问题:找到一个最佳的策略
Optimize the value function of an unknown MDP
On-policy learning
"Learn on the job"
The policy we're following is the policy we learn about
Off-policy learning
"Look over someone's shoulder"
Evaluate target policy π(a|s) to compute Vπ(s) or Qπ(s) while following behavior policy μ(a|s)
Why off-policy is important:
(1) Learn from observing humans or other agents
(2)Reuse experience generated from old policies π1,π2,π3,...
(3)Learn about optimal policy while following exploratory policy
(4)Learn about multiple policies while following one policy
Monte-Carlo is not fit to Off-Policy

浙公网安备 33010602011771号