Offline RL | Deadly Triad issue
推荐阅读Why is there a Deadly Triad issue and how to handle it ?
- Bootstrapping
- Off-policy learning
- Function approximations
当上述三者结合在一起时,value function 可能表示不稳定或者过于多样
why:

function approximation might implicitly embed Q(S, a) ≈ Q(S’, a1)


浙公网安备 33010602011771号