CS294-112 深度强化学习 秋季学期(伯克利)NO.9 Learning policies by imitating optimal controllers
































make compromise between learnt policy and minimal cost!






π hat is using states
π theta is using observations










浙公网安备 33010602011771号