强化学习 - 01 符号定义
设一局游戏有\(n\)步,一局中的奖励记作\(R_1,\cdots,R_n\)。那么\(t\)时刻的:
-
折扣回报:\(U_t=\sum_{k=t}^n \gamma^{k-t} \cdot R_k\)
-
动作价值函数:\(Q_\pi\left( s_t, a_t \right) = \mathbb {E}\left[ U_t\mid S_t=s_t,A_t=a_t \right]\)
-
状态价值函数:\(V_\pi\left( s_t \right) = \mathbb{E}_{A_t \sim \pi\left( \cdot \mid s_t;\theta \right) }\left[ Q_\pi\left( s_t, A_t \right) \right]\)
-
目标函数:\(J\left( \theta \right)=\mathbb{E}_S \left[ V_\pi\left( S \right) \right]\)
动作价值函数的贝尔曼公式:\(Q_\pi\left( s_t,a_t \right) = \mathbb{E}_{S_{t+1}\sim p\left( \cdot\mid s_t,a_t \right)} \left[ R_t+\gamma\cdot V_\pi\left( S_{t+1} \right) \right]\)
状态价值函数的贝尔曼公式:\(V_\pi\left( s_t \right)=\mathbb{E}_{A_t\sim \pi\left( \cdot\mid s_t;\theta \right)}\left[ \mathbb{E}_{S_{t+1}\sim p\left( \cdot\mid s_t,a_t \right)} \left[ R_t+\gamma\cdot V_\pi\left( S_{t+1} \right) \right]\right]\)

浙公网安备 33010602011771号