人工智能课预习： 7.3 动态规划算法

动态规划算法

动态规划算法（Dynamic Programming） 是一种求解多阶段决策过程最优化问题的方法。在动态规划中，通过把原问题分解为相对简单的子问题，先求解子问题，再由子问题的解而得到原问题的解。

要求：具有最优子结构和子问题重叠

最优子结构（Optimal substructure）：适用最优原理，最优解可分解为子问题的解

重叠子问题（Overlapping subproblems）：子问题会多次出现，其解可缓存复用

马尔可夫决策过程满足上述特性：贝尔曼方程提供了递归分解方式，价值函数存储并复用求解结果。

\[V_\pi(s) = \sum_{a \in A} \pi(a \mid s) \left( R(s, a) + \gamma \sum_{s' \in S} P(s' \mid s, a) V_\pi(s') \right), \quad Q_\pi(s, a) = R(s, a) + \gamma \sum_{s' \in S} P(s' \mid s, a) \sum_{a' \in A} \pi(a' \mid s') Q_\pi(s', a'), \]

\[V^*(s) = \max_{a \in A} \left( R(s, a) + \gamma \sum_{s' \in S} P(s' \mid s, a) V^*(s') \right), \quad Q^*(s, a) = R(s, a) + \gamma \sum_{s' \in S} P(s' \mid s, a) \max_{a' \in A} Q^*(s', a'). \]

MDP :

预测（Prediction）
- 输入：MDP 模型 $\langle \mathcal{S}, \mathcal{A}, \mathcal{P}, \mathcal{R}, \gamma \rangle$ 和给定策略 $\pi$
- 输出：该策略的价值函数 $v_\pi(s)$
控制（Control）
- 输入：MDP 模型 $\langle \mathcal{S}, \mathcal{A}, \mathcal{P}, \mathcal{R}, \gamma \rangle$
- 输出：最优价值函数 $v_*$ 或最优策略 $\pi_*$

迭代策略评估（Iterative Policy Evaluation）

给定 MDP 模型，评估策略 $\pi$，即，计算其价值函数 $v_\pi$

迭代策略评估：反复应用 贝尔曼期望方程 进行迭代

\[v_{k+1}(s) \leftarrow \sum_{a \in \mathcal{A}} \pi(a \mid s) \left( R_s^a + \gamma \sum_{s' \in \mathcal{S}} P_{ss'}^a v_k(s') \right) \]

\[v_1 \to v_2 \to \cdots \to v_\pi \]

同步（synchronous）迭代：$k+1$ 步的价值函数 $v_{k+1}(s)$ 由 $k$ 步的价值函数 $v_k(s)$ 针对所有状态 $s \in \mathcal{S}$ 统一进行更新

扩展阅读：收缩映射定理（Contraction Mapping Theorem） 保证上述迭代过程一定会收敛到唯一的不动点 $v_\pi$

\[v_{k+1}(s) = \sum_{a \in \mathcal{A}} \pi(a \mid s) \left( \mathcal{R}_s^a + \gamma \sum_{s' \in \mathcal{S}} \mathcal{P}_{ss'}^a v_k(s') \right) \]

\[\mathbf{v}^{k+1} = \mathcal{R}^\pi + \gamma \mathcal{P}^\pi \mathbf{v}^k \]

刚刚手推了一下，这个让我感觉好像高中做的那种被称为 “马尔科夫链” 的题，也是找出递推式，然后左右取极限，（其实是转移矩阵的很多次迭代），得到稳定值。当然，这个的求法也可以矩阵特征值分解，然后令 n 趋于无穷。我觉得这个就有点像迭代，然后稳定到一个值，这就是在这个策略下的价值。

$e.g.1$ $\langle \mathcal{S}, \mathcal{P} \rangle$ 小明家楼下有两家早餐铺，其中A主营小笼包，B主营煎饼果子。当某日小明早餐选择 $A$ ，那么下一天就有可能 $\begin{cases} 40\%选择A \\ 60\%选择B \end{cases}$ ；若某日选择 $B$ ，那么下一天就有 $\begin{cases} 50\%选择A \\ 50\%选择B \end{cases}$ 。若第一天选择A，那么第 n 天选择A的概率是多少？

$\mathcal{S} = { A , B } $

$\mathcal{P} : P (A | A) = 0.4, P( B | A ) = 0.6, P(A|B) = 0.5, P(B|B) = 0.5$

$e.g.2$ 甲乙丙丁4人传接球训练，球从甲脚下开始，等可能地随机传向3人中的一人，接球者接到球后，再等可能地随机传向另外3人中的1人，依此类推。假设所有传出的球都能接住。
记第$n$次传球之前，球在甲脚下的概率为$P_n\ (n \in \mathbb{N}^*)$，易知$P_1 = 1, P_2 = 0$ $\\$
(1) 推导$P_n$相关的递推式及通项 $\\$
(2) 设第$n$次传球之前，球在乙脚下的概率为$Q_n$，比较$P_n$与$Q_n\ (n \geq 2)$的大小

策略迭代算法

策略评估（Policy Evaluation） :给定当前策略 $\pi$ ，计算其价值函数 $v_\pi$ 。如，迭代策略评估，线性代数直接求解等。

\[v_\pi(s) = \mathbb{E} \left[ R_{t+1} + \gamma R_{t+2} + \dots \mid S_t = s \right] \]

策略改进（Policy Improvement） ：在每个状态 $s$ 上，基于 $v_\pi$ 采取 贪心行动 $\pi' = \text{greedy}(v_\pi)$

\[\begin{align*} \pi'(s) &= \arg\max_{a} q_\pi(s, a) \\ &= \arg\max_{a} \mathbb{E} \left[ R_{t+1} + \gamma v_\pi(S_{t+1}) \mid S_t = s, A_t = a \right] \\ &= \arg\max_{a} \sum_{s', r} p(s', r \mid s, a) \left[ r + \gamma v_\pi(s') \right], \end{align*} \]

收缩映射定理（Contraction Mapping Theorem） ：保证上述迭代过程一定会收敛到唯一的不动点，最优策略 $\pi^*$

价值迭代算法

若已知子问题的最优值函数解 $v_*(s')$

则状态 $s$ 的最优值函数解 $v_*(s)$ 可通过一步迭代得到：

\[v_*(s) \leftarrow \max_{a \in \mathcal{A}} \mathcal{R}_s^a + \gamma \sum_{s' \in \mathcal{S}} \mathcal{P}_{ss'}^a v_*(s') \]

\[v_1 \rightarrow v_2 \rightarrow \dots \rightarrow v_* \]

价值迭代：按上述公式迭代更新

与策略迭代不同，值迭代中没有显式的策略表示，中间过程的值函数可能不对应任何实际策略

\[v_{k+1}(s) = \max_{a \in \mathcal{A}} \left( \mathcal{R}_s^a + \gamma \sum_{s' \in \mathcal{S}} \mathcal{P}_{ss'}^a v_k(s') \right) \]

\[\mathbf{v}_{k+1} = \max_{a \in \mathcal{A}} \mathcal{R}^\mathbf{a} + \gamma \mathcal{P}^\mathbf{a} \mathbf{v}_k \]

代码示例

import numpy as np
import pandas as pd

# 环境参数
grid_size = (3, 3)

# 状态
states = [(i, j) for i in range(grid_size[0]) for j in range(grid_size[1])]
terminal = (2, 1)   # 终止状态

# 动作 A
actions = {
    '↑': (-1, 0),
    '↓': (1, 0),
    '←': (0, -1),
    '→': (0, 1),
}

# 折扣因子
gamma = 1.0

# 初始策略：均匀随机
policy = {s: {a: 1/4 for a in actions} for s in states if s != terminal}

# 转移与奖励
def step(s, a):
    # 状态转移
    if s == terminal:
        return s, 0
    di, dj = actions[a]
    ni, nj = s[0] + di, s[1] + dj
    if 0 <= ni < grid_size[0] and 0 <= nj < grid_size[1]:
        next_s = (ni, nj)
    else:
        next_s = s
    # 即时奖励
    reward = 0 if next_s == terminal else -1
    # 转移后的状态、这一步的奖励
    return next_s, reward

# 迭代式策略评估
def policy_evaluation(policy, theta=1e-4):
    # 初始化价值函数：都为0
    V = {s: 0.0 for s in states}
    # 迭代过程
    while True:
        delta = 0
        for s in states:
            if s == terminal:
                continue
            # 新的价值状态
            v_new = 0.0
            # 把原价值状态存储起来
            v = V[s]
            # 对每个动作按策略加权
            for a, pa in policy[s].items():
                # TODO:
                next_s, reward = step(s, a)
                v_new += pa * (reward + gamma * V[next_s])
            # 更新
            V[s] = v_new
            # 收敛时停止更新
            delta = max(delta, abs(V[s] - v))
        if delta < theta:
            break
    return V

# 策略改进
def policy_improvement(V):
    new_policy = {}
    # 对于所有状态，每项动作都试一遍
    for s in states:
        if s == terminal:
            continue
        # 计算所有动作的Q值
        q_vals = {}
        for a in actions:
            next_s, reward = step(s, a)
            q_vals[a] = reward + gamma * V[next_s]
        # 找到最大Q值对应的动作集
        max_q = max(q_vals.values())
        best_as = [a for a, q in q_vals.items() if np.isclose(q, max_q)]
        prob = 1 / len(best_as)
        # 更新策略
        new_policy[s] = {a: (prob if a in best_as else 0.0) for a in actions}
    return new_policy

# 策略迭代
def policy_iteration():
    current_policy = policy.copy()
    while True:
        # 1. 按当前策略评估价值函数
        V = policy_evaluation(current_policy)
        # 2. 改进策略
        new_policy = policy_improvement(V)
        # 3. 检查策略是否稳定
        if all(new_policy[s] == current_policy[s] for s in new_policy):
            return V, new_policy
        current_policy = new_policy

# 运行
V_opt, policy_opt = policy_iteration()

# 输出结果
v_matrix = np.array([[V_opt[(i, j)] for j in range(3)] for i in range(3)])
p_matrix = np.array([[ ''.join([a for a in actions if policy_opt.get((i,j), {}).get(a,0)>0])
                       for j in range(3)] for i in range(3)])

df_values = pd.DataFrame(v_matrix, index=[f"Row {i}" for i in range(3)],
                         columns=[f"Col {j}" for j in range(3)])
df_policy = pd.DataFrame(p_matrix, index=[f"Row {i}" for i in range(3)],
                         columns=[f"Col {j}" for j in range(3)])

print("Optimal Value Function V*：")
print(df_values)
print("\nOptimal Policy π*：")
print(df_policy)

运行结果：

Optimal Value Function V*：
       Col 0  Col 1  Col 2
Row 0   -2.0   -1.0   -2.0
Row 1   -1.0    0.0   -1.0
Row 2    0.0    0.0    0.0

Optimal Policy π*：
      Col 0 Col 1 Col 2
Row 0    ↓→     ↓    ↓←
Row 1    ↓→     ↓    ↓←
Row 2     →           ←

posted @ 2025-06-20 15:33 Antimerry 阅读(33) 评论(0) 收藏举报

刷新页面返回顶部

antimerry

人工智能课预习： 7.3 动态规划算法

动态规划算法

迭代策略评估（Iterative Policy Evaluation）

策略迭代算法

价值迭代算法

代码示例

公告

antimerry

人工智能课预习 ： 7.3 动态规划算法

动态规划算法

迭代策略评估（Iterative Policy Evaluation）

策略迭代算法

价值迭代算法

代码示例

公告

人工智能课预习： 7.3 动态规划算法