强化学习(十一)-Actor Critic

一、概念

1、融合了以价值学习为基础的算法(例如Q Learning)和以策略学习为基础的算法(例如Policy Gradient)

2、Actor对应Policy Gradient,在连续的行为中,基于概率选择合适的行为,回合更新,根据得分来修改选择行为的概率

3、Critic对应Q Learning,单步更新,用来评判行为的得分

 

二、从Policy Gradient到Actor Critic的转化

1、策略梯度没有Critic,但可以使用蒙特卡洛法,计算每一步的价值,相当于Critic的部分功能,场景比较受限

2、可以类似DQN,用价值函数代替蒙特卡洛法,作为一个通用的Critic

3、有两组近似:

策略函数近似

价值函数近似

4、总结

Critic通过Q网络,来计算状态的最优价值Vt

Actor使用Vt,迭代更新策略函数的参数θ,进而选择动作,得到反馈和新的状态

Critic使用反馈和新的状态,更新Q网路参数w,帮助Critic计算状态的最有价值

 

三、优缺点

1、优点:可以单步更新,比传统的Policy Gradient要快

2、缺点:取决于critic的评判标准,但是critic难以收敛,有可能学不到东西

 

四、Actor-Critic算法的可选形式

1、基于状态价值

2、基于动作价值

3、基于TD误差

4、基于优势函数

5、基于TD(λ)误差

 

五、流程

1、Actor:输入state,输出预测的行为的概率,通过概率优化行为,需要使用Critic给的td_error进行处理

image

 

2、Critic:输入state,输出这个状态的价值,生成td_error

image

 

六、代码

import numpy as np
import tensorflow as tf
import gymnasium as gym

np.random.seed(2)
tf.random.set_seed(2)  # reproducible

# Superparameters
OUTPUT_GRAPH = False
MAX_EPISODE = 3000
DISPLAY_REWARD_THRESHOLD = 200  # renders environment if total episode reward is greater then this threshold
MAX_EP_STEPS = 1000   # maximum time step in one episode
RENDER = False  # rendering wastes time
GAMMA = 0.9     # reward discount in TD error
LR_A = 0.001    # learning rate for actor
LR_C = 0.01     # learning rate for critic

env = gym.make('CartPole-v1', render_mode='human')
env = env.unwrapped

N_F = env.observation_space.shape[0]
N_A = env.action_space.n


class Actor(tf.keras.Model):
    def __init__(self, n_features, n_actions, lr=0.001):
        super(Actor, self).__init__()
        self.n_actions = n_actions
        
        # Build the network
        self.dense1 = tf.keras.layers.Dense(
            20, 
            activation='relu',
            kernel_initializer=tf.keras.initializers.RandomNormal(0., 0.1),
            bias_initializer=tf.keras.initializers.Constant(0.1)
        )
        
        self.dense2 = tf.keras.layers.Dense(
            n_actions,
            activation='softmax',
            kernel_initializer=tf.keras.initializers.RandomNormal(0., 0.1),
            bias_initializer=tf.keras.initializers.Constant(0.1)
        )
        
        self.optimizer = tf.keras.optimizers.Adam(learning_rate=lr)
    
    def call(self, state):
        x = self.dense1(state)
        return self.dense2(x)
    
    def learn(self, s, a, td_error):
        s = tf.expand_dims(s, 0)
        
        with tf.GradientTape() as tape:
            acts_prob = self(s)
            log_prob = tf.math.log(acts_prob[0, a])
            loss = -log_prob * td_error  # negative because we want to maximize
        
        gradients = tape.gradient(loss, self.trainable_variables)
        self.optimizer.apply_gradients(zip(gradients, self.trainable_variables))
        return loss
    
    def choose_action(self, s):
        s = tf.expand_dims(s, 0)
        probs = self(s)
        return np.random.choice(self.n_actions, p=probs.numpy().ravel())


class Critic(tf.keras.Model):
    def __init__(self, n_features, lr=0.01):
        super(Critic, self).__init__()
        
        # Build the network
        self.dense1 = tf.keras.layers.Dense(
            20,
            activation='relu',
            kernel_initializer=tf.keras.initializers.RandomNormal(0., 0.1),
            bias_initializer=tf.keras.initializers.Constant(0.1)
        )
        
        self.dense2 = tf.keras.layers.Dense(
            1,
            activation=None,  # linear activation for value function
            kernel_initializer=tf.keras.initializers.RandomNormal(0., 0.1),
            bias_initializer=tf.keras.initializers.Constant(0.1)
        )
        
        self.optimizer = tf.keras.optimizers.Adam(learning_rate=lr)
    
    def call(self, state):
        x = self.dense1(state)
        return self.dense2(x)
    
    def learn(self, s, r, s_):
        s = tf.expand_dims(s, 0)
        s_ = tf.expand_dims(s_, 0)
        
        # Get next state value
        v_next = self(s_)
        
        with tf.GradientTape() as tape:
            v_current = self(s)
            td_error = r + GAMMA * v_next - v_current
            loss = tf.square(td_error)
        
        gradients = tape.gradient(loss, self.trainable_variables)
        self.optimizer.apply_gradients(zip(gradients, self.trainable_variables))
        
        return td_error.numpy()[0, 0]  # return scalar td_error


# Initialize actor and critic
actor = Actor(n_features=N_F, n_actions=N_A, lr=LR_A)
critic = Critic(n_features=N_F, lr=LR_C)     # we need a good teacher, so the teacher should learn faster than the actor

for i_episode in range(MAX_EPISODE):
    s, info = env.reset()
    t = 0
    track_r = []
    while True:
        if RENDER: env.render()

        a = actor.choose_action(s)

        s_, r, terminated, truncated, info = env.step(a)
        done = terminated or truncated

        if done: r = -20

        track_r.append(r)

        td_error = critic.learn(s, r, s_)  # gradient = grad[r + gamma * V(s_) - V(s)]
        actor.learn(s, a, td_error)     # true_gradient = grad[logPi(s,a) * td_error]

        s = s_
        t += 1

        if done or t >= MAX_EP_STEPS:
            ep_rs_sum = sum(track_r)

            if 'running_reward' not in globals():
                running_reward = ep_rs_sum
            else:
                running_reward = running_reward * 0.95 + ep_rs_sum * 0.05
            if running_reward > DISPLAY_REWARD_THRESHOLD: RENDER = True  # rendering
            print("episode:", i_episode, "  reward:", int(running_reward))
            break

 

posted @ 2025-08-18 17:48  牧云文仔  阅读(19)  评论(0)    收藏  举报