强化学习(七)-Q Learning
一、概念
1、Q Table是一个记录了价值的表,行为state,列为action,每个action对应不同的奖惩的值,在当前state时选择Q表里值最大的action
2、强化学习的控制问题:给定5个要素(状态集S,动作集A,即时奖励R,衰减因子γ,探索率ϵ),求最优的动作价值函数q∗、最优的策略π∗
3、Q Learning是时序差分求解强化学习控制问题的离线控制算法
选择动作:使用ϵ-贪婪法
更新价值函数:使用贪婪法,直接学习最优策略,选择使Q最大的action;而Sarsa还会做探索,使用ϵ-贪婪法,这是和Sarsa算法的本质区别
4、缺点:
只能用于离散动作,而不能用于连续动作
Q表太大,只适用于小型的强化学习问题
二、计算公式
1、Q(s1,a2)估计:Q(s1,a2)
2、Q(s1,a2)现实:r+y*maxQ(s2)
3、计算差距=现实-估计
4、得到:新的Q(s1,a2)=老的Q(s1,a2)+a*差距

三、Q函数
1、也叫动作价值函数
2、输入:状态-动作对,即在某个状态和其对应的动作
3、两种表示方法
仅使用状态时,输出是多个价值
使用状态-动作对时,输出是一个标量
四、一维代码
1、初始化Q表,数据都为0,行为state,列为action
2、选择动作,例如90%的概率选择最大值的action,10%的概率随机选择action
3、获取环境的反馈,即奖励和下一个行为
4、进行循环
import numpy as np import pandas as pd import time np.random.seed(2) # 全局变量 N_STATES = 6 ACTIONS = ['left', 'right'] EPSILON = 0.9 ALPHA = 0.1 LAMBDA = 0.9 MAX_EPISODE = 13 FRESH_TIME = 0.01 # 创建q table def build_q_table(n_states, actions): table = pd.DataFrame( np.zeros((n_states, len(actions))), columns=actions, ) # print(table) return table # 选择动作 def choose_action(state, q_table): state_action = q_table.iloc[state, :] if (np.random.uniform() > EPSILON) or (state_action.all() == 0): action_name = np.random.choice(ACTIONS) else: action_name = state_action.idxmax() # 使用idxmax()返回列名而不是索引 return action_name # 获取环境的反馈 def get_env_feedback(S, A): if A == 'right': if S == N_STATES - 2: S_ = 'terminal' R = 1 else: S_ = S + 1 R = 0 else: R = 0 if S == 0: S_ = S else: S_ = S - 1 return S_, R # This is how environment be updated def update_env(S, episode, step_counter): env_list = ['-'] * (N_STATES - 1) + ['T'] if S == 'terminal': interaction = 'Episode %s: total_steps = %s' % (episode + 1, step_counter) print('\r{}'.format(interaction), end='') time.sleep(2) print('/r ', end='') else: env_list[S] = 'o' interaction = ''.join(env_list) print('\r{}'.format(interaction), end='') time.sleep(FRESH_TIME) def rl(): q_table = build_q_table(N_STATES, ACTIONS) for episode in range(MAX_EPISODE): step_counter = 0 S = 0 is_terminal = False update_env(S, episode, step_counter) while not is_terminal: A = choose_action(S, q_table) S_, R = get_env_feedback(S, A) q_predict = q_table.loc[S, A] if S_ != 'terminal': q_target = R + LAMBDA * q_table.iloc[S_, :].max() else: q_target = R is_terminal = True q_table.loc[S, A] += ALPHA * (q_target - q_predict) S = S_ update_env(S, episode, step_counter + 1) step_counter += 1 return q_table if __name__ == '__main__': q_table = rl() print(q_table)
五、二维迷宫代码
1、算法更新
class QLearningTable: def __init__(self, actions, learning_rate=0.01, reward_decay=0.9, e_greedy=0.9): self.actions = actions self.lr = learning_rate self.gamma = reward_decay self.epsilon = e_greedy self.q_table = pd.DataFrame(columns=self.actions) def choose_action(self, observation): self.check_state_exist(observation) if np.random.uniform() < self.epsilon: state_action = self.q_table.loc[observation, :] state_action = state_action.reindex(np.random.permutation(state_action.index)) action = state_action.argmax() else: action = np.random.choice(self.actions) return action def learn(self, s, a, r, s_): self.check_state_exist(s_) q_predict = self.q_table.loc[s, a] if s_ != 'terminal': q_target = r + self.gamma * self.q_table.loc[s_, :].max() else: q_target = r self.q_table.loc[s, a] += self.lr * (q_target - q_predict) def check_state_exist(self, state): if state not in self.q_table.index: self.q_table = pd.concat( [self.q_table, pd.Series( [0] * len(self.actions), index=self.q_table.columns, name=state ).to_frame().T] )
2、思维决策
from RL_brain import QLearningTable from maze_env import Maze def update(): for episode in range(100): observation = env.reset() while True: env.render() action = RL.choose_action(str(observation)) observation_, reward, done = env.step(action) RL.learn(str(observation), action, reward, str(observation_)) observation = observation_ if done: break print('game over') env.destroy() if __name__ == '__main__': env = Maze() RL = QLearningTable(actions=list(range(env.n_actions))) env.after(100, update) env.mainloop()
参考:
https://www.cnblogs.com/pinard/p/9669263.html
浙公网安备 33010602011771号