强化学习理论-第5课-蒙特卡洛方法

之前的章节都是基于model base，这节是model free的方法。

1. model-base to model-free:

2. 计算\(q_{\pi k}:\)

3. MC base algorithm:

step 1和model base是不一样的，后面的步骤是一样的。

4. MC exploring starts算法:

提出visit概念，MC Basic的算法只考虑第一个visit，对数据的使用是浪费的。

4.1 update the policy:

第一种方案是收集到所有的episodes，然后再计算；另一种是获得一个episode，计算近似值，改进策略

4.2 MC exploring starts：

在实际当中，获取所有的(s,a)是比较困难的，所以有下面的改进策略。

5. MC Epsilon-Greedy：

A policy is called soft if the probability to take any action is positive.

    def mc_basic(self, length = 30, epochs = 10):
        for epoch in range(epochs):
            for state in range(self.state_space_size):
                for action in range(self.action_space_size):
                    episode = self.obtain_episode(self.policy, state, action, length)
                    g = 0
                    for step in range(len(episode)-1, -1, -1):
                        g = episode[step]['reward'] + self.gama * g
                    self.qvalue[state][action] = g
                qvalue_star = self.qvalue[state].max()
                action_star = self.qvalue[state].tolist().index(qvalue_star)
                self.policy[state] = np.zeros(shape = self.action_space_size)
                self.policy[state,action_star] = 1    

    def mc_exploring_starts(self, length = 10):
        policy = self.mean_policy.copy()
        qvalue = self.qvalue.copy()
        returns = [[[0 for row in range(1)] for col in range(5)] for block in range(25)]
        while np.linalg.norm(policy - self.policy, ord = 1) > 0.001:
            policy = self.policy.copy()
            for state in range(self.state_space_size):
                for action in range(self.action_space_size):
                    visit_list = []
                    g = 0
                    episode = self.obtain_episode(policy=self.policy, start_state=state, start_action=action,
                                                  length=length)
                    for step in range(len(episode) - 1, -1, -1):
                        reward = episode[step]['reward']
                        state = episode[step]['state']
                        action = episode[step]['action']
                        g = self.gama * g + reward
                        if [state, action] not in visit_list:
                            visit_list.append([state, action])
                            returns[state][action].append(g)
                            qvalue[state, action] = np.array(returns[state][action]).mean()
                            qvalue_star = qvalue[state].max()
                            action_star = qvalue[state].tolist().index(qvalue_star)
                            self.policy[state] = np.zeros(shape = self.action_space_size).copy()
                            self.policy[state, action_star] = 1
                    
    def mc_epsilon_greedy(self, length = 10, epsilon=0):
        policy = self.mean_policy.copy()
        qvalue = self.qvalue.copy()
        returns = [[[0 for row in range(1)] for col in range(5)] for block in range(25)]
        while np.linalg.norm(policy - self.policy, ord = 1) > 0.001:
            policy = self.policy.copy()
            for state in range(self.state_space_size):
                for action in range(self.action_space_size):
                    visit_list = []
                    g = 0
                    episode = self.obtain_episode(policy=self.policy, start_state=state, start_action=action,
                                                  length=length)
                    for step in range(len(episode) - 1, -1, -1):
                        reward = episode[step]['reward']
                        state = episode[step]['state']
                        action = episode[step]['action']
                        g = self.gama * g + reward
                        if [state, action] not in visit_list:
                            visit_list.append([state, action])
                            returns[state][action].append(g)
                            qvalue[state, action] = np.array(returns[state][action]).mean()
                            qvalue_star = qvalue[state].max()
                            action_star = qvalue[state].tolist().index(qvalue_star)
                            for a in range(self.action_space_size):
                                if a == action_star:
                                    self.policy[state, a] = 1 - (
                                            self.action_space_size - 1) / self.action_space_size * epsilon
                                else:
                                    self.policy[state, a] = 1 / self.action_space_size * epsilon

posted @ 2024-11-18 13:47 penuel 阅读(117) 评论(0) 收藏举报

刷新页面返回顶部

penuel