Reinforcement Learning

Reinforcement learning

https://en.wikipedia.org/wiki/Reinforcement_learning

Reinforcement learning (RL) is an area of machine learning concerned with how intelligent agents ought to take actions in an environment in order to maximize the notion of cumulative reward.^[1] Reinforcement learning is one of three basic machine learning paradigms, alongside supervised learning and unsupervised learning.

Reinforcement learning differs from supervised learning in not needing labelled input/output pairs be presented, and in not needing sub-optimal actions to be explicitly corrected. Instead the focus is on finding a balance between exploration (of uncharted territory) and exploitation (of current knowledge).^[2]

The environment is typically stated in the form of a Markov decision process (MDP), because many reinforcement learning algorithms for this context use dynamic programming techniques.^[3] The main difference between the classical dynamic programming methods and reinforcement learning algorithms is that the latter do not assume knowledge of an exact mathematical model of the MDP and they target large MDPs where exact methods become infeasible.

Q-learning

https://en.wikipedia.org/wiki/Q-learning#Algorithm

Q-learning is a model-free reinforcement learning algorithm to learn the value of an action in a particular state. It does not require a model of the environment (hence "model-free"), and it can handle problems with stochastic transitions and rewards without requiring adaptations.

For any finite Markov decision process (FMDP), Q-learning finds an optimal policy in the sense of maximizing the expected value of the total reward over any and all successive steps, starting from the current state.^[1] Q-learning can identify an optimal action-selection policy for any given FMDP, given infinite exploration time and a partly-random policy.^[1] "Q" refers to the function that the algorithm computes - the expected rewards for an action taken in a given state.^[2]

RPS High Level Environment

https://github.com/dennylslee/rock-paper-scissors-DeepRL

还是回到 RPS 问题。

此处将状态设计为输赢状态。

The environment of this game play is depicted below. It follows the classical RL environment definition.

Inside the "environment" is the embedded player 2 (i.e. the opponent). This player might adopt different type of play strategy. The interaction contains the following:

action space: either rock, paper, or scissors that the AI agent (player 1) puts out.

rewards: this is an indication from the environment back to player 1. The reward is simply a value of 1 if it is a win for player 1, and 0 otherwise.

state: this is where the fun is and some creativity comes into play (which might affect the player winning outcome). In this setup, we have designed the state space to be:

win, tie, lost indicators: one of the three can be set to a 1 for a particular state

winRateTrend, tieRateTrend, lostRateTrend: this is an indicator which reflects a positively-trending moving average (set to 1) or not (set to 0). All three indicators are assessed independently.

winRateMovingAvg, tieRateMovingAvg, lostRateMovingAvg: floating point value between 0 and 1 which indicates the rate. This rate is calculated based on a configured moving average window size.

对手出招作为状态

https://stats.stackexchange.com/questions/291906/can-reinforcement-learning-be-stateless

此处建议将对手出招作为状态。

If you wrote a rock-paper-scissors agent to play against a human opponent, you might actually formulate it as a reinforcement learning problem, taking the last N plays as the state, because that could learn to take advantage of human players' poor judgement of randomness.

https://github.com/raul1991/rock-paper-scissors-RL

此例实现了，将对手出招作为状态的一个实现。

但是其中更新 q value的地方有问题，其拿 max Q(action) 来作为对下一状态的期望奖励，这是不对的。

It is just a simple version of a rock paper scissor game. However, to make the game deterministic it takes in the move of the player and learns to maximize its wins.

修正后的版本

https://github.com/fanqingsong/boilerplate-rock-paper-scissors-RL

import random
import numpy as np


"""
reference:
https://github.com/raul1991/rock-paper-scissors-RL
state should be opponent's play, last N plays is better
https://stats.stackexchange.com/questions/291906/can-reinforcement-learning-be-stateless
Note: this code implement last one state, but for abbey and kris, the wining rate is not improved appearantly.
but in most times, it can beat all players.
--------- you vs quincy ----------
Final results: {'p1': 386, 'p2': 140, 'tie': 474}
Player 1 win rate: 73.38403041825094%
--------- you vs abbey ----------
Final results: {'p1': 3525, 'p2': 3306, 'tie': 3169}
Player 1 win rate: 51.60298638559509%
--------- you vs kris ----------
Final results: {'p1': 3295, 'p2': 3262, 'tie': 3443}
Player 1 win rate: 50.251639469269485%
--------- you vs mrugesh ----------
Final results: {'p1': 609, 'p2': 230, 'tie': 161}
Player 1 win rate: 72.58641239570917%
state can also be designed as WIN LOSE TIE
https://github.com/dennylslee/rock-paper-scissors-DeepRL
"""

class Bot(object):
    # our states can be either "ROCK, PAPER or SCISSORS"
    state_space = 3

    # three actions by our player
    action_space = 3

    q_table = np.random.uniform(low=-2, high=5, size=(3, 3))
    total_reward, reward = 0, 0
    avg_rewards_list = []
    avg_reward = 0
    result = 'DRAW'
    tags = ["R", "P", "S"]
    # looses to map
    loses_to = {
        "0": 1,  # rock loses to paper
        "1": 2,  # paper loses to scissor
        "2": 0  # scissor loses to rock
    }

    def __init__(self, alpha=0.5, gamma=0.2, epsilon=0.8, min_eps=0, episodes=1000, verbose=False):
        self.alpha = alpha
        self.gamma = gamma
        self.epsilon = epsilon
        self.min_eps = min_eps
        self.episodes = episodes
        # Calculate episodic reduction in epsilon
        self.reduction = (epsilon - min_eps) / episodes

        self.verbose = verbose

    # either explore or exploit, any which ways return the next action
    def bot_move(self, player_move):
        action = 0
        # Determine next action - epsilon greedy strategy
        if np.random.random() < 1 - self.epsilon:
            if self.verbose:
                print("Exploiting....")

            action = np.argmax(self.q_table[player_move])
        else:
            if self.verbose:
                print("Exploring.....")

            action = np.random.randint(0, self.action_space)

        # Decay epsilon
        if self.epsilon > self.min_eps:
            self.epsilon -= self.reduction

        if self.verbose:
            print("choose ", self.tags[action])

        return action

    def get_reward(self, player, bot):
        reward = 0

        if self.get_result(player, bot) == 'WIN':
            reward = 5
        elif self.get_result(player, bot) == 'LOSE':
            reward = -2
        else:
            # Draw case
            reward = 4

        return reward

    # update q_table
    def update_experience(self, state, action, reward, player_next_move):
        reward_next_move = np.max(self.q_table[player_next_move])
        delta = self.alpha * (reward + self.gamma * reward_next_move - self.q_table[state, action])
        self.q_table[state, action] += delta

    def print_stats(self, player, bot, reward):
        if self.verbose:
            print("Player move : {0}, bot: {1}, reward: {2}, result: {3}, total_reward: {4}".format(self.tags[player],
                                                                                                    self.tags[bot], reward,
                                                                                                    self.result,
                                                                                                    self.total_reward))
            print(self.q_table)

    # returns either a WIN, LOSE or a DRAW to indicate the same.
    def get_result(self, player_move, bot_move):
        if bot_move == player_move:
            self.result = 'DRAW'
        elif self.loses_to[str(bot_move)] == player_move:
            self.result = 'LOSE'
        else:
            self.result = 'WIN'

        return self.result

    def get_avg_rewards(self):
        return self.avg_rewards_list

    def learn(self, player_move, bot_move, player_next_move):
        # add reward
        reward = self.get_reward(player_move, bot_move)

        self.total_reward += reward
        self.avg_rewards_list.append(reward)

        # update experience
        self.update_experience(player_move, bot_move, reward, player_next_move)
        self.print_stats(player_move, bot_move, reward)


# when each opponent start, the opponent_history will be a empty list,
# At that moment , we should create a new bot to learn that opponents' rules

bot_player = None


def player(opponent_prev_play, opponent_history, me_prev_play, me_history, num_games, verbose=False):
    # print("call player")
    # print(prev_play)
    # print(len(opponent_history))

    global bot_player

    play_list = ["R", "P", "S"]
    win_dict = {"R": "P", "P": "S", "S": "R"}

    if len(opponent_history) == 0:
        bot_player = Bot(verbose=verbose, episodes=num_games)

    # suppose opponent's play is R, before real first round
    opponent_prev_play_index = 0

    if opponent_prev_play in play_list:
        if len(opponent_history) > 0:
            opponent_prev_prev_play = opponent_history[-1]
            opponent_prev_prev_play_index = play_list.index(opponent_prev_prev_play)

        opponent_history.append(opponent_prev_play)
        opponent_prev_play_index = play_list.index(opponent_prev_play)

    if me_prev_play in play_list:
        if len(me_history) > 1:
            me_prev_prev_play = me_history[-1]
            me_prev_prev_play_index = play_list.index(me_prev_prev_play)

        me_history.append(me_prev_play)

    if len(opponent_history) >= 3:
        state = opponent_prev_prev_play_index
        next_state = opponent_prev_play_index
        action = me_prev_prev_play_index
        bot_player.learn(state, action, next_state)

    me_play_index = bot_player.bot_move(opponent_prev_play_index)

    if verbose:
        print(f"opponent most possible next play is {me_play_index}")

    me_play = play_list[me_play_index]

    return me_play

原理

Stay Hungry,Stay Foolish!

lightsong

{Web: [React, Vue, NodeJS, HTTP]，DevOps:[Jenkins,Docker,K8S], Languages:[Python, JS, C, Lua, Shell, Groovy]}