Introduction to Reinforcement Learning

9 min readJan 19, 2021

Beating a professional player at Go (Chinese ancient board strategy game) was a long standing challenge of Artificial Intelligence (AI). So, before I explain what is Reinforcement Learning, check below the hierarchy of Reinforcement Learning (RL). Like many other techniques in AI, RL is a subset of Machine Learning (ML).

Go is played on a grid of black lines (usually 19×19). Game pieces, called *stones*, are played on the lines

What is Machine Learning ?

In a simple terms, it’s just a statistics is mathematics of data.

What is Intelligence ?

This question regard as a highly debatable question. It’s got different meaning in different subjects and also it’s depends on to whom you ask like neuroscientist, physicist, mathematician, computer scientist, psychologist etc.. In this segment, I will be defining the way I see an intelligence. Measures an agent ability to achieve goals in a wide range of tasks in various domains without any prior knowledge (the task that never seen before). If we think about human level AGI then we’re talking about system that can pretty much do the full spectrum of cognitive tasks that humans can at least as good humans are able to do it. We humans are like an agents. For example, we have a very powerful general purpose learning algorithm and it’s our brain, the human mind. Our brains are capable of doing both things an exquisite of this being possible.

What is Reinforcement Learning?

RL is learning what to do given an environment and how to map situations to actions and maximize a numerical reward signal which defines how well it’s doing in the environment. The learner i.e. agent is not told which actions to take to achieve a goal but instead must discover which actions yield the most reward by trying them. RL problem is to simply take actions over time is to maximise that reward single. In the most interesting and challenging cases, actions may affect not only the immediate reward but also the next situation and, through that, all subsequent rewards. These two characteristics trial-and-error search and delayed reward are the two most important distinguishing features of reinforcement learning.

Major components of Reinforcement Learning

Beyond the agent and the environment, the four main sub elements of a reinforcement learning system: a policy, a reward signal, a value function, and, optionally, a model of the environment.

Policy: A policy defines the learning agent’s way of behaving at a given time. It’s a mapping from perceived states of the environment to actions to be taken when in those states. It corresponds to what in psychology would be called a set of stimulus–response rules or associations. The policy is the core of a reinforcement learning agent in the sense that it alone is sufficient to determine behavior. In general, policies may be stochastic, specifying probabilities for each action.

Reward: Reward signal is the one of the special thing in RL. It defines the goal of a reinforcement learning problem. On each time step, the environment sends to the reinforcement learning agent a single number called the reward. The goal of RL is clear that the agent simply take actions and over time is to maximise that reward single. The reward signal thus defines what are the good and bad events for the agent. In a biological system, we might think of rewards as analogous to the experiences of pleasure or pain. They are the immediate and defining features of the problem faced by the agent. The reward signal is the primary basis for altering the policy; if an action selected by the policy is followed by low reward, then the policy may be changed to select some other action in that situation in the future.

Value function : Whereas the reward signal indicates what is good in an immediate sense, a value function specifies what is good in the long run. The value of a state is the total amount of reward an agent can expect to accumulate over the future, starting from that state. Whereas rewards determine the immediate, intrinsic desirability of environmental states, values indicate the long-term desirability of states after taking into account the states that are likely to follow and the rewards available in those states. For example, a state might always yield a low immediate reward but still have a high value because it is regularly followed by other states that yield high rewards. Or the reverse could be true.

Model: Model defines behaviour of the environment, or more generally, that allows inferences to be made about how the environment will behave. For example, given a state and action, the model might predict the resultant next state and next reward. Models are used for planning, by which we mean any way of deciding on a course of action by considering possible future situations before they are actually experienced.

Challenges In Reinforcement Learning

One of the challenges that arise in reinforcement learning, and not in other kinds of learning, is the trade-off between exploration and exploitation. To obtain a lot of reward, a reinforcement learning agent must prefer actions that it has tried in the past and found to be effective in producing reward. But to discover such actions, it has to try actions that it has not selected before. The agent has to exploit what it has already experienced in order to obtain reward, but it also has to explore in order to make better action selections in the future. The dilemma is that neither exploration nor exploitation can be pursued exclusively without failing at the task. The agent must try a variety of actions and progressively favor those that appear to be best. On a stochastic task, each action must be tried many times to gain a reliable estimate of its expected reward. The exploration–exploitation dilemma has been intensively studied by mathematicians for many decades, yet remains unresolved. For now, we simply note that the entire issue of balancing exploration and exploitation does not even arise in supervised and unsupervised learning, at least in the purest forms of these paradigms.

Other Approaches

There are other various approaches used in RL like Exploration and adaptation to new task in a transfer learning which is the major challenge in RL.

There are many methods that can evaluate transfer learning approaches in RL like reward shaping, learning from demonstrations, policy transfer, transfer via policy reuse, inter-task mapping, reusing representation, etc.. Important note there’s a limitation with transfer learning in RL sometimes the algorithm performs well beyond human level in a simulation environment however when you transfer into the real world it isn’t able to adapt. Examples like in robotics. The advantage of a transferrable representation is that even if the learning takes a large amount of time and training data initially, this cost can be amortised by allowing a policy to apply to a wide variety of similar environments.

Alpha Go

The main innovation that made AlphaGo such a strong player is that it selected moves by a novel version of Monte carlo tree search(MCTS) that was guided by both a policy and a value function learned by reinforcement learning with function approximation provided by deep convolutional ANNs. Another key feature is that instead of reinforcement learning starting from random network weights, it started from weights that were the result of previous supervised learning from a large collection of human expert moves.

The DeepMind team called AlphaGo’s modification of basic MCTS “asynchronous policy and value MCTS,” or APV-MCTS. It selected actions via basic MCTS as described above but with some twists in how it extended its search tree and how it evaluated action edges. In contrast to basic MCTS, which expands its current search tree by using stored action values to select an unexplored edge from a leaf node, APV-MCTS, as implemented in AlphaGo, expanded its tree by choosing an edge according to probabilities supplied by a 13-layer deep convolutional ANN, called the SL-policy network, trained previously by supervised learning to predict moves contained in a database of nearly 30 million human expert moves.

Then, also in contrast to basic MCTS, which evaluates the newly-added state node solely by the return of a rollout initiated from it, APV-MCTS evaluated the node in two ways: by this return of the rollout, but also by a value function learned previously by a reinforcement learning method.

where G was the return of the rollout and ⌘ controlled the mixing of the values resulting from these two evaluation methods. In AlphaGo, these values were supplied by the value network, another 13-layer deep convolutional ANN that was trained as we describe below to output estimated values of board positions. APV-MCTS’s rollouts in AlphaGo were simulated games with both players using a fast rollout policy provided by a simple linear network, also trained by supervised learning before play. Throughout its execution, APV-MCTS kept track of how many simulations passed through each edge of the search tree, and when its execution completed, the most-visited edge from the root node was selected as the action to take, here the move AlphaGo actually made in a game.

Above figure illustrates the networks used by AlphaGo and the steps taken to train them in what the DeepMind team called the “AlphaGo pipeline.” All these networks were trained before any live game play took place, and their weights remained fixed throughout live play.

Go (Game), complexity of the game of GO is in almost PSPACE harder and number of possible configuration of the board is more than the number of atoms in the universe.

Alpha zero

In contrast to AlphaGo, this program used no human data or guidance beyond the basic rules of the game (hence the Zero in its name). It learned exclusively from self-play reinforcement learning, with input giving just “raw” descriptions of the placements of stones on the Go board. AlphaGo Zero implemented a form of policy iteration, interleaving policy evaluation with policy improvement. AlphaGo Zero and AlphaGo is that AlphaGo Zero used MCTS to select moves throughout self-play reinforcement learning, whereas AlphaGo used MCTS for live play however during learning process it did not used MCTS. Other differences besides not using any human data or human-crafted features are that AlphaGo Zero used only one deep convolutional ANN and used a simpler version of MCTS.

AlphaGo Zero’s MCTS was simpler than the version used by AlphaGo in that it did not include rollouts of complete games, and therefore did not need a rollout policy. Each iteration of AlphaGo Zero’s MCTS ran a simulation that ended at a leaf node of the current search tree instead of at the terminal position of a complete game simulation. But as in AlphaGo, each iteration of MCTS in AlphaGo Zero was guided by the output of a deep convolutional network.

References:

Reinforcement learning David Silver

Introduction to Reinforcement Learning Sutton & Barto

Mastering game of Go without human knowledge Deepmind

A general reinforcement learning that masters chess, shogi, and Go through self play Deepmind

Transfer learning in Deep reinforcement learning by Zhuangdi Zhu, Kaixiang Lin, and Jiayu Zhou.

Introduction to Reinforcement Learning

Written by Ashyi

No responses yet