Classifications of RL

Three popular classifications of Reinforcement Learning

Reinforcement Learning [RL] is an algorithm for agent(s) to behaviour in a specific environment for the best outcome.

It requires its environment following Markov Decision Process [MDP], which means, the environment of RL shall have

  • State [SS]: The current state of the environment

  • Action [AA]: The action from the agent

  • Probability [PaP_a]: Probability of transition at time tt from state StS_t to St+1S_{t+1} at given action aa

  • Reward [RaR_a]: The reward from the environment of trasition from from StSt+1S_t \rightarrow S_{t+1}

Model-Based & Model-Free

The first classficaition is about the RL's agnosticism towards the Environment.

A classical RL problem is to play some easy games such as Tic-Tac-Toe, where there are very limited scenarios and can easily find a best solution for all scenarios through Minimax strategy.

A typical RL algorthmn for conditions of accessible modelling of the environment is called Monte-Carlo Tree Search [MCTS] . It consists of four components:

  1. Selection: Giving an action aa, get the transition of state sass {\overset{a}{\longrightarrow}} s'

  2. Expansion: Expand from the state ss until a terminal reward is produced

  3. Simulation: Heuristically or randomly act until reaching a terminal state

  4. Backpropogation: Backpropagate the reward to each selection node

Once we have a set of values from reward backpropagation, we can deliver the best action accordingly. However, this requires the RL algorithms to acknowledge the modelling of the environment. In this case, the probability of transition (P(sas)P(s\overset{a}{\rightarrow}s')), and that is the reason why it is a Model-Based algorithm.

By the way, because it backpropogate to node values, MCTS is also a value-based model.

Value-Based & Policy-Based

Having a state-action value mapping [V(s,a)V(s, a)] is one great way to go. Another method is to update the Policy π\pi for the optimal actions. Policy stands for a set of rules for what is the best action given current state.

In the Vanilla Policy Gradient [VPG], the policy is parametric. The RL starts off with randomized trajectories, which stands for the set {state, action, reward, state...} based on the current policy. With the reward info, we can calculate policy gradient for updating the policy until a policy with an optimal parameters.

  1. Trajectories: The RL starts off with randomized actions for a trajectory. Try multiple times to collect a set of trajectories

  2. Advantages: Using the rewards from the trajectories to calculate Advantages, which means the the reward comparing to average action reward

  3. Gradient: Calculate the gradient based on how much Advantages are acheived

  4. Update: Update the policy with gradient

Please note that the Value-Based and Policy-Based is not distinct from each other. There are algos which are hybirations of Value-Policy-Based such as Advantage Actor Critic (A2C).

Actor and Critic

Actor-Critic

Another important policy-based algo is called Deep Deterministic Policy Gradient [DDPG]. It introduced two-layers of the network.

  • Actor: Take state ss as input and outputs an action aa

  • Critic: Take state ss and action aa as input and outputs an Q-value, which means the quality for the given state-action combination

The DDPG adopted both Actor and Critic network for improving the quality of the actions. The steps are as follow:

  1. Initialize: Initialize both Actor network and Critic network and a Target network withrandom parameters

  2. Buffer: Store a series of experiences (state, action, reward) from Actor network interacting with the environment in replay buffer

  3. Evaluate: Sample some experiences from the replay buffer and calculate gradient based on the Critic network

  4. Backpropogate: Use th experiences in replay buffer to calculate the gradients and backpropogate. Update the Actor and Critic network but soft update the Target network.

  5. Converge: Loop the process of Buffer, Evaluate and Backpropogate until convergence.

Critic-Only

After the most complicate Actor-Critic Network, the next is the Critic-Only network which means there is no Actor, only a Critic producing critiques. The representative work here is Deep Q-Network [DQN]. On a rough look, it feels very like a DDPG without Actor network. Basically using the Target-network to produce actions and a Q network to evaluate the actions:

  1. Initialize: Initialize the Q network with the same weights as the Target network

  2. Buffer: Q network pick the best actions, and collect experiences to restore in the replay buffer

  3. Evaluate: Use Target network to calculate a target Q value, Q network for predict Q value from sampled experiences

  4. Backpropogate: The difference between target and Q value prediction will be backpropogate to Q network at every step, and periodically Target network

  5. Converge: Loop until both Q network and Target network converge

Because the whole point of the DQN is to train a Q-network which predict the best Q-value for Q-table, and at the end of the training, only Q network will be saved, that is why DQN is a value-based Critic-only algorithm.

In case you are curious: Why is a target network required?

https://stackoverflow.com/questions/54237327/why-is-a-target-network-required

Actor-Only

Similar to the Critic-Only network, The Actor-Only means that only an Actor network exist, the most famous Actor-Only algo is the VPG, which has been stated above.

On-Policy & Off-Policy

The most cited On-Policy algo is Proximal Policy Optimization [PPO]. Let's hop into the steps first:

  1. Initialize: An Actor (Policy) network and a Critic network are initialized with random parameters

  2. Evaluation: At every step, the difference between reward and Critic network predictions are calculated as the Advantage of the action made by Policy network, and an Objective (Clipped Surrogate Objective) Function for optimization is calculated based on the Advantages. Same applies to the Gradient for the Critic network

  3. Optimize: Use gradient ascent to optimize the policy based on the Objective

  4. Backpropogate: Use the gradient to update the Critic network parameters

  5. Converge: Repeat until converge

As you may have observed, different from DDPG or DPN, there is no target network in PPO. The main reason behind is, all the RL algos want to have stable learning pace. The target network ensured the soft update, however it is acheived in PPO through the objective function. On the contrary, DDPG and DPN are Off-Policy Algorithms.

To conclude: Whether the Exploration (data collection) and Exploitation (generate action from knowledge) comes from the Same policy is the distinction beween On-Policy and Off-Policy. Off-Policy algos use different network for action and evaluation which naturally provides the randomness in Exploration. In the mean time On-Policy algos manage to encourage exploration via clipping learning steps or including designed randomness, etc.

Conclusion

An example table for viewing

On/Off-Policy
Value-Based
Policy-Based
Hybrid

On-Policy

SARSA (Critic-Only)

Policy Gradient (Actor-Only)

A2C (Actor-Critic)

Off-Policy

Q-Learning (Critic-Only)

DDPG (Actor-Critic)

SAC (Actor-Critic)

References

  • Rummery, G. A., & Niranjan, M. (1994). On-line Q-learning using connectionist systems (Vol. 37, p. 14). Cambridge, UK: University of Cambridge, Department of Engineering.

  • Schulman, J., Levine, S., Abbeel, P., Jordan, M., & Moritz, P. (2015, June). Trust region policy optimization. In International conference on machine learning (pp. 1889-1897). PMLR.

  • Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., ... & Kavukcuoglu, K. (2016, June). Asynchronous methods for deep reinforcement learning. In International conference on machine learning (pp. 1928-1937). PMLR.

  • Watkins, C. J., & Dayan, P. (1992). Q-learning. Machine learning, 8, 279-292.

  • Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., ... & Wierstra, D. (2015). Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971.

  • Haarnoja, T., Zhou, A., Abbeel, P., & Levine, S. (2018, July). Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning (pp. 1861-1870). PMLR.

  • Meng, T. L., & Khushi, M. (2019). Reinforcement learning in financial markets. Data, 4(3), 110.

Last updated