Classifications of RL
Three popular classifications of Reinforcement Learning
Reinforcement Learning [RL] is an algorithm for agent(s) to behaviour in a specific environment for the best outcome.
It requires its environment following Markov Decision Process [MDP], which means, the environment of RL shall have
State []: The current state of the environment
Action []: The action from the agent
Probability []: Probability of transition at time from state to at given action
Reward []: The reward from the environment of trasition from from
Model-Based & Model-Free
The first classficaition is about the RL's agnosticism towards the Environment.
A classical RL problem is to play some easy games such as Tic-Tac-Toe, where there are very limited scenarios and can easily find a best solution for all scenarios through Minimax strategy.
A typical RL algorthmn for conditions of accessible modelling of the environment is called Monte-Carlo Tree Search [MCTS] . It consists of four components:
Selection: Giving an action , get the transition of state
Expansion: Expand from the state until a terminal reward is produced
Simulation: Heuristically or randomly act until reaching a terminal state
Backpropogation: Backpropagate the reward to each selection node
Once we have a set of values from reward backpropagation, we can deliver the best action accordingly. However, this requires the RL algorithms to acknowledge the modelling of the environment. In this case, the probability of transition (), and that is the reason why it is a Model-Based algorithm.
By the way, because it backpropogate to node values, MCTS is also a value-based model.
Value-Based & Policy-Based
Having a state-action value mapping [] is one great way to go. Another method is to update the Policy for the optimal actions. Policy stands for a set of rules for what is the best action given current state.
In the Vanilla Policy Gradient [VPG], the policy is parametric. The RL starts off with randomized trajectories, which stands for the set {state, action, reward, state...} based on the current policy. With the reward info, we can calculate policy gradient for updating the policy until a policy with an optimal parameters.
Trajectories: The RL starts off with randomized actions for a trajectory. Try multiple times to collect a set of trajectories
Advantages: Using the rewards from the trajectories to calculate Advantages, which means the the reward comparing to average action reward
Gradient: Calculate the gradient based on how much Advantages are acheived
Update: Update the policy with gradient
Please note that the Value-Based and Policy-Based is not distinct from each other. There are algos which are hybirations of Value-Policy-Based such as Advantage Actor Critic (A2C).
Actor and Critic
Actor-Critic
Another important policy-based algo is called Deep Deterministic Policy Gradient [DDPG]. It introduced two-layers of the network.
Actor: Take state as input and outputs an action
Critic: Take state and action as input and outputs an Q-value, which means the quality for the given state-action combination
The DDPG adopted both Actor and Critic network for improving the quality of the actions. The steps are as follow:
Initialize: Initialize both Actor network and Critic network and a Target network withrandom parameters
Buffer: Store a series of experiences (state, action, reward) from Actor network interacting with the environment in replay buffer
Evaluate: Sample some experiences from the replay buffer and calculate gradient based on the Critic network
Backpropogate: Use th experiences in replay buffer to calculate the gradients and backpropogate. Update the Actor and Critic network but soft update the Target network.
Converge: Loop the process of Buffer, Evaluate and Backpropogate until convergence.
Critic-Only
After the most complicate Actor-Critic Network, the next is the Critic-Only network which means there is no Actor, only a Critic producing critiques. The representative work here is Deep Q-Network [DQN]. On a rough look, it feels very like a DDPG without Actor network. Basically using the Target-network to produce actions and a Q network to evaluate the actions:
Initialize: Initialize the Q network with the same weights as the Target network
Buffer: Q network pick the best actions, and collect experiences to restore in the replay buffer
Evaluate: Use Target network to calculate a target Q value, Q network for predict Q value from sampled experiences
Backpropogate: The difference between target and Q value prediction will be backpropogate to Q network at every step, and periodically Target network
Converge: Loop until both Q network and Target network converge
Because the whole point of the DQN is to train a Q-network which predict the best Q-value for Q-table, and at the end of the training, only Q network will be saved, that is why DQN is a value-based Critic-only algorithm.
Actor-Only
Similar to the Critic-Only network, The Actor-Only means that only an Actor network exist, the most famous Actor-Only algo is the VPG, which has been stated above.
On-Policy & Off-Policy
The most cited On-Policy algo is Proximal Policy Optimization [PPO]. Let's hop into the steps first:
Initialize: An Actor (Policy) network and a Critic network are initialized with random parameters
Evaluation: At every step, the difference between reward and Critic network predictions are calculated as the Advantage of the action made by Policy network, and an Objective (Clipped Surrogate Objective) Function for optimization is calculated based on the Advantages. Same applies to the Gradient for the Critic network
Optimize: Use gradient ascent to optimize the policy based on the Objective
Backpropogate: Use the gradient to update the Critic network parameters
Converge: Repeat until converge
As you may have observed, different from DDPG or DPN, there is no target network in PPO. The main reason behind is, all the RL algos want to have stable learning pace. The target network ensured the soft update, however it is acheived in PPO through the objective function. On the contrary, DDPG and DPN are Off-Policy Algorithms.
To conclude: Whether the Exploration (data collection) and Exploitation (generate action from knowledge) comes from the Same policy is the distinction beween On-Policy and Off-Policy. Off-Policy algos use different network for action and evaluation which naturally provides the randomness in Exploration. In the mean time On-Policy algos manage to encourage exploration via clipping learning steps or including designed randomness, etc.
Conclusion
An example table for viewing
On-Policy
SARSA (Critic-Only)
Policy Gradient (Actor-Only)
A2C (Actor-Critic)
Off-Policy
Q-Learning (Critic-Only)
DDPG (Actor-Critic)
SAC (Actor-Critic)
References
Rummery, G. A., & Niranjan, M. (1994). On-line Q-learning using connectionist systems (Vol. 37, p. 14). Cambridge, UK: University of Cambridge, Department of Engineering.
Schulman, J., Levine, S., Abbeel, P., Jordan, M., & Moritz, P. (2015, June). Trust region policy optimization. In International conference on machine learning (pp. 1889-1897). PMLR.
Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., ... & Kavukcuoglu, K. (2016, June). Asynchronous methods for deep reinforcement learning. In International conference on machine learning (pp. 1928-1937). PMLR.
Watkins, C. J., & Dayan, P. (1992). Q-learning. Machine learning, 8, 279-292.
Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., ... & Wierstra, D. (2015). Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971.
Haarnoja, T., Zhou, A., Abbeel, P., & Levine, S. (2018, July). Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning (pp. 1861-1870). PMLR.
Meng, T. L., & Khushi, M. (2019). Reinforcement learning in financial markets. Data, 4(3), 110.
Last updated