This paper is a comprehensive study of several model-free policy gradient methods:
- Trust Region Policy Optimization (TRPO)
- Deep Deterministic Policy Gradients (DDPG)
- Proximal Policy Optimization (PPO)
- Actor Critic using Kronecker-Factored Trust Region (ACKTR)
It was shown that network architecture is highly interconnected with algorithm methodology. For example, using a large network in PPO required tweaking hyperparameters like trust region clipping. This suggests the need to design hyperparameter agnostic algorithms.
Large, and sparse rewards can lead to saturation and inefficiency. Reward scaling was shown to have large effects, but results were inconsistent across environments and scaling values. Is there a more principled approach?
How do the environment properties affect variability in reported RL algorithm performance? Algorithm performance can vary across environments, and the best performing algorithm across all environments is not always clear. It is important to present results across multiple environments. It is also important to shown the learnt policy in action. IT is possible that algorithms only optimize local minima, rather than reaching the desired behaviour.
Small implementation details are not reflected in publications, but these can have dramatic effect on performance.
What metrics should an RL algorithm report?
- Average cumulative reward (average returns): misleading, range of performance across random seeds and trials unknown
- Maximum reward achieved over a fixed number of timesteps: inadequate under high variance
Perhaps one can use bootstrap and significance testing to add confidence intervals. Some techniques:
Obtain a bootstrap estimator by resampling with replacement many times to generate a statistically relevant mean and confidence bound. TRPO and PPO are the most stable with small confidence bounds from the bootstrap.
If we use our sample and give it some uniform lift (e.g. scaling everything by 1.25), we can run many bootstrap simulations and determine what percentage of simulations result in statistically significant values with the lift. If there is a small percentage of significant values, more trials need to be run.
In supervised learning, k-folt t-test, and corrected resampling t-test are some significance metrics used, but these make assumptions about the underlying data that do not necessarily hold in RL.
- Find the working hyperparameter set that matches the original reported performance
- New baseline algorithm implementations should match original codebase results if possible
- Averaging multiple runs over different random seeds to get better insight into algorithm performance
- Report all hyperparameters, implementation details, experimental setup, and evaluation methods