Jethro's Braindump

Transfer Learning

Reinforcement Learning ⭐

Prior understanding of problem structure can help us solve complex tasks quickly. Perhaps solving prior tasks would help acquire useful knowledge for solving a new task.

Transfer learning is the use of experience from one set of tasks for faster learning and better performance on a new task.

Few-Shot Learning

“shot” refers to the number of attempts in the target domain. For example, in 0-shot learning, a policy trained in the source domain works in the target domain.

This typically requires assumptions of similarity between source and target domain.

Taxonomy of Transfer Learning

“forward” transfer
train on one task, transfer to a new task - Randomizing source domain - Fine-tuning
multi-task transfer
train on many tasks, transfer to a new task - generate highly randomized source domains - model-based RL - model distillation - contextual policies - modular policy networks
multi-task meta-learning
learn to learn from many tasks - RNN-based/Gradient-based meta-learning

Forward Transfer


Key Idea: Train on the source task, then train some more on the target task, for example, by retraining the weights on the last layer. Lower layers are likely to learn representations from the source task that are useful in the target task. This works well if the source task is broad and diverse.

Fine tuning is popular in the supervised learning setting.

  • Why fine-tuning doesn’t work well for RL

    1. RL tasks tend to be narrow (not broad and diverse), and features are less general
    2. RL methods tend to learn deterministic policies, the policies that are optimal in the fully-observed MDP.
      1. Low-entropy policies adapt very slowly to new settings
      2. Little exploration at convergence

    To increase diversity and entropy, we can do maximum-entropy learning which acts as randomly as possible while collecting high rewards (Haarnoja et al. 2017):

    \begin{equation} \pi(a|s) = \mathrm{exp} (Q_\phi(s,a)-V(s)) \end{equation}

    This optimizes

    \begin{equation} \sum_t E_{\pi(s_t, a_t)}[r(s_t, a_t)] + E_{\pi(s_t)}[\mathcal{H}(\pi(a_t|s_t))] \end{equation}

Manipulating the Source Domain

This is used where we can design the source domain (e.g. training in a simulator, which can be tweaked). Injecting randomness/diversity in the source tends to be helpful.


Multi-task Transfer

Some things remain constant across tasks, such as the laws of physics. Model-based RL may learn these laws, and transfer this knowledge across multiple tasks.

In model distillation, multiple policies are combined into one, for concurrent multi-task learning. Policy distillation (Rusu et al. 2015) is able to:

  1. Compress policies learnt on single games into smaller models
  2. Build agents capable of playing multiple games
  3. Improve the stability of the DQN learning algorithm by distilling online the policy of the best performing agent


Haarnoja, Tuomas, Haoran Tang, Pieter Abbeel, and Sergey Levine. 2017. “Reinforcement Learning with Deep Energy-Based Policies.” In Proceedings of the 34th International Conference on Machine Learning, edited by Doina Precup and Yee Whye Teh, 70:1352–61. Proceedings of Machine Learning Research. International Convention Centre, Sydney, Australia: PMLR.

Rusu, Andrei A., Sergio Gomez Colmenarejo, Caglar Gulcehre, Guillaume Desjardins, James Kirkpatrick, Razvan Pascanu, Volodymyr Mnih, Koray Kavukcuoglu, and Raia Hadsell. 2015. “Policy Distillation.” CoRR.