A Distributional Code for Value in Dopamine-based Reinforcement Learning

paper: https://www.nature.com/articles/s41586-019-1924-6
related: Reinforcement Learning ⭐

Abstract

Reward prediction errors are typically represented by a single scalar quantity, as in temporal-difference learning. In distributional RL, the reward prediction error consists of a diverse set of channels, predicting multiple future outcomes. Different channels have different relative scalings for positive and negative reward prediction errors. Imbalances between these scalings cause each channel to learn a different value prediction, and collectively represent a distribution over possible rewards.

Resources

Distributional RL – Simple Machine Learning