A Distributional Code for Value in Dopamine-based Reinforcement Learning
Reward prediction errors are typically represented by a single scalar quantity, as in temporal-difference learning. In distributional RL, the reward prediction error consists of a diverse set of channels, predicting multiple future outcomes. Different channels have different relative scalings for positive and negative reward prediction errors. Imbalances between these scalings cause each channel to learn a different value prediction, and collectively represent a distribution over possible rewards.