Is Bang-Bang Control All You Need?

Reinforcement learning (RL) methods for continuous control typically employ distributions whose support covers the entire action space. While it is colloquially known that trained agents often prefer actions at the boundaries of that space, we lack a deeper understanding of the underlying phenomenon. In this work, we draw theoretical connections to the emergence of bang-bang behaviour in optimal control, and provide extensive empirical evaluation across a variety of recent RL algorithms. We replace the standard Gaussian by a Bernoulli distribution that solely considers the extremes along each action dimension - a bang-bang controller. Surprisingly, we find that this achieves state-of-the-art performance on several continuous control benchmarks - a stark contrast to real robotic hardware, where energy and maintenance cost affect controller choices. Since exploration, learning and the final solution are entangled in RL, we provide additional experiments built on imitation learning to reduce the impact of exploration on our analysis. Finally, we show that our observations generalize to environments that aim to model real-world challenges and evaluate which factors mitigate the emergence of bang-bang solutions. Our findings indicate the need for refining current continuous control benchmarks and for re-evaluating the metrics used to judge performance of RL algorithms, particularly in light of potential transfer to real-world robotic systems.

Authors' notes