First do no harm: counterfactual objective functions for safe & ethical AI

Safe and ethical AI must be able to reason about harm and avoid harmful actions. In this article we derive the first statistical definition of harm that can be incorporated into algorithmic decision making. We argue that harm is fundamentally a counterfactual quantity, and show that standard machine learning algorithms are guaranteed to pursue harmful policies in certain environments. To resolve this, we derive a family of counterfactual objective functions that robustly mitigate for harm, and demonstrate our approach with an interpretable model for identifying optimal drug doses. While standard algorithms that maximize causal treatment effects result in needlessly harmful doses, our counterfactual algorithm identifies doses that are far less harmful without sacrificing efficacy. Our results show that causal modelling and counterfactual reasoning are key ingredients for developing safe and ethical AI.

Authors' notes