Reconstructing Training Data with Informed Adversaries: Attacks and Mitigations

Given access to a machine learning model, can an adversary reconstruct the model's training data? This work studies this question from the lens of an informed adversary, who knows all the training data points except one. By instantiating concrete attacks, we show that, under such threat model, it is often feasible to reconstruct the remaining data point given white-box access to the model parameters. For convex models (e.g.\ logistic regression), reconstruction attacks are simple and can be derived in closed-form. For more general models (e.g.\ neural networks), we propose a novel attack strategy based on training a reconstructor network that receives as input the weights of the model under attack and produces as output the target data point. We demonstrate the effectiveness of this attack on image classification models trained on MNIST and CIFAR-10, and systematically investigate which factors of standard machine learning pipelines affect reconstruction success. Finally, we empirically and theoretically investigate what amount of differential privacy suffices to mitigate reconstruction attacks by informed adversaries; consequently, this result implies resilience against weaker but more realistic adversaries. Our work is the first effective reconstruction attack against generic machine learning models beyond specialized settings considered in previous works (e.g.\ generative language models or access to training gradients); it shows that standard models have the capacity to store enough information to enable high-fidelity reconstruction of training data points, and it demonstrates that differential privacy can successfully mitigate such attacks in a parameter regime where utility degradation is minimal.

Authors' notes