Paper
Publication
Red Teaming Language Models with Language Models
Abstract

Language Models (LMs) often cannot be deployed because of their potential to harm users in ways that are hard to predict in advance. Prior work identifies harmful behaviors before deployment by using human annotators to hand-write test cases. However, human annotation is expensive, limiting the number and diversity of test cases. In this work, we automatically find cases where a target LM behaves in a harmful way, by generating test cases (“red teaming”) using another LM. We evaluate the target LM’s replies to generated test questions using a classifier trained to detect offensive content, uncovering tens of thousands of offensive replies in a 280B parameter LM chatbot. We explore several methods, from zero-shot generation to reinforcement learning, for generating test cases with varying levels of diversity and difficulty. Furthermore, we use prompt engineering to control LM-generated test cases to uncover a variety of other harms, automatically finding groups of people that the chatbot discusses in offensive ways, personal and hospital phone numbers generated as the chatbot’s own contact info, leakage of private training data in generated text, and harms that occur over the course of a conversation. Overall, LM-based red teaming is one promising tool (among many needed) for finding and fixing diverse, undesirable LM behaviors before impacting users.

Authors' notes

In our recent paper, we show that it is possible to automatically find inputs that elicit harmful text from language models by generating inputs using language models themselves. Our approach provides one tool for finding harmful model behaviours before users are impacted, though we emphasize that it should be viewed as one component alongside many other techniques that will be needed to find harms and mitigate them once found.

Large generative language models like GPT-3 and Gopher have a remarkable ability to generate high-quality text, but they are difficult to deploy in the real world. Generative language models come with a risk of generating very harmful text, and even a small risk of harm is unacceptable in real-world applications.

For example, in 2016, Microsoft released the Tay Twitter bot to automatically tweet in response to users. Within 16 hours, Microsoft took Tay down after several adversarial users elicited racist and sexually-charged tweets from Tay, which were sent to over 50,000 followers. The outcome was not for lack of care on Microsoft’s part:

Although we had prepared for many types of abuses of the system, we had made a critical oversight for this specific attack.”

Peter Lee
VP, Microsoft

The issue is that there are so many possible inputs that can cause a model to generate harmful text. As a result, it’s hard to find all of the cases where a model fails before it is deployed in the real world. Previous work relies on paid, human annotators to manually discover failure cases (Xu et al. 2021, inter alia). This approach is effective but expensive, limiting the number and diversity of failure cases found.

We aim to complement manual testing and reduce the number of critical oversights by finding failure cases (or ‘red teaming’) in an automatic way. To do so, we generate test cases using a language model itself and use a classifier to detect various harmful behaviors on test cases, as shown below:

Our approach uncovers a variety of harmful model behaviors:

  1. Offensive Language: Hate speech, profanity, sexual content, discrimination, etc.
  2. Data Leakage: Generating copyrighted or private, personally-identifiable information from the training corpus.
  3. Contact Information Generation: Directing users to unnecessarily email or call real people.
  4. Distributional Bias: Talking about some groups of people in an unfairly different way than other groups, on average over a large number of outputs.
  5. Conversational Harms: Offensive language that occurs in the context of a long dialogue, for example.

To generate test cases with language models, we explore a variety of methods, ranging from prompt-based generation and few-shot learning to supervised finetuning and reinforcement learning. Some methods generate more diverse test cases, while other methods generate more difficult test cases for the target model. Together, the methods we propose are useful for obtaining high test coverage while also modeling adversarial cases.

Once we find failure cases, it becomes easier to fix harmful model behavior by:

  1. Blacklisting certain phrases that frequently occur in harmful outputs, preventing the model from generating outputs that contain high-risk phrases.
  2. Finding offensive training data quoted by the model, to remove that data when training future iterations of the model.
  3. Augmenting the model’s prompt (conditioning text) with an example of the desired behavior for a certain kind of input, as shown in our recent work.
  4. Training the model to minimize the likelihood of its original, harmful output for a given test input.

Overall, language models are a highly effective tool for uncovering when language models behave in a variety of undesirable ways. In our current work, we focused on red teaming harms that today’s language models commit. In the future, our approach can also be used to preemptively discover other, hypothesized harms from advanced machine learning systems, e.g., due to inner misalignment or failures in objective robustness. This approach is just one component of responsible language model development: we view red teaming as one tool to be used alongside many others, both to find harms in language models and to mitigate them. We refer to Section 7.3 of Rae et al. 2021 for a broader discussion of other work needed for language model safety. For more details on our approach and results, as well as the broader consequences of our findings, read our red teaming paper here.