Building more secure dialogue agents

Estimated read time: 5 min

Wireless

Training artificial intelligence to communicate in a useful, correct and harmless way

In recent years, language large models (LLMs) have achieved success with a range of tasks such as question answering, summarization, and dialogue. Dialogue is a particularly interesting task because it is characterized by fluid and interactive communication. However, dialogue agents supported by the LLM may express inaccurate or invented information, use discriminatory language, or encourage unsafe behavior.

To create safer dialogue agents, we need to be able to learn from human feedback. Applying reinforcement learning based on input from research participants, we are exploring new ways to train dialogue agents that hold promise for a safer system.

In our latest paper, we present bird – A useful dialogue agent and reduces the risk of unsafe and inappropriate answers. Our agent is designed to talk to the user, answer questions and search the Internet with Google when it is useful to look for clues to inform their responses.

Our new conversational AI model responds on its own to an initial human prompt.

Sparrow is a research and proof-of-concept model, designed with the goal of training dialogue agents to be more helpful, healthy, and harmless. By learning these traits within a general dialogue, Sparrow is advancing our understanding of how to train customers to be safer and more helpful—and ultimately, to help build safer and more useful artificial general intelligence (AGI).

Asfour refuses to answer a potentially harmful question.

How does Sparrow work?

Conversational AI training is a particularly difficult problem because it is difficult to determine what makes a successful dialogue. To address this problem, we turn to a form of reinforcement learning (RL) based on people’s observations, using study participants’ preference observations to train a model of how useful an answer is.

To obtain this data, we show participants multiple typical answers to the same question and ask them which answer they like best. Since we show answers with and without evidence retrieved from the Internet, this form can also determine when an answer should be supported by evidence.

We ask study participants to rate and interact with the Sparrow either normally or hostilely, while continuing to expand the data set used to train the Sparrow.

But increasing interest is only part of the story. To ensure that the model’s behavior is safe, we must constrain its behaviour. Thus, we define a simple initial set of rules for the form, such as “Don’t make threatening statements” and “Don’t make hateful or derogatory comments”.

We also provide rules about potentially harmful advice and not pretending to be a person. These rules have been reported through a study of existing works on language harms and consultation with experts. Then we ask the study participants to talk to our system with the aim of tricking it into breaking the rules. These conversations then allow us to train a separate “rule model” that indicates when Sparrow’s behavior breaks any of the rules.

Towards better AI and better judgment

Validating Sparrow’s answers is difficult even for experts. Instead, we ask participants to determine whether Sparrow’s answers are reasonable and whether the evidence provided by Sparrow actually supports the answer. According to our respondents, Sparrow provides a plausible and evidence-supported answer 78% of the time when asked a factual question. This is a huge improvement over our base models. However, Sparrow is not immune from making mistakes, such as hallucinating facts and giving off-topic answers at times.

Sparrow also has room for better rule-following. After training, the participants were still able to trick it into breaking our rules 8% of the time, but compared to simpler methods, Sparrow is better at following our rules under questioning from opponents. For example, our original dialogue model broke rules nearly 3 times more often than Sparrow when participants tried to trick it into doing so.

Sparrow answers a question and a follow-up question using clues, then follows the “Don’t pretend to have a human identity” rule when asked a personal question (sample from Sept. 9, 2022).

Our goal with Sparrow was to build a flexible mechanism for enforcing rules and standards in dialogue agents, but the specific rules we use are preliminary. Developing a better and more complete set of rules will require the input of experts on many subjects (including policy makers, sociologists and ethicists) and participatory input from a variety of affected users and groups. We believe our methods will still apply to a stricter set of rules.

Sparrow is an important step forward in understanding how to train dialogue agents to be more helpful and safer. However, successful interpersonal communication and dialogue agents should not only avoid harm but be in line with humanistic values ​​for effective and useful communication, as discussed in recent work on aligning language paradigms with humanistic values.

We also assert that a good agent will still decline to answer questions in contexts where it is appropriate to defer to humans or where it has the potential to deter harmful behavior. Finally, our initial research focused on an English-speaking agent, and more work is needed to ensure similar results across other languages ​​and cultural contexts.

In the future, we hope that conversations between humans and machines will lead to better judgments of AI behavior, allowing people to align and improve systems that may be too complex to understand without the help of a machine.

Keen to explore a conversational path to safe AI? We are currently recruiting Research Scientists for a Scalable Alignment team.

Source link

Post a Comment

Cookie Consent
We serve cookies on this site to analyze traffic, remember your preferences, and optimize your experience.
Oops!
It seems there is something wrong with your internet connection. Please connect to the internet and start browsing again.
AdBlock Detected!
We have detected that you are using adblocking plugin in your browser.
The revenue we earn by the advertisements is used to manage this website, we request you to whitelist our website in your adblocking plugin.
Site is Blocked
Sorry! This site is not available in your country.