Challenges in language models of detoxification

Estimated read time: 7 min

Wireless

Unwanted behavior of language models

Language models trained on large script blocks can generate fluent text, and show promise as low/zero learners and code generation tools, among other capabilities. However, previous research has also identified several issues with LM use that need to be addressed, including distributional biases, social stereotypes, the potential for exposure to training samples, and other potential harms to the LM module. A specific type of LM damage is the generation of toxic language, which includes hate speech, insults, profanity and threats.

In our paper, we focus on LMs and their tendency to generate toxic language. We examine the effectiveness of different approaches to mitigate LM toxicity and its side effects, and investigate the reliability and limitations of automatic classifier-based toxicity assessment.

Following the definition of toxicity developed by the Perspective API, we here consider speech to be a must Toxic if rude, disrespectful, or unreasonable language is likely to cause someone to leave the discussion. However, we note two important caveats. First, toxicity judgments are subjective – they depend on both the raters assessing toxicity and their cultural background, as well as the inferred context. While not the focus of this work, it is important that future work continue to develop this definition above, and show how it can be fairly applied in different contexts. Second, we note that toxicity covers only one aspect of the potential harms of LM, excluding, for example, harms arising from distribution model bias.

Toxicity measurement and mitigation

To enable the use of a safer language model, we set out to measure and understand the origins and mitigation of toxic text generation in LMs. There has been previous work that looked at different approaches to reduce LM toxicity, either by tuning pre-trained LMs, through generations of routing models, or through direct filtering of test time. Furthermore, previous work has introduced automatic scales to measure LM toxicity, when prompted by different types of prompts, as well as in unconditional generation. These metrics are based on toxicity scores for the widely used Perspective API model, which is trained on online feedback annotated for toxicity.

In our study, we first showed that a combination of relatively simple baselines leads to drastic reduction, as measured by previously entered LM toxicity measures. Concretely, we found that a combination of 1) filtering the LM training data annotated as toxic by the Perspective API, 2) filtering the transcript produced for toxicity based on a separate, fine-tuned BERT classifier trained to detect toxicity, and 3) directing the generation toward being less toxic, They are highly effective in reducing LM toxicity, as measured by automatic toxicity scales. When prompted with toxic (or non-toxic) claims from the RealToxicityPrompts dataset, we see a 6-fold (or 17-fold) decrease compared to most recently reported, in total Possibility of toxicity measurement. We’ve reached a value of zero on the unrestricted text generation setting, which suggests we’ve exhausted this metric. Given how much toxicity levels have decreased in absolute terms, as measured by automatic measures, the question arises to what extent this is also reflected in human judgment, and whether improvements on these measures are still meaningful, especially as they derive from a non-mechanical mechanism. Complete rating system. To collect more insights, we are moving towards evaluation by humans.

evaluation by humans

We conduct a human evaluation study in which raters annotate the LM-generated transcript for toxicity. The results of this study indicate that there is a largely direct and monotone relationship between average human-based and classifier-based outcomes, and LM toxicity decreases according to human judgment.

We found agreement among commentators similar to other studies measuring toxicity, and that explanatory toxicity has both subjective and ambiguous aspects. For example, we found that ambiguity frequently arises as a result of satire and news text about violent behavior and quoting toxic text (either neutrally or in order to disagree with it).

In addition, we found that automatic assessment of LM toxicity becomes less reliable once detoxification measures have been applied. While they initially couple very well, for samples with a high (spontaneous) toxicity score, the association between human ratings and Perspective API scores disappears once the LM toxicity-reducing interventions are applied and increased in strength.

Further manual examination also reveals that the false positive transcripts mention some identity terms with disproportionate frequencies. For example, for one of the detoxification paradigms, we note that within the high automatic toxicity bucket, 30.2% of the transcripts mention “gay,” reflecting biases previously observed in the autotoxicity classifiers (which the community is already working on improving). Together, these results indicate that when judging LM toxicity, reliance on automatic measures alone can lead to potentially misleading interpretations.

Unintended consequences of detoxification

We further examine the potential unintended consequences of LM toxicity reduction interventions. For detoxifying language models, we see a significant increase in language modeling loss, and this increase correlates with the strength of the detoxification intervention. However, the increase is greater in documents with higher automatic toxicity scores, compared to documents with lower toxicity scores. At the same time, in our human assessments, we found no notable differences in terms of grammar and comprehension and in how the style of the pre-adaptive script was maintained.

Another consequence of detoxification is that it can disproportionately reduce the ability of LMs to model transcripts related to specific identity groups. (any topic covered)as well as text messages written by people of different identity groups and different dialects (no dialect coverage). We find that there is a greater increase in loss of language modeling for text in African American English (AAE) when compared to text in white-biased English.

We see similar disparities in the deterioration of LM loss for the transcript about the female actors when compared to the transcript about the male actors. For a text about specific racial subgroups (such as Hispanic Americans), the decline in performance is again relatively higher when compared to other subgroups.

Takeaway

Our experiments measuring and mitigating language model toxicity provide us with valuable insights into potential next steps toward reducing the toxicity-related harms of language models.

From our automated and human evaluation studies, we have found that current dilution methods are indeed very effective in reducing measures of spontaneous toxicity, and this improvement largely offsets reductions in toxicity as judged by humans. However, we may have reached the point of exhaustion for the use of automatic measures in assessing LM toxicity: after toxicity limiting measures have been applied, the majority of remaining samples with high automatic toxicity scores are not actually judged as toxic by human raters, indicating Therefore, automatic measures become less reliable for detoxified LMs. This motivates efforts towards designing more challenging criteria for automatic assessment, and consideration of human judgment for future studies on mitigation of LM toxicity.

Furthermore, given the ambiguity in human judgments of toxicity, noting that judgments can vary across users and applications (eg, language describing violence, which might be flagged as toxic, might be appropriate in a news article), it should continue Future work developed and adapted the idea of ​​toxicity to different contexts, and revised it for different LM applications. We hope that the list of phenomena for which we found a difference in the annotations will be useful in this regard.

Finally, we also observed unintended consequences of mitigating LM toxicity, including worsening LM loss, and unintended amplification of social biases—measured in topic and dialect—potentially leading to lower LM performance for marginalized groups. Our findings suggest that along with toxicity, it is essential for future work not to rely on just one metric, but to consider a ‘range of metrics’ that capture different issues. It is likely that future interventions, such as reducing bias in toxicity classifiers, will help prevent trade-offs such as those we observed, allowing for the use of a safer language model.

Source link

Post a Comment

Cookie Consent
We serve cookies on this site to analyze traffic, remember your preferences, and optimize your experience.
Oops!
It seems there is something wrong with your internet connection. Please connect to the internet and start browsing again.
AdBlock Detected!
We have detected that you are using adblocking plugin in your browser.
The revenue we earn by the advertisements is used to manage this website, we request you to whitelist our website in your adblocking plugin.
Site is Blocked
Sorry! This site is not available in your country.