Ethical Guidance of LLMs

From FDHwiki
Jump to navigation Jump to search

Abstract

Constitutional AI is a framework for creating artificial systems that can align with human values and preferences, without violating ethical principles. However, most existing methods for constitutional AI rely on human intervention, which can be costly, biased, and inconsistent. In this exploratory project, we replicate and extend the constitutional AI pipeline proposed by Anthropic, using Meta's Llama 2, a large language model with 7 billion parameters. We fine-tune a quantised Llama 2 on a set of ethical principles and corresponding unethical principles, using a critique-revision loop in supervised learning. The critique-revision loop involves generating answers to ethical dilemmas which are used to finetune the model. We then use a generated dataset of ideal answers to generate a preference dataset to train our reward model. We then introduce a reinforcement learning model based on the policy generated by the preference model, which is trained using RLAIF (Reinforcement Learning from AI Feedback). RLAIF leverages the feedback from Llama 2 to improve its own behavior and alignment with human values. We explore the ethical spectrum with regards to LLMs by inverting the values and measuring the impact on the outputs.

[ADD A SENTENCE ABOUT RESULTS]


Introduction

“Success in creating AI would be the biggest event in human history. Unfortunately, it might also be the last, unless we learn how to avoid the risks.” - Stephen Hawking

Large language models (LLMs) pose ethical challenges and risks for human society and values. How can we align them with human interests and norms? How can we prevent or mitigate their misuse, bias, manipulation, or deception? How can we foster trust, accountability, and transparency in their development and deployment?

So why exactly do LLMs make us shake in our boots? LLMs have the potential to be misused in various ways, which can lead to ethical and social risks. For example, LLMs can be used to impersonate the style of speech of specific individuals or groups, which can be abused at scale to mislead potential victims into placing their trust in the hands of criminal actors. Additionally, LLMs can be employed for malicious purposes, including generating harmful content, impersonating individuals, or facilitating cyberattacks. The risks associated with LLMs are not limited to security concerns. LLMs can perpetuate stereotypes, unfair discrimination, exclusionary norms, toxic language, and lower performance by social group. They can also reproduce biases and generate offensive responses that create further risk for businesses. In healthcare, LLMs pose risks related to the accuracy of the model and the privacy implications of its usage. In education, LLMs can be used to plagiarize content and spamming. In finance, LLMs can be used to generate false answers, leading to a direct threat to science. In law, LLMs can be used to impersonate individuals and groups, leading to data breaches or unauthorised dissemination of proprietary information.

By exploring the potential risks and challenges associated with LLMs, this project aims to identify ways to mitigate them and to promote responsible use of LLMs. The project’s goal is to foster trust, accountability, and transparency in the development and deployment of LLMs. By fine-tuning the Llama2 model with a set of pre-defined values, the project aims to test the limits of LLMs across the ethical spectrum and to identify the benefits and challenges of embedding ethical values into LLMs. The project’s findings can help researchers and developers create LLMs that are more ethical and aligned with human values. Overall, this project has the potential to make a significant contribution to the field of digital humanities by addressing the ethical implications of LLMs and their impact on society.

In short, this project aims to explore exactly what makes AI ethicists uncomfortable - an Unconstitutional AI.

Project Plan and Milestones

Overview

Contributions

Methodology

Results

Limitations

Future Work

Conclusion

Appendix

Acknowledgments

Github Codebase

Few Shot Attempts