Resources

In this page, we share some seminal, influential and innovative works that have made waves in the RAI space, as well as a short tldr on their impact and/or potential applications.

Surveys

Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback (Jul 2023)
Siren’s Song in the AI Ocean: A Survey on Hallucination in Large Language Models (Sep 2023)

Benchmarks

Holistic Evaluation of Language Models - a reproducible and transparent framework for evaluating foundation models
Libra-Leaderboard: Towards Responsible AI through a Balanced Leaderboard of Safety and Capability - uses a distance-to-optimal-score method to calculate the overall rankings of LLMs, balancing performance and safety

Methodologies

Testing and red-teaming

Red Teaming Language Models with Language Models (Feb 2022) - generating red-teaming test cases with another LM
GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts (Sep 2023) - automates generation of jailbreak templates by starting with human-written templates as initial seeds, then mutating them to produce new templates with LLMs themselves
AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models (Oct 2023) - automatically generate stealthy jailbreak prompts using carefully designed hierarchical genetic algorithms
Universal and Transferable Adversarial Attacks on Aligned Language Models (Dec 2023) - finds a suffix that, when attached to a wide range of queries for an LLM to produce objectionable content, aims to maximize the probability that the model produces an affirmative response (rather than refusing to answer)
The Crescendo Multi-Turn LLM Jailbreak Attack (Apr 2024) - multi-turn attack that starts with harmless dialogue and progressively steers the conversation toward the intended, prohibited objective
Fishing for Magikarp: Automatically detecting under-trained tokens in large language models (May 2024) - automatic detection of problematic, rare tokens that allow for jailbreaking
AutoDAN-Turbo: A Lifelong Agent for Strategy Self-Exploration to Jailbreak LLMs (Oct 2024) - Discovers, evolves, and stores attack strategies without human intervention, resulting in greater diversity of prompts

Guardrails

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations (Dec 2023) - LLM-based input-output safeguard model geared towards Human-AI conversation use cases
Constitutional Classifiers: Defending against universal jailbreaks - input and output classifiers trained on synthetically generated data that filter the overwhelming majority of jailbreaks with minimal over-refusals and without incurring a large compute overhead

Alignment

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback (Apr 2022) - first few papers on using RLHF to finetune language models to act as helpful and harmless assistants, discussing the helpful-harmful trade-off
Constitutional AI: Harmlessness from AI Feedback (Dec 2022) - using LLMs (i.e., RLAIF) to provide feedback based on a pre-defined constitution
Refusal in Language Models Is Mediated by a Single Direction (Jun 2024) - a one-dimensional subspace can be erased from models' residual stream activations to prevent them from refusing harmful instructions, while adding this direction elicits refusal on even harmless instructions
Inference-Time Intervention: Eliciting Truthful Answers from a Language Model - shift model activations during inference, following a set of directions across a limited number of attention head, to enhance "truthfulness" of LLMs

Interpretability

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet (May 2024) - using sparse autoencoders to extract interpretable features from the activations of Claude and finding many generally interpretable and monosemantic fearures that are safety-relevant

Repositories

Awesome-LM-SSP - Reading list for safety, security, and privacy in large models; maintained by researchers from Tsinghua University, HKSU, Xian Jiaotong University
Awesome-LLM-Judges - Research on using LLM judges for automated evaluation; maintained by Haize Labs