Research Papers
A curated list of key publications shaping the field of AI safety and alignment.
Papers We've Written
Research contributions from PAIA members advancing the field of AI safety and alignment.
CCS-Lib: A Python package to elicit latent knowledge from LLMs
Laurito et al. • 2025
A Python package for implementing Contrast-Consistent Search (CCS) to extract truthful beliefs from language models, addressing the challenge of eliciting latent knowledge.
Prompt-Character Divergence: A Responsibility Compass for Human-AI Creative Collaboration
Maggie Wang, Wouter Haverals • 2025
A lightweight metric that quantifies semantic drift in AI-generated images, helping creators determine when outputs reflect their intent versus model-driven biases. Published at NeurIPS Creative AI Track 2025.
Dynamic Risk Assessment for Offensive Cybersecurity Agents
Wei et al. • 2025
A framework for dynamically assessing and managing risks in offensive cybersecurity agents, ensuring safe deployment of AI systems in security-critical contexts. Published at NeurIPS 2025 Datasets & Benchmarks Track.
Large Language Models Develop Novel Social Biases Through Adaptive Exploration
Wu et al. • 2025
Demonstrates that LLMs can spontaneously develop novel social biases about artificial demographic groups even when no inherent differences exist, resulting in highly stratified task allocations. These biases stem from exploration-exploitation trade-offs and are exacerbated by newer, larger models. Published at NeurIPS 2025 Workshop on Multi-Turn Interactions in Large Language Models (MTI-LLM).
Demo: Statistically Significant Results on Biases and Errors of LLMs Do Not Guarantee Generalizable Results
Liu et al. • 2025
Develops an infrastructure to probe medical chatbots using automatically generated queries across patient demographics, histories, and disorders. Finds that LLM annotators exhibit low agreement scores, and only specific LLM pairs yield statistically significant differences. Recommends using multiple LLM evaluators and publishing inter-LLM agreement metrics. Published at NeurIPS 2025 Workshop on GenAI for Health Potential, Trust, and Policy Compliance.
Foundation
Essential readings to build a strong foundation for the alignment problem.
AI Alignment: A Comprehensive Survey
Ji et al. • 2024
Provides a comprehensive yet beginner-friendly review of alignment research topic.
AI Governance: A Research Agenda
Dafoe • 2018
Outlines key questions and challenges relating to AI governance and policy.
Concrete Problems in AI Safety
Amodei et al. • 2016
Presents practical research problems in AI safety.
The Alignment Problem from a Deep Learning Perspective
Ngo et al. • 2024
Discusses the challenges of aligning advanced AI models from the deep learning paradigm with human values and intentions.
An Overview of Catastrophic AI Risks
Hendrycks et al. • 2023
Provides an overview of the main sources of catastrophic AI risks.
Unsolved Problems in ML Safety
Hendrycks et al. • 2022
Identifies four key areas of unsolved problems in machine learning safety.
Technical
The engineering/mathematical side of AI safety and alignment.
Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet
Templeton et al. • 2024
A big milestone in the mechanistic interpretability of large neural networks.
Training Language Models to Follow Instructions with Human Feedback
Ouyang et al. • 2022
RLHF, the prevailing technique used to align AI systems with human values.
Weak-To-Strong Generalization
Burns et al. • 2023
A new research direction for how human intelligence can take steps to align superhuman intelligence.