Research Papers

A curated list of key publications shaping the field of AI safety and alignment.

Papers We've Written

Research contributions from PAIA members advancing the field of AI safety and alignment.

CCS-Lib: A Python package to elicit latent knowledge from LLMs

Laurito et al.2025

A Python package for implementing Contrast-Consistent Search (CCS) to extract truthful beliefs from language models, addressing the challenge of eliciting latent knowledge.

Prompt-Character Divergence: A Responsibility Compass for Human-AI Creative Collaboration

Maggie Wang, Wouter Haverals2025

A lightweight metric that quantifies semantic drift in AI-generated images, helping creators determine when outputs reflect their intent versus model-driven biases. Published at NeurIPS Creative AI Track 2025.

Dynamic Risk Assessment for Offensive Cybersecurity Agents

Wei et al.2025

A framework for dynamically assessing and managing risks in offensive cybersecurity agents, ensuring safe deployment of AI systems in security-critical contexts. Published at NeurIPS 2025 Datasets & Benchmarks Track.

Large Language Models Develop Novel Social Biases Through Adaptive Exploration

Wu et al.2025

Demonstrates that LLMs can spontaneously develop novel social biases about artificial demographic groups even when no inherent differences exist, resulting in highly stratified task allocations. These biases stem from exploration-exploitation trade-offs and are exacerbated by newer, larger models. Published at NeurIPS 2025 Workshop on Multi-Turn Interactions in Large Language Models (MTI-LLM).

Demo: Statistically Significant Results on Biases and Errors of LLMs Do Not Guarantee Generalizable Results

Liu et al.2025

Develops an infrastructure to probe medical chatbots using automatically generated queries across patient demographics, histories, and disorders. Finds that LLM annotators exhibit low agreement scores, and only specific LLM pairs yield statistically significant differences. Recommends using multiple LLM evaluators and publishing inter-LLM agreement metrics. Published at NeurIPS 2025 Workshop on GenAI for Health Potential, Trust, and Policy Compliance.

Foundation

Essential readings to build a strong foundation for the alignment problem.

AI Alignment: A Comprehensive Survey

Ji et al.2024

Provides a comprehensive yet beginner-friendly review of alignment research topic.

AI Governance: A Research Agenda

Dafoe2018

Outlines key questions and challenges relating to AI governance and policy.

Concrete Problems in AI Safety

Amodei et al.2016

Presents practical research problems in AI safety.

The Alignment Problem from a Deep Learning Perspective

Ngo et al.2024

Discusses the challenges of aligning advanced AI models from the deep learning paradigm with human values and intentions.

An Overview of Catastrophic AI Risks

Hendrycks et al.2023

Provides an overview of the main sources of catastrophic AI risks.

Unsolved Problems in ML Safety

Hendrycks et al.2022

Identifies four key areas of unsolved problems in machine learning safety.

Technical

The engineering/mathematical side of AI safety and alignment.

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

Templeton et al.2024

A big milestone in the mechanistic interpretability of large neural networks.

The Off-Switch Game

Hadfield-Menell et al.2017

A game-theoretic view on AI self-preservation.

Training Language Models to Follow Instructions with Human Feedback

Ouyang et al.2022

RLHF, the prevailing technique used to align AI systems with human values.

Weak-To-Strong Generalization

Burns et al.2023

A new research direction for how human intelligence can take steps to align superhuman intelligence.