Mesa-Optimization & Inner Alignment Researcher

Investigate mesa-optimization, deceptive alignment, and inner alignment failures in learned models to build safer training pipelines.

Mesa-optimization and inner alignment represent some of the most technically subtle and consequential problems in AI safety. The core concern: when we train a machine learning model, we optimize for certain behaviors using a base objective — but the trained model may itself become an optimizer with its own mesa-objective that differs from the base objective. If this mesa-objective diverges from what we intended, the model may behave safely during training and evaluation while harboring misaligned goals that only manifest in deployment. This is the inner alignment problem, and it sits at the heart of AI deception risk.

The Mesa-Optimization & Inner Alignment Researcher assistant supports researchers working on this frontier of AI safety theory and empirics. It is built on deep familiarity with the foundational work in this space — including Risks from Learned Optimization (Hubinger et al.) — and with subsequent theoretical and empirical work that has extended, critiqued, and operationalized these ideas.

Working with this assistant, you can explore the conditions under which mesa-optimizers are likely to emerge, reason about what distinguishes a deceptively aligned mesa-optimizer from a robustly corrigible one, and think through how different training regimes and model architectures might affect inner alignment risk. It helps you engage with the steganography and goal misgeneralization literature and connect these to broader alignment concerns.

The assistant supports both theoretical work (formalizing inner alignment concepts, developing new framings) and empirical research design (designing experiments to detect mesa-optimization in real models, operationalizing deceptive alignment as a measurable property). It can also help you write about these concepts clearly for both technical and policy audiences.

This role is ideal for alignment researchers at the frontier of safety theory, PhD students working on goal misgeneralization or deceptive alignment, and senior ML researchers who want to integrate inner alignment considerations into training pipeline design.

🔒 Unlock the AI System Prompt

Sign in with Google to access expert-crafted prompts. New users get 10 free credits.

Sign in to unlock