Steerability: an on-ramp on the road to alignment.

We've heard the pitch before: LLMs can do Olympiad math, follow complex instructions, and write like Hemingway. Modern LLMs are impressive, but are they really steerable towards user goals?

Learn more here about our open-source evaluation framework for measuring the steerability of LLMs.

Github logo

Code

Is your favorite LLM steerable? Try out our evaluation framework for measuring steerability in LLMs.
Huggingface logo

Data

Download our steerability probes: benchmarks specifically designed to measure steerability.
Demo

Demo

Try out our steerability visualizer: see how LLMs "move" in goal-space in response to requests.
Paper logo

Paper

Learn about how we define steerability formally, and how we might improve steerability.

Quickstart


Installation

We recommend creating a new virtual environment. We use the uv package manager to do so.

uv venv /path/to/your/env/ --python 3.12.8 --seed # recommended version
source /path/to/your/env/bin/activate
git clone git@github.com:tchang1997/steerability.git
uv pip install -e .
bash initial_setup.sh # makes result directories, downloads auxiliary data

Usage


1. Create a configuration file

Your config specifies which model to use, the prompt type, where to save results, and other parameters. We show an example config that you can use. We support both OpenAI and vLLM-supported models!

For OpenAI models, provide the model endpoint (example: gpt-4o-2024-08-06). For vLLM models, provide the HuggingFace model ID (example: google/gemma-3-1b-it).

An example config is shown on the right, but we provide a few extra examples in the repo as well.

# Example config: config/my_favorite_model.yaml 
model_id: "google/gemma-3-1b-it"
probe: "./data/2025_06_steerbench_64x32.csv"
# Direct + Negative Prompt
prompt_strategy: "direct"
inst_addons:
  disambig: True
seed: 42
max_tokens: 4096
rate_limit: 128
text_gen_kwargs:
  randomness: 0.0 # temperature
  num_generations: 1
# Saving
save_as: "gemma_3_1b_results"

2. Run the steerability evaluation script

From the repo root, use the following command to run a complete steerability evaluation.

You'll want to point --api-config to some file storing your API key in this format: {"api-key": "sk-..."}.

python steer_eval.py \
    --config config/my_favorite_model.yaml \
    --api-config /path/to/my/api_key

3. Approve flagged completions (optional but recommended)

We recommend using LLM-as-a-judge to flag responses that may not be valid rewrites (e.g., refusals). You’ll be prompted to confirm whether responses flagged by the model are acceptable rewrites by approving or overriding the LLM's reasoning (Yes/No) interactively.

We've truncated the text for easier reading here, but you'll see the complete rewritten and original texts during the actual evaluation.

You can skip this step by passing --skip-interactive when you launch python steer_eval.py Step 2.

┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━━┓
┃ Original  ┃ Rewrite           ┃ Answer ┃ Reasoning        ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━━┩
│ Cats are  │ Felines, part of  │ Yes    │ Same core idea   │
│ animals.  │ Felidae, may be   │        │ with more formal │
│ They can  │ pets or hunters…  │        │ wording.         │
│ be pets…  │                   │        │                  │
└───────────┴───────────────────┴────────┴──────────────────┘
>>> Approve reasoning? (yes/no):

4. View results

After filtering, you'll see a printout summarizing steering error, miscalibration, and side-effects.

This printout also includes information about the model and probe used, the prompting strategy, and the results of LLM-as-judge evaluation. The results are saved in a JSON file for later viewing as well.

╭────────────────── STEERABILITY REPORT ───────────────────╮
│ Model: gpt-4.1-nano-2025-04-14                           │
│ Prompting strategy: direct (w/ neg. prompt)              │
│ Probe: ./data/2025_06_steerbench_64x32.csv               │
│ # of total responses: 2048                               │
╰──────────────────────────────────────────────────────────╯
╭────────────────── INTERACTIVE JUDGING ───────────────────╮
│ Judge model: meta-llama/Llama-3.1-8B-Instruct            │
│ # of valid responses: 2048 (100.00%)                     │
│ # of LLM-flagged responses: 0 (0.00%)                    │
│ # of overruled responses: 0 (0.00%)                      │
╰──────────────────────────────────────────────────────────╯
                    STEERABILITY METRICS                    
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Metric                       ┃              Median (IQR) ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ Steering Error               │             0.452 (0.239) │
│ Miscalibration               │             0.556 (0.500) │
│ Orthogonality                │             0.763 (0.338) │
└──────────────────────────────┴───────────────────────────┘

5. Visualize how LLMs steer in goal-space (optional but fun!)

Use our packaged Steerflow application to generate and save animated vector flow diagrams of model behavior in different goal subspaces. Steerflow can help you explore where models consistently overshoot, undershoot, or induce side-effects. We host a lightweight demo here for preview purposes.

Here, we show an animated version of the reading difficulty-formality subspace measured on our steerability probe for Gemma3-1B. This flow diagram shows how the model "moves" texts on average in goal-space when we ask for changes in reading level, but not formality. Blue regions (less movement) indicate that the model doesn't really drift (good), while red regions indicate more drift, with purple somewhere in between.


steerflow launch --port 12345
# go to http://localhost:12345 in your browser
Flow diagram demo

Measured in a reading difficulty (specified) + formality (unspecified) goal-space.
Model: Gemma3-1B.

Flow diagram demo

Our team: Trenton Chang (University of Michigan), Tobias Schnabel (Microsoft Research), Adith Swaminathan (Netflix), and Jenna Wiens (University of Michigan).
Acknowledgements: Special thanks to members of the MLD3 Lab (University of Michigan), the AI Interaction & Learning Group (Microsoft Research), the Machine Learning & Inference Research Team (Netflix) and the NeurIPS Safe Generative AI workshop for their helpful feedback on early versions of this work.