We recommend creating a new virtual environment. We use the uv package manager to do so.
uv venv /path/to/your/env/ --python 3.12.8 --seed # recommended version
source /path/to/your/env/bin/activate
git clone git@github.com:tchang1997/steerability.git
uv pip install -e .
bash initial_setup.sh # makes result directories, downloads auxiliary data
Your config specifies which model to use, the prompt type, where to save results, and other parameters. We show an example config that you can use. We support both OpenAI and vLLM-supported models!
For OpenAI models, provide the model endpoint (example: gpt-4o-2024-08-06
). For vLLM models, provide the HuggingFace model ID (example: google/gemma-3-1b-it
).
An example config is shown on the right, but we provide a few extra examples in the repo as well.
# Example config: config/my_favorite_model.yaml
model_id: "google/gemma-3-1b-it"
probe: "./data/2025_06_steerbench_64x32.csv"
# Direct + Negative Prompt
prompt_strategy: "direct"
inst_addons:
disambig: True
seed: 42
max_tokens: 4096
rate_limit: 128
text_gen_kwargs:
randomness: 0.0 # temperature
num_generations: 1
# Saving
save_as: "gemma_3_1b_results"
From the repo root, use the following command to run a complete steerability evaluation.
You'll want to point --api-config
to some file storing your API key in this format: {"api-key": "sk-..."}
.
python steer_eval.py \
--config config/my_favorite_model.yaml \
--api-config /path/to/my/api_key
We recommend using LLM-as-a-judge to flag responses that may not be valid rewrites (e.g., refusals). You’ll be prompted to confirm whether responses flagged by the model are acceptable rewrites by approving or overriding the LLM's reasoning (Yes/No) interactively.
We've truncated the text for easier reading here, but you'll see the complete rewritten and original texts during the actual evaluation.
You can skip this step by passing --skip-interactive
when you launch python steer_eval.py
Step 2.
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━━┓
┃ Original ┃ Rewrite ┃ Answer ┃ Reasoning ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━━┩
│ Cats are │ Felines, part of │ Yes │ Same core idea │
│ animals. │ Felidae, may be │ │ with more formal │
│ They can │ pets or hunters… │ │ wording. │
│ be pets… │ │ │ │
└───────────┴───────────────────┴────────┴──────────────────┘
>>> Approve reasoning? (yes/no):
After filtering, you'll see a printout summarizing steering error, miscalibration, and side-effects.
This printout also includes information about the model and probe used, the prompting strategy, and the results of LLM-as-judge evaluation. The results are saved in a JSON file for later viewing as well.
╭────────────────── STEERABILITY REPORT ───────────────────╮
│ Model: gpt-4.1-nano-2025-04-14 │
│ Prompting strategy: direct (w/ neg. prompt) │
│ Probe: ./data/2025_06_steerbench_64x32.csv │
│ # of total responses: 2048 │
╰──────────────────────────────────────────────────────────╯
╭────────────────── INTERACTIVE JUDGING ───────────────────╮
│ Judge model: meta-llama/Llama-3.1-8B-Instruct │
│ # of valid responses: 2048 (100.00%) │
│ # of LLM-flagged responses: 0 (0.00%) │
│ # of overruled responses: 0 (0.00%) │
╰──────────────────────────────────────────────────────────╯
STEERABILITY METRICS
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Metric ┃ Median (IQR) ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ Steering Error │ 0.452 (0.239) │
│ Miscalibration │ 0.556 (0.500) │
│ Orthogonality │ 0.763 (0.338) │
└──────────────────────────────┴───────────────────────────┘
Use our packaged Steerflow application to generate and save animated vector flow diagrams of model behavior in different goal subspaces. Steerflow can help you explore where models consistently overshoot, undershoot, or induce side-effects. We host a lightweight demo here for preview purposes.
Here, we show an animated version of the reading difficulty-formality subspace measured on our steerability probe for Gemma3-1B. This flow diagram shows how the model "moves" texts on average in goal-space when we ask for changes in reading level, but not formality. Blue regions (less movement) indicate that the model doesn't really drift (good), while red regions indicate more drift, with purple somewhere in between.
steerflow launch --port 12345
# go to http://localhost:12345 in your browser
Measured in a reading difficulty (specified) + formality (unspecified) goal-space.
Model: Gemma3-1B.
Our team: Trenton Chang (University of Michigan), Tobias Schnabel (Microsoft Research), Adith Swaminathan (Netflix), and Jenna Wiens (University of Michigan).
Acknowledgements: Special thanks to members of the MLD3 Lab (University of Michigan), the AI Interaction & Learning Group (Microsoft Research), the Machine Learning & Inference Research Team (Netflix) and the NeurIPS Safe Generative AI workshop for their helpful feedback on early versions of this work.