Measuring steerability in LLMs

Usage

1. Create a configuration file

Your config specifies which model to use, the prompt type, where to save results, and other parameters. We show an example config that you can use. We support both OpenAI and vLLM-supported models!

For OpenAI models, provide the model endpoint (example: gpt-4o-2024-08-06). For vLLM models, provide the HuggingFace model ID (example: google/gemma-3-1b-it).

An example config is shown on the right, but we provide a few extra examples in the repo as well.

# Example config: config/my_favorite_model.yaml 
model_id: "google/gemma-3-1b-it"
probe: "./data/2025_06_steerbench_64x32.csv"
# Direct + Negative Prompt
prompt_strategy: "direct"
inst_addons:
  disambig: True
seed: 42
max_tokens: 4096
rate_limit: 128
text_gen_kwargs:
  randomness: 0.0 # temperature
  num_generations: 1
# Saving
save_as: "gemma_3_1b_results"

2. Run the steerability evaluation script

From the repo root, use the following command to run a complete steerability evaluation.

You'll want to point --api-config to some file storing your API key in this format: {"api-key": "sk-..."}.

python steer_eval.py \
    --config config/my_favorite_model.yaml \
    --api-config /path/to/my/api_key

3. Approve flagged completions (optional but recommended)

We recommend using LLM-as-a-judge to flag responses that may not be valid rewrites (e.g., refusals). You’ll be prompted to confirm whether responses flagged by the model are acceptable rewrites by approving or overriding the LLM's reasoning (Yes/No) interactively.

We've truncated the text for easier reading here, but you'll see the complete rewritten and original texts during the actual evaluation.

You can skip this step by passing --skip-interactive when you launch python steer_eval.py Step 2.

┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━━┓
┃ Original  ┃ Rewrite           ┃ Answer ┃ Reasoning        ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━━┩
│ Cats are  │ Felines, part of  │ Yes    │ Same core idea   │
│ animals.  │ Felidae, may be   │        │ with more formal │
│ They can  │ pets or hunters…  │        │ wording.         │
│ be pets…  │                   │        │                  │
└───────────┴───────────────────┴────────┴──────────────────┘
>>> Approve reasoning? (yes/no):

4. View results

After filtering, you'll see a printout summarizing steering error, miscalibration, and side-effects.

This printout also includes information about the model and probe used, the prompting strategy, and the results of LLM-as-judge evaluation. The results are saved in a JSON file for later viewing as well.

╭────────────────── STEERABILITY REPORT ───────────────────╮
│ Model: gpt-4.1-nano-2025-04-14                           │
│ Prompting strategy: direct (w/ neg. prompt)              │
│ Probe: ./data/2025_06_steerbench_64x32.csv               │
│ # of total responses: 2048                               │
╰──────────────────────────────────────────────────────────╯
╭────────────────── INTERACTIVE JUDGING ───────────────────╮
│ Judge model: meta-llama/Llama-3.1-8B-Instruct            │
│ # of valid responses: 2048 (100.00%)                     │
│ # of LLM-flagged responses: 0 (0.00%)                    │
│ # of overruled responses: 0 (0.00%)                      │
╰──────────────────────────────────────────────────────────╯
                    STEERABILITY METRICS                    
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Metric                       ┃              Median (IQR) ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ Steering Error               │             0.452 (0.239) │
│ Miscalibration               │             0.556 (0.500) │
│ Orthogonality                │             0.763 (0.338) │
└──────────────────────────────┴───────────────────────────┘

5. Visualize how LLMs steer in goal-space (optional but fun!)

Use our packaged Steerflow application to generate and save animated vector flow diagrams of model behavior in different goal subspaces. Steerflow can help you explore where models consistently overshoot, undershoot, or induce side-effects. We host a lightweight demo here for preview purposes.

Here, we show an animated version of the reading difficulty-formality subspace measured on our steerability probe for Gemma3-1B. This flow diagram shows how the model "moves" texts on average in goal-space when we ask for changes in reading level, but not formality. Blue regions (less movement) indicate that the model doesn't really drift (good), while red regions indicate more drift, with purple somewhere in between.


steerflow launch --port 12345
# go to http://localhost:12345 in your browser

Flow diagram demo

Measured in a reading difficulty (specified) + formality (unspecified) goal-space.
Model: Gemma3-1B.

Flow diagram demo

steerability.org

Steerability: an on-ramp on the road to alignment.

Code

Data

Demo

Paper

Quickstart

Installation

Usage

1. Create a configuration file

2. Run the steerability evaluation script

3. Approve flagged completions (optional but recommended)

4. View results

5. Visualize how LLMs steer in goal-space (optional but fun!)

Flow diagram demo