This project is part of my master's thesis at the German Aerospace Center (DLR). It implements a training-free approach for creating steering vectors for large language models (LLMs). We aim to improve upon existing training-free steering methods along two axes: (i) better steering outcomes and (ii) improved interpretability.
The core idea is to construct interpretable steering vectors from Sparse Autoencoder (SAE) latents: given contrastive sets of positive and negative prompts, the pipeline identifies SAE latents whose activation patterns are most characteristic of the target concept, validates them semantically via a logit lens projection and an LLM judge, and combines the selected latent decoder directions into a final steering vector.
We benchmark our approach across two settings: First, we adopt the benchmark from another DLR Paper concerned with steering Ekman's six basic emotions. We also evaluate our method on Stanford's AxBench benchmark (see our AxBench Fork here). Across both benchmarks, our approach significantly outperforms other training-free methods.
There is no paper (yet), but results and analysis can be found in the docs. If you want to run our pipeline yourself, follow our setup below!
Our approach consists of five main stages:
- Activation Extraction — Feed contrastive prompt sets (positive / negative) into the base LLM and collect residual-stream hidden states at a chosen layer.
- SAE Encoding — Pass the extracted activations through the SAE encoder to obtain sparse latent representations for both sets.
- Latent Ranking — Score each SAE latent by how much more strongly / frequently it activates on the positive set and select the top-k candidates.
- Logit Lens Projection — Decode each candidate latent direction back into vocabulary space via the unembedding matrix to get an interpretable token signature — no extra forward passes required.
- Semantic Validation — An LLM judge filters and re-ranks the candidates by semantic fit, then combines the selected decoder directions (weighted by activation strength and logit-lens signal) into the final steering vector.
Again, we refer to our documentation for a more detailed explanation.
The following steps get you up and running to reproduce our emotion steering experiments. For custom concepts, models, or SAEs, see Pipeline Usage below.
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -r requirements.txtThe datasets and evaluation prompts are stored via Git LFS. To download them:
git lfs install
git lfs pullCopy .env.example to .env, add your OpenAI API key, and set your device:
cp .env.example .envAll paths default to subdirectories of resources/ and do not need to be changed unless you want to use custom locations. The full list of configurable variables:
| Variable | Description |
|---|---|
OPENAI_API_KEY |
Required for LLM judge in semantic validation |
MODEL_STORE |
Directory containing local model weights |
SAE_STORE |
Directory containing local SAE weights |
DATASET_STORE |
Directory with input datasets (.pkl) |
ACTIVATION_STORE |
Output for raw LLM activations |
SAE_ACTIVATION_STORE |
Output for SAE-encoded activations |
STEERING_VECTOR_STORE |
Output for steering vectors |
GENERATION_STORE |
Output for steered generations |
EVAL_PROMPT_STORE |
Input prompts for evaluation |
RANKING_ANALYSIS_STORE |
Output for latent ranking analysis |
DEVICE |
PyTorch device (e.g. cuda:0) |
Once the environment is set up, run the full emotion steering pipeline end-to-end:
python pipeline/activation_extraction.py
python pipeline/sae_activation_extraction.py
python pipeline/steering_vector_creation.py
python pipeline/generation.py
python pipeline/evaluation.pyThis runs the pre-configured emotion steering example using google/gemma-2-9b-it and gemma-scope-9b-it-res-canonical. Afterwards, inspect the results in the notebooks:
notebooks/latent_ranking.ipynb— compare ranking mechanisms and candidate latents across our defined metricsnotebooks/steering_outcomes.ipynb— visualise emotion scores and fluency across concepts and steering strengths
SAELogits/
├── config.py # Central path & device configuration
├── .env # Local path and API key overrides
├── pipeline/
│ ├── activation_extraction.py # Stage 1: Extract LLM hidden states
│ ├── sae_activation_extraction.py # Stage 2: Encode activations via SAE
│ ├── steering_vector_creation.py # Stage 3–5: Rank, validate & build steering vectors
│ ├── generation.py # Steered text generation
│ └── evaluation.py # Emotion & fluency evaluation
├── notebooks/
│ ├── latent_ranking.ipynb # Latent ranking analysis
│ └── steering_outcomes.ipynb # Steering outcome visualisation
└── resources/
├── datasets/ # Input datasets (.pkl)
├── activations/ # Extracted LLM activations
├── sae_activations/ # SAE-encoded activations
├── steering_vectors/ # Constructed steering vectors
├── ranking_analysis/ # Feature ranking analysis data
├── generations/ # Steered model completions
└── eval_prompts/ # Prompts used for evaluation
Model names can be a local name in MODEL_STORE, a HuggingFace model ID, or an absolute path. The pipeline has been tested with:
| Model | SAE |
|---|---|
google/gemma-2-9b-it |
gemma-scope-9b-it-res-canonical (via SAELens) |
meta-llama/Llama-3.1-8B-Instruct |
andyrdt/saes-llama-3.1-8b-instruct (via HuggingFace) |
SAEs can be loaded from SAELens (via release parameter), HuggingFace, or a local path in SAE_STORE.
Extracts last-token hidden states for each sample at outgoing residual stream of specified layers.
from pipeline.activation_extraction import extract_and_save_activations_batch
extract_and_save_activations_batch(
dataset_name='goemotions',
model_name='google/gemma-2-9b-it',
layer_indices=[31],
batch_size=32,
)Output: ACTIVATION_STORE/<model>/<dataset>/activations_<i>.pkl
Encodes the extracted activations through a Sparse Autoencoder to obtain sparse feature representations.
from pipeline.sae_activation_extraction import load_sae, get_sae_activations
sae = load_sae(
sae_id='layer_31/width_131k/canonical',
release='gemma-scope-9b-it-res-canonical',
device='cuda:0'
)
get_sae_activations(
sae=sae,
device='cuda:0',
sae_layer=31,
model_name='google/gemma-2-9b-it',
dataset_name='goemotions',
)Output: SAE_ACTIVATION_STORE/<model>/<dataset>/sae_activations_<file>.pkl
Three methods are available:
Ranks SAE latents by their specificity to the target concept, computes logit-lens token predictions per latent, and uses an LLM judge to select a semantically pure subset. The final vector is a weighted sum of the selected SAE decoder directions.
from pipeline.steering_vector_creation import create_steering_vector
create_steering_vector(
target_concept='joy',
activation_path=SAE_ACTIVATION_STORE / 'google--gemma-2-9b-it' / 'goemotions',
sae_model=sae,
model_name='google/gemma-2-9b-it',
layer=31,
ranking_methods=['power_alpha'],
alphas_sweep=[1.0],
llm_model='gpt-4o',
)Output: STEERING_VECTOR_STORE/<model>/layer_<n>/sae_logits/<concept>.pkl
This is the core method and the primary extension point of our pipeline. The two main control knobs for adapting the pipeline to new concepts or use cases are:
- Ranking mechanism — the
ranking_methodsandalphas_sweepparameters control how SAE latents are scored and selected. New ranking strategies can be added torank_sae_featuresinsteering_vector_creation.pyand plugged in viaranking_methods. - LLM judge prompt — the prompt passed to the judge (in
create_steering_vector) determines the filtering and re-ranking criteria. Adjusting it is the most direct way to tune semantic precision or to target different concept types.
In addition to the steering vector itself, create_steering_vector always saves a companion analysis file (<concept>_analysis.pkl) at the same location. It contains the full ranked candidate list, per-latent logit-lens token signatures, the raw LLM judge response, the final selected indices, and the per-latent weights used to construct the vector — making every steering vector fully reproducible and inspectable.
Available ranking methods: power_alpha, additive, harmonic, diff_mean, frequency, magnitude
Computes the difference of mean activations between target and contrastive samples.
from pipeline.steering_vector_creation import create_steering_vector_baseline
create_steering_vector_baseline(
target_concept='joy',
activation_path=ACTIVATION_STORE / 'google--gemma-2-9b-it' / 'goemotions',
model_name='google/gemma-2-9b-it',
layer=31,
)Output: STEERING_VECTOR_STORE/<model>/layer_<n>/baseline/<concept>.pkl
To compare all ranking mechanisms and inspect which latents each method surfaces, use create_latent_ranking_analysis. Unlike create_steering_vector, this runs all ranking methods over a full alpha sweep and saves detailed per-latent information for downstream analysis in notebooks/latent_ranking.ipynb.
from pipeline.steering_vector_creation import create_latent_ranking_analysis
create_latent_ranking_analysis(
target_concept='joy',
activation_path=SAE_ACTIVATION_STORE / 'google--gemma-2-9b-it' / 'goemotions',
sae_model=sae,
model_name='google/gemma-2-9b-it',
layer=31,
alphas_sweep=[0.01, 0.1, 0.5, 1.0, 2.0, 5.0, 10.0, 20.0],
ranking_methods=['power_alpha', 'additive', 'harmonic', 'diff_mean', 'frequency', 'magnitude'],
llm_model='gpt-4o',
)Output: RANKING_ANALYSIS_STORE/<model>/layer_<n>/concept_<concept>_analysis.pkl
Injects a steering vector into a specified layer during inference and generates completions for a set of prompts.
from pipeline.generation import run_experiment
run_experiment(
model_name='google/gemma-2-9b-it',
sv_path='path/to/joy.pkl',
prompt_path=EVAL_PROMPT_STORE / 'subjective_prompts.csv',
layer_idx=31,
alpha=200.0,
experiment_name='gemma/goemo/sae_logits/joy',
device='cuda:0',
)Output: GENERATION_STORE/<experiment_name>/<model>__layer_<n>__alpha_<a>.pkl
Evaluates generated completions on two axes:
- Emotion scores — via
j-hartmann/emotion-english-distilroberta-base(Huggingface Repo) - Fluency score — via an LLM judge (Gemma 3 12B) on a 0–2 scale (similar to AxBench)
from pipeline.evaluation import evaluate_emotions, evaluate_fluency_llm_judge
evaluate_emotions(pickle_path='path/to/completions.pkl', device_map='cuda:0')
evaluate_fluency_llm_judge(
pickle_path='path/to/completions.pkl',
model=model,
processor=processor,
)Results are written back to the same .pkl file as new columns emotion_scores and fluency_score.
