SAELogits - SAE Assisted Contrastive Steering

This project is part of my master's thesis at the German Aerospace Center (DLR). It implements a training-free approach for creating steering vectors for large language models (LLMs). We aim to improve upon existing training-free steering methods along two axes: (i) better steering outcomes and (ii) improved interpretability.

The core idea is to construct interpretable steering vectors from Sparse Autoencoder (SAE) latents: given contrastive sets of positive and negative prompts, the pipeline identifies SAE latents whose activation patterns are most characteristic of the target concept, validates them semantically via a logit lens projection and an LLM judge, and combines the selected latent decoder directions into a final steering vector.

We benchmark our approach across two settings: First, we adopt the benchmark from another DLR Paper concerned with steering Ekman's six basic emotions. We also evaluate our method on Stanford's AxBench benchmark (see our AxBench Fork here). Across both benchmarks, our approach significantly outperforms other training-free methods.

There is no paper (yet), but results and analysis can be found in the docs. If you want to run our pipeline yourself, follow our setup below!

Overview

Our approach consists of five main stages:

Activation Extraction — Feed contrastive prompt sets (positive / negative) into the base LLM and collect residual-stream hidden states at a chosen layer.
SAE Encoding — Pass the extracted activations through the SAE encoder to obtain sparse latent representations for both sets.
Latent Ranking — Score each SAE latent by how much more strongly / frequently it activates on the positive set and select the top-k candidates.
Logit Lens Projection — Decode each candidate latent direction back into vocabulary space via the unembedding matrix to get an interpretable token signature — no extra forward passes required.
Semantic Validation — An LLM judge filters and re-ranks the candidates by semantic fit, then combines the selected decoder directions (weighted by activation strength and logit-lens signal) into the final steering vector.

Again, we refer to our documentation for a more detailed explanation.

Setup

The following steps get you up and running to reproduce our emotion steering experiments. For custom concepts, models, or SAEs, see Pipeline Usage below.

Installation

python -m venv .venv
source .venv/bin/activate  # Windows: .venv\Scripts\activate
pip install -r requirements.txt

The datasets and evaluation prompts are stored via Git LFS. To download them:

git lfs install
git lfs pull

Environment Variables

Copy .env.example to .env, add your OpenAI API key, and set your device:

cp .env.example .env

All paths default to subdirectories of resources/ and do not need to be changed unless you want to use custom locations. The full list of configurable variables:

Variable	Description
`OPENAI_API_KEY`	Required for LLM judge in semantic validation
`MODEL_STORE`	Directory containing local model weights
`SAE_STORE`	Directory containing local SAE weights
`DATASET_STORE`	Directory with input datasets (`.pkl`)
`ACTIVATION_STORE`	Output for raw LLM activations
`SAE_ACTIVATION_STORE`	Output for SAE-encoded activations
`STEERING_VECTOR_STORE`	Output for steering vectors
`GENERATION_STORE`	Output for steered generations
`EVAL_PROMPT_STORE`	Input prompts for evaluation
`RANKING_ANALYSIS_STORE`	Output for latent ranking analysis
`DEVICE`	PyTorch device (e.g. `cuda:0`)

Quick Start

Once the environment is set up, run the full emotion steering pipeline end-to-end:

python pipeline/activation_extraction.py
python pipeline/sae_activation_extraction.py
python pipeline/steering_vector_creation.py
python pipeline/generation.py
python pipeline/evaluation.py

This runs the pre-configured emotion steering example using google/gemma-2-9b-it and gemma-scope-9b-it-res-canonical. Afterwards, inspect the results in the notebooks:

notebooks/latent_ranking.ipynb — compare ranking mechanisms and candidate latents across our defined metrics
notebooks/steering_outcomes.ipynb — visualise emotion scores and fluency across concepts and steering strengths

Project Structure

SAELogits/
├── config.py                            # Central path & device configuration
├── .env                                 # Local path and API key overrides
├── pipeline/
│   ├── activation_extraction.py         # Stage 1: Extract LLM hidden states
│   ├── sae_activation_extraction.py     # Stage 2: Encode activations via SAE
│   ├── steering_vector_creation.py      # Stage 3–5: Rank, validate & build steering vectors
│   ├── generation.py                    # Steered text generation
│   └── evaluation.py                    # Emotion & fluency evaluation
├── notebooks/
│   ├── latent_ranking.ipynb             # Latent ranking analysis
│   └── steering_outcomes.ipynb          # Steering outcome visualisation
└── resources/
    ├── datasets/                        # Input datasets (.pkl)
    ├── activations/                     # Extracted LLM activations
    ├── sae_activations/                 # SAE-encoded activations
    ├── steering_vectors/                # Constructed steering vectors
    ├── ranking_analysis/                # Feature ranking analysis data
    ├── generations/                     # Steered model completions
    └── eval_prompts/                    # Prompts used for evaluation

Pipeline Usage

Supported Models & SAEs

Model names can be a local name in MODEL_STORE, a HuggingFace model ID, or an absolute path. The pipeline has been tested with:

Model	SAE
`google/gemma-2-9b-it`	`gemma-scope-9b-it-res-canonical` (via SAELens)
`meta-llama/Llama-3.1-8B-Instruct`	`andyrdt/saes-llama-3.1-8b-instruct` (via HuggingFace)

SAEs can be loaded from SAELens (via release parameter), HuggingFace, or a local path in SAE_STORE.

Stage 1 – Activation Extraction

Extracts last-token hidden states for each sample at outgoing residual stream of specified layers.

from pipeline.activation_extraction import extract_and_save_activations_batch

extract_and_save_activations_batch(
    dataset_name='goemotions',
    model_name='google/gemma-2-9b-it',
    layer_indices=[31],
    batch_size=32,
)

Output: ACTIVATION_STORE/<model>/<dataset>/activations_<i>.pkl

Stage 2 – SAE Activation Extraction

Encodes the extracted activations through a Sparse Autoencoder to obtain sparse feature representations.

from pipeline.sae_activation_extraction import load_sae, get_sae_activations

sae = load_sae(
    sae_id='layer_31/width_131k/canonical',
    release='gemma-scope-9b-it-res-canonical',
    device='cuda:0'
)

get_sae_activations(
    sae=sae,
    device='cuda:0',
    sae_layer=31,
    model_name='google/gemma-2-9b-it',
    dataset_name='goemotions',
)

Output: SAE_ACTIVATION_STORE/<model>/<dataset>/sae_activations_<file>.pkl

Stage 3 – Steering Vector Creation

Three methods are available:

SAE-based

Ranks SAE latents by their specificity to the target concept, computes logit-lens token predictions per latent, and uses an LLM judge to select a semantically pure subset. The final vector is a weighted sum of the selected SAE decoder directions.

from pipeline.steering_vector_creation import create_steering_vector

create_steering_vector(
    target_concept='joy',
    activation_path=SAE_ACTIVATION_STORE / 'google--gemma-2-9b-it' / 'goemotions',
    sae_model=sae,
    model_name='google/gemma-2-9b-it',
    layer=31,
    ranking_methods=['power_alpha'],
    alphas_sweep=[1.0],
    llm_model='gpt-4o',
)

Output: STEERING_VECTOR_STORE/<model>/layer_<n>/sae_logits/<concept>.pkl

This is the core method and the primary extension point of our pipeline. The two main control knobs for adapting the pipeline to new concepts or use cases are:

Ranking mechanism — the ranking_methods and alphas_sweep parameters control how SAE latents are scored and selected. New ranking strategies can be added to rank_sae_features in steering_vector_creation.py and plugged in via ranking_methods.
LLM judge prompt — the prompt passed to the judge (in create_steering_vector) determines the filtering and re-ranking criteria. Adjusting it is the most direct way to tune semantic precision or to target different concept types.

In addition to the steering vector itself, create_steering_vector always saves a companion analysis file (<concept>_analysis.pkl) at the same location. It contains the full ranked candidate list, per-latent logit-lens token signatures, the raw LLM judge response, the final selected indices, and the per-latent weights used to construct the vector — making every steering vector fully reproducible and inspectable.

Available ranking methods: power_alpha, additive, harmonic, diff_mean, frequency, magnitude

Baseline (mean difference)

Computes the difference of mean activations between target and contrastive samples.

from pipeline.steering_vector_creation import create_steering_vector_baseline

create_steering_vector_baseline(
    target_concept='joy',
    activation_path=ACTIVATION_STORE / 'google--gemma-2-9b-it' / 'goemotions',
    model_name='google/gemma-2-9b-it',
    layer=31,
)

Output: STEERING_VECTOR_STORE/<model>/layer_<n>/baseline/<concept>.pkl

Latent Ranking Analysis (optional)

To compare all ranking mechanisms and inspect which latents each method surfaces, use create_latent_ranking_analysis. Unlike create_steering_vector, this runs all ranking methods over a full alpha sweep and saves detailed per-latent information for downstream analysis in notebooks/latent_ranking.ipynb.

from pipeline.steering_vector_creation import create_latent_ranking_analysis

create_latent_ranking_analysis(
    target_concept='joy',
    activation_path=SAE_ACTIVATION_STORE / 'google--gemma-2-9b-it' / 'goemotions',
    sae_model=sae,
    model_name='google/gemma-2-9b-it',
    layer=31,
    alphas_sweep=[0.01, 0.1, 0.5, 1.0, 2.0, 5.0, 10.0, 20.0],
    ranking_methods=['power_alpha', 'additive', 'harmonic', 'diff_mean', 'frequency', 'magnitude'],
    llm_model='gpt-4o',
)

Output: RANKING_ANALYSIS_STORE/<model>/layer_<n>/concept_<concept>_analysis.pkl

Stage 4 – Steered Generation

Injects a steering vector into a specified layer during inference and generates completions for a set of prompts.

from pipeline.generation import run_experiment

run_experiment(
    model_name='google/gemma-2-9b-it',
    sv_path='path/to/joy.pkl',
    prompt_path=EVAL_PROMPT_STORE / 'subjective_prompts.csv',
    layer_idx=31,
    alpha=200.0,
    experiment_name='gemma/goemo/sae_logits/joy',
    device='cuda:0',
)

Output: GENERATION_STORE/<experiment_name>/<model>__layer_<n>__alpha_<a>.pkl

Stage 5 – Evaluation

Evaluates generated completions on two axes:

Emotion scores — via j-hartmann/emotion-english-distilroberta-base (Huggingface Repo)
Fluency score — via an LLM judge (Gemma 3 12B) on a 0–2 scale (similar to AxBench)

from pipeline.evaluation import evaluate_emotions, evaluate_fluency_llm_judge

evaluate_emotions(pickle_path='path/to/completions.pkl', device_map='cuda:0')

evaluate_fluency_llm_judge(
    pickle_path='path/to/completions.pkl',
    model=model,
    processor=processor,
)

Results are written back to the same .pkl file as new columns emotion_scores and fluency_score.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.github/workflows		.github/workflows
docs		docs
notebooks		notebooks
pipeline		pipeline
resources		resources
.DS_Store		.DS_Store
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
Gemfile		Gemfile
Gemfile.lock		Gemfile.lock
LICENSE		LICENSE
README.md		README.md
config.py		config.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SAELogits - SAE Assisted Contrastive Steering

Overview

Setup

Installation

Environment Variables

Quick Start

Project Structure

Pipeline Usage

Supported Models & SAEs

Stage 1 – Activation Extraction

Stage 2 – SAE Activation Extraction

Stage 3 – Steering Vector Creation

SAE-based

Baseline (mean difference)

Latent Ranking Analysis (optional)

Stage 4 – Steered Generation

Stage 5 – Evaluation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SAELogits - SAE Assisted Contrastive Steering

Overview

Setup

Installation

Environment Variables

Quick Start

Project Structure

Pipeline Usage

Supported Models & SAEs

Stage 1 – Activation Extraction

Stage 2 – SAE Activation Extraction

Stage 3 – Steering Vector Creation

SAE-based

Baseline (mean difference)

Latent Ranking Analysis (optional)

Stage 4 – Steered Generation

Stage 5 – Evaluation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages