Skip to content

lukagerlach/SAELogits

Repository files navigation

SAELogits - SAE Assisted Contrastive Steering

This project is part of my master's thesis at the German Aerospace Center (DLR). It implements a training-free approach for creating steering vectors for large language models (LLMs). We aim to improve upon existing training-free steering methods along two axes: (i) better steering outcomes and (ii) improved interpretability.

The core idea is to construct interpretable steering vectors from Sparse Autoencoder (SAE) latents: given contrastive sets of positive and negative prompts, the pipeline identifies SAE latents whose activation patterns are most characteristic of the target concept, validates them semantically via a logit lens projection and an LLM judge, and combines the selected latent decoder directions into a final steering vector.

We benchmark our approach across two settings: First, we adopt the benchmark from another DLR Paper concerned with steering Ekman's six basic emotions. We also evaluate our method on Stanford's AxBench benchmark (see our AxBench Fork here). Across both benchmarks, our approach significantly outperforms other training-free methods.

There is no paper (yet), but results and analysis can be found in the docs. If you want to run our pipeline yourself, follow our setup below!

Overview

Our approach consists of five main stages:

Pipeline Stages

  1. Activation Extraction — Feed contrastive prompt sets (positive / negative) into the base LLM and collect residual-stream hidden states at a chosen layer.
  2. SAE Encoding — Pass the extracted activations through the SAE encoder to obtain sparse latent representations for both sets.
  3. Latent Ranking — Score each SAE latent by how much more strongly / frequently it activates on the positive set and select the top-k candidates.
  4. Logit Lens Projection — Decode each candidate latent direction back into vocabulary space via the unembedding matrix to get an interpretable token signature — no extra forward passes required.
  5. Semantic Validation — An LLM judge filters and re-ranks the candidates by semantic fit, then combines the selected decoder directions (weighted by activation strength and logit-lens signal) into the final steering vector.

Again, we refer to our documentation for a more detailed explanation.

Setup

The following steps get you up and running to reproduce our emotion steering experiments. For custom concepts, models, or SAEs, see Pipeline Usage below.

Installation

python -m venv .venv
source .venv/bin/activate  # Windows: .venv\Scripts\activate
pip install -r requirements.txt

The datasets and evaluation prompts are stored via Git LFS. To download them:

git lfs install
git lfs pull

Environment Variables

Copy .env.example to .env, add your OpenAI API key, and set your device:

cp .env.example .env

All paths default to subdirectories of resources/ and do not need to be changed unless you want to use custom locations. The full list of configurable variables:

Variable Description
OPENAI_API_KEY Required for LLM judge in semantic validation
MODEL_STORE Directory containing local model weights
SAE_STORE Directory containing local SAE weights
DATASET_STORE Directory with input datasets (.pkl)
ACTIVATION_STORE Output for raw LLM activations
SAE_ACTIVATION_STORE Output for SAE-encoded activations
STEERING_VECTOR_STORE Output for steering vectors
GENERATION_STORE Output for steered generations
EVAL_PROMPT_STORE Input prompts for evaluation
RANKING_ANALYSIS_STORE Output for latent ranking analysis
DEVICE PyTorch device (e.g. cuda:0)

Quick Start

Once the environment is set up, run the full emotion steering pipeline end-to-end:

python pipeline/activation_extraction.py
python pipeline/sae_activation_extraction.py
python pipeline/steering_vector_creation.py
python pipeline/generation.py
python pipeline/evaluation.py

This runs the pre-configured emotion steering example using google/gemma-2-9b-it and gemma-scope-9b-it-res-canonical. Afterwards, inspect the results in the notebooks:

  • notebooks/latent_ranking.ipynb — compare ranking mechanisms and candidate latents across our defined metrics
  • notebooks/steering_outcomes.ipynb — visualise emotion scores and fluency across concepts and steering strengths

Project Structure

SAELogits/
├── config.py                            # Central path & device configuration
├── .env                                 # Local path and API key overrides
├── pipeline/
│   ├── activation_extraction.py         # Stage 1: Extract LLM hidden states
│   ├── sae_activation_extraction.py     # Stage 2: Encode activations via SAE
│   ├── steering_vector_creation.py      # Stage 3–5: Rank, validate & build steering vectors
│   ├── generation.py                    # Steered text generation
│   └── evaluation.py                    # Emotion & fluency evaluation
├── notebooks/
│   ├── latent_ranking.ipynb             # Latent ranking analysis
│   └── steering_outcomes.ipynb          # Steering outcome visualisation
└── resources/
    ├── datasets/                        # Input datasets (.pkl)
    ├── activations/                     # Extracted LLM activations
    ├── sae_activations/                 # SAE-encoded activations
    ├── steering_vectors/                # Constructed steering vectors
    ├── ranking_analysis/                # Feature ranking analysis data
    ├── generations/                     # Steered model completions
    └── eval_prompts/                    # Prompts used for evaluation

Pipeline Usage

Supported Models & SAEs

Model names can be a local name in MODEL_STORE, a HuggingFace model ID, or an absolute path. The pipeline has been tested with:

Model SAE
google/gemma-2-9b-it gemma-scope-9b-it-res-canonical (via SAELens)
meta-llama/Llama-3.1-8B-Instruct andyrdt/saes-llama-3.1-8b-instruct (via HuggingFace)

SAEs can be loaded from SAELens (via release parameter), HuggingFace, or a local path in SAE_STORE.


Stage 1 – Activation Extraction

Extracts last-token hidden states for each sample at outgoing residual stream of specified layers.

from pipeline.activation_extraction import extract_and_save_activations_batch

extract_and_save_activations_batch(
    dataset_name='goemotions',
    model_name='google/gemma-2-9b-it',
    layer_indices=[31],
    batch_size=32,
)

Output: ACTIVATION_STORE/<model>/<dataset>/activations_<i>.pkl


Stage 2 – SAE Activation Extraction

Encodes the extracted activations through a Sparse Autoencoder to obtain sparse feature representations.

from pipeline.sae_activation_extraction import load_sae, get_sae_activations

sae = load_sae(
    sae_id='layer_31/width_131k/canonical',
    release='gemma-scope-9b-it-res-canonical',
    device='cuda:0'
)

get_sae_activations(
    sae=sae,
    device='cuda:0',
    sae_layer=31,
    model_name='google/gemma-2-9b-it',
    dataset_name='goemotions',
)

Output: SAE_ACTIVATION_STORE/<model>/<dataset>/sae_activations_<file>.pkl


Stage 3 – Steering Vector Creation

Three methods are available:

SAE-based

Ranks SAE latents by their specificity to the target concept, computes logit-lens token predictions per latent, and uses an LLM judge to select a semantically pure subset. The final vector is a weighted sum of the selected SAE decoder directions.

from pipeline.steering_vector_creation import create_steering_vector

create_steering_vector(
    target_concept='joy',
    activation_path=SAE_ACTIVATION_STORE / 'google--gemma-2-9b-it' / 'goemotions',
    sae_model=sae,
    model_name='google/gemma-2-9b-it',
    layer=31,
    ranking_methods=['power_alpha'],
    alphas_sweep=[1.0],
    llm_model='gpt-4o',
)

Output: STEERING_VECTOR_STORE/<model>/layer_<n>/sae_logits/<concept>.pkl

This is the core method and the primary extension point of our pipeline. The two main control knobs for adapting the pipeline to new concepts or use cases are:

  • Ranking mechanism — the ranking_methods and alphas_sweep parameters control how SAE latents are scored and selected. New ranking strategies can be added to rank_sae_features in steering_vector_creation.py and plugged in via ranking_methods.
  • LLM judge prompt — the prompt passed to the judge (in create_steering_vector) determines the filtering and re-ranking criteria. Adjusting it is the most direct way to tune semantic precision or to target different concept types.

In addition to the steering vector itself, create_steering_vector always saves a companion analysis file (<concept>_analysis.pkl) at the same location. It contains the full ranked candidate list, per-latent logit-lens token signatures, the raw LLM judge response, the final selected indices, and the per-latent weights used to construct the vector — making every steering vector fully reproducible and inspectable.

Available ranking methods: power_alpha, additive, harmonic, diff_mean, frequency, magnitude

Baseline (mean difference)

Computes the difference of mean activations between target and contrastive samples.

from pipeline.steering_vector_creation import create_steering_vector_baseline

create_steering_vector_baseline(
    target_concept='joy',
    activation_path=ACTIVATION_STORE / 'google--gemma-2-9b-it' / 'goemotions',
    model_name='google/gemma-2-9b-it',
    layer=31,
)

Output: STEERING_VECTOR_STORE/<model>/layer_<n>/baseline/<concept>.pkl

Latent Ranking Analysis (optional)

To compare all ranking mechanisms and inspect which latents each method surfaces, use create_latent_ranking_analysis. Unlike create_steering_vector, this runs all ranking methods over a full alpha sweep and saves detailed per-latent information for downstream analysis in notebooks/latent_ranking.ipynb.

from pipeline.steering_vector_creation import create_latent_ranking_analysis

create_latent_ranking_analysis(
    target_concept='joy',
    activation_path=SAE_ACTIVATION_STORE / 'google--gemma-2-9b-it' / 'goemotions',
    sae_model=sae,
    model_name='google/gemma-2-9b-it',
    layer=31,
    alphas_sweep=[0.01, 0.1, 0.5, 1.0, 2.0, 5.0, 10.0, 20.0],
    ranking_methods=['power_alpha', 'additive', 'harmonic', 'diff_mean', 'frequency', 'magnitude'],
    llm_model='gpt-4o',
)

Output: RANKING_ANALYSIS_STORE/<model>/layer_<n>/concept_<concept>_analysis.pkl


Stage 4 – Steered Generation

Injects a steering vector into a specified layer during inference and generates completions for a set of prompts.

from pipeline.generation import run_experiment

run_experiment(
    model_name='google/gemma-2-9b-it',
    sv_path='path/to/joy.pkl',
    prompt_path=EVAL_PROMPT_STORE / 'subjective_prompts.csv',
    layer_idx=31,
    alpha=200.0,
    experiment_name='gemma/goemo/sae_logits/joy',
    device='cuda:0',
)

Output: GENERATION_STORE/<experiment_name>/<model>__layer_<n>__alpha_<a>.pkl


Stage 5 – Evaluation

Evaluates generated completions on two axes:

  • Emotion scores — via j-hartmann/emotion-english-distilroberta-base (Huggingface Repo)
  • Fluency score — via an LLM judge (Gemma 3 12B) on a 0–2 scale (similar to AxBench)
from pipeline.evaluation import evaluate_emotions, evaluate_fluency_llm_judge

evaluate_emotions(pickle_path='path/to/completions.pkl', device_map='cuda:0')

evaluate_fluency_llm_judge(
    pickle_path='path/to/completions.pkl',
    model=model,
    processor=processor,
)

Results are written back to the same .pkl file as new columns emotion_scores and fluency_score.

About

SAE Assisted Contrastive Steering

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors