Part of a series on |
Machine learning and data mining |
---|
Mechanistic interpretability (often shortened to mech interp or MI) is a subfield of research within explainable artificial intelligence which seeks to fully reverse-engineer neural networks (akin to reverse-engineering a compiled binary of a computer program), with the ultimate goal of understanding the mechanisms underlying their computations.[1][2][3] The field is particularly focused on large language models.
History
[edit]Chris Olah is generally credited with coining the term "mechanistic interpretability" and spearheading early development of the field.[4] In the 2018 paper The Building Blocks of Interpretability,[5] Olah (then at Google Brain) and his colleagues combined existing interpretability techniques, including feature visualization, dimensionality reduction, and attribution with human-computer interface methods to explore features represented by the neurons in the vision model, Inception v1. In the March 2020 paper Zoom In: An Introduction to Circuits,[6] Olah and the OpenAI Clarity team described "an approach inspired by neuroscience or cellular biology", hypothesizing that features, like individual cells, are the basis of computation for neural networks and connect to form circuits, which can be understood as "sub-graphs in a network". In this paper, the authors described their line of work as understanding the "mechanistic implementations of neurons in terms of their weights".
In 2021, Chris Olah co-founded the company Anthropic and established its Interpretability team, which publishes their results on the Transformer Circuits Thread.[7] In December 2021, the team published A Mathematical Framework for Transformer Circuits, reverse-engineering a toy transformer with one and two attention layers. Notably, they discovered the complete algorithm of induction circuits, responsible for in-context learning of repeated token sequences. The team further elaborated this result in the March 2022 paper In-context Learning and Induction Heads.[8]
Notable results in mechanistic interpretability from 2022 include the theory of superposition wherein a model represents more features than there are directions in its representation space;[9] a mechanistic explanation for grokking, the phenomenon where test-set loss begins to decay only after a delay relative to training-set loss;[10] and the introduction of sparse autoencoders, a sparse dictionary learning method to extract interpretable features from LLMs.[11][12]
Mechanistic interpretability has garnered significant interest, talent, and funding in the AI safety community. In 2021, Open Philanthropy called for proposals that advanced "mechanistic understanding of neural networks" alongside other projects aimed to reduce risks from advanced AI systems.[13] The interpretability topic prompt in the request for proposal was written by Chris Olah.[14] The ML Alignment & Theory Scholars (MATS) program, a research seminar focused on AI alignment, has historically supported numerous projects in mechanistic interpretability. In its summer 2023 cohort, for example, 20% of the research projects were on mechanistic interpretability.[15]
Many organizations and research groups work on mechanistic interpretability, often with the stated goal of improving AI safety. Max Tegmark runs the Tegmark AI Safety Group at MIT, which focuses on mechanistic interpretability.[16] In February 2023, Neel Nanda started the mechanistic interpretability team at Google DeepMind. Apollo Research, an AI evals organization with a focus on interpretability research, was founded in May 2023.[17] EleutherAI has published multiple papers on interpretability.[18] Goodfire, an AI interpretability startup, was founded in 2024.[19]
Mechanistic interpretability has greatly expanded its scope, practitioners, and attention in the ML community in recent years. In July 2024, the first ICML Mechanistic Interpretability Workshop was held, aiming to bring together "separate threads of work in industry and academia".[20] In November 2024, Chris Olah discussed mechanistic interpretability on the Lex Fridman podcast as part of the Anthropic team.[21]
Definition
[edit]The term mechanistic interpretability designates both a class of technical methods and a research community. Chris Olah is usually credited with coining the term "mechanistic interpretability". His motivation was the differentiate this nascent approach to interpretability from established saliency map-based approaches which at the time dominated computer vision.[22]
In-field explanations of the goal of mechanistic interpretability make an analogy to reverse-engineering computer programs,[3][23] with the argument being that rather than being arbitrary functions, neural networks represent are composed of independent reverse-engineerable mechanisms that are compressed into the weights.
Mechanistic interpretability seeks to reverse engineer neural networks, similar to how one might reverse engineer a compiled binary computer program.
— Chris Olah, "Mechanistic Interpretability, Variables, and the Importance of Interpretable Bases"[1] [emphasis added]
One emerging approach for understanding the internals of neural networks is mechanistic interpretability: reverse engineering the algorithms implemented by neural networks into human-understandable mechanisms, often by examining the weights and activations of neural networks to identify circuits [Cammarata et al., 2020, Elhage et al., 2021] that implement particular behaviors.
— Mechanistic Interpretability Workshop 2024[24] [emphasis added]
Mechanistic interpretability's early development was rooted in the AI safety community, though the term is increasingly adopted by academia at large. In “Mechanistic?”, Saphra and Wiegreffe identify four senses of “mechanistic interpretability”:[4]
- Narrow technical definition: A technical approach to understanding neural networks through their causal mechanisms.
- Broad technical definition: Any research that describes the internals of a model, including its activations or weights.
- Narrow cultural definition: Any research originating from the MI community.
- Broad cultural definition: Any research in the field of AI—especially LM—interpretability.
As the scope and popular recognition of mechanistic interpretability increase, many[who?] have begun to recognize that other communities such as natural language processing researchers have pursued similar objectives in their work.
Key concepts
[edit]Linear representation
[edit]The latent space of a LLM is generally conceptualized as a vector space. There is a growing body of work around the linear representation hypothesis, which posits that high-level features correspond to approximately linear directions in a model’s activation space.[25][26]
Superposition
[edit]Superposition is the phenomenon where many unrelated features are “packed’’ into the same subspace or even into single neurons, making a network highly over-complete yet still linearly decodable after nonlinear filtering.[9] Recent formal analysis links the amount of polysemanticity to feature ‘‘capacity’’ and input sparsity, predicting when neurons become monosemantic or remain polysemantic.[27]
Methods
[edit]Probing
[edit]Probing involves training a linear classifier on model activations to test whether a feature is linearly decodable at a given layer or subset of neurons.[28] Generally, a linear probe is trained on a labelled dataset encoding the desired feature. While linear probes are popular amongst mechanistic interpretability researchers, its introduction dates back to 2016 and it has been widely used in the NLP community.[29]
Nanda, Lee & Wattenberg (2023) showed that world-model features such as in-context truth values emerge as linearly decodable directions early in training, strengthening the case for linear probes as faithful monitors of internal state.
Difference-in-means
[edit]Difference-in-means, or diff-in-means constructs a steering vector by subtracting the mean activation for one class of examples from the mean for another. Unlike learned probes, diff-in-means has no trainable parameters and often generalises better out-of-distribution.[30] Diff-in-means has been used to isolate model representation for refusal/compliance, true/false, and sentiment.[31][30][32]
Steering
[edit]Steering adds or subtracts a direction (often obtained via probing, diff-in-means, or K-means) from the residual stream to causally change model behaviour.
Activation/attribution patching
[edit]Activation patching ablates model components or neurons to trace which part of the model is causally necessary for a behaviour. Automated circuit discovery (ACDC) prunes the computational graph by iteratively patch-testing edges, localising minimal sub-circuits without manual hand-holding.[33]
Sparse decomposition methods
[edit]Sparse autoencoders (SAEs) and dictionary learning
[edit]Sparse autoencoders (SAEs) for mechanistic interpretability were proposed in order to address the superposition problem by decomposing the feature space into an overcomplete basis (i.e. with more features than dimensions) of monosemantic concepts. The underlying intuition is that features can only be manipulable under superposition if they are sparsely activated (otherwise, interference between features would be too high).[11]
Given a vector representing an activation collected from some model component (in a transformer, usually the MLP inner activation or the residual stream), the sparse autoencoder computes the following: Here, projects the activation into a -dimensional latent space, applies the ReLU nonlinearity, and finally the decoder aims to reconstruct the original activation from this latent representation. The bias terms are ; the latter is omitted in some formulations. The encoder and decoder matrices may also be tied.
Given a dataset of activations , the SAE is trained with gradient descent to minimise the following loss function: where the first term is the reconstruction loss (i.e. the standard autoencoding objective) and the second is a sparsity loss on the latent representation which aims to minimise its -norm.
Alternative designs
[edit]Several works motivate alternative nonlinearities to ReLU based on improved downstream performance or training stability.
- TopK, which means selecting the top- activating latents and zeroing out the rest, which also allows for dropping the sparsity loss entirely.[34]
- JumpReLU, defined as where is the Heaviside step function.[35]
Evaluation
[edit]The core metrics for evaluating SAEs are sparsity, measured by the -norm of the latent representations over the dataset, and fidelity, which may be the MSE reconstruction error as in the loss function or a downstream metric when substituting the SAE output into the model, such as loss recovered or KL-divergence from the original model behaviour.[36]
SAE latents are usually labelled using an autointerpretability pipeline. Most such pipelines feed highly-activating (i.e. having activation resulting in large for feature , repeating over all features) dataset exemplars to a large language model, which generates a natural-language description based on the contexts the latent is active.
Early works directly adapt Bills (2023)'s neuron-labelling and evaluation pipeline and report higher interpretability scores than alternative methods (the standard basis, PCA, etc.).[11][12] However, this leads to misleading score, since explanations achieve high recall but usually low precision, leading to more nuanced evaluation metrics being introduced in later works: neuron-to-graph explanations (or other approaches) reporting both precision and recall,[34] and intervention-based metrics that measure the downstream effect of manipulating a latent feature.[37][38]
Transcoders
[edit]Transcoders are formulated identically to SAEs, with the caveat that they seek to approximate the input-output behaviour of a model component (usually the MLP).[39] This is useful for measuring how latent features in different layers of the model affect each other in an input-invariant manner (i.e. by directly comparing encoder and decoder weights). A transcoder thus computes the following: and is trained to minimise the loss: When ignoring or holding attention components constant (which may obscure some information), transcoders trained on different layers of a model can then be used to conduct circuit analysis without having to process individual inputs and collect latent activations, unlike SAEs.
Transcoders generally outperform SAEs, achieving lower loss and better automated interpretability scores.[40][39]
Sparse crosscoders
[edit]A disadvantage of single-layer SAEs and transcoders is that they produce duplicate features when trained on multiple layers, if those features persist throughout the residual stream. This complicates understanding layer-to-layer feature propagation and also wastes latent parameters. Crosscoders were introduced to enable cross-layer representation of features, which minimises these issues.[41]
A crosscoder computes the cross-layer latent representation using a set of layer-wise activations over layers obatained from some input as follows: The reconstruction is done independently for each layer using this cross-layer representation: Alternatively, the target may be layer-wise component outputs if using the transcoder objective. The model is then trained to minimise a loss: Note that the regularisation term uses the -norm; the -norm is an alternative choice, considered but not used in the original paper.[41]
Circuit tracing and automated graph discovery
[edit]Circuit tracing substitutes parts of the model, in particular the MLP block, with more interpretable components called "Transcoders". The goal is to recover explicit computational graphs. Like SAEs, circuit tracing uses sparse dictionary learning techniques. Instead of reconstructing model activations like SAEs, however, Transcoders aims to predict the output of non-linear components given their input. The technique was introduced in the paper "Circuit Tracing: Revealing Computational Graphs in Language Models", published in April 2025 by Anthropic. Circuit tracing has been used to understand how a model plans the rhyme in a poem, perform medical diagnosis, and understand chain of thought unfaithfulness.[42]
Critique
[edit]Critiques of Mechanistic Interpretability can be roughly divided into two categories: critiques of the real-world impact of interpretability, and critiques of the mechanistic approach compared to others.
Critics have argued that current circuits-level analyses generalise poorly and risk giving false confidence about alignment, citing cherry-picking, small-scale demos, and absence of worst-case guarantees.[43][44] In response to Dario Amodei's blogpost "The Urgency of Interpretability", which claimed that researchers are on track to achieve MI's goal of developing "the analogue of a highly precise and accurate MRI that would fully reveal the inner workings of an AI model", Neel Nanda, who leads Google Deepmind's mechanistic interpretability team, argued that mechanistic interpretability alone won't enable high-reliability monitoring for safety-relevant features such as deception.[45][46] In particular, Nanda highlighted that only a small portion of the model seems to be interpretable and that current methods cannot guarantee the absence of particular features or circuits of interest.
Some have framed mechanistic interpretability as "bottom-up interpretability", where the emphasis is neuron-level circuits, in contrast to "top-down interpretability", which focuses on emergent high-level concepts.[47] They argue that LLMs are complex systems and high-level concepts cannot be simulated, predicted, and understood with low-level components. As one critic puts, "If you wanted to explain why England won World War II using particle physics, you would just be on the wrong track."[48]
References
[edit]- ^ a b Olah, Chris (June 2022). "Mechanistic Interpretability, Variables, and the Importance of Interpretable Bases". Transformer Circuits Thread. Anthropic. Retrieved 28 March 2025.
- ^ Olah, Chris; Cammarata, Nick; Schubert, Ludwig; Goh, Gabriel; Petrov, Michael; Carter, Shan (2020). "Zoom In: An Introduction to Circuits". Distill. doi:10.23915/distill.00024.001.
- ^ a b Elhage, Nelson; et al. (2021). "A Mathematical Framework for Transformer Circuits". Transformer Circuits Thread. Anthropic.
- ^ a b Saphra, Naomi; Wiegreffe, Sarah (2024). "Mechanistic?". arXiv:2410.09087 [cs.AI].
- ^ Olah, Chris; et al. (2018). "The Building Blocks of Interpretability". Distill. doi:10.23915/distill.00010.
- ^ Olah, Chris; Cammarata, Nick; Schubert, Ludwig; Goh, Gabriel; Petrov, Michael; Carter, Shan (2020-03-10). "Zoom In: An Introduction to Circuits". Distill. 5 (3). doi:10.23915/distill.00024.001. ISSN 2476-0757.
- ^ "Transformer Circuits Thread". transformer-circuits.pub. Retrieved 2025-05-12.
- ^ Olsson, Catherine; Elhage, Nelson; Nanda, Neel; Joseph, Nicholas; DasSarma, Nova; Henighan, Tom; Mann, Ben; Askell, Amanda; Bai, Yuntao; Chen, Anna; Conerly, Tom; Drain, Dawn; Ganguli, Deep; Hatfield-Dodds, Zac; Hernandez, Danny; Johnston, Scott; Jones, Andy; Kernion, Jackson; Lovitt, Liane; Ndousse, Kamal; Amodei, Dario; Brown, Tom; Clark, Jack; Kaplan, Jared; McCandlish, Sam; Olah, Chris (2022). "In-context Learning and Induction Heads". arXiv:2209.11895 [cs.LG].
- ^ a b Elhage, Nelson; Hume, Tristan; Olsson, Catherine; Schiefer, Nicholas; Henighan, Tom; Kravec, Shauna; Hatfield-Dodds, Zac; Lasenby, Robert; Drain, Dawn; Chen, Carol; Grosse, Roger; McCandlish, Sam; Kaplan, Jared; Amodei, Dario; Wattenberg, Martin; Olah, Christopher (2022). "Toy Models of Superposition". arXiv:2209.10652 [cs.LG].
- ^ Nanda, Neel; Chan, Lawrence; Lieberum, Tom; Smith, Jess; Steinhardt, Jacob (2023). "Progress measures for grokking via mechanistic interpretability". arXiv:2301.05217 [cs.LG].
- ^ a b c Cunningham, Hoagy; Ewart, Aidan; Riggs, Logan; Huben, Robert; Sharkey, Lee (May 7–11, 2024). "Sparse Autoencoders Find Highly Interpretable Features in Language Models". The Twelfth International Conference on Learning Representations (ICLR 2024). Vienna, Austria: OpenReview.net. Retrieved 2025-04-29.
- ^ a b Bricken, Trenton; et al. (2023). "Towards Monosemanticity: Decomposing Language Models With Dictionary Learning". Transformer Circuits Thread. Retrieved 2025-04-29.
- ^ "Request for proposals for projects in AI alignment that work with deep learning systems". Open Philanthropy. Retrieved 2025-05-12.
- ^ "Interpretability". Alignment Forum. 2021-10-29.
- ^ Gil, Juan; Kidd, Ryan; Smith, Christian (December 1, 2023). "MATS Summer 2023 Retrospective".
- ^ "Tegmark Group". tegmark.org. Retrieved 2025-05-12.
- ^ Hobbhahn, Marius; Millidge, Beren; Sharkey, Lee; Bushnaq, Lucius; Braun, Dan; Balesni, Mikita; Scheurer, Jérémy (2023-05-30). "Announcing Apollo Research".
- ^ "Interpretability". EleutherAI. 2024-02-06. Retrieved 2025-05-12.
- ^ Sullivan, Mark (2025-04-22). "This startup wants to reprogram the mind of AI—and just got $50 million to do it". Fast Company. Archived from the original on 2025-05-04. Retrieved 2025-05-12.
- ^ "ICML 2024 Mechanistic Interpretability Workshop". icml2024mi.pages.dev. Retrieved 2025-05-12.
- ^ "Mechanistic Interpretability explained – Chris Olah and Lex Fridman". YouTube. 14 November 2024. Retrieved 2025-05-03.
- ^ @ch402 (July 29, 2024). "I was motivated by many of my colleagues at Google Brain being deeply skeptical of things like saliency maps. When I started the OpenAI interpretability team, I used it to distinguish our goal: understand how the weights of a neural network map to algorithms" (Tweet) – via Twitter.
- ^ Nanda, Neel (January 31, 2023). "Mechanistic Interpretability Quickstart Guide". Neel Nanda. Retrieved 28 March 2025.
- ^ "Mechanistic Interpretability Workshop 2024". 2024.
- ^ Park, Kiho; Choe, Yo Joong; Veitch, Victor (2024). "The Linear Representation Hypothesis and the Geometry of Large Language Models". arXiv:2311.03658. A bot will complete this citation soon. Click here to jump the queue
- ^ Elhage, Nelson; Nanda, Neel; Olsson, Catherine (2021-12-22). "A Mathematical Framework for Transformer Circuits". Transformer Circuits Thread. et al. Retrieved 2025-05-30.
- ^ Scherlis, Adam (2025). "Polysemanticity and Capacity in Neural Networks". arXiv:2210.01892. A bot will complete this citation soon. Click here to jump the queue
- ^ Bereska & Gavves (2024), p. 14.
- ^ Alain, Guillaume; Bengio, Yoshua (2018). "Understanding intermediate layers using linear classifier probes". arXiv:1610.01644. A bot will complete this citation soon. Click here to jump the queue
- ^ a b Marks, Samuel; Tegmark, Max (2024). "The Geometry of Truth: Emergent Linear Structure in LLM Representations". arXiv:2310.06824. A bot will complete this citation soon. Click here to jump the queue
- ^ Arditi, Andy; Obeso, Oscar; Syed, Aaquib; Paleka, Daniel; Panickssery, Nina; Gurnee, Wes; Nanda, Neel (2024). "Refusal in Language Models Is Mediated by a Single Direction". arXiv:2406.11717. A bot will complete this citation soon. Click here to jump the queue
- ^ Tigges, Curt; Hollinsworth, Oskar John; Geiger, Atticus; Nanda, Neel (2023). "Linear Representations of Sentiment in Large Language Models". arXiv:2310.15154. A bot will complete this citation soon. Click here to jump the queue
- ^ Conmy, Arthur (2023). "Towards Automated Circuit Discovery for Mechanistic Interpretability". arXiv:2304.14997. A bot will complete this citation soon. Click here to jump the queue
- ^ a b Gao, Leo; et al. (2024). "Scaling and evaluating sparse autoencoders". arXiv:2406.04093. A bot will complete this citation soon. Click here to jump the queue
- ^ Rajamanoharan, Senthooran; et al. (2024). "Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders". arXiv:2407.14435. A bot will complete this citation soon. Click here to jump the queue
- ^ Karvonen, Adam; et al. (2025). "SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability". arXiv:2503.09532. A bot will complete this citation soon. Click here to jump the queue
- ^ Paulo, Gonçalo; et al. (2024). "Automatically Interpreting Millions of Features in Large Language Models". arXiv:2410.13928. A bot will complete this citation soon. Click here to jump the queue
- ^ Wu, Zhengxuan; et al. (2025). "AxBench: Steering LLMs? Even Simple Baselines Outperform Sparse Autoencoders". arXiv:2501.17148. A bot will complete this citation soon. Click here to jump the queue
- ^ a b Dunefsky, Jacob; et al. (December 10–15, 2024). "Transcoders find interpretable LLM feature circuits". Advances in Neural Information Processing Systems 38 (NeurIPS 2024). Vancouver, BC, Canada. Retrieved 2025-04-29.
- ^ Paulo, Gonçalo; et al. (2025). "Transcoders Beat Sparse Autoencoders for Interpretability". arXiv:2501.18823 [cs.LG].
- ^ a b Lindsey, Jack; et al. (2024). "Sparse Crosscoders for Cross-Layer Features and Model Diffing". Transformer Circuits Thread. Anthropic. Retrieved 2025-04-30.
- ^ "Circuit Tracing: Revealing Computational Graphs in Language Models". Transformer Circuits. Retrieved 2025-06-30.
- ^ Casper, Stephen (2023-02-17). "EIS VI: Critiques of Mechanistic Interpretability Work in AI Safety". AI Alignment Forum.
- ^ Segerie, Charbel-Raphael (17 August 2023). "Against Almost Every Theory of Impact of Interpretability". AI Alignment Forum. Retrieved 31 May 2025.
- ^ Amodei, Dario (April 2025). "The Urgency of Interpretability". www.darioamodei.com. Retrieved 2025-05-29.
- ^ Nanda, Neel (2025-05-04). "Interpretability Will Not Reliably Find Deceptive AI". AI Alignment Forum. Retrieved 2025-05-30.
- ^ Hendrycks, Dan; Hiscott, Laura (May 15, 2025). "The Misguided Quest for Mechanistic AI Interpretability". AI Frontiers. Retrieved May 31, 2025.
- ^ Patel, Dwarkesh (May 22, 2025). "How Does Claude 4 Think? — Sholto Douglas & Trenton Bricken". Dwarkesh Podcast. Retrieved May 31, 2025.
Sources
[edit]- Bereska, Leonard; Gavves, Efstratios (2024). "Mechanistic Interpretability for AI Safety -- A Review". arXiv:2404.14082 [cs.AI].
- Bills, Steven; et al. (2023). "Language models can explain neurons in language models". OpenAI. Retrieved 2025-04-29.
- Nanda, Neel; Lee, Andrew; Wattenberg, Martin (2023). "Emergent Linear Representations in World Models of Self-Supervised Sequence Models". arXiv:2309.00941 [CS.LG].
- Rai, Daking; et al. (2025). "A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models". arXiv:2407.02646 [cs.AI].