Interpretability of Large Language Models

Interpretability of Large Language Models#

We’d really like the individual neurons in LLMs to be easily interpretable. Unfortunately, they’re mostly not. Neurons – and more precisely, collections of neurons called circuits – can be used in different ways as part of seeminglying unrelated logical circuits. This property is called “polysemantism” and makes it difficult for us to interpret how a circuit is being used.

Anthropic Paper#

https://www.anthropic.com/news/mapping-mind-language-model

Dictionary Learning#

https://www.youtube.com/watch?app=desktop&v=Ri0ComuqS7Y&t=0

Monosemantic Features#

https://transformer-circuits.pub/2023/monosemantic-features/index.html

Feature Decomposition#

The underlying model has some activations, \(x^j\), that we approximate as the linear combination of some fundamental features.

\[ x^j \approx b + \sum_i^{features} f_i(x^j) \cdot d_i \]

where \(x^j\) is the activation vector of length \(d_{MLP}\) for datapoint \(j\). The MLP is the multi-layer perceptron that sits on top of the single attention block. The MLP has a ReLU activation, and is of size 512.

Further Questions#

What is the classical problem of “dictionary learning”?
I don’t quite understand the “features as decompositions” section. I.e., the exact math on the decompositions.
Do you just replace the actual model activations with your simplified linear combination of feature activations? This would allow you to measure how well you approximate the actual model.

Interesting sources/further reading:#

This property is closely related to the desiderata of Causality, Generality, and Purity discussed in Cammarata et al. [25], and those provide an example of how we might make this property concrete in a specific instance.