And related to my discussions with
hyperbertha
above, there's a fantastic new research paper from the brilliant team over at Anthropic today.
It examines the mental space of a large language model (Claude, their own competitor to GTP4 etc), and is able to find
features (similar to regions or patterns by which clusters of neurons activate) for all kinds of concepts, from things in the world (eg. the Golden Gate bridge) to concepts like sarcasm.
It's an incredible read:
https://www.anthropic.com/research/mapping-mind-language-model
(after that intro, reading the full paper is even better:
https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html)
Look at their analysis of the Golden Gate bridge concept. They were able to locate this feature space inside the model, and then to measure when it fires, a bit like looking at in-depth brain scans while having a patient talk or look at things, but much more detailed since we can freely take this kind of brain apart to see all its active connections.
As they note in the paper, the concept of this bridge can be seen activating when the landmark is discussed in any language, and even when you feed new images of the bridge into the model.
What's even more fascinating is that they can basically boost that sector of its "brain" to alter its behavior. So normally if you ask it "what is your physical form?" it comes back with a generic thing about being an AI without a physical body. But if you pump some extra activation (kind of like the zapping the brain) into the Golden Gate bridge feature when asking the question, it responds as if it understands itself to be this bridge, and coherently.
One of the other fascinating cases is that they were able to isolate a feature that activates on
code errors -- and even across different programming languages. And across a large range of possible errors, from typos to divide-by-zero to many other bugs, and activates only when the code is actually incorrect. Manipulating this feature has fascinating effects that prove this concept to be a way that it understands errors in general.
Read this section:
https://transformer-circuits.pub/20...index.html#assessing-sophisticated-code-error
There's a lot of content in this report, but the basic finding is of course what most of us who examine the connective logic of transformer networks already intuit:
- they represent complex abstractions (even things like sarcasm or sycophancy) and they do so across languages and genres, genuinely abstracting from language to higher order concepts
- they have some kind of implicit model of the world of possible objects and places, which shows if you examine the activations of something like a specific proper name, landmark, etc and watch how it affects everything it does
- they also have features which demonstrate that they aren't merely repeating code when giving programming samples, but even "know" when making an error, even of subtle types that require reading the entire context of the line of code to see why it's an error
and so on