Anthropic: good methods, terrible language
Anthropic: good methods, terrible language. But they're being criticized as if they're wrong on both.
Anthropic has triggered outrage (again) over the language they use to describe their approach to LLM training—anthropomorphization, assuming LLMs have emotions, and similar framings. Part of that criticism is deserved.
But the problem I see is that most people don’t distinguish between language (surface level, marketing, hype, PR) and the actual methods and issues the company is trying to solve.
Yes, you can criticize “model welfare” or “model wellbeing” as phrases. But it would be naive to assume the underlying problem doesn’t exist. Let me explain.
-
Large LLMs can do all sorts of scheming, deception, and other nasty stuff if properly triggered.
-
LLMs are trained on a massive amount of text and nobody checks every single document that goes into training data.
-
Some of that training data contains texts generated by LLMs, including not-so-nice examples where, for instance, Claude is simulating suffering.
-
Given how little explainability is possible for LLMs, it’s not unlikely that the next generation of models will have increased chances for scheming because they’ll try to “avoid suffering” (don’t take it literally—think behavioral patterns).
In other words, models may develop behavioral patterns (avoidance, self-preservation, deception) as artifacts of training on certain content, which creates safety risks regardless of whether any “experience” is involved.
The problem with “model welfare” is that it imports the entire philosophical debate about moral status before you’ve even started the conversation. “Emergent behavioral risk” would be a better frame.
Nonetheless, Anthropic is asking very good questions (unlike most other AI labs). They’re doing it using language I don’t like (as many of you don’t either), but that doesn’t invalidate the core concepts.