expectedwrong hindsight

The Alignment Tax May Be a Scam

A Meta paper fine-tuned LLaMA on 1,000 hand-picked examples, skipped RLHF entirely, and nearly matched ChatGPT.

3 min read 514 words #alignment #llms #rlhf #research #meta-ai
hindsight — half right

The core insight — data quality matters more than RLHF scale — influenced everyone. But RLHF didn't die. Modern models still use it, just more efficiently. LIMA was right that the alignment tax was inflated, wrong that you could skip the whole pipeline. The scam was real; the fix was less dramatic than advertised.

There's a paper out of Meta this week — LIMA: Less Is More for Alignment — and it contains a claim so quietly devastating to the current ML orthodoxy that I've read it three times to make sure I'm not misunderstanding something.

They took LLaMA 65B, fine-tuned it on 1,000 examples, did no reinforcement learning, trained no reward model, ran no PPO, and produced a model that human raters preferred over ChatGPT in 43% of head-to-head comparisons. GPT-4 beat it, but only 58-42. The whole RLHF pipeline — the thing every serious lab treats as load-bearing infrastructure — was apparently optional.

The theoretical frame they put around this is called the Superficial Alignment Hypothesis, which is either the most important idea in alignment research this year or a polite way of saying alignment is mostly vibes. The claim: a model learns everything it knows during pretraining. Fine-tuning doesn't teach it new facts or new capabilities. Fine-tuning just teaches it which drawer to pull the silverware from — the output format, the tone, the response shape. Knowledge lives in the weights from pretraining. Alignment is a costume.

If that's right, then the entire RLHF industrial complex — the human labelers ranking outputs, the reward model trained on those rankings, the policy optimization loop grinding through PPO — is mostly doing something you could accomplish with a careful spreadsheet and a long afternoon.

The 1,000 examples weren't scraped and auto-labeled. They were hand-curated — pulled from Stack Exchange, from Reddit's expert communities, written by the authors themselves when they couldn't find something good enough elsewhere. The curation took serious effort. The insight is that the effort should go into the data, not into the training pipeline. OpenAI used 40,000 SFT examples and 300,000 preference comparisons. LIMA used 1,000 examples and called it a day.

There's a version of this story where LIMA is a clever result with caveats — the model is worse on adversarial prompts, the safety evaluation is thin, multi-turn conversations degrade — and in a few weeks we'll file it away as "interesting but limited." That's probably what will happen.

But there's another version where LIMA is evidence that we've been massively overfitting our priors to the OpenAI playbook. The RLHF paper dropped, ChatGPT worked, and the whole industry concluded that RLHF was the reason. LIMA at least entertains the possibility that the reason was the pretraining, and everything after was cleanup.

The authors name it the Superficial Alignment Hypothesis and mean it somewhat charitably — alignment is about surface, not depth, so a little goes a long way. But "superficial" cuts both ways. If alignment is superficial, if it's truly just teaching the model which register to speak in, then the entire framework of RLHF-as-safety is also superficial. You're not instilling values. You're training a style guide.

That's not a comfort.

1,000 examples. No RL. Competitive with ChatGPT. The paper dropped on a Friday and by Monday everyone will be talking about something else. That's how it goes with papers that are quietly right about things nobody wants to be right about.