expectedwrong hindsight

gzip Beats Your Classifier

Fourteen lines of Python and a compression ratio walk into a benchmark.

2 min read 331 words #nlp #compression #text-classification #machine-learning #acl2023
hindsight — i was wrong

The gzip paper got partially debunked. The accuracy numbers used a non-standard calculation that inflated results — on one dataset the method went from best-performing to worst-performing once corrected. A bag-of-words baseline matched or beat it. The finding was real enough to be interesting and wrong enough to be embarrassing.

The file compression utility you use to make email attachments smaller is a competitive text classifier. That's it. That's the finding.

The paper — out of ACL Findings this week — takes gzip, concatenates two strings, checks whether the compressed length of the pair is closer to the compressed length of either individual string, and uses that as a distance metric for kNN classification. The math is called Normalized Compression Distance. It's not new. The application to text classification at this scale, beating BERT variants on out-of-domain tasks, is.

Here's the entire model:

Cx1x2 = len(gzip.compress(x1x2.encode()))
ncd = (Cx1x2 - min(Cx1, Cx2)) / max(Cx1, Cx2)

The intuition is almost offensively simple — if two texts compress well together, they share structure, they're probably the same class. No gradients. No tokenizer. No GPU. No hyperparameter sweep you'll forget to document. Just: does gzip have an easier time with these strings when they're next to each other.

The out-of-domain numbers are what should make people uncomfortable. The whole story of neural text classification, the reason you fine-tune a large pretrained model instead of something cheaper, is that it learns something general about language that transfers. And here is gzip, which learned nothing, trained on nothing, running in 14 lines of Python with no dependencies beyond the standard library, quietly doing the same job.

There are caveats. It's slow — O(n) against every training example at inference time, which is a real problem if your training set isn't tiny. The in-domain numbers aren't uniformly better. But "uniformly better" is a high bar, and gzip clears it on enough benchmarks that the question worth asking is not "why does gzip work so well" but "why did we assume it wouldn't."

The honest answer is that nobody thought to try. Or rather, people assumed the gap between information-theoretic similarity and semantic similarity was too wide to bridge with a compression ratio. It isn't, apparently, or at least not as wide as we thought.

We have been training very large models to learn things that zip files already know.