I Was Doing This in 2019

Generative synthetic data was not invented this year, no matter how many breathless tweets you saw about it.

There's a specific experience — not quite grief, not quite vindication — where you watch an idea you worked on four years ago get rediscovered by the industry and treated like a divine revelation, presented at conferences with the energy of someone who just invented fire.

Generative synthetic data is having that moment right now.

I was doing this in 2019. Not gesturing at it, not writing a speculative blog post about it — doing it, publishing it, arxiv'd and timestamped. The paper exists. The dates are in the PDF.

And now, in the summer of 2023, synthetic data is everywhere. Every lab is using it. Every fine-tuning pipeline is using it. The discourse has produced roughly ten thousand posts explaining, with great confidence, that generating synthetic training data with language models is a new idea that the field is just now grasping.

The field was not just now grasping it.

The frustrating part isn't being early — early is fine, early is just a coordinate in time. The frustrating part is that being early and being quiet about it is indistinguishable, from the outside, from not having done it at all. The work sat in a corner of the internet being correct and unread, which is basically the same as not existing.

This is not a complaint. This is a timestamp.

I Was Doing This in 2019

Counterpoints