expectedwrong hindsight

LinkedIn Has 1 Billion Résumés and Just Decided to Use Them

The most boring social network on the internet turns out to have been sitting on the most valuable training corpus in the world.

2 min read 436 words #ai #training-data #linkedin #microsoft #data-ethics
hindsight — nailed it

LinkedIn did use the data. The billion-resume corpus became training material. The outrage about opt-out faded. The more interesting story — the immaculate self-labeled dataset — was the right call.

LinkedIn updated its privacy settings this week to allow member data to be used for training generative AI models — opt-out, not opt-in, already enabled for most users, buried in a settings screen nobody opens.

The reaction was mostly outrage about the opt-out thing. The outrage is correct but also misses the more interesting story.

LinkedIn has a billion users. A billion professionals who have spent fifteen years carefully curating their career history, their skills, their industry vocabulary, their job titles, their connections, their company affiliations. Nobody made them do this. They typed it in willingly, to get jobs and feel validated and argue with strangers about hustle culture. The corpus is immaculate — structured, self-labeled, continuously updated, spanning every industry and seniority level on the planet.

This is not like scraping Reddit. Reddit is mostly thirty-year-olds arguing about whether a hot dog is a sandwich. LinkedIn is a database of what every professional role in every industry actually involves, described in the words of people who do those roles, indexed by company size and geography and career stage, updated in real time as the economy shifts.

Nobody was really thinking about LinkedIn as an AI training player. That was the surprise. Microsoft acquired them in 2016 for $26 billion and everyone kind of filed it away as a weird LinkedIn thing and moved on — and then OpenAI happened and suddenly Microsoft is sitting on a dataset that nobody else has, that cannot be reconstructed from public web crawls, that has genuine signal about professional knowledge and communication and career trajectory.

The model trained heavily on LinkedIn data will know things that GPT-4 trained on Common Crawl doesn't know. Not trivia. Practical things. What a VP of Product at a Series B company actually does all day. What language a compliance officer at a European bank uses when they're worried about something. How hiring patterns at defense contractors differ from hiring patterns at consumer apps. This is not nothing.

The opt-out controversy will blow over in a week. People will forget or not bother. The training will happen. And sometime in the next eighteen months something will ship — some professional AI tool, some copilot feature, some resume assistant — and it will be uncannily good at understanding the texture of professional life in a way that feels qualitatively different from the generic models, and most people won't know why.

The database of professional ambition, built for free by a billion people trying to get promoted, is now a training set. There is something almost cosmically funny about this that I don't have words for.