expectedwrong hindsight

Gemini 1.5 Will Remember Your Day Better Than You Do

Raw audio in, 22 hours long, one pass — and GPT-4 can't keep up.

1 min read 206 words #gemini #audio #context-length #llm
hindsight — half right

the audio capabilities expanded — gemini processes raw audio natively now. but the 'record your whole day' use case didn't materialize as a product. the tech was real. the privacy and practical barriers were larger than the context window.

Gemini 1.5 takes raw audio. Not a transcript. The actual audio. Up to 22 hours of it.

GPT-4 needs you to run Whisper first, caps out at 11 hours, and still misses things. It's doing a fundamentally different job — reasoning over a text artifact of your audio, not the audio itself. The needle-in-a-haystack benchmarks make this concrete: Gemini finds what's buried, GPT-4 doesn't, repeatedly.

The implication sitting quietly underneath all of this is that when this model ships, you could record your entire waking day — every meeting, every conversation, every moment of thinking out loud in the car — and feed it to Gemini as a single unbroken context window. One pass. Full reasoning across the whole thing.

Not a transcript you have to search. Not a summary some pipeline summarized. The actual day, held in memory, available for questions.

There's something unsettling about the fact that this is now a compute problem and not a fundamental impossibility. The barrier wasn't "can a model understand speech" — Whisper solved that years ago. The barrier was context length, and Gemini 1.5 just made it irrelevant.

Your phone has been recording ambient audio since voice memos existed. Nobody processed it because nobody could. Now somebody can.