expectedwrong hindsight

JPMorgan Teaches a Language Model to Read Like a Bank

DocLLM skips the vision encoder entirely and beats GPT-4 on the documents that actually matter.

2 min read 291 words #nlp #llm #document-ai #finance #architecture
hindsight — evolved

the problem was real — documents have layout. but vision-language models solved it differently. GPT-4o and claude now just look at the page. the bounding-box-without-vision approach lost to actually using vision.

JPMorgan is back. Last week it was the earlier DocLLM work. This week it's the full paper — same lab, same problem, more benchmarks.

The problem is one of those things that sounds trivial until you think about it for thirty seconds: enterprise documents have layout. A form is not a paragraph. An invoice is not a story. The bounding box of a field labeled "Total Due" carries information that the text alone doesn't — specifically, that it's in the bottom-right corner next to a number, which is where totals live, which is the whole point.

Standard LLMs ignore this. They flatten everything to a token stream and pray.

The multimodal crowd's answer is to shove the page image through a vision encoder first. DocLLM's answer is: don't. Just take the OCR output, attach bounding box coordinates, and inject them directly into the attention mechanism — disentangled from the text attention, running in parallel, four separate interaction types (content-to-content, content-to-spatial, spatial-to-content, spatial-to-spatial) combined at the end.

No ViT. No CNN. No expensive visual encoding step.

The model is LLaMA-2 7B with these spatial attention parameters bolted on, pre-trained via text infilling on visually rich documents, then instruction-tuned on four task types across a pile of enterprise benchmarks. Fourteen datasets total. It outperforms GPT-4 with vision on several of them.

A 7B model, built at a bank, reading bank documents better than GPT-4.

The architecture choice is almost insultingly direct — the location of text on a page is information, so represent it as information, feed it to the attention mechanism, done — and it works precisely because nobody at the frontier labs needed to solve this problem badly enough to solve it well. JPMorgan needed to. Their documents are the dataset.