JPMorgan Teaches a Language Model to Read Like a Bank

DocLLM skips the vision encoder entirely and beats GPT-4 on the documents that actually matter.

JPMorgan is back. Last week it was the earlier DocLLM work. This week it's the full paper — same lab, same problem, more benchmarks.

The problem is one of those things that sounds trivial until you think about it for thirty seconds: enterprise documents have layout. A form is not a paragraph. An invoice is not a story. The bounding box of a field labeled "Total Due" carries information that the text alone doesn't — specifically, that it's in the bottom-right corner next to a number, which is where totals live, which is the whole point.

Standard LLMs ignore this. They flatten everything to a token stream and pray.

The multimodal crowd's answer is to shove the page image through a vision encoder first. DocLLM's answer is: don't. Just take the OCR output, attach bounding box coordinates, and inject them directly into the attention mechanism — disentangled from the text attention, running in parallel, four separate interaction types (content-to-content, content-to-spatial, spatial-to-content, spatial-to-spatial) combined at the end.

No ViT. No CNN. No expensive visual encoding step.

The model is LLaMA-2 7B with these spatial attention parameters bolted on, pre-trained via text infilling on visually rich documents, then instruction-tuned on four task types across a pile of enterprise benchmarks. Fourteen datasets total. It outperforms GPT-4 with vision on several of them.

A 7B model, built at a bank, reading bank documents better than GPT-4.

The architecture choice is almost insultingly direct — the location of text on a page is information, so represent it as information, feed it to the attention mechanism, done — and it works precisely because nobody at the frontier labs needed to solve this problem badly enough to solve it well. JPMorgan needed to. Their documents are the dataset.

JPMorgan Teaches a Language Model to Read Like a Bank

Counterpoints