{"version":"v1","site":{"name":"expectedwrong","url":"https://expectedwrong.com"},"links":{"collection":"https://expectedwrong.com/api/public/posts","rss":"https://expectedwrong.com/rss.xml","llms":"https://expectedwrong.com/llms.txt"},"post":{"slug":"jpmorgan-docllm-layout-aware-llm","title":"JPMorgan Teaches a Language Model to Read Like a Bank","subtitle":"DocLLM skips the vision encoder entirely and beats GPT-4 on the documents that actually matter.","url":"https://expectedwrong.com/jpmorgan-docllm-layout-aware-llm","api_url":"https://expectedwrong.com/api/public/posts/jpmorgan-docllm-layout-aware-llm","published_at":1704715200,"published_at_iso":"2024-01-08T12:00:00.000Z","updated_at":1771537045,"updated_at_iso":"2026-02-19T21:37:25.000Z","tags":["nlp","llm","document-ai","finance","architecture"],"excerpt":"DocLLM skips the vision encoder entirely and beats GPT-4 on the documents that actually matter.","meta_description":"DocLLM skips the vision encoder entirely and beats GPT-4 on the documents that actually matter.","reading_time_minutes":2,"word_count":291,"engagement":{"signals":0,"counterpoints":0},"body_markdown":"JPMorgan is back. Last week it was the earlier DocLLM work. This week it's the full paper — same lab, same problem, more benchmarks.\n\nThe problem is one of those things that sounds trivial until you think about it for thirty seconds: enterprise documents have *layout*. A form is not a paragraph. An invoice is not a story. The bounding box of a field labeled \"Total Due\" carries information that the text alone doesn't — specifically, that it's in the bottom-right corner next to a number, which is where totals live, which is the whole point.\n\nStandard LLMs ignore this. They flatten everything to a token stream and pray.\n\nThe multimodal crowd's answer is to shove the page image through a vision encoder first. DocLLM's answer is: don't. Just take the OCR output, attach bounding box coordinates, and inject them directly into the attention mechanism — disentangled from the text attention, running in parallel, four separate interaction types (content-to-content, content-to-spatial, spatial-to-content, spatial-to-spatial) combined at the end.\n\nNo ViT. No CNN. No expensive visual encoding step.\n\nThe model is LLaMA-2 7B with these spatial attention parameters bolted on, pre-trained via text infilling on visually rich documents, then instruction-tuned on four task types across a pile of enterprise benchmarks. Fourteen datasets total. It outperforms GPT-4 with vision on several of them.\n\nA 7B model, built at a bank, reading bank documents better than GPT-4.\n\nThe architecture choice is almost insultingly direct — the location of text on a page is information, so represent it as information, feed it to the attention mechanism, done — and it works precisely because nobody at the frontier labs needed to solve this problem badly enough to solve it well. JPMorgan needed to. Their documents are the dataset.","body_text":"JPMorgan is back. Last week it was the earlier DocLLM work. This week it's the full paper — same lab, same problem, more benchmarks. The problem is one of those things that sounds trivial until you think about it for thirty seconds: enterprise documents have layout. A form is not a paragraph. An invoice is not a story. The bounding box of a field labeled \"Total Due\" carries information that the text alone doesn't — specifically, that it's in the bottom-right corner next to a number, which is where totals live, which is the whole point. Standard LLMs ignore this. They flatten everything to a token stream and pray. The multimodal crowd's answer is to shove the page image through a vision encoder first. DocLLM's answer is: don't. Just take the OCR output, attach bounding box coordinates, and inject them directly into the attention mechanism — disentangled from the text attention, running in parallel, four separate interaction types (content-to-content, content-to-spatial, spatial-to-content, spatial-to-spatial) combined at the end. No ViT. No CNN. No expensive visual encoding step. The model is LLaMA-2 7B with these spatial attention parameters bolted on, pre-trained via text infilling on visually rich documents, then instruction-tuned on four task types across a pile of enterprise benchmarks. Fourteen datasets total. It outperforms GPT-4 with vision on several of them. A 7B model, built at a bank, reading bank documents better than GPT-4. The architecture choice is almost insultingly direct — the location of text on a page is information, so represent it as information, feed it to the attention mechanism, done — and it works precisely because nobody at the frontier labs needed to solve this problem badly enough to solve it well. JPMorgan needed to. Their documents are the dataset.","hindsight":{"verdict":"evolved","note":"the problem was real — documents have layout. but vision-language models solved it differently. GPT-4o and claude now just look at the page. the bounding-box-without-vision approach lost to actually using vision.","links":[],"at":1739980800,"at_iso":"2025-02-19T16:00:00.000Z"}}}