The problem with documents
Most business information lives in documents — PDFs, scanned forms, email attachments, printed reports. The information is there. A human can read it and act on it. But software cannot, not directly. You cannot run a formula across 500 invoice PDFs the way you would run one across a spreadsheet column. You cannot query a scanned form the way you would query a database table.
Structured data extraction is what bridges that gap: it reads documents and converts their contents into a format that software can actually use.
What "structured" actually means
The distinction between structured and unstructured data is simpler than it sometimes sounds. Structured data has an explicit schema — columns with names, rows representing discrete records. A spreadsheet, a database table, a CSV file. You can query it, filter it, aggregate it, and feed it into other systems without interpretation.
Unstructured data — a PDF invoice, a photograph of a receipt, a scanned contract — contains information arranged visually rather than in defined rows and columns. The information is real and often important, but extracting it requires understanding layout, context, and meaning. That is exactly what language models are well-suited for.
How modern extraction works
The process combines two layers. First, the document is converted from an image or PDF into machine-readable text — this is the OCR step, which handles the basic character recognition. Second, a language model reads that text and identifies the values for each requested field based on semantic understanding rather than fixed positions.
What the LLM adds is interpretation. It can recognise that "Ref. 10042" and "Invoice #10042" refer to the same type of information. It can infer a due date from a payment term and an issue date. It can extract a line-item table even when the column headers are non-standard. This contextual reasoning is what makes modern extraction reliable on documents that vary in layout — which, in practice, is most real-world documents.
The output
The result is a table where each row corresponds to one source document and each column corresponds to one extracted field. This table can be imported into any system that accepts structured data: accounting software, CRMs, ERPs, or a simple Excel workbook. Fields can be joined to other datasets using common identifiers, aggregated for reporting, or passed to downstream automation.
The underrated value of this output format is not just the import — it is the queryability. Once documents are a table, questions that were previously impossible become trivial: what is the total invoice value from this vendor over the last year? Which expense receipts are missing a VAT breakdown? Which deliveries do not have a matching purchase order? Those are simple queries on a spreadsheet, but they are completely invisible inside a folder of PDFs.
What kinds of documents work well
Any document type with a consistent repeating structure is a strong candidate: invoices, purchase orders, expense receipts, bank statements, healthcare claim forms, customs declarations, survey responses, rental agreements. The key requirement is that the same fields appear across instances — not that every instance looks identical.
Highly irregular documents — free-form correspondence, narrative reports, documents with no consistent field structure — are harder to extract in a structured way. Some people use LLMs to pull key information from these too, but the output tends to be less consistent. For those cases, search and summarisation are probably better starting points than structured field extraction.
Why it matters
If your business regularly processes documents that contain information you need to act on, every hour spent manually reading and re-typing that information is an hour not spent on something more valuable. The compounding effect is significant: at low volumes the time cost is manageable; at scale it becomes a bottleneck. Structured extraction eliminates the bottleneck without requiring a change to the documents themselves — it meets them where they are.