Why PDF Table Extraction is Hard

Tables are one of the most information-dense structures in business documents — financial reports, invoices, scientific papers, and regulatory filings all rely on them. Yet extracting table data programmatically from PDF files remains surprisingly difficult. The reason lies in how the PDF format works internally: there is no concept of a "table" inside a PDF. What a human perceives as neatly organized rows and columns is actually a collection of individual characters placed at specific X-Y coordinates on a canvas. When a standard text extraction tool reads the file, it returns a flat stream of characters with no awareness of the spatial relationships that form the table.

This is exactly the problem that Camelot was designed to solve. Camelot is an open-source Python library that uses computer vision and text analysis techniques to detect table boundaries, reconstruct the row-column grid, and deliver clean results directly as Pandas DataFrames. In this guide, we will cover how Camelot works under the hood, walk through its four parsing strategies, and explore the advanced configuration options that make it one of the most powerful open-source table extraction tools available.

How Camelot Detects Tables Under the Hood

Before diving into code, it is worth understanding why Camelot is more than a simple text parser. The library employs fundamentally different algorithmic strategies depending on the visual properties of the table, which is why it offers multiple parsing "flavors." The diagram below illustrates the decision flow.

Figure 1: Camelot's internal decision tree. The parser selection determines whether table detection relies on image processing (Lattice), text coordinate analysis (Network/Stream), or a combination (Hybrid).

Lattice: Line-Based Detection

The Lattice parser is designed for tables that have explicit ruled lines — the kind of table where you can clearly see horizontal and vertical borders forming a grid. Rather than analyzing the text layer, Lattice converts the entire PDF page into an image and then applies a series of morphological transformations (erosion and dilation) using OpenCV to isolate the line segments. Once horizontal and vertical lines are detected, the parser computes their intersections by performing a logical AND operation on pixel intensities at overlay points. These intersections define the cell corners of the table. Table boundaries are determined by performing a logical OR across all detected segments, producing the outer contour.

Because the PDF page dimensions and the rasterized image dimensions differ, Lattice then scales and translates all detected coordinates back into PDF coordinate space. Finally, it assigns the text characters found on the page to each cell based on their X-Y positions relative to the reconstructed grid. This approach is highly deterministic: if the lines exist, Lattice will find them.

Stream: Whitespace-Based Inference

Stream handles the opposite scenario — tables that rely on whitespace rather than visible borders to separate cells. This parser works entirely on the text layer, grouping characters into words and sentences using the text layout engine from PDFMiner. The algorithm proceeds in several stages:

Words are grouped into text rows based on overlapping Y-axis coordinates.
"Text edges" — consistent vertical alignments across multiple rows — are calculated to identify likely table regions. This technique draws from Anssi Nurminen's research on automatic table recognition in digital documents.
The number of columns is estimated by computing the mode of word counts per row, then selecting representative words from each row to establish column X-coordinate ranges.
These column ranges are iteratively extended by incorporating words that fall just inside or outside the current boundaries.
A final table structure is assembled from the row Y-ranges and column X-ranges, and each word is assigned to its corresponding cell.

Because Stream relies on heuristics rather than deterministic line detection, it is inherently less precise than Lattice. However, it is the only viable option when the table has no visible borders.

Network: Alignment-Based Discovery

The Network parser takes a different approach from both Lattice and Stream. Instead of looking for graphical lines or whitespace gaps, it analyzes whether the bounding boxes of text elements share common coordinate alignments — top, center, bottom on the horizontal axis, or left, right, center on the vertical axis. Text elements that participate in alignments along both axes form a "network," indicating they are likely part of a structured table.

The parser prunes isolated elements that are aligned along only one axis (since, for example, lines of a paragraph are all left-aligned but do not form a table). It then identifies the element with the most connections as a seed and iteratively grows the table bounding box outward, absorbing nearby aligned elements until no more are found. This makes the Network parser particularly effective at discovering tables embedded within mixed-content pages where other layout elements surround the table.

Hybrid: The Best of Both Worlds

The Hybrid parser combines Network and Lattice by running both independently and then merging their results. Where both parsers detect a table in the same region, Hybrid uses the Network parser's text-based cell identification but enhances it with the more geometrically precise row and column boundaries from Lattice's line detection. This combination tends to produce the most accurate results on documents that have partial grid lines — a border at the top and bottom of the table but no internal cell borders, for instance.

Installing Camelot

Camelot requires a few system-level dependencies beyond the Python package itself. The primary external dependency is Ghostscript, which Camelot uses as one of its available PDF-to-image conversion backends (though newer versions default to pypdfium2).

Start by installing the Python package with OpenCV support, which is necessary for the Lattice parser's morphological operations:

pip install "camelot-py[cv]"

Next, install Ghostscript on your operating system if it is not already present:

macOS: brew install ghostscript
Linux (Ubuntu/Debian): sudo apt-get install ghostscript
Windows: Download the installer from the Ghostscript website.

If you plan to use the visual debugging features covered later in this guide, also install the plotting extension:

pip install "camelot-py[plot]"

To verify the installation, open a Python shell and import the library:

import camelot
print(camelot.__version__)

If this prints the version number without errors, you are ready to proceed.

Extracting Tables with Each Parser

Lattice Mode: Ruled Tables

The Lattice parser is the right choice when your table has clearly visible grid lines. The following code reads a financial report and extracts all tables from the first page using line-based detection.

import camelot

# Read tables from page 1 using the Lattice parser
# Lattice detects explicit ruled lines in the PDF
tables = camelot.read_pdf(
    "financial_report.pdf",
    pages="1",
    flavor="lattice"
)

# Print how many tables were found on the page
print(f"Tables found: {tables.n}")

Each element in the returned TableList object is a Table with a .df property that holds the data as a Pandas DataFrame. Camelot also provides built-in quality metrics — accuracy and whitespace — that let you programmatically assess extraction quality without manually inspecting every table.

# Inspect the first table
table = tables[0]

# Access the DataFrame representation
print(table.df.head())

# Check extraction quality metrics
# Higher accuracy and lower whitespace indicate a cleaner extraction
print(f"Accuracy: {table.parsing_report['accuracy']}")
print(f"Whitespace: {table.parsing_report['whitespace']}")

Stream Mode: Borderless Tables

When a table uses whitespace to visually separate columns instead of drawn lines, the Lattice parser will find nothing. Switch to the Stream flavor instead. Because Stream relies on inferring column positions from text alignment, it benefits from specifying table_areas when the automatic detection picks up surrounding content.

import camelot

# Use the Stream parser for tables without visible borders
tables = camelot.read_pdf(
    "whitespace_table.pdf",
    pages="1",
    flavor="stream"
)

print(f"Tables found: {tables.n}")
print(tables[0].df)

Network Mode: Complex Layouts

The Network parser excels on pages where multiple content types coexist — paragraphs, figures, and tables sharing the same page. It identifies tables by finding clusters of text elements that exhibit systematic alignment patterns.

import camelot

# Use the Network parser for mixed-content pages
tables = camelot.read_pdf(
    "research_paper.pdf",
    pages="3",
    flavor="network"
)

print(f"Tables found: {tables.n}")
for i, t in enumerate(tables):
    print(f"\nTable {i + 1}:")
    print(t.df.head())

Hybrid Mode: Partial Borders

For tables that have some visible lines but not a complete grid, the Hybrid parser combines the precision of Lattice's line detection with Network's text-alignment analysis.

import camelot

# Hybrid uses both line detection and text alignment
tables = camelot.read_pdf(
    "mixed_borders.pdf",
    pages="1",
    flavor="hybrid"
)

print(f"Tables found: {tables.n}")
print(tables[0].df)

Exporting Extracted Tables

Once tables have been extracted into DataFrames, Camelot provides built-in export methods for the most common data formats used in downstream analytics and ETL pipelines.

The following snippet exports all detected tables to a compressed archive of CSV files and a single table to JSON format:

# Export all tables as a ZIP containing individual CSV files
tables.export("extracted_tables.csv", f="csv", compress=True)

# Export a specific table to JSON
tables[0].to_json("first_table.json")

# Additional supported formats: Excel, HTML, Markdown, SQLite
tables[0].to_excel("first_table.xlsx")
tables[0].to_html("first_table.html")

Because each table is a standard Pandas DataFrame, you can also bypass the built-in exporters entirely and use any Pandas I/O method you need — writing to a SQL database, Parquet, or a cloud storage connector.

Advanced Configuration and Debugging

Specifying Table Areas

When automatic table detection incorrectly includes headers, footers, or adjacent content in the extracted table, you can manually define the exact bounding box using the table_areas parameter. Coordinates follow PDF coordinate space, where the origin (0, 0) is at the bottom-left corner of the page.

import camelot

# Specify an exact bounding box: "x1,y1,x2,y2"
# (x1, y1) = top-left corner, (x2, y2) = bottom-right corner
tables = camelot.read_pdf(
    "complex_layout.pdf",
    flavor="stream",
    table_areas=["316,499,566,337"]
)

print(tables[0].df)

Processing Background Lines

Some PDFs render table lines as background elements rather than foreground strokes. By default, Lattice ignores these. To include them, enable process_background:

tables = camelot.read_pdf(
    "background_lines.pdf",
    flavor="lattice",
    process_background=True
)

Visual Debugging with Plot

One of Camelot's most valuable features for troubleshooting is the plot() method, which generates matplotlib visualizations of the detected elements. This allows you to see exactly what Camelot "sees" on the page and adjust parameters accordingly.

import camelot

tables = camelot.read_pdf("financial_report.pdf", flavor="lattice")

# Visualize what Camelot detected
# Supported kinds: 'text', 'grid', 'contour', 'line', 'joint', 'textedge'
camelot.plot(tables[0], kind="grid").show()

The kind parameter controls which detection layer is visualized: 'text' shows identified text regions, 'grid' overlays the reconstructed table grid, 'line' and 'joint' display the detected line segments and their intersections (Lattice only), and 'textedge' shows the inferred text alignment boundaries (Stream only). This is an invaluable tool for understanding why a particular table was not extracted correctly.

Multi-Page Extraction

Batch processing across multiple pages is straightforward. The pages parameter accepts individual page numbers, ranges, and comma-separated combinations:

import camelot

# Extract tables from pages 1 through 5
tables = camelot.read_pdf("annual_report.pdf", pages="1-5", flavor="lattice")

print(f"Total tables across all pages: {tables.n}")

# Iterate through all extracted tables
for i, table in enumerate(tables):
    print(f"\nTable {i + 1} (Page {table.page}):")
    print(f"  Shape: {table.df.shape}")
    print(f"  Accuracy: {table.parsing_report['accuracy']}")

For very large documents, extracting all pages at once can be memory-intensive. In these cases, processing pages in batches and passing pages="1,2,3" incrementally provides better resource control.

Where Camelot Falls Short

Despite its sophistication, Camelot has fundamental limitations that are important to understand before committing to it in a production pipeline.

Scanned PDFs are invisible to Camelot. Because every parser depends on either the text layer (Stream, Network) or rasterized line detection followed by text-coordinate mapping (Lattice), Camelot cannot extract anything from a PDF that is a scanned image. There is no embedded text to read. Running OCR with a tool like pytesseract before passing the PDF to Camelot is technically possible, but the OCR process typically destroys the spatial precision that Camelot depends on for accurate table reconstruction.

Merged cells and nested headers cause fragmentation. Camelot's cell assignment algorithm maps words to grid cells based on coordinate overlap. When cells span multiple rows or columns, the boundaries become ambiguous, and the resulting DataFrame often splits what should be a single cell across multiple entries or assigns content to the wrong cell entirely.

Background artifacts interfere with Lattice detection. Watermarks, shading, colored backgrounds, and decorative elements can produce spurious line segments during the morphological transformation step, causing Lattice to detect phantom table borders. The process_background flag mitigates some of these issues, but heavily styled documents remain challenging.

System dependencies create deployment friction. Ghostscript and Tkinter are not trivial to install on headless servers or inside Docker containers, which can make Camelot difficult to integrate into containerized CI/CD pipelines or serverless functions.

For a broader perspective on how Camelot compares to other extraction tools, see our comparison of the top 5 Python PDF libraries.

When Rule-Based Parsing Reaches Its Limits

Camelot works excellently when the documents you process are consistent — the same vendor always sends the same table format, the same financial report template is reused quarter after quarter. The moment document layouts start varying, however, rule-based tools require constant recalibration. Tweaking table_areas, adjusting line_scale for Lattice, or tuning edge_tol for Stream on a per-template basis quickly becomes a maintenance burden that scales with the number of document sources you handle.

This is the fundamental difference between rule-based extraction and structured data extraction powered by machine learning. Where Camelot requires you to describe the visual structure of the table through parameters, a model-based approach learns to recognize table structures from the visual and semantic content of the document itself — regardless of whether the PDF is digital or scanned, and regardless of how many different layouts appear in the pipeline.

nolainocr uses Vision Transformers and large language models to process documents end-to-end: layout detection, table reconstruction, cell merging, and field extraction happen in a single pass. It handles scanned documents, digital PDFs, merged cells, and multi-line content automatically — without coordinate tuning, Ghostscript dependencies, or custom parsing logic per document type. If your workflow involves processing tables from diverse sources, it is worth considering a solution that does not require parameter maintenance for every new document template.

Frequently Asked Questions

▼What types of tables does Camelot work best on?

Camelot excels at tables in digital (non-scanned) PDFs. The Lattice parser is most reliable on tables with complete, visible grid lines. The Stream parser handles borderless tables where columns are separated by whitespace. The newer Network and Hybrid parsers improve accuracy on complex pages with partial borders or mixed content.

▼Can Camelot extract tables from scanned PDFs?

No. Camelot requires an embedded text layer to function. Scanned PDFs contain only image data with no selectable text. You would need to OCR the document first, but this typically degrades the spatial precision needed for accurate table detection. For scanned documents, a Vision Transformer-based approach like nolainocr is more reliable.

▼How do I choose between Lattice, Stream, Network, and Hybrid?

Use Lattice when the table has clear, drawn borders. Use Stream when the table relies on whitespace alignment. Use Network for tables embedded in mixed-content pages. Use Hybrid when tables have partial borders. If you are unsure, start with Lattice and fall back to Stream or Network based on the results.

▼Why is Camelot returning empty results?

The most common causes are: the PDF is scanned (no text layer), the wrong parser flavor is selected (e.g., using Lattice on a borderless table), or the table is located in a region that the automatic detection algorithm missed. Use the plot() method to visualize what Camelot detects on the page, and try specifying table_areas manually.

▼How does Camelot compare to Tabula-py?

Both extract tables from digital PDFs. Camelot provides more configuration options and generally better accuracy on complex tables, but requires Ghostscript. Tabula-py wraps the Java Tabula engine, is faster on straightforward grids, but requires a Java Runtime. For a detailed comparison, see our Python PDF libraries comparison.

▼Can I use Camelot inside a Docker container?

Yes, but you will need to install Ghostscript and the OpenCV system dependencies in the container image. A typical Debian-based Dockerfile would include apt-get install -y ghostscript python3-tk alongside the pip install step. Some teams find this dependency chain adds unwanted complexity to their deployment pipeline.

▼What output formats does Camelot support?

Camelot can export tables to CSV, JSON, Excel (.xlsx), HTML, Markdown, and SQLite. Since extracted tables are Pandas DataFrames, you can also use any Pandas I/O method to write to SQL databases, Parquet, or cloud storage connectors.

Free tools from nolainocr

Need to work with PDF files before or after table extraction? nolainocr offers free browser-based tools — no sign-up required:

Merge PDF — combine multiple PDFs into one before processing
Split PDF — extract the specific pages containing your tables
PDF ↔ Images — convert PDF pages to PNG for visual inspection
Delete PDF pages — remove cover pages or appendices before extraction

For scanned PDFs where Camelot cannot be used, nolainocr uses Vision Transformers to extract tables from image-based documents without requiring an embedded text layer.

How to Extract Tables from PDFs in Python Using Camelot