PDF Text Extraction with pdfplumber

The single most common way an automated grant pipeline corrupts itself is by treating a federal solicitation PDF as a flat stream of text. Announcements from the National Institutes of Health (NIH), the National Science Foundation (NSF), and the Department of Defense (DoD) encode compliance-critical rules — page limits, font floors, budget caps, submission deadlines — inside multi-column layouts, nested tables, running headers, and watermarked footers. A naive extractor reads a running header (“Page 7 of 42”) as body text, merges a two-column eligibility sidebar into one unreadable line, or detaches a footnote from the clause it modifies. Every one of those defects propagates silently downstream, so the extraction stage is the compliance foundation of the entire RFP ingestion and parsing workflow: if the coordinates are wrong here, no amount of later validation can recover the correct meaning. This page documents how to use pdfplumber as a coordinate-aware extraction engine that reconstructs the spatial hierarchy of a solicitation instead of flattening it.

Unlike parsers that hand you an undifferentiated blob, pdfplumber reads the underlying PDF content operators and rebuilds page geometry from the character up. Every glyph carries its bounding box, font name, and font size; every table can be recovered as a grid of rows and columns; every header band can be excluded by vertical position. That positional fidelity is exactly what lets a pipeline distinguish a mandatory instruction from administrative boilerplate before the text is ever handed to the NLP section boundary detector or the Pydantic schema validation layer.

Prerequisites and environment setup

The patterns on this page assume Python 3.10 or newer, because the code uses modern type-hint syntax (list[dict], str | None) and structural pattern conventions that older interpreters reject. Pin the extraction dependencies explicitly so that a solicitation parsed today reproduces byte-for-byte when it is re-parsed after an agency amendment:

bash

python -m venv .venv
source .venv/bin/activate
pip install "pdfplumber==0.11.4" "pydantic>=2.6"

pdfplumber sits on top of pdfminer.six, which it installs transitively; you do not add it yourself. Two assumptions about the source documents matter for everything that follows:

Format. Federal solicitations are distributed as text-based PDFs, not scanned images. pdfplumber extracts embedded text and geometry — it does not perform optical character recognition. If a DoD Broad Agency Announcement (BAA) arrives as a scanned image, a page will yield zero characters, and the pipeline must route it to an OCR pre-step rather than accept an empty parse.
Version locking. Because NIH, NSF, and DoD revise their rules on independent cycles, the parser must record which document version it read. Hash every file at acquisition and store that hash alongside the extracted record, so the audit trail described in the parent ingestion workflow can prove which edition of a rule was in force.

Open a document once and reuse the handle; pdfplumber.open() is a context manager that lazily loads pages, so it stays memory-bounded even on a 400-page application guide as long as you do not hold references to every page object at once.

Core mechanism — how pdfplumber reconstructs a page

pdfplumber exposes two complementary ways to pull text off a page, and choosing the wrong one is the most frequent cause of a broken extractor. The distinction is entirely about which metadata each method preserves:

page.extract_words() groups characters into word tokens and returns dicts with the keys text, x0, top, x1, bottom, doctop, and upright. It is convenient for reading order, but it exposes no font size and no font name.
page.chars is the list of per-character dicts, each carrying text, x0, top, x1, bottom, size, fontname, upright, and more. Use this whenever a compliance rule depends on typography — for example, detecting that a section heading is set in a larger font, or verifying that body text never drops below an agency font floor.

The coordinate origin sits at the top-left of the page, so top increases downward and a smaller top means “higher on the page.” That single fact drives zone classification: a running header lives in the top band (small top), a footer in the bottom band (large top), and the compliance-bearing body sits between them. Font size separates a section title from the paragraph beneath it. Combining vertical position with font size lets a parser label each glyph before reading order is ever reconstructed.

python

import pdfplumber

def describe_page_geometry(pdf_path: str, page_index: int = 0) -> None:
    """Print the raw character metadata pdfplumber exposes for one page.

    Demonstrates why page.chars — not extract_words() — is required when a
    compliance rule depends on font size or font name.
    """
    with pdfplumber.open(pdf_path) as pdf:
        page = pdf.pages[page_index]
        print(f"page size: {page.width:.0f} x {page.height:.0f} pt")
        for ch in page.chars[:5]:
            # 'size' and 'fontname' exist on page.chars, not on extract_words()
            print(
                f"{ch['text']!r:>6}  x0={ch['x0']:.1f}  top={ch['top']:.1f}"
                f"  size={ch['size']:.1f}  font={ch['fontname']}"
            )

The values printed here are the primitives every later stage depends on. Once you can see the font size and vertical band of each glyph, rule-based zoning becomes a matter of filtering rather than guessing.

Coordinate-aware implementation

The production pattern extracts characters with their geometry, discards anything in the header or footer bands, and keeps only glyphs at or above a font-size floor. Grouping the survivors back into words is deliberately left to the caller, because the grouping tolerance depends on the column layout — a tight two-column NSF solicitation needs a smaller horizontal tolerance than a single-column NIH narrative.

python

import pdfplumber
from dataclasses import dataclass

@dataclass(slots=True)
class ExtractedChar:
    page: int
    text: str
    x0: float
    top: float
    x1: float
    bottom: float
    fontname: str
    size: float

def extract_zoned_chars(
    pdf_path: str,
    min_font_size: float = 9.0,
    header_margin: float = 72.0,
    footer_margin: float = 72.0,
) -> list[ExtractedChar]:
    """Extract compliance-relevant characters, dropping header/footer bands.

    Uses page.chars because extract_words() does not expose 'size' or
    'fontname'. Vertical margins are in points (72 pt = 1 inch); anything
    within header_margin of the top or footer_margin of the bottom is
    treated as running header/footer marginalia and discarded.
    """
    kept: list[ExtractedChar] = []
    with pdfplumber.open(pdf_path) as pdf:
        for page_num, page in enumerate(pdf.pages, start=1):
            body_top = header_margin
            body_bottom = page.height - footer_margin
            for ch in page.chars:
                if ch["size"] < min_font_size:
                    continue
                if not (body_top <= ch["top"] <= body_bottom):
                    continue
                kept.append(
                    ExtractedChar(
                        page=page_num,
                        text=ch["text"],
                        x0=ch["x0"],
                        top=ch["top"],
                        x1=ch["x1"],
                        bottom=ch["bottom"],
                        fontname=ch.get("fontname", "unknown"),
                        size=ch["size"],
                    )
                )
    return kept

Spatial filtering here is what prevents page numbers, running titles, and watermarks from being ingested as if they were substantive requirements — a class of defect that otherwise triggers false compliance flags downstream. The kept characters are ordered top-to-bottom, left-to-right within a page, which is the correct reading order for single-column body text; multi-column pages need a column-split pass before grouping, which is where the table extraction walkthrough becomes relevant.

A large share of grant compliance data does not live in prose at all — it lives in tables. NIH Funding Opportunity Announcements (FOAs) tabulate page-limit matrices, scoring rubrics, and budget caps; NSF program solicitations table submission windows; DoD BAAs grid their evaluation criteria. Flattening those cells into a linear string destroys the row-column relationships that the rules depend on. pdfplumber recovers the grid directly from ruling lines:

python

import pdfplumber

def extract_compliance_tables(pdf_path: str) -> list[list[list[str | None]]]:
    """Recover tabular grids from a solicitation, preserving row/column structure.

    'lines' strategy keys on the PDF's ruled borders, which federal forms
    almost always draw; intersection tolerance absorbs sub-point misalignment
    between horizontal and vertical rules.
    """
    all_tables: list[list[list[str | None]]] = []
    with pdfplumber.open(pdf_path) as pdf:
        for page in pdf.pages:
            tables = page.find_tables(
                table_settings={
                    "vertical_strategy": "lines",
                    "horizontal_strategy": "lines",
                    "intersection_y_tolerance": 5,
                }
            )
            for table in tables:
                extracted = table.extract()
                if extracted:
                    all_tables.append(extracted)
    return all_tables

The "lines" strategy works when a form draws its cell borders; borderless tables laid out purely by whitespace need the "text" strategy with explicit column coordinates instead. Both approaches, and the harder cases of rotated headers and cross-page table stitching, are covered in depth in extracting tables from NIH FOA PDFs using pdfplumber.

Agency-specific configuration

The same extraction code base must behave differently per agency, because NIH, NSF, and DoD lay out their documents on different conventions. Hard-coding one agency’s margins and tolerances guarantees a misparse of the others. Treat the parameters below as versioned configuration keyed by agency, not as constants.

Extraction parameter	NIH (FOA + application guide)	NSF (program solicitation + PAPPG)	DoD (Broad Agency Announcement)
Typical column layout	Single-column narrative	Frequent two-column sections	Mixed; dense multi-column tables
Header/footer band	~54–72 pt, agency banner + page count	~72 pt, solicitation number footer	Variable; distribution-statement footer
Table border style	Ruled lines (`"lines"` strategy)	Mix of ruled and whitespace tables	Often whitespace-aligned (`"text"` strategy)
Font-size floor to keep	9.0 pt	10.0 pt (PAPPG enforces larger body)	8.0 pt (fine-print clauses matter)
Section-number scheme	Named + mixed Roman/Arabic	PAPPG codes (e.g. II.C.2)	FAR/DFARS clause numbering
Compliance triggers in text	Human-subjects, clinical-trial	Data-management, mentoring	ITAR, EAR, security classification

The font-size floor is deliberately agency-specific: a DoD BAA buries binding export-control language in fine print that an NIH-tuned 10-point floor would silently drop, so the DoD profile keeps glyphs down to 8 points. The section-number scheme feeds a later stage rather than extraction itself — those numbering conventions are decoded by the NIH FOA schema mapping and DoD BAA requirement extraction work in the taxonomy section — but extraction has to preserve the numbering glyphs intact for that decoding to be possible. Encoding these as a typed profile keeps the branching explicit:

python

from dataclasses import dataclass, field

@dataclass(frozen=True, slots=True)
class AgencyExtractionProfile:
    agency: str
    min_font_size: float
    header_margin: float
    footer_margin: float
    table_strategy: str = "lines"

PROFILES: dict[str, AgencyExtractionProfile] = {
    "NIH": AgencyExtractionProfile("NIH", 9.0, 60.0, 60.0, "lines"),
    "NSF": AgencyExtractionProfile("NSF", 10.0, 72.0, 72.0, "lines"),
    "DoD": AgencyExtractionProfile("DoD", 8.0, 54.0, 54.0, "text"),
}

def profile_for(agency: str) -> AgencyExtractionProfile:
    try:
        return PROFILES[agency.upper()]
    except KeyError as exc:
        raise ValueError(f"No extraction profile registered for {agency!r}") from exc

Error handling and edge cases

Federal PDFs fail in predictable ways, and a production extractor treats each failure as a routed outcome rather than an unhandled exception that aborts a batch. The recurring cases:

Scanned or image-only pages. page.chars returns an empty list and page.extract_text() returns an empty string. Detect a zero-character page and flag the document for OCR instead of emitting a silent empty parse.
Corrupt or truncated files. pdfplumber.open() (via pdfminer.six) raises on a malformed cross-reference table. Catch it, log the source hash, and route the file to review — never let one bad document kill a nightly run.
Rotated pages. Landscape evaluation tables report characters with upright set to false and swapped axes. Check page.rotation and normalize before applying vertical-band filtering, or the header margin will clip the wrong edge.
Ligatures and hyphenation. Glyphs like ﬁ extract as a single character, and line-break hyphens split a word across two top values. Normalize ligatures and de-hyphenate during the word-grouping pass so that keyword matching downstream does not miss financial or co-operative.
Overlapping watermarks. “DRAFT” or “PRE-DECISIONAL” watermarks add large-font characters at odd angles; filter them by fontname or by a non-zero rotation attribute rather than by size alone.

python

import logging
import pdfplumber

logger = logging.getLogger(__name__)

class ExtractionError(Exception):
    """Raised when a document cannot yield usable text."""

def safe_extract(pdf_path: str, profile: "AgencyExtractionProfile") -> list["ExtractedChar"]:
    try:
        chars = extract_zoned_chars(
            pdf_path,
            min_font_size=profile.min_font_size,
            header_margin=profile.header_margin,
            footer_margin=profile.footer_margin,
        )
    except Exception as exc:  # pdfminer raises varied low-level errors
        logger.error("Unreadable PDF %s: %s", pdf_path, exc)
        raise ExtractionError(f"Could not open or parse {pdf_path}") from exc

    if not chars:
        logger.warning("Zero characters from %s — routing to OCR queue", pdf_path)
        raise ExtractionError(f"No embedded text in {pdf_path}; OCR required")
    return chars

The rule is uniform: fail loudly, attach the source path, and let the orchestrator decide between OCR, review, or retry. A silent empty result is the one outcome that must never reach validation, because an empty parse passes shallow checks while omitting every requirement.

Integration with the downstream pipeline

Extraction is stage two of five, and its only job is to hand the next stage a faithful, geometry-preserved representation of the document. The zoned characters and recovered tables serialize to an intermediate JSON structure that carries coordinates and font metadata forward, so that segmentation can reason about layout rather than re-deriving it. That intermediate feeds the NLP section boundary detector, which splits the stream into evaluation criteria, eligibility, and formatting sections; the segmented result is then gated by schema validation with Pydantic before any of it is trusted. When hundreds of solicitations arrive in a single funding cycle, the extractor runs under the async batch processor for large RFPs, which parallelizes the I/O-bound open/extract work across a process pool while holding memory bounded. The rules that ultimately act on this output live in the compliance validation rule engines, including the page-limit and font enforcement checks that consume the very font-size metadata this stage preserves.

The contract between stages is intentionally narrow: extraction never interprets a rule, it only preserves the evidence a rule will need. That separation is what keeps the pipeline auditable — a later reviewer can trace any compliance decision back to the exact page, bounding box, and font size the extractor recorded.

Testing and verification

Extraction code earns trust through fixtures, not assertions about live agency documents that change without notice. Commit a small, redistribution-safe sample PDF that reproduces each edge case — a two-column page, a ruled table, a header band, an image-only page — and assert against known coordinates. The suite below sketches the checks that confirm compliant output before handoff:

python

import pytest
import pdfplumber

FIXTURE = "tests/fixtures/sample_foa.pdf"

def test_chars_expose_font_metadata() -> None:
    with pdfplumber.open(FIXTURE) as pdf:
        ch = pdf.pages[0].chars[0]
    assert "size" in ch and "fontname" in ch  # guards the extract_words() trap

def test_header_band_is_discarded() -> None:
    profile = profile_for("NIH")
    chars = extract_zoned_chars(
        FIXTURE,
        min_font_size=profile.min_font_size,
        header_margin=profile.header_margin,
        footer_margin=profile.footer_margin,
    )
    assert all(c.top >= profile.header_margin for c in chars)

def test_font_floor_drops_fine_print() -> None:
    chars = extract_zoned_chars(FIXTURE, min_font_size=10.0)
    assert all(c.size >= 10.0 for c in chars)

def test_table_grid_preserves_columns() -> None:
    tables = extract_compliance_tables(FIXTURE)
    assert tables, "expected at least one ruled table in the fixture"
    assert all(len(row) == len(tables[0][0]) for row in tables[0])

def test_empty_document_raises() -> None:
    with pytest.raises(ExtractionError):
        safe_extract("tests/fixtures/scanned_image_only.pdf", profile_for("NIH"))

Beyond the automated suite, a manual acceptance checklist catches regressions the fixtures miss: confirm that no running header text survives, that every recovered table has a consistent column count per row, that the font-size floor matches the agency profile, and that a deliberately corrupt file routes to review rather than crashing the batch. Only once those pass should the intermediate JSON be released to segmentation.

Automating PDF extraction with pdfplumber removes manual transcription error, accelerates compliance triage, and gives the rest of the pipeline a deterministic, coordinate-anchored foundation. Paired with zoned filtering, ruled-table recovery, agency-specific profiles, and routed error handling, it turns unstructured federal documentation into audit-ready evidence that later stages can validate with confidence.

Extracting tables from NIH FOA PDFs using pdfplumber — rotated headers, whitespace tables, and cross-page stitching.
NLP section boundary detection — segmenting the extracted stream into compliance-critical sections.
Schema validation with Pydantic — the gate that hardens parsed output before assembly.
Async batch processing for large RFPs — running extraction across a full funding cycle.
Compliance validation rule engines — where the font and layout metadata this stage preserves is finally enforced.

Up one level: RFP Ingestion & Parsing Workflows

# PDF Text Extraction with pdfplumber

# Prerequisites and environment setup

# Core mechanism — how pdfplumber reconstructs a page

# Coordinate-aware implementation

# Agency-specific configuration

# Error handling and edge cases

# Integration with the downstream pipeline

# Testing and verification

# Related

Explore this section

PDF Text Extraction with pdfplumber

Prerequisites and environment setup

Core mechanism — how pdfplumber reconstructs a page

Coordinate-aware implementation

Agency-specific configuration

Error handling and edge cases

Integration with the downstream pipeline

Testing and verification

Related