Extracting tables from NIH FOA PDFs using pdfplumber

A National Institutes of Health (NIH) Funding Opportunity Announcement (FOA) hides its most consequential numbers inside tables: the modular versus detailed budget ceilings for an R01, the page-limit matrix that governs the Research Strategy, the scoring rubric that reviewers apply. When a budget table breaks across a page boundary or a merged “Budget Period” cell empties out on continuation rows, a naive parser silently detaches those figures from their headers, and the wrong Total Award ceiling flows straight into a pre-submission audit — a defect that surfaces only when eRA Commons rejects the package days before the Grants.gov deadline. This page walks the exact transformation that prevents it, as one focused workflow inside the broader PDF text extraction with pdfplumber stage: tune tolerance-aware table detection, normalize and stitch the grid across pages, resolve the agency-specific edge cases, then validate the reconstructed rows before any downstream handoff.

Phase 1 — Configure tolerance-aware table detection

The mapping starts by teaching pdfplumber where the cells actually are, rather than accepting the defaults that fail on government-typeset documents. NIH FOAs right-align dollar figures while left-aligning their descriptors, repeat header rows on every page, and embed footnotes that shift a compliance interpretation. Default snap tolerances merge tightly packed budget line items into one unreadable cell; loose tolerances split a single value across two columns.

Implementation steps:

Choose a ruling strategy. NIH budget forms draw their cell borders, so the "lines" strategy keyed on ruled edges is authoritative; reserve the "text" strategy for the borderless, whitespace-aligned tables a Department of Defense (DoD) Broad Agency Announcement (BAA) often uses.
Tighten the snap tolerances. Small snap_x_tolerance and snap_y_tolerance values stop right-aligned numerics from bleeding into the adjacent descriptor column.
Preserve blank cells. Keep empty characters so merged regions survive as detectable placeholders rather than vanishing, which Phase 2 depends on to reconstruct hierarchy.
Extract per page, tagged with a page index. Continuation logic in the next phase needs to know which page each fragment came from.

python

import pdfplumber
from collections.abc import Iterator

def extract_foa_tables(pdf_path: str) -> Iterator[list[list[list[str | None]]]]:
    """Yield the tables detected on each page with NIH-tuned tolerances.

    'lines' keys on the ruled borders NIH budget forms draw; the tight
    snap tolerances stop right-aligned dollar figures from merging into
    the left-aligned descriptor column.
    """
    with pdfplumber.open(pdf_path) as pdf:
        for page in pdf.pages:
            yield page.extract_tables(table_settings={
                "vertical_strategy": "lines",
                "horizontal_strategy": "lines",
                "snap_x_tolerance": 3,
                "snap_y_tolerance": 2,
                "keep_blank_chars": True,
            })

Because this stage consumes the coordinate-anchored geometry the parent extractor already preserved, it never re-derives layout — it only interprets the grid. That separation keeps the table logic auditable.

Phase 2 — Normalize headers and stitch across pages

The core transformation turns raw nested lists into a single logical grid. extract_tables() returns each page fragment as an independent object, so a table that spans three pages arrives as three disconnected structures with the header repeated. Two operations reconnect them: header normalization maps ragged labels onto a canonical vocabulary, and a continuity check stitches fragments whose header signatures match.

Normalize first, because the stitch decision compares normalized signatures rather than brittle raw strings:

python

import re

NIH_HEADER_MAP: dict[str, str] = {
    r"(?i)direct\s*costs?": "direct_costs",
    r"(?i)indirect\s*costs?": "indirect_costs",
    r"(?i)total\s*award": "total_award",
    r"(?i)budget\s*period": "budget_period",
}

def normalize_headers(raw_headers: list[str | None]) -> list[str]:
    """Map ragged FOA header cells onto a canonical compliance vocabulary."""
    normalized: list[str] = []
    for header in raw_headers:
        if not header or not header.strip():
            normalized.append("unknown_column")
            continue
        cleaned = header.strip().replace("$", "").replace(",", "")
        mapped = next(
            (v for pat, v in NIH_HEADER_MAP.items() if re.search(pat, cleaned)),
            cleaned,
        )
        normalized.append(mapped)
    return normalized

Standardized headers are what let an automated check compare an extracted award ceiling against an institution’s negotiated indirect cost rate, and they feed the same canonical field names the NIH FOA schema mapping process expects. With signatures normalized, the continuity pass decides whether an adjacent fragment continues the previous table or begins a new one, then forward-fills merged cells:

python

def stitch_and_propagate(
    pages_tables: list[list[list[list[str | None]]]],
) -> list[list[str]]:
    """Reconnect page-split tables and forward-fill merged cells."""
    stitched: list[list[str]] = []
    last_signature: list[str] = []

    for page_tables in pages_tables:
        for table in page_tables:
            if not table:
                continue
            headers = normalize_headers(table[0])
            body = [[c or "" for c in row] for row in table[1:]]
            if last_signature and headers == last_signature:
                stitched.extend(body)          # continuation: drop repeated header
            else:
                stitched.append(headers)
                stitched.extend(body)
                last_signature = headers

    # Merged "Budget Period" cells arrive blank on continuation rows;
    # propagate the last known value downward to rebuild the hierarchy.
    for i in range(1, len(stitched)):
        for j, cell in enumerate(stitched[i]):
            if cell.strip() == "":
                stitched[i][j] = stitched[i - 1][j]
    return stitched

The branching that governs whether a fragment is stitched or started fresh runs as follows:

Every page fragment is normalized, then either stitched onto the running table or started fresh before both paths reconverge to forward-fill merged cells and validate.

Phase 3 — Edge cases and agency-specific overrides

Two classes of problem break a stitcher that only reads the happy path: layout anomalies that the ruling strategy cannot see, and the fact that NIH table conventions are not shared by other agencies.

Layout anomalies. Rotated landscape evaluation tables report characters with swapped axes; check page.rotation and normalize orientation before detection, or the column boundaries land on the wrong edge. Borderless tables laid out purely by whitespace return nothing under the "lines" strategy — fall back to "text" with explicit column coordinates. A “DRAFT” or “PRE-DECISIONAL” watermark can inject stray large-font glyphs into a cell; filter those by font name before normalization. Where a header row itself fails to repeat on a continuation page, the signature check returns no match and the fragment is wrongly started fresh — guard against it by also comparing column count and column x-positions, not the header text alone.

Agency overrides. The Direct/Indirect/Total budget triad and modular ceiling are NIH constructs. The National Science Foundation (NSF) tables its budget through the fielded SF424 line items governed by the Proposal & Award Policies & Procedures Guide (PAPPG), and a DoD BAA grids evaluation criteria per announcement. Keep the strategy and tolerances data-driven per agency rather than hardcoded, cross-checking the sibling models built under DoD BAA requirement extraction and the budget justification format standards reference so one extraction shape serves all three.

Table parameter	NIH (FOA budget)	NSF (PAPPG budget)	DoD (BAA)
Border style	Ruled lines (`"lines"`)	Ruled SF424 grid (`"lines"`)	Often whitespace (`"text"`)
Header repeats per page	Yes	Yes	Variable
Merged cells	Budget-period grouping	Rare	Evaluation-criteria spans
Canonical cost fields	Direct / Indirect / Total	Detailed line items	Per-BAA
Authority of record	Activity-code FOA	PAPPG (current)	Announcement

A page-split budget table: the repeated header is dropped and the merged Year 1 value forward-fills every blank cell, reconstructing one logical grid.

Phase 4 — Validate the reconstructed rows and emit an audit record

The stitched grid only earns trust once each budget row is coerced, cross-checked, and bound to the exact bytes it came from. Model the row with Pydantic v2 so a malformed or internally inconsistent figure fails loudly at construction rather than silently mis-validating a submission — the same Pydantic validation layer the ingestion pipeline relies on downstream:

python

from pydantic import BaseModel, field_validator, model_validator

class BudgetRow(BaseModel):
    """One reconstructed NIH budget row with an internal-consistency check."""
    direct_costs: float
    indirect_costs: float
    total_award: float

    @field_validator("direct_costs", "indirect_costs", "total_award", mode="before")
    @classmethod
    def _coerce_currency(cls, v: str | float) -> float:
        if isinstance(v, str):
            return float(v.replace("$", "").replace(",", "").strip())
        return v

    @model_validator(mode="after")
    def _totals_reconcile(self) -> "BudgetRow":
        if abs((self.direct_costs + self.indirect_costs) - self.total_award) > 0.01:
            raise ValueError("direct + indirect does not reconcile to total award")
        return self

Wrap the extraction in a routine that validates every row, quarantines the failures, and writes an immutable record keyed to the source hash — the reproducible baseline the compliance validation rule engines consume, including the page-limit and font enforcement checks:

python

import hashlib
import logging
from datetime import datetime, timezone
from pydantic import ValidationError

logger = logging.getLogger(__name__)

def validate_and_log(pdf_path: str, rows: list[dict[str, str]]) -> dict:
    """Validate reconstructed rows and return an audit-safe record."""
    with open(pdf_path, "rb") as fh:
        source_hash = hashlib.sha256(fh.read()).hexdigest()

    valid: list[BudgetRow] = []
    quarantine: list[dict] = []
    for idx, row in enumerate(rows):
        try:
            valid.append(BudgetRow(**row))
        except ValidationError as exc:
            logger.warning("Row %d failed in %s: %s", idx, source_hash, exc)
            quarantine.append({"row": idx, "errors": exc.errors()})

    return {
        "source_sha256": source_hash,
        "checked_at": datetime.now(timezone.utc).isoformat(),
        "rows_valid": len(valid),
        "rows_quarantined": len(quarantine),
        "is_clean": not quarantine,
        "quarantine": quarantine,
    }

Before releasing the record, walk a short acceptance checklist that fixtures alone miss: confirm every recovered row has a consistent column count, that merged “Budget Period” cells resolved to a non-empty value, that each Direct + Indirect pair reconciles to its Total, and that a low-density or irregular table routed to the quarantine queue rather than crashing the batch. Only a grid that clears every row is safe to hand to the next stage; at funding-cycle volume that hand-off runs under the async batch processor for large RFPs.

Frequently asked questions

Why do NIH budget tables split across pages break the extractor?

pdfplumber treats each page fragment as an independent object, so a table that spans a page boundary arrives as two disconnected grids with the header repeated. Without a stitch pass, continuation rows lose their column headers and their Total Award figure detaches from the ceiling it belongs to. Comparing normalized header signatures across fragments reconnects them into one logical table.

When should I use the "text" strategy instead of "lines"?

Use "lines" whenever the form draws ruled cell borders, which NIH and NSF budget forms almost always do. Switch to "text" with explicit column coordinates only for borderless tables laid out purely by whitespace — a pattern more common in DoD Broad Agency Announcements. Running "lines" against a borderless table returns nothing, so detect the empty result and fall back rather than emitting a silent blank parse.

How are merged cells recovered?

Keep blank characters during extraction so a merged region survives as an empty-string placeholder rather than disappearing. Then forward-fill: for each empty cell, propagate the last known value in that column downward. That rebuilds the implicit hierarchy — for example a “Budget Period” label that visually spans several rows — so auditors receive fully populated categorical breakdowns instead of fragments.

Can one extraction shape serve NIH, NSF, and DoD budget tables?

Yes, if the strategy and tolerances are treated as per-agency configuration rather than constants. Keep the stitch-and-normalize logic generic and drive the ruling strategy, snap tolerances, and canonical field map from an agency profile. The NIH FOA, the NSF PAPPG budget, and a DoD BAA then differ only in their configuration values, not in the code path.

What belongs in the audit record beyond the extracted numbers?

Bind the record to the SHA-256 of the source PDF, a UTC timestamp, the count of valid versus quarantined rows, and the structured validation errors for anything that failed. That lets a compliance officer reconstruct months later precisely why a package was flagged, and proves which edition of the FOA the figures were read from.

PDF text extraction with pdfplumber — the parent extraction stage this table workflow plugs into.
Schema validation with Pydantic — the typed layer that hardens the reconstructed rows.
NIH FOA schema mapping — where the canonical budget fields feed the compiled rule model.
Budget justification format standards — the cross-agency shape the extracted ceilings must satisfy.
Async batch processing for large RFPs — running table extraction across a full funding cycle.

Up one level: PDF text extraction with pdfplumber

# Extracting tables from NIH FOA PDFs using pdfplumber

# Phase 1 — Configure tolerance-aware table detection

# Phase 2 — Normalize headers and stitch across pages

# Phase 3 — Edge cases and agency-specific overrides

# Phase 4 — Validate the reconstructed rows and emit an audit record

# Frequently asked questions

# Related

Related pages

Extracting tables from NIH FOA PDFs using pdfplumber

Phase 1 — Configure tolerance-aware table detection

Phase 2 — Normalize headers and stitch across pages

Phase 3 — Edge cases and agency-specific overrides

Phase 4 — Validate the reconstructed rows and emit an audit record

Frequently asked questions

Related