Parsing NSF PAPPG Section Headers Programmatically

A National Science Foundation (NSF) proposal is rejected structurally, not scientifically, when its section headers arrive out of order, mislabelled, or missing entirely — and Research.gov surfaces that failure only in the final compliance sweep, often hours before the deadline. The Proposal & Award Policies & Procedures Guide (PAPPG) fixes the exact heading hierarchy every competitive application must follow, but as prose it cannot gate a submission. This page shows how to extract, normalize, and validate that hierarchy in code so a misordered A. under a missing I., or a Data Management Plan header that never appears, is caught while the draft is still editable. It is the header-recognition detail behind the parent NSF Proposal Guide Taxonomy, which consumes the typed section records this parser produces.

Phase 1 — Decompose the document and normalize both formats

Header parsing begins before any regular expression runs: the pipeline must first reduce two incompatible source formats to one clean text stream. NSF proposal documents reach the parser as either DOCX (author drafts) or PDF (the compiled package Research.gov assembles), and each exposes structure differently.

Route by format. DOCX files expose heading structure through the Office Open XML paragraph-style and numbering definitions, so python-docx can walk styles directly. PDF files carry no semantic hierarchy, so text must be recovered by coordinate-aware PDF text extraction with pdfplumber before anything downstream can read it.
Collapse to a common shape. Emit each candidate line as an ordered tuple — (header_text, style_or_bbox, inferred_depth, page_number) — so the classifier that follows never has to know which format produced a line.
Normalize aggressively. Apply Unicode NFKC normalization, collapse repeated whitespace, strip citation markers and non-breaking spaces, and standardize the trailing punctuation NSF templates apply inconsistently. Skipping this step is the most common cause of a header that “should” match failing to.

python

import re
import unicodedata


def normalize_line(raw: str) -> str:
    """Reduce an extracted line to a stable form before pattern matching."""
    text = unicodedata.normalize("NFKC", raw)
    text = text.replace(" ", " ")          # non-breaking space
    text = re.sub(r"\[\d+\]", "", text)          # citation markers
    text = re.sub(r"\s+", " ", text)             # collapse whitespace
    return text.strip()

Because both ingestion paths converge on the same normalized tuples, the classification logic in Phase 2 is written once and reused for every source format.

Phase 2 — Classify each header against the PAPPG hierarchy

The PAPPG uses a rigid alphanumeric hierarchy: primary sections take Roman numerals (I., II.), secondary sections take uppercase letters (A., B.), and tertiary levels take Arabic numerals (1., 2.). A handful of flat headers — appendices, the Biographical Sketch, Current & Pending Support — sit outside the numbered tree entirely. A robust classifier captures all four cases in one anchored expression, tolerating trailing periods, optional whitespace, and the missing punctuation of legacy templates.

python

import re

HEADER_PATTERN = re.compile(
    r"^(?P<roman>[IVXLCDM]+)\.\s+"
    r"|^(?P<alpha>[A-Z])\.\s+"
    r"|^(?P<arabic>\d{1,2})\.\s+"
    r"|^(?P<flat>(?:Appendix|Budget Justification"
    r"|Biographical Sketch|Current & Pending Support))\b",
    re.IGNORECASE | re.MULTILINE,
)


def classify_depth(match: re.Match[str]) -> int:
    """Map a matched header to its PAPPG hierarchy depth (1=Roman ... 4=flat)."""
    if match.group("roman"):
        return 1
    if match.group("alpha"):
        return 2
    if match.group("arabic"):
        return 3
    return 4  # flat / non-numbered headers

Keeping this layer deterministic is deliberate. A probabilistic boundary model belongs later in the pipeline, in NLP section boundary detection, where a human still reviews the output; a compliance gate that is wrong on one proposal in fifty is worse than no gate at all, so the header classifier stays rule-based and reproducible.

Phase 3 — Edge cases and agency-specific overrides

Real extracted text is messier than the PAPPG’s examples, and the header conventions differ across agencies, so the parser has to fail predictably on the cases that actually occur in an institutional queue:

Merged and orphaned cells. Two-column layouts and merged table cells can absorb header text, and page breaks can split one header across two extraction chunks. Resolution: run a sliding-window context check that inspects adjacent lines for continuity and flags orphaned numbering rather than silently dropping it.
Style-inheriting footnotes. Footnotes that inherit a heading style produce false positives. Resolution: reject candidates whose captured label is under three characters or reads as all-uppercase boilerplate.
PAPPG version drift. Section numbering and page thresholds change between revisions (for example, PAPPG 24-1), so a proposal drafted against last year’s guide can violate a rule that moved. Resolution: stamp every parsed header with the PAPPG revision it was validated against and refuse to score stale records without an explicit override.

The enforced parent-child depth constraints map directly onto the alphanumeric hierarchy: a Roman-numeral section may parent an uppercase letter, which may parent an Arabic numeral; a tertiary header may never appear without a direct secondary parent; and flat headers are excluded from depth inheritance entirely.

The header labels are also agency-specific, which is why this parser lives beside the broader Core Architecture & RFP Taxonomy rather than hard-coding NSF conventions inline:

Header trait	NSF (PAPPG)	NIH (SF424 / FOA)	DoD (BAA)
Primary numbering	Roman numeral / Arabic	Section-letter (`A.`, `B.`)	Per-BAA (often numbered `1.0`)
Governing document	Proposal & Award Policies & Procedures Guide	Funding Opportunity Announcement	Broad Agency Announcement
Flat sections	Biographical Sketch, Current & Pending	Specific Aims, Bibliography	Statement of Work, DFARS refs
Submission portal	Research.gov	eRA Commons / Grants.gov	eBRAP / SAM.gov

The National Institutes of Health (NIH) label set is modeled by the NIH FOA schema mapping process, and the Department of Defense (DoD) conventions — with their export-control and Statement-of-Work wrinkles — by DoD BAA requirement extraction.

Phase 4 — Validate the sequence and verify before handoff

Classification produces depths; it does not produce trust. Before the header list is handed downstream, coerce each entry into a typed model and run a sequence pass that enforces the parent-child rules. Using a Pydantic v2 model means an invalid depth fails loudly at construction time rather than propagating a bad hierarchy into a submission decision.

python

from pydantic import BaseModel, Field, field_validator

ACTIVE_PAPPG_VERSION = "24-1"


class ParsedHeader(BaseModel):
    text: str = Field(min_length=1, max_length=120)
    depth: int = Field(ge=1, le=4)
    page_number: int = Field(ge=1)
    pappg_version: str

    @field_validator("pappg_version")
    @classmethod
    def version_must_be_current(cls, v: str) -> str:
        if v != ACTIVE_PAPPG_VERSION:
            raise ValueError(
                f"header parsed against PAPPG {v}, active revision is {ACTIVE_PAPPG_VERSION}"
            )
        return v


def find_sequence_violations(headers: list[ParsedHeader]) -> list[str]:
    """Enforce PAPPG parent-child depth rules across an ordered header list."""
    problems: list[str] = []
    seen_depths: set[int] = set()
    for h in headers:
        if h.depth == 4:
            continue  # flat headers are exempt from inheritance
        if h.depth > 1 and (h.depth - 1) not in seen_depths:
            problems.append(
                f"'{h.text}' (depth {h.depth}) appears without a depth-{h.depth - 1} parent"
            )
        seen_depths.add(h.depth)
    return problems

Confirm the checklist below holds before releasing the parsed hierarchy to the compliance validation rule engine:

Every line passed NFKC normalization and whitespace collapsing before matching.
Each header carries a depth, a page number, and the active pappg_version stamp.
No tertiary (1.) header exists without a direct secondary (A.) parent, and no A. without an I..
Flat headers (appendices, Biographical Sketch) are flagged and excluded from depth inheritance.
Split or orphaned headers from column and table layouts are reconciled, not dropped.
Every extraction anomaly is logged with a SHA-256 file hash, character offset, and ISO-8601 timestamp for audit.

A minimal test suite pins the rules that most often regress when a new PAPPG revision lands:

python

import pytest
from pydantic import ValidationError


def _h(**kw) -> ParsedHeader:
    base = dict(text="I. Project Summary", depth=1, page_number=1,
                pappg_version=ACTIVE_PAPPG_VERSION)
    base.update(kw)
    return ParsedHeader(**base)


def test_orphan_tertiary_is_flagged() -> None:
    headers = [_h(text="1. Objectives", depth=3)]
    assert find_sequence_violations(headers)


def test_stale_version_rejected() -> None:
    with pytest.raises(ValidationError):
        _h(pappg_version="23-1")

Frequently asked questions

Why not just detect headings by font size instead of regex?

Font size alone is unreliable across the two ingestion paths: a DOCX heading style and a PDF glyph size do not map to the same depth, and institutional templates routinely override the PAPPG’s typography. The alphanumeric label (I., A., 1.) is the one signal both formats preserve after normalization, so the anchored regex is the portable classifier. Font size is still worth capturing — it feeds the separate page-limit and font enforcement checks — but it should not decide hierarchy depth.

Is the PAPPG heading numbering safe to hard-code?

No. Section numbering and thresholds move between PAPPG revisions, and individual program solicitations can add required sections the base guide does not. Stamp every parsed header with its pappg_version and keep the active revision in one constant so a mid-cycle policy update is a one-line change, not a hunt through regexes.

How should the parser treat appendices and the Biographical Sketch?

Treat them as flat (depth 4) headers that match the flat alternation, and exclude them from parent-child inheritance. They are legitimately not part of the numbered I. → A. → 1. tree, so applying sequence rules to them would produce false violations. Flag their presence for the mandatory-section check instead.

What happens when a header is split across a page break?

Coordinate-aware extraction can return the two halves as separate chunks on different pages. The sliding-window context check in Phase 3 detects a fragment that ends without terminal punctuation and rejoins it with the following line before classification, rather than emitting a truncated header that fails to match.

Should section-header validation block a submission on its own?

It should gate the record before downstream stages run, but the authoritative pass/fail belongs to the shared engine. Hand trusted ParsedHeader records to the compliance validation rule engine, which combines header order with mandatory-section coverage and page-limit checks to reach a single verdict — see mapping mandatory sections for NSF CAREER proposals for the coverage half.

NSF Proposal Guide Taxonomy — the section model these parsed headers are coerced into.
PDF text extraction with pdfplumber — the coordinate-aware extraction that feeds this parser.
NLP section boundary detection — the probabilistic boundary layer that runs after deterministic header recognition.
Mapping mandatory sections for NSF CAREER proposals — the coverage check that confirms every required header is present.
DoD BAA requirement extraction — the same header problem under Broad Agency Announcement conventions.

Up one level: NSF Proposal Guide Taxonomy

# Parsing NSF PAPPG Section Headers Programmatically

# Phase 1 — Decompose the document and normalize both formats

# Phase 2 — Classify each header against the PAPPG hierarchy

# Phase 3 — Edge cases and agency-specific overrides

# Phase 4 — Validate the sequence and verify before handoff

# Frequently asked questions

# Related

Related pages

Parsing NSF PAPPG Section Headers Programmatically

Phase 1 — Decompose the document and normalize both formats

Phase 2 — Classify each header against the PAPPG hierarchy

Phase 3 — Edge cases and agency-specific overrides

Phase 4 — Validate the sequence and verify before handoff

Frequently asked questions

Related