Required Section Mapping

Federal grant proposals are rejected for structure before they are ever read for science. A submission that omits a mandatory heading, orders its attachments incorrectly, or embeds prohibited material inside the narrative fails the automated intake screen at the National Institutes of Health (NIH), the National Science Foundation (NSF), and the Department of Defense (DoD) — regardless of scientific merit. Required section mapping is the compliance stage that prevents that failure: it converts the structural mandates of a Funding Opportunity Announcement (FOA) into a machine-enforceable schema, then verifies that an assembled document presents every required section, in the correct sequence, with dependencies satisfied. Within the broader Compliance Validation & Rule Engines layer, this mapping is the first gate every proposal passes through, because every downstream measurement — page counts, font checks, completeness reports — depends on knowing precisely where each section begins and ends.

This page covers how to build a deterministic section mapper in Python: the schema model that encodes agency structure as data, the matcher that reconciles a document’s heading tree against that schema, the agency-specific differences you must parameterize, the failure modes that produce false positives, and how the mapper feeds the rest of the pipeline.

Prerequisites and environment setup

Section mapping operates on already-extracted document text, so it sits downstream of ingestion and upstream of the typographic checks. Assume a Python 3.10+ environment and the following dependencies:

bash

python -m venv .venv
source .venv/bin/activate
pip install "pydantic>=2.6" "pyyaml>=6.0" "pdfplumber>=0.11"

Three input assumptions drive the design:

Document format. Proposals arrive as PDF (the mandatory submission format for NIH SF424 and NSF Research.gov packages) or as pre-render LaTeX/DOCX/Markdown during assembly. Section mapping runs at both stages: on the source during drafting for fast feedback, and on the final PDF for authoritative sign-off. Extract the raw text first — the PDF Text Extraction with pdfplumber workflow supplies coordinate-aware text and heading candidates that this stage consumes.
Schema source. The rule set is authored once per FOA and version-controlled as YAML. It is not hardcoded — a new solicitation ships a new schema file, not a code change.
Determinism. The same document plus the same schema must always yield the same verdict, so no network calls, model randomness, or wall-clock dependencies are permitted inside the validator.

Core mechanism: schema as data, structure as verdict

The central idea is that agency policy is data and evaluation is code. Rather than scattering literals like "Research Strategy" or ordering rules through scripts, a production mapper loads a typed schema and runs one generic engine over it. Each required section is described by a rule: an identifier, a pattern that recognizes its heading, whether it is mandatory, its position in the canonical sequence, an optional dependency on another section, and an optional page allowance that later stages consume.

Encode that rule as a Pydantic validation layer model so that a malformed schema fails loudly at load time rather than silently mis-validating every proposal in a submission cycle:

python

from __future__ import annotations

import re

from pydantic import BaseModel, ConfigDict, field_validator


class SectionRule(BaseModel):
    """One mandatory or optional section in an agency structural schema."""

    model_config = ConfigDict(frozen=True)

    id: str
    title_pattern: str          # regex that recognizes the section heading
    order_index: int            # canonical position in the required sequence
    required: bool = True
    depends_on: str | None = None
    max_pages: int | None = None

    @field_validator("title_pattern")
    @classmethod
    def _pattern_must_compile(cls, value: str) -> str:
        try:
            re.compile(value)
        except re.error as exc:
            raise ValueError(f"invalid heading regex {value!r}: {exc}") from exc
        return value

    @field_validator("order_index")
    @classmethod
    def _order_non_negative(cls, value: int) -> int:
        if value < 0:
            raise ValueError("order_index must be >= 0")
        return value


class ComplianceSchema(BaseModel):
    """A complete structural schema for one FOA."""

    agency: str
    foa_id: str
    sections: list[SectionRule]

    @field_validator("sections")
    @classmethod
    def _orders_unique(cls, value: list[SectionRule]) -> list[SectionRule]:
        orders = [rule.order_index for rule in value]
        if len(orders) != len(set(orders)):
            raise ValueError("order_index values must be unique within a schema")
        ids = [rule.id for rule in value]
        if len(ids) != len(set(ids)):
            raise ValueError("section ids must be unique within a schema")
        return value

Authoring the schema in YAML keeps the structural policy readable to the grant administrators who own it and diffable across funding cycles:

yaml

agency: NIH
foa_id: PA-25-123
sections:
  - id: specific_aims
    title_pattern: '^\s*Specific\s+Aims\b'
    order_index: 0
    max_pages: 1
  - id: research_strategy
    title_pattern: '^\s*Research\s+Strategy\b'
    order_index: 1
    depends_on: specific_aims
    max_pages: 12
  - id: bibliography
    title_pattern: '^\s*Bibliography\s+(and|&)\s+References\s+Cited\b'
    order_index: 2
    required: true

Loading that file into ComplianceSchema.model_validate(yaml.safe_load(text)) yields a fully typed, self-checked schema that the matcher can trust.

Rule-aware section matcher implementation

The matcher reconciles the schema against the extracted document text. It records where each section heading actually appears, so that ordering is judged by real document position rather than assumed from the schema, and it reports three distinct violation classes — missing required sections, sequence mismatches, and unsatisfied dependencies — rather than collapsing them into a single boolean. Precise, typed output is what lets Automated Checklist Generation route each specific failure to the person who can fix it.

python

from __future__ import annotations

import re

from pydantic import BaseModel


class StructuralReport(BaseModel):
    foa_id: str
    is_compliant: bool
    detected_order: list[str]
    violations: list[str]


class SectionMapper:
    """Validates a document's heading structure against a compliance schema."""

    def __init__(self, schema: ComplianceSchema) -> None:
        self.schema = schema
        self._patterns = {
            rule.id: re.compile(rule.title_pattern, re.IGNORECASE | re.MULTILINE)
            for rule in schema.sections
        }

    def validate_structure(self, document_text: str) -> StructuralReport:
        violations: list[str] = []
        positions: dict[str, int] = {}

        # 1. Presence: locate each section by the character offset of its heading.
        for rule in self.schema.sections:
            match = self._patterns[rule.id].search(document_text)
            if match is not None:
                positions[rule.id] = match.start()
            elif rule.required:
                violations.append(f"MISSING: required section '{rule.id}' not found")

        # 2. Sequence: compare observed order against the canonical order,
        #    restricted to the sections that were actually found.
        detected_order = [
            sid for sid, _ in sorted(positions.items(), key=lambda kv: kv[1])
        ]
        expected_order = [
            rule.id
            for rule in sorted(self.schema.sections, key=lambda r: r.order_index)
            if rule.id in positions
        ]
        if detected_order != expected_order:
            violations.append(
                f"ORDER_MISMATCH: expected {expected_order}, detected {detected_order}"
            )

        # 3. Dependencies: a present section whose prerequisite is absent.
        for rule in self.schema.sections:
            if (
                rule.id in positions
                and rule.depends_on is not None
                and rule.depends_on not in positions
            ):
                violations.append(
                    f"DEPENDENCY_FAIL: '{rule.id}' requires '{rule.depends_on}'"
                )

        return StructuralReport(
            foa_id=self.schema.foa_id,
            is_compliant=not violations,
            detected_order=detected_order,
            violations=violations,
        )

Because the matcher keys ordering on match.start(), it correctly flags a proposal that places the Bibliography before the Research Strategy even when both sections are present — a common and silently disqualifying error when authors reshuffle attachments late in a drafting cycle. The max_pages field is carried but not evaluated here; it is handed to Page Limit & Font Enforcement, which measures each mapped section against its allowance, and the tolerance it applies is set by Threshold Tuning for Compliance.

Agency-specific configuration

The engine is generic; the differences live entirely in the schema files. NIH, NSF, and DoD diverge on which sections are mandatory, how strictly ordering is enforced, what nomenclature authoritatively names each heading, and which materials must be uploaded as separate documents rather than embedded in the narrative. Encode these differences per agency rather than branching in code.

Structural concern	NIH (SF424 R&R / PHS 398)	NSF (PAPPG)	DoD (BAA / white paper)
Core narrative section	Research Strategy (Significance, Innovation, Approach)	Project Description (Intellectual Merit, Broader Impacts)	Technical Volume / Technical Approach
Preceding required section	Specific Aims (1 page)	Project Summary (Overview, Intellectual Merit, Broader Impacts)	White paper or quad chart, per BAA
Ordering enforcement	Fixed attachment sequence in the application package	Strict; Data Management and Sharing Plan is a separate supplementary document	BAA-defined; volumes often submitted as separate files
Conditional sections	Human Subjects, Vertebrate Animals triggered by study type	Results from Prior NSF Support (conditional on prior award)	Classified annex, ITAR/EAR export-control statement
Heading nomenclature source	NIH application guide and the active FOA	NSF Proposal & Award Policies & Procedures Guide (PAPPG)	Individual BAA and component addenda
Embedding prohibitions	References excluded from the page limit	URLs and hyperlinks disallowed in the Project Description	Cost material must not appear in the Technical Volume

The nomenclature column matters most for the matcher: NSF headings follow the NSF Proposal Guide Taxonomy derived from the PAPPG, NIH headings follow its application guide, and DoD headings are defined per solicitation — so DoD schemas are best generated from the extracted requirements produced by the DoD BAA requirement extraction workflow rather than authored by hand. Building the canonical schema itself upstream is the job of NIH FOA schema mapping; required section mapping is the enforcement counterpart that checks a document against it.

Error handling and edge cases

Naive heading matching produces false positives that destroy administrator trust faster than missed violations do. Production mappers need explicit handling for the ways real documents deviate from a clean heading tree.

Heading numbering variants. Authors and institutional templates render the same section as ## 3. Research Strategy, ### III. Research Strategy, or a bare Research Strategy. Patterns must tolerate optional leading numerals, Roman numerals, and markup — anchor on the section name, not the decoration.
Ambiguous or duplicated matches. A phrase like “research strategy” appearing inside a sentence can be mistaken for a heading. Constrain patterns to line starts (re.MULTILINE), and when extraction supplies coordinates, prefer heading candidates in the top zone of a page or those carrying larger font sizes, as surfaced by the NLP Section Boundary Detection stage.
Conditional sections that must not misfire. A missing Human Subjects section is a violation only when the study involves human subjects; a missing Results from Prior NSF Support section is a violation only for a PI with a prior award. Model these as dependency-gated rules whose required status is resolved from proposal metadata, not asserted unconditionally.
Fallback chains for unrecognized headings. When the primary regex fails, escalate rather than immediately declaring the section missing: try a secondary relaxed pattern, then a semantic heading classifier, and only then flag for manual review. This graceful degradation prevents a single template quirk from cascading into dozens of spurious violations.

A fallback resolver makes the escalation explicit. Each strategy is tried in order, and the first hit wins; only when every strategy misses does the section fall through to manual review, and the resolver records which strategy fired so ambiguous matches can be audited later:

python

from __future__ import annotations

from collections.abc import Callable


class HeadingResolver:
    """Resolves a section heading through an ordered chain of strategies."""

    def __init__(self, strategies: list[tuple[str, Callable[[str], int | None]]]) -> None:
        # Each strategy returns the heading offset, or None if it cannot match.
        self._strategies = strategies

    def resolve(self, section_id: str, text: str) -> tuple[int | None, str]:
        for name, strategy in self._strategies:
            offset = strategy(text)
            if offset is not None:
                return offset, name          # (position, winning strategy)
        return None, "manual_review"          # every strategy missed

Wiring a strict regex, a relaxed regex that tolerates numbering and markup, and a trained classifier into that chain lets one resolver serve NIH, NSF, and DoD documents while keeping the reason for every match inspectable.

Each strategy is tried in order down the spine; a miss escalates to the next, a match resolves immediately and records which strategy fired. Only a document that defeats every strategy reaches the manual-review flag.

Every violation should carry enough context — the section id, the offending position, and the rule that fired — for downstream annotation. Log the resolved schema foa_id and an internal rule-engine version with each run so that a verdict can always be reproduced against the exact policy that produced it.

Integration with the downstream compliance pipeline

Section mapping is the entry point of a two-pass evaluation. The structural pass confirms presence, sequence, and dependency resolution; only sections that pass structurally are handed to the dimensional pass, which measures rendered size against each section’s max_pages allowance. Separating the passes means a page-limit failure is reported against a correctly identified section rather than being conflated with a mislabeled-heading error.

The pipeline moves from extracted document text through schema matching to either a compliant verdict or a routed remediation task:

The structural pass is a single spine that ends in one decision: a clean document hands its mapped sections to the dimensional checks, while any violation is emitted as typed output and routed into a remediation checklist item.

The structural report feeds two consumers. The mapped section boundaries and their max_pages allowances flow to Page Limit & Font Enforcement, and the violation list flows to Automated Checklist Generation, which converts each entry into a NOFO-specific, human-readable deficiency that names the fix and the responsible owner. When institutions process solicitations in bulk, run the mapper through the async batch processor so a full portfolio can be validated in one overnight pass.

Testing and verification

Because the mapper gates every submission, it earns exhaustive tests. Fixture documents should exercise each violation class independently so a regression names exactly which behavior broke:

python

import pytest

from mapper import ComplianceSchema, SectionMapper, SectionRule


@pytest.fixture
def nih_schema() -> ComplianceSchema:
    return ComplianceSchema(
        agency="NIH",
        foa_id="PA-25-123",
        sections=[
            SectionRule(id="aims", title_pattern=r"^\s*Specific\s+Aims\b", order_index=0),
            SectionRule(
                id="strategy",
                title_pattern=r"^\s*Research\s+Strategy\b",
                order_index=1,
                depends_on="aims",
            ),
        ],
    )


def test_compliant_document_passes(nih_schema: ComplianceSchema) -> None:
    doc = "Specific Aims\n...\nResearch Strategy\n..."
    report = SectionMapper(nih_schema).validate_structure(doc)
    assert report.is_compliant
    assert report.detected_order == ["aims", "strategy"]


def test_missing_required_section_is_flagged(nih_schema: ComplianceSchema) -> None:
    doc = "Research Strategy\n..."  # Specific Aims absent
    report = SectionMapper(nih_schema).validate_structure(doc)
    assert not report.is_compliant
    assert any("MISSING" in v for v in report.violations)


def test_out_of_order_sections_are_flagged(nih_schema: ComplianceSchema) -> None:
    doc = "Research Strategy\n...\nSpecific Aims\n..."  # reversed
    report = SectionMapper(nih_schema).validate_structure(doc)
    assert any("ORDER_MISMATCH" in v for v in report.violations)


def test_dependency_failure_is_flagged() -> None:
    schema = ComplianceSchema(
        agency="NSF",
        foa_id="PAPPG-24-1",
        sections=[
            SectionRule(id="prior", title_pattern=r"^\s*Results\s+from\s+Prior", order_index=0),
            SectionRule(
                id="continuation",
                title_pattern=r"^\s*Continuation",
                order_index=1,
                depends_on="prior",
            ),
        ],
    )
    doc = "Continuation\n..."  # depends on a missing 'prior' section
    report = SectionMapper(schema).validate_structure(doc)
    assert any("DEPENDENCY_FAIL" in v for v in report.violations)

A release checklist complements the unit tests: confirm every active FOA schema loads without a validation error, replay the previous cycle’s submissions and diff the verdicts against recorded receipts, and verify that each violation class round-trips into a checklist item. Treating section mapping as a version-controlled, deterministic compliance layer lets research institutions eliminate structural rework and guarantee intake-screen readiness across every federal funding mechanism they submit to.

Page Limit & Font Enforcement — measures each mapped section against its page and typography allowance.
Threshold Tuning for Compliance — sets the tolerance bands that separate real violations from rendering artifacts.
Automated Checklist Generation — turns structural violations into routed, human-readable deficiencies.
Mapping mandatory sections for NSF CAREER proposals — a worked agency-specific reference implementation.
NIH FOA Schema Mapping — builds the canonical schema this stage enforces against.

Up one level: Compliance Validation & Rule Engines.

# Required Section Mapping

# Prerequisites and environment setup

# Core mechanism: schema as data, structure as verdict

# Rule-aware section matcher implementation

# Agency-specific configuration

# Error handling and edge cases

# Integration with the downstream compliance pipeline

# Testing and verification

# Related resources

Explore this section

Required Section Mapping

Prerequisites and environment setup

Core mechanism: schema as data, structure as verdict

Rule-aware section matcher implementation

Agency-specific configuration

Error handling and edge cases

Integration with the downstream compliance pipeline

Testing and verification

Related resources