Mapping mandatory sections for NSF CAREER proposals

Q: Should section mapping run on the draft source or only on the final PDF?

Both. Run it on the draft source during drafting for fast feedback, and again on the final assembled PDF for authoritative sign-off. The same schema drives both passes; only the heading-extraction front end differs.

The National Science Foundation (NSF) Faculty Early Career Development (CAREER) program enforces one of the most rigid structural frameworks among federal funding mechanisms, and the failure mode it punishes is entirely mechanical: a proposal that omits the Departmental Letter, mislabels the integrated education plan, or presents its attachments out of sequence is returned without review by the Research.gov intake screen, no matter how strong the science. This page shows how to build a deterministic section mapper in Python that catches those structural defects before submission. It is a focused application of the required section mapping workflow, and it sits inside the broader compliance validation and rule engines layer, where every downstream measurement — page counts, font checks, completeness reports — depends first on knowing precisely where each mandatory CAREER section begins and ends.

Unlike a standard research grant, a CAREER proposal (currently governed by program solicitation NSF 22-586 and the active NSF Proposal and Award Policies and Procedures Guide (PAPPG)) requires a specific attachment set in a fixed order, plus two artifacts unique to the mechanism: a signed Departmental Letter and a Project Description that explicitly integrates research with education. The mapper’s job is to convert that structural mandate into machine-enforceable rules and verify an assembled package against them.

Phase 1 — Decompose the CAREER package into a canonical schema

Section mapping runs on already-extracted document text, so it sits downstream of ingestion and upstream of the typographic checks. Assume a Python 3.10+ environment:

bash

python -m venv .venv
source .venv/bin/activate
pip install "pydantic>=2.6" "pyyaml>=6.0"

Before writing any code, enumerate the canonical CAREER structure as an ordered, immutable sequence. Each entry becomes one rule the engine will enforce:

Project Summary — 1 page, with three separately labeled subsections: Overview, Intellectual Merit, Broader Impacts.
Project Description — 15 pages, and it must describe an integrated research-and-education plan (the CAREER-specific constraint).
References Cited — no page limit, but must be contiguous and complete.
Biographical Sketch(es) — SciENcv-generated, per the page limit in the active PAPPG.
Budget and Budget Justification — up to 5 pages; follow shared budget justification format standards so caps and cost categories stay consistent across agencies.
Current and Pending (Other) Support — SciENcv or the NSF fillable format.
Facilities, Equipment and Other Resources.
Data Management Plan — a mandatory 2-page supplementary document.
Departmental Letter — mandatory for CAREER, signed by the department chair, filed under Other Supplementary Docs.

The implementation steps to turn that list into a schema are:

Assign every section a stable section_id and a position index so ordering is a comparison, not a guess.
Attach a heading_pattern (a case-insensitive regular expression) that recognizes the section’s heading in extracted text.
Mark each section mandatory or optional, and record any depends_on relationship (for example, the Budget Justification depends on the Budget existing).
Store the schema as version-controlled YAML keyed to the solicitation number, so a new CAREER cycle ships a new file rather than a code change.

The nine CAREER attachments in canonical position order. Each becomes one SectionRule; the two highlighted entries — the education-integrated Project Description and the chair-signed Departmental Letter — are the CAREER-specific gates the mapper must enforce.

Phase 2 — Encode the schema as data and evaluate structure as a verdict

The central discipline is that agency policy is data and evaluation is code: never scatter string literals like "Project Description" through scripts. Encode each rule as a Pydantic validation layer model so a malformed schema fails loudly at load time rather than silently mis-validating every proposal in a submission cycle.

python

from __future__ import annotations

import re

from pydantic import BaseModel, ConfigDict, field_validator


class SectionRule(BaseModel):
    """One mandatory or optional section in the CAREER structural schema."""

    model_config = ConfigDict(frozen=True)

    section_id: str
    heading_pattern: str
    position: int
    mandatory: bool = True
    depends_on: str | None = None
    max_pages: int | None = None

    @field_validator("heading_pattern")
    @classmethod
    def _pattern_must_compile(cls, value: str) -> str:
        # Reject a schema whose regex cannot compile, before it is ever run.
        try:
            re.compile(value, re.IGNORECASE)
        except re.error as exc:  # noqa: TRY003
            raise ValueError(f"invalid heading_pattern: {exc}") from exc
        return value


class CareerSchema(BaseModel):
    solicitation: str
    sections: list[SectionRule]

    @field_validator("sections")
    @classmethod
    def _positions_unique_and_ordered(cls, rules: list[SectionRule]) -> list[SectionRule]:
        positions = [r.position for r in rules]
        if len(set(positions)) != len(positions):
            raise ValueError("duplicate position index in CAREER schema")
        return sorted(rules, key=lambda r: r.position)

The matcher then walks the extracted heading tree once, records the observed order of recognized sections, and produces a verdict object rather than raising on the first defect — grant offices need the full list of problems, not the first one:

python

from dataclasses import dataclass, field


@dataclass
class MappingVerdict:
    missing: list[str] = field(default_factory=list)
    out_of_order: list[str] = field(default_factory=list)
    unmet_dependencies: list[str] = field(default_factory=list)

    @property
    def is_compliant(self) -> bool:
        return not (self.missing or self.out_of_order or self.unmet_dependencies)


def map_sections(headings: list[str], schema: CareerSchema) -> MappingVerdict:
    """Reconcile a document's heading list against the canonical CAREER schema."""
    verdict = MappingVerdict()
    found_at: dict[str, int] = {}

    for rule in schema.sections:
        pattern = re.compile(rule.heading_pattern, re.IGNORECASE)
        idx = next((i for i, h in enumerate(headings) if pattern.search(h)), None)
        if idx is None:
            if rule.mandatory:
                verdict.missing.append(rule.section_id)
            continue
        found_at[rule.section_id] = idx

    # Ordering: observed document order must match canonical position order.
    observed = sorted(found_at, key=lambda sid: found_at[sid])
    expected = [r.section_id for r in schema.sections if r.section_id in found_at]
    if observed != expected:
        verdict.out_of_order = [s for s, e in zip(observed, expected) if s != e]

    # Dependencies: a section present without its prerequisite is a defect.
    present = set(found_at)
    for rule in schema.sections:
        if rule.section_id in present and rule.depends_on and rule.depends_on not in present:
            verdict.unmet_dependencies.append(rule.section_id)

    return verdict

The branching the mapper drives — presence, ordering, and the CAREER-specific education-plan and Departmental Letter gates — resolves as follows:

Three sequential gates — mandatory-section presence, the Departmental Letter, and the education plan inside the Project Description — each short-circuit to a single critical failure node; only a package clearing all three plus canonical ordering is handed downstream.

Phase 3 — Edge cases and agency-specific overrides

PDF generation variability and directorate-level differences produce predictable false positives. The mapper must neutralize each one deterministically rather than trusting raw extracted text:

Merged or split headings. Faculty drafting in LaTeX or Microsoft Word frequently emit a single line reading PROJECT DESCRIPTION Intellectual Merit, or split one heading across two runs. Detect consecutive lines sharing an identical font size and weight, then re-segment them with a keyword dictionary ({"PROJECT", "DESCRIPTION", "SUMMARY", "REFERENCES"}) before the matcher runs. Heading candidates should come from a coordinate-aware source such as the pdfplumber text extraction stage, which preserves per-character font size.
Directorate naming variance. The integrated education component is labeled differently across NSF directorates — “Education Plan,” “Education and Outreach Plan,” or “Career Development Plan.” The heading_pattern for that rule must be an alternation (education (and outreach )?plan|career development plan) rather than a literal, or a valid proposal will be flagged as missing its most CAREER-defining element.
Non-ASCII encoding. Normalize author names and institutional affiliations with unicodedata.normalize("NFKC", text) before regex evaluation, so a ligature or accented character in a heading does not silently truncate a match.
Versioned policy references. Font floors are typeface-dependent under the current PAPPG: Arial, Courier New, and Palatino Linotype are permitted at 10 points or larger, but Times New Roman must be 11 points or larger. Because NSF revises these rules on its own cycle, pin the font table to the schema’s solicitation key and re-verify against the active PAPPG at deploy time — never hardcode a single global minimum. This mirrors the typeface-aware approach used when enforcing NIH page-limit and font rules programmatically, where NIH’s flat 11-point floor differs from NSF’s conditional one.

When extraction confidence for a mandatory section falls below a defined threshold, route the document to manual review rather than proceeding with corrupted state.

Phase 4 — Validation and verification before handoff

Before the mapper’s output feeds the page-and-font checks, confirm it deterministically. The following snippet asserts a compliant package and demonstrates the two CAREER-specific gates:

python

def test_career_mapping_gates() -> None:
    schema = CareerSchema(
        solicitation="NSF 22-586",
        sections=[
            SectionRule(section_id="project_summary", heading_pattern=r"project summary", position=1),
            SectionRule(section_id="project_description", heading_pattern=r"project description", position=2, max_pages=15),
            SectionRule(section_id="references_cited", heading_pattern=r"references cited", position=3),
            SectionRule(section_id="budget_justification", heading_pattern=r"budget justification", position=5, depends_on="budget"),
            SectionRule(section_id="departmental_letter", heading_pattern=r"departmental letter", position=9),
        ],
    )
    headings = [
        "Project Summary", "Project Description", "References Cited",
        "Budget", "Budget Justification", "Departmental Letter",
    ]
    verdict = map_sections(headings, schema)
    assert verdict.is_compliant

    # Remove the Departmental Letter → CAREER-specific CRITICAL failure.
    broken = map_sections([h for h in headings if h != "Departmental Letter"], schema)
    assert "departmental_letter" in broken.missing

For sign-off, run this checklist against the MappingVerdict before routing to Research.gov:

missing is empty (every mandatory section, including the Departmental Letter and Data Management Plan, is present).
out_of_order is empty (attachments follow canonical position order).
unmet_dependencies is empty (no Budget Justification without a Budget).
The integrated education plan heading matched inside the Project Description.
Every extraction and validation event is written to append-only, WORM-compliant storage with a UTC timestamp, the file’s SHA-256 hash, the observed section boundaries, and the schema solicitation version — so a sponsored-projects office can reconstruct the exact compliance posture at submission time.

Frequently asked questions

Why does the mapper treat the Departmental Letter as a CRITICAL failure rather than a warning?

The Departmental Letter is a mandatory attachment unique to the CAREER mechanism, and Research.gov's automated intake screen returns proposals that omit it without review. Because a missing letter guarantees a return-without-review outcome, it must halt routing exactly like a missing Project Description — a warning that a reviewer might ignore is not sufficient.

How should the schema handle the education plan being named differently across directorates?

Use a regular-expression alternation for that rule's heading_pattern (for example, education (and outreach )?plan|career development plan) instead of a literal string. Directorates label the integrated education component inconsistently, so a literal match will report a valid proposal as missing its most CAREER-defining element.

Is there a single minimum font size I can enforce for every CAREER attachment?

No. Under the current PAPPG the floor is typeface-dependent: Arial, Courier New, and Palatino Linotype are allowed at 10 points, but Times New Roman must be at least 11 points. Pin the font table to the solicitation version in the schema and re-verify against the active PAPPG at deploy time rather than hardcoding one global minimum.

Should section mapping run on the draft source or only on the final PDF?

Both. Run it on the LaTeX, DOCX, or Markdown source during drafting for fast feedback, and again on the final assembled PDF for authoritative sign-off. The same schema drives both passes; only the heading-extraction front end differs.

Why does map_sections return a verdict object instead of raising on the first defect?

Grant offices need the complete list of structural problems in one pass so a writer can fix them all before resubmitting. Raising on the first missing or out-of-order section would force a slow fix-one-rerun loop; collecting missing, out_of_order, and unmet_dependencies into a MappingVerdict surfaces every defect at once.

Required section mapping — the schema-as-data pattern and matcher this page specializes for CAREER.
Enforcing NIH 12-page limit rules programmatically — the downstream page and font check.
Parsing NSF PAPPG section headers programmatically — how the canonical heading patterns are derived.
Validating parsed RFP JSON against agency schemas — the Pydantic validation techniques reused here.
Automated checklist generation — turning a MappingVerdict into a reviewer-facing checklist.

Up one level: Required Section Mapping

# Mapping mandatory sections for NSF CAREER proposals

# Phase 1 — Decompose the CAREER package into a canonical schema

# Phase 2 — Encode the schema as data and evaluate structure as a verdict

# Phase 3 — Edge cases and agency-specific overrides

# Phase 4 — Validation and verification before handoff

# Frequently asked questions

# Related

Related pages

Mapping mandatory sections for NSF CAREER proposals

Phase 1 — Decompose the CAREER package into a canonical schema

Phase 2 — Encode the schema as data and evaluate structure as a verdict

Phase 3 — Edge cases and agency-specific overrides

Phase 4 — Validation and verification before handoff

Frequently asked questions

Related