NSF Proposal Guide Taxonomy

A single misread heading in a National Science Foundation (NSF) solicitation is enough to route a Data Management Plan into the wrong validator, silently exceed a character limit, or drop a mandatory section from the pre-submission manifest — and the sponsored programs office finds out only when Research.gov rejects the package hours before the deadline. The NSF Proposal & Award Policies & Procedures Guide (PAPPG) encodes the structure every competitive application must follow, but as prose it is unenforceable. This page defines a machine-readable taxonomy that turns the PAPPG’s narrative blocks, formatting mandates, and budget categories into typed, validated objects a pipeline can act on. It sits inside the broader Core Architecture & RFP Taxonomy domain, where each agency’s requirement model is normalized to a shared contract so downstream compliance engines never have to special-case NSF by hand. Get the taxonomy right and structural violations surface while a draft is still editable — not after Research.gov has already said no.

Prerequisites and environment setup

The taxonomy is built as a small Python package with strict typing throughout. Everything below assumes Python 3.10 or newer, because the code relies on structural pattern matching and the X | Y union syntax in annotations.

bash

python3 --version          # must report 3.10 or higher
python3 -m venv .venv
source .venv/bin/activate
pip install "pydantic>=2.6" pdfplumber

Three assumptions about the source documents shape every parser in this page:

Format. NSF distributes the PAPPG and individual program solicitations as PDF. Author-submitted proposal documents (Project Description, Data Management Plan, budget justification) also arrive as PDF once compiled by Research.gov. Raw text is only available after a coordinate-aware extraction pass — covered in PDF Text Extraction with pdfplumber — so this taxonomy consumes extracted text, never the binary PDF directly.
Version. The PAPPG is revised roughly annually and each revision carries an effective date (for example, PAPPG 24-1). Section numbering and page-limit thresholds change between revisions, so every parsed record must be stamped with the PAPPG version it was validated against.
Scope. This page models the proposal-preparation chapter (PAPPG Chapter II) — the sections, formatting rules, and budget categories a proposal must satisfy. Award-management rules (Chapters V–XII) are out of scope here.

Pin the PAPPG revision in one place so a mid-cycle policy update is a one-line change rather than a hunt through regexes:

python

from typing import Final

ACTIVE_PAPPG_VERSION: Final[str] = "24-1"
ACTIVE_PAPPG_EFFECTIVE: Final[str] = "2024-05-20"

Core mechanism — the PAPPG decomposed into addressable nodes

The taxonomy treats the PAPPG not as a document but as a tree of addressable nodes. Each node is one required proposal component — Project Summary, Project Description, References Cited, Biographical Sketches, Budget and Budget Justification, Data Management Plan, and so on — and each carries the constraints NSF attaches to it: a stable identifier, a page or character limit, allowed fonts and margins, and whether the section is mandatory or conditional on the program.

The parsing mechanism has two stages. First, deterministic header recognition segments the extracted text into candidate blocks. NSF headers follow a small set of predictable patterns (1., II., (a), Section C), which a single anchored regular expression can capture without the ambiguity that free-text NLP would introduce at this layer. Second, each captured block is mapped onto a known node in the taxonomy so its constraints can be looked up and enforced.

python

import re
from dataclasses import dataclass


@dataclass
class SectionBlock:
    header: str
    content: str
    char_count: int


# Anchored to line start; matches "1. Project Summary", "II. Project Description",
# "C. Broader Impacts". Deliberately narrow to avoid capturing inline enumerations.
_HEADER_PATTERN = re.compile(
    r"^(?:[IVX]+\.|[0-9]+\.|[A-Z]\.)\s+([A-Za-z][A-Za-z\s&/-]{2,60})$",
    re.MULTILINE,
)


def segment_pappg_blocks(raw_text: str) -> list[SectionBlock]:
    """Split extracted PAPPG/proposal text into header-anchored blocks.

    Character counts strip Markdown/HTML artifacts and citation markers so the
    number reflects what NSF's page-limit rules actually count.
    """
    blocks: list[SectionBlock] = []
    matches = list(_HEADER_PATTERN.finditer(raw_text))
    for i, match in enumerate(matches):
        start = match.end()
        end = matches[i + 1].start() if i + 1 < len(matches) else len(raw_text)
        content = raw_text[start:end].strip()
        clean = re.sub(r"\[.*?\]\(.*?\)|[*_~`]|\[\d+\]", "", content)
        blocks.append(
            SectionBlock(
                header=match.group(0).strip(),
                content=content,
                char_count=len(clean),
            )
        )
    return blocks

Keeping segmentation deterministic at this stage matters: a probabilistic boundary detector may be right 98% of the time, but a compliance gate that is wrong on one proposal in fifty is worse than useless. Fuzzy boundary work belongs later in the pipeline, in NLP Section Boundary Detection, where a human still reviews the output; the taxonomy’s header layer stays rule-based so its verdicts are reproducible. The full header-recognition ruleset — including how to handle appendices, letters of collaboration, and single-copy documents — is developed in Parsing NSF PAPPG section headers programmatically.

Rule-aware implementation — a validated NSF section model

Segmentation produces blocks; it does not produce trust. To make the taxonomy safe to hand to a compliance engine, each block is coerced into a Pydantic v2 model that encodes NSF’s rules as field validators. Invalid data fails loudly at construction time rather than propagating a bad character count into a submission decision.

python

from enum import Enum
from pydantic import BaseModel, Field, field_validator


class NsfSectionId(str, Enum):
    PROJECT_SUMMARY = "project_summary"
    PROJECT_DESCRIPTION = "project_description"
    REFERENCES_CITED = "references_cited"
    BIOSKETCH = "biographical_sketch"
    BUDGET_JUSTIFICATION = "budget_justification"
    DATA_MANAGEMENT_PLAN = "data_management_plan"


# Page limits per PAPPG Chapter II. A Project Summary is capped by character count
# (4600 across the three required elements); the Project Description is capped at
# 15 pages; the Data Management Plan at 2 pages.
_PAGE_LIMITS: dict[NsfSectionId, int | None] = {
    NsfSectionId.PROJECT_SUMMARY: None,      # governed by char limit, not pages
    NsfSectionId.PROJECT_DESCRIPTION: 15,
    NsfSectionId.REFERENCES_CITED: None,     # no limit
    NsfSectionId.BIOSKETCH: 3,
    NsfSectionId.BUDGET_JUSTIFICATION: 5,
    NsfSectionId.DATA_MANAGEMENT_PLAN: 2,
}

_ALLOWED_FONTS = frozenset({"Arial", "Courier New", "Palatino Linotype", "Times New Roman"})


class NsfSection(BaseModel):
    section_id: NsfSectionId
    heading_text: str = Field(min_length=3, max_length=120)
    char_count: int = Field(ge=0)
    page_count: int | None = Field(default=None, ge=0)
    font_name: str
    font_size_pt: float = Field(gt=0)
    pappg_version: str

    @field_validator("font_name")
    @classmethod
    def font_must_be_allowed(cls, v: str) -> str:
        if v not in _ALLOWED_FONTS:
            raise ValueError(
                f"Font '{v}' is not NSF-approved; allowed: {sorted(_ALLOWED_FONTS)}"
            )
        return v

    @field_validator("font_size_pt")
    @classmethod
    def font_size_floor(cls, v: float) -> float:
        # PAPPG requires no smaller than 10pt for the main body text.
        if v < 10.0:
            raise ValueError(f"Font size {v}pt is below the NSF 10pt minimum")
        return v

    def limit_violations(self) -> list[str]:
        """Return human-readable violations against this section's own limits."""
        problems: list[str] = []
        if self.section_id is NsfSectionId.PROJECT_SUMMARY and self.char_count > 4600:
            problems.append(
                f"Project Summary is {self.char_count} chars (max 4600)"
            )
        limit = _PAGE_LIMITS.get(self.section_id)
        if limit is not None and self.page_count is not None and self.page_count > limit:
            problems.append(
                f"{self.section_id.value} is {self.page_count} pages (max {limit})"
            )
        return problems

Because the model rejects a non-approved font or a sub-10pt size at construction, the page-limit and font rules live in exactly one place. When the shared compliance validation rule engine later evaluates a proposal, it consumes NsfSection instances it can trust rather than re-deriving NSF’s thresholds. The reusable font-and-page ruleset that this model expresses is maintained under Page-Limit & Font Enforcement, and the mandatory-section coverage check — is every required node actually present? — is handled by Mapping mandatory sections for NSF CAREER proposals.

Agency-specific configuration

The reason this taxonomy is worth building rather than hard-coding NSF rules inline is portfolio scale: a research office rarely submits to only one agency. The same abstract nodes (a narrative cap, a font floor, a budget schema) carry different values at NSF, the National Institutes of Health (NIH), and the Department of Defense (DoD). Encoding those differences as data — not as branching code — is what lets one engine serve all three.

Requirement node	NSF (PAPPG)	NIH (SF424 / FOA)	DoD (BAA)
Governing document	Proposal & Award Policies & Procedures Guide	Funding Opportunity Announcement + application guide	Broad Agency Announcement
Main narrative limit	Project Description ≤ 15 pages	Research Strategy ≤ 12 pages (R01)	Set per-BAA (often white-paper then full)
Summary/abstract	Project Summary ≤ 4600 chars, 3 elements	Abstract ≤ 30 lines	Varies; often ≤ 1 page
Font floor	10 pt, approved family	11 pt (Arial/Georgia/Helvetica/Palatino)	Per-BAA, commonly 11–12 pt
Margins	1 inch all sides	0.5 inch all sides	Per-BAA
Merit criteria	Intellectual Merit + Broader Impacts	Significance, Innovation, Approach, etc.	Mission relevance + technical merit
Submission portal	Research.gov	eRA Commons / Grants.gov	SAM.gov / eBRAP / agency portal

The NIH column is modeled in full by the NIH FOA Schema Mapping process, and the DoD column — with its export-control and classification wrinkles — by DoD BAA Requirement Extraction. Loading the active values as a versioned profile keeps the switch declarative:

python

from pydantic import BaseModel


class AgencyProfile(BaseModel):
    agency: str
    doc_version: str
    narrative_page_limit: int
    summary_char_limit: int | None
    font_size_floor_pt: float
    margin_inches: float


NSF_PROFILE = AgencyProfile(
    agency="NSF",
    doc_version="PAPPG 24-1",
    narrative_page_limit=15,
    summary_char_limit=4600,
    font_size_floor_pt=10.0,
    margin_inches=1.0,
)

Error handling and edge cases

Real extracted text is messier than the PAPPG’s examples. The taxonomy has to fail predictably on the cases that actually occur in an institutional queue:

Ghost headers from two-column layouts. When a PDF renders text in columns, a coordinate-unaware extractor can interleave a right-column line into a left-column header, producing a spurious match. Resolution: run segmentation on text that has already passed coordinate-aware zoning, and reject header candidates whose captured group is under three characters or is all uppercase boilerplate.
Conditional sections. Some nodes are mandatory only for certain programs — a Postdoctoral Mentoring Plan is required only when the budget requests postdoc salary; a Data Management Plan is always required but its content rules vary by directorate. Resolution: mark nodes as mandatory, conditional, or optional and evaluate the trigger before flagging an absence as a violation.
PAPPG version drift. A proposal drafted against last year’s guide can silently violate a threshold the new revision changed. Resolution: stamp every NsfSection with pappg_version and refuse to validate a record whose version does not match ACTIVE_PAPPG_VERSION without an explicit override.
Character-count disagreements. NSF counts the Project Summary’s characters, but hyperlinks, citation markers, and non-breaking spaces inflate a naive len(). Resolution: normalize before counting — strip citation brackets, collapse whitespace, and decode entities — exactly as segment_pappg_blocks does.

python

class PappgVersionMismatch(ValueError):
    """Raised when a section was validated against a stale PAPPG revision."""


def assert_current(section: NsfSection, *, override: bool = False) -> None:
    if not override and section.pappg_version != ACTIVE_PAPPG_VERSION:
        raise PappgVersionMismatch(
            f"section validated against PAPPG {section.pappg_version}, "
            f"active revision is {ACTIVE_PAPPG_VERSION}"
        )

Integration with downstream pipeline

The taxonomy is one stage in a sequential pipeline: extracted text enters, validated NsfSection records leave, and each subsequent stage refuses to run if the prior one reported violations. Budget schema alignment cross-references line items against NSF’s allowable categories (Personnel, Equipment, Travel, Participant Support, Other Direct Costs, Indirect Costs), and the pre-submission audit emits a manifest of everything still wrong.

The budget stage keeps its own allow-list so a disallowed category or a thin justification is caught before the fiscal reviewers ever see the package. Its cross-agency normalization — so an NSF budget and a DoD budget land in the same shape — is the subject of Budget Justification Format Standards.

python

ALLOWED_NSF_BUDGET_CATEGORIES: frozenset[str] = frozenset({
    "Personnel", "Equipment", "Travel", "Participant Support",
    "Other Direct Costs", "Indirect Costs",
})


def validate_budget_justification(
    line_items: list[dict[str, str]],
) -> tuple[bool, list[str]]:
    """Validate NSF budget line items against categorical compliance rules."""
    violations: list[str] = []
    for item in line_items:
        category = item.get("category", "").strip()
        justification = item.get("justification", "").strip()
        if category not in ALLOWED_NSF_BUDGET_CATEGORIES:
            violations.append(f"Unrecognized category: '{category}'")
        if len(justification) < 25:
            violations.append(f"Insufficient justification for '{category}' (min 25 chars)")
        if category == "Participant Support" and "tuition" in justification.lower():
            violations.append("NSF prohibits tuition charges under Participant Support")
    return not violations, violations

Once a proposal clears validation, the same NsfSection records feed the schema-conformance layer described in Schema Validation with Pydantic, which enforces the JSON contract the assembly and submission stages depend on.

Testing and verification

A compliance taxonomy earns trust through tests that encode the rules as executable expectations. The suite below pins the thresholds that most often regress when a new PAPPG revision lands — the font floor, the Project Summary character cap, and the version guard.

python

import pytest
from pydantic import ValidationError


def _section(**overrides) -> NsfSection:
    base = dict(
        section_id=NsfSectionId.PROJECT_SUMMARY,
        heading_text="1. Project Summary",
        char_count=4200,
        page_count=None,
        font_name="Arial",
        font_size_pt=11.0,
        pappg_version=ACTIVE_PAPPG_VERSION,
    )
    base.update(overrides)
    return NsfSection(**base)


def test_sub_minimum_font_rejected() -> None:
    with pytest.raises(ValidationError):
        _section(font_size_pt=9.5)


def test_unapproved_font_rejected() -> None:
    with pytest.raises(ValidationError):
        _section(font_name="Comic Sans MS")


def test_project_summary_char_cap() -> None:
    over = _section(char_count=5000)
    assert any("4600" in v for v in over.limit_violations())


def test_stale_pappg_version_blocks_validation() -> None:
    stale = _section(pappg_version="23-1")
    with pytest.raises(PappgVersionMismatch):
        assert_current(stale)

Before promoting a proposal package out of the taxonomy stage, confirm the checklist below holds:

Every mandatory node for the target program is present; conditional nodes are present when their trigger fires.
No section exceeds its page or character limit under the active PAPPG revision.
All body text uses an approved font at 10 pt or larger with 1-inch margins.
Every budget line item maps to an allowed category with a substantive justification.
Each NsfSection is stamped with ACTIVE_PAPPG_VERSION; no stale records pass assert_current.

Treating the PAPPG as a versioned data contract rather than a prose reference is what shifts a research office from reactive, deadline-night editing to continuous validation. With the taxonomy in place, the same rules that gate a single NSF proposal scale unchanged across an entire agency portfolio.

Up one level: Core Architecture & RFP Taxonomy

# NSF Proposal Guide Taxonomy

# Prerequisites and environment setup

# Core mechanism — the PAPPG decomposed into addressable nodes

# Rule-aware implementation — a validated NSF section model

# Agency-specific configuration

# Error handling and edge cases

# Integration with downstream pipeline

# Testing and verification

# Related pages

Explore this section

NSF Proposal Guide Taxonomy

Prerequisites and environment setup

Core mechanism — the PAPPG decomposed into addressable nodes

Rule-aware implementation — a validated NSF section model

Agency-specific configuration

Error handling and edge cases

Integration with downstream pipeline

Testing and verification

Related pages