NIH FOA Schema Mapping

A National Institutes of Health (NIH) submission is administratively rejected long before a study section ever weighs its science when the package violates a structural rule buried in the Funding Opportunity Announcement (FOA) — a missing SF424 (R&R) form, a modular budget requested above the direct-cost ceiling, or a Research Strategy that runs past its page limit. NIH FOA schema mapping is the workflow stage that prevents those failures by translating the unstructured, narrative directives of an announcement into a typed, machine-readable contract that downstream tooling can validate deterministically. It sits inside the broader Core Architecture & RFP Taxonomy, which treats every solicitation as a data contract; this page covers the NIH-specific half of that contract — how an FOA’s forms, budget model, deadlines, and format rules become a Pydantic object that fails fast when a proposal does not conform.

The compliance failure this stage addresses is specific: manual transcription of FOA requirements into a checklist is where institutional pipelines silently drift out of compliance. A grant administrator who reads “modular budget, ≤ $250,000 direct costs per year” and hard-codes it into a template will miss the reissue that changed it, or apply an R01 rule to an R21. Schema mapping replaces that human transcription with a single source of truth extracted directly from the announcement text and pinned to its FOA number, so the same object that describes a requirement is the object that enforces it.

Prerequisites and Environment Setup

Schema mapping assumes the FOA has already been reduced to text upstream. NIH publishes announcements as HTML on the Grants.gov and NIH Guide sites and as companion PDF application guides; the PDF text extraction with pdfplumber stage handles the PDF case, while HTML FOAs are fetched and stripped to text before they reach this layer. This page assumes you are starting from a plain-text or lightly structured representation of a single FOA.

The implementation targets Python 3.10 or newer (the X | None union syntax and match statements are used) and Pydantic v2. Install the runtime dependencies:

bash

python -m venv .venv
source .venv/bin/activate
pip install "pydantic>=2.6" python-dateutil

Three document assumptions hold throughout:

Activity code drives the rules. NIH mechanisms (R01, R21, R03, F31, K99/R00, and so on) carry different page limits and budget expectations. The activity code is parsed first because it selects which constraint set applies.
Budget model is announcement-declared. An FOA states whether it permits a modular budget (research grants requesting ≤ $250,000 direct costs per year, in $25,000 modules) or requires a detailed SF424 (R&R) budget. The schema must model both paths and refuse to accept a modular request above the ceiling.
Form sets are versioned. NIH reissues form packages (the “FORMS-H” family, for example). A mapped schema pins the form-set version so a proposal validated today can be diffed against a later reissue rather than silently re-interpreted.

Core Mechanism: From FOA Directives to Typed Fields

Schema mapping decomposes an announcement into discrete, typed fields, each carrying a validator that encodes one NIH rule. The rule schema is expressed as a Pydantic model: field types enforce shape, Field constraints enforce bounds, and field_validator methods enforce the cross-field logic that plain types cannot. The model below is the canonical surface — an FOA is valid only if it constructs one of these objects without raising.

python

from datetime import date
from enum import Enum
from pydantic import BaseModel, Field, field_validator, model_validator


class BudgetFormat(str, Enum):
    MODULAR = "modular"
    DETAILED = "detailed"


class NIHFOASchema(BaseModel):
    # The FOA number, e.g. "PA-25-303" or "RFA-CA-24-018".
    opportunity_id: str = Field(..., alias="FOA_Number")
    # NIH activity code, e.g. "R01", "R21", "F31" — selects the rule set.
    activity_code: str
    budget_format: BudgetFormat
    mandatory_forms: list[str]
    # Application due date parsed from the FOA; None for rolling/standard-date FOAs.
    submission_deadline: date | None = None
    # NIH modular budgets cap at $250,000 direct costs/year in $25,000 modules.
    modular_cap: int | None = Field(None, ge=25_000, le=250_000)
    # Pins the constraint set to a specific form package reissue.
    form_set_version: str

    @field_validator("mandatory_forms")
    @classmethod
    def enforce_nih_forms(cls, v: list[str]) -> list[str]:
        required = {"SF424", "PHS398"}
        missing = required - set(v)
        if missing:
            raise ValueError(f"Missing mandatory NIH forms: {sorted(missing)}")
        return v

    @field_validator("modular_cap")
    @classmethod
    def enforce_module_increment(cls, v: int | None) -> int | None:
        if v is not None and v % 25_000 != 0:
            raise ValueError("Modular budget must be a multiple of $25,000")
        return v

    @model_validator(mode="after")
    def modular_requires_cap(self) -> "NIHFOASchema":
        # A modular FOA must declare a cap; a detailed FOA must not carry one.
        if self.budget_format is BudgetFormat.MODULAR and self.modular_cap is None:
            raise ValueError("Modular FOA must declare a modular_cap")
        if self.budget_format is BudgetFormat.DETAILED and self.modular_cap is not None:
            raise ValueError("Detailed-budget FOA must not carry a modular_cap")
        return self

Three mechanisms carry the NIH-specific meaning here. The field_validator on mandatory_forms rejects any FOA whose form set omits the SF424 (R&R) or PHS 398 components. The Field(ge=25_000, le=250_000) bound plus the increment validator encode the modular-budget rule that trips up so many applicants. The model_validator(mode="after") enforces the cross-field invariant that budget format and cap must agree — a rule no single field type can express. This same field-level decomposition, applied at R01 granularity with full narrative-length rules, is worked through step by step in how to map NIH R01 FOA requirements to JSON.

Rule-Aware Implementation

In production the schema is not hand-populated; it is built from extracted FOA text by a mapper that pulls each field with a targeted rule and then lets Pydantic reject anything malformed. Isolating extraction from validation keeps the parser dumb and the contract strict — the mapper’s only job is to locate candidate values, and the schema’s only job is to decide whether they are legal.

python

import re
from dateutil import parser as dateparser
from pydantic import ValidationError


ACTIVITY_CODE_RE = re.compile(r"\b([A-Z]\d{2})\b")           # R01, R21, F31, K99...
FOA_NUMBER_RE = re.compile(r"\b((?:PA|PAR|PAS|RFA|NOSI)-[A-Z]{0,3}-?\d{2}-\d{3})\b")
MODULAR_RE = re.compile(r"modular budget", re.IGNORECASE)


def map_foa_text(text: str, form_set_version: str) -> NIHFOASchema:
    """Extract NIH FOA fields from raw announcement text into a validated schema.

    Raises ValidationError if the extracted values violate any NIH rule.
    """
    foa_match = FOA_NUMBER_RE.search(text)
    activity_match = ACTIVITY_CODE_RE.search(text)
    if not foa_match or not activity_match:
        raise ValueError("Could not locate FOA number or activity code in text")

    is_modular = bool(MODULAR_RE.search(text))
    payload: dict = {
        "FOA_Number": foa_match.group(1),
        "activity_code": activity_match.group(1),
        "budget_format": "modular" if is_modular else "detailed",
        "mandatory_forms": _extract_form_ids(text),
        "submission_deadline": _extract_deadline(text),
        "modular_cap": 250_000 if is_modular else None,
        "form_set_version": form_set_version,
    }
    # Construction is validation: an illegal FOA never becomes an object.
    return NIHFOASchema.model_validate(payload)


def _extract_form_ids(text: str) -> list[str]:
    forms: list[str] = []
    if re.search(r"SF\s?424", text, re.IGNORECASE):
        forms.append("SF424")
    if re.search(r"PHS\s?398", text, re.IGNORECASE):
        forms.append("PHS398")
    return forms


def _extract_deadline(text: str) -> date | None:
    m = re.search(r"due\s+(?:date|by)[:\s]+([A-Za-z0-9,\s/]+?\d{4})", text)
    if not m:
        return None
    try:
        return dateparser.parse(m.group(1), fuzzy=True).date()
    except (ValueError, OverflowError):
        return None

The mapped object becomes the intermediate representation the rest of the architecture consumes. It is deliberately the same shape produced by the general schema validation with Pydantic stage, so an NIH FOA and an NSF or DoD solicitation can travel through one pipeline while retaining sponsor-specific constraints. Narrative-boundary work — knowing where “Research Strategy” begins so its length can be checked — is delegated to NLP section boundary detection rather than reimplemented here.

Agency-Specific Configuration

NIH schema mapping rarely runs alone. Institutional pipelines ingest National Science Foundation (NSF) and Department of Defense (DoD) solicitations through the same intake, so the NIH field set is one branch of a unified representation. The table below fixes the parameters that differ, which is exactly what a cross-agency adapter must switch on. The NSF Proposal Guide Taxonomy prioritizes narrative and broader-impacts validation over form matching, while DoD BAA Requirement Extraction adds security-classification and export-control fields the NIH schema never carries.

Mapping parameter	NIH (FOA)	NSF (PAPPG)	DoD (BAA)
Governing document	FOA / NOFO + SF424 (R&R) guide	Proposal & Award Policies & Procedures Guide (PAPPG), versioned	Broad Agency Announcement (BAA) + FAR/DFARS
Opportunity id pattern	`PA-/PAR-/RFA-/NOSI-YY-NNN`	Program solicitation `NSF NN-NNN`	BAA number, agency-specific
Selector field	Activity code (R01, R21, F31…)	Program + track	Topic / technical area
Budget model	Modular (≤ $250K/yr) or detailed R&R	Detailed R&R, mandatory	Cost-reimbursement, detailed
Mandatory forms	SF424 (R&R), PHS 398 set	NSF fillable PDFs via Research.gov	Volumes per BAA; no fixed form
Primary narrative limit	12 pp (R01), 6 pp (R21)	15 pp project description	Per-BAA, volume-based
Deadline model	Standard dates or FOA-specific	Target dates / windows	Rolling or white-paper gated
Export control	Rare, case-by-case	Fundamental-research exclusion	ITAR/EAR triggers common
Submission portal	Grants.gov → eRA Commons	Research.gov	Grants.gov / eBRAP

Numeric limits such as the modular ceiling and page counts are deliberately kept out of the code and in an external, version-tagged store, so the threshold tuning for compliance workflow can adjust them when NIH reissues a form set without a redeploy.

Error Handling and Edge Cases

Real FOA text is messy, and the mapper must fail loudly rather than emit a plausible-but-wrong contract. The recurring failure modes:

Ambiguous activity code. An FOA that references companion mechanisms (an R01 announcement that mentions an associated R21) can match multiple activity codes. Resolve by anchoring the regex to the FOA’s own “Activity Code” field rather than the first code in the body, and raise if more than one distinct code appears in that field.
Modular cap on a detailed FOA. If extraction wrongly infers a modular budget for a mechanism that forbids it (fellowships and many large center grants), the model_validator rejects the object. Treat that ValidationError as a routing signal to human review, not a value to coerce.
Missing or reissued forms. When mandatory_forms omits SF424 (R&R) or PHS 398, the form-set version is usually stale. The validator names the missing forms so the pipeline can re-fetch the correct package instead of guessing.
Unparseable deadlines. Standard-date FOAs list multiple cycle dates and rolling FOAs list none. _extract_deadline returns None on failure rather than raising, and the downstream scheduler flags a null deadline for manual confirmation instead of assuming one.

python

def safe_map(text: str, form_set_version: str) -> dict:
    """Map an FOA, converting any failure into a structured, routable result."""
    try:
        schema = map_foa_text(text, form_set_version)
        return {"status": "ok", "foa": schema.model_dump(mode="json")}
    except ValidationError as exc:
        # Rule violation: the FOA text conflicts with NIH constraints.
        return {"status": "rule_violation", "errors": exc.errors()}
    except ValueError as exc:
        # Extraction failure: fields could not be located at all.
        return {"status": "unmappable", "reason": str(exc)}

The distinction between rule_violation and unmappable matters downstream: the first means the announcement was read but breaks an expectation and needs a human to confirm the rule; the second means extraction failed and the document should return to the parsing stage. Enforcement of the rules the object carries is then handled by the compliance validation rule engines, which decide whether a given draft satisfies the mapped constraints.

Integration with the Downstream Pipeline

A validated NIHFOASchema is the handoff artifact. It feeds the unified intermediate representation, which the compliance engine evaluates and the document assembler renders against. Because the object carries its form_set_version, every downstream verdict is reconstructable back to the exact announcement edition that produced it. The end-to-end path from raw announcement to institutional routing is:

At portfolio scale, hundreds of FOAs arrive within a single funding cycle. Mapping them one at a time bottlenecks intake, so the mapper is fanned out through the asynchronous batch processing for large RFPs pattern, and the mapped objects flow into automated checklist generation so program staff see a human-readable requirement list per opportunity rather than raw JSON.

Testing and Verification

Schema mapping is only trustworthy if a test suite pins each NIH rule to a concrete example. The pytest cases below verify that legal FOAs map cleanly and that every guarded rule actually rejects a violating input.

python

import pytest
from pydantic import ValidationError


def _base_payload(**overrides) -> dict:
    payload = {
        "FOA_Number": "PA-25-303",
        "activity_code": "R01",
        "budget_format": "modular",
        "mandatory_forms": ["SF424", "PHS398"],
        "modular_cap": 250_000,
        "form_set_version": "FORMS-H",
    }
    payload.update(overrides)
    return payload


def test_valid_modular_foa_maps() -> None:
    schema = NIHFOASchema.model_validate(_base_payload())
    assert schema.opportunity_id == "PA-25-303"
    assert schema.budget_format is BudgetFormat.MODULAR


def test_missing_mandatory_form_rejected() -> None:
    with pytest.raises(ValidationError, match="Missing mandatory NIH forms"):
        NIHFOASchema.model_validate(_base_payload(mandatory_forms=["SF424"]))


def test_modular_cap_over_ceiling_rejected() -> None:
    with pytest.raises(ValidationError):
        NIHFOASchema.model_validate(_base_payload(modular_cap=300_000))


def test_non_module_increment_rejected() -> None:
    with pytest.raises(ValidationError, match="multiple of"):
        NIHFOASchema.model_validate(_base_payload(modular_cap=260_000))


def test_detailed_budget_forbids_cap() -> None:
    with pytest.raises(ValidationError, match="must not carry a modular_cap"):
        NIHFOASchema.model_validate(
            _base_payload(budget_format="detailed", modular_cap=250_000)
        )

Beyond unit tests, verify a mapped FOA against this checklist before handing it downstream:

The FOA number and activity code were extracted from the announcement’s own identifier fields, not the first match in the body.
The budget format agrees with the mechanism (no modular cap on fellowships or center grants).
Every mandatory form in the pinned form-set version is present.
Deadlines parse to real dates, or a null deadline is explicitly flagged for confirmation.
The form_set_version is current, so the constraints are not validated against a superseded package.

Passing all five means the mapped object is a faithful, version-pinned contract — the deterministic foundation the rest of the assembly pipeline depends on.

How to map NIH R01 FOA requirements to JSON — the R01-specific decomposition in full detail
NSF Proposal Guide Taxonomy — model versioned PAPPG requirements programmatically
DoD BAA Requirement Extraction — isolate conditional defense obligations and export triggers
Budget Justification Format Standards — normalize agency financial schemas to one representation
Schema Validation with Pydantic — the general validation stage this mapping specializes

Up: Core Architecture & RFP Taxonomy

# NIH FOA Schema Mapping

# Prerequisites and Environment Setup

# Core Mechanism: From FOA Directives to Typed Fields

# Rule-Aware Implementation

# Agency-Specific Configuration

# Error Handling and Edge Cases

# Integration with the Downstream Pipeline

# Testing and Verification

# Related

Explore this section

NIH FOA Schema Mapping

Prerequisites and Environment Setup

Core Mechanism: From FOA Directives to Typed Fields

Rule-Aware Implementation

Agency-Specific Configuration

Error Handling and Edge Cases

Integration with the Downstream Pipeline

Testing and Verification

Related