How to map NIH R01 FOA requirements to JSON

A National Institutes of Health (NIH) R01 Funding Opportunity Announcement (FOA) is legally binding prose, and every page limit, font floor, and eligibility clause it carries is a rejection waiting to happen if a submission drifts out of bounds. The eRA Commons validation layer refuses an out-of-policy package outright — no score, no correction window, an entire funding cycle lost. Mapping that prose into a deterministic, machine-readable JSON model is what lets a pre-submission audit catch the error days before the Grants.gov deadline instead of after it. This page walks through the exact transformation as one workflow inside the broader NIH FOA schema mapping process: decompose the announcement into compliance domains, encode its conditional overrides, resolve the agency-specific edge cases, then validate an applicant payload against the compiled schema before any downstream handoff.

Phase 1 — Decompose the FOA into a hierarchical rule model

The mapping starts by breaking the announcement into discrete compliance domains rather than scraping it into one flat blob. An R01 FOA resolves cleanly into four: administrative metadata (activity code, receipt dates, eligible organization types), scientific-narrative constraints (Research Strategy page limit, font and margin floors), budgetary ceilings (modular vs. detailed thresholds, cost principles), and eligibility matrices (PI status, foreign-component rules). Each domain becomes a nested object that preserves the branching logic in the source text — the same domain classification the parent Core Architecture & RFP Taxonomy reference standardizes across agencies.

Implementation steps:

Segment the announcement. Split the FOA into its canonical sections (Key Dates, Eligibility, Section IV formatting, Section V review criteria) using layout-aware extraction, feeding on the coordinate-anchored text produced by pdfplumber.
Classify each requirement into a domain. Route every extracted clause to one of the four domains so a page-limit rule never lands in the same bucket as an eligibility rule.
Model constraints as typed objects, not strings. A twenty-five-page ceiling that collapses to twelve pages under a specific program announcement code is two states of one rule — capture both, never a single hardcoded integer.
Preserve provenance. Tag every rule with the FOA number and the source clause so a failed check can be traced back to the sentence that produced it.

This structural discipline is what stops the most common mapping failure: a page limit or font specification hardcoded once and silently wrong for every institute that varies it.

One announcement decomposes into four typed domain objects, each carrying its own fields and provenance, that recompile into a single nested R01 schema.

Phase 2 — Encode conditional logic and branching constraints

Federal announcements lean on conditional dependencies, so the core transformation is expressing “this limit applies unless that trigger fires” without letting two rules evaluate at once. Model the base constraint and its overrides as separate structures: a default value, then an ordered list of overrides whose triggers are evaluated first. Represent this either with JSON Schema draft-2020-12 if/then/else constructs or, for richer institutional logic, a custom override array that a schema compiler resolves before applying the base rule.

json

{
  "research_strategy": {
    "type": "object",
    "properties": {
      "page_limit": { "type": "integer", "default": 25 },
      "font_size_pt": { "type": "number", "const": 11 }
    },
    "conditional_overrides": [
      {
        "trigger": { "foa_code": { "enum": ["PA-23-XXX"] } },
        "action": { "page_limit": 12 }
      }
    ]
  }
}

Codify the same rule as a typed model so an out-of-policy configuration fails loudly at construction rather than silently mis-validating a submission. This uses Pydantic v2 field_validator syntax — the same Pydantic validation layer the ingestion pipeline relies on to harden parsed data:

python

from pydantic import BaseModel, field_validator

class ConditionalOverride(BaseModel):
    """One trigger→action override lifted from an FOA clause."""
    foa_code: str
    page_limit: int

class ResearchStrategyRule(BaseModel):
    """Base narrative constraint plus its ordered conditional overrides."""
    base_page_limit: int = 25
    font_size_pt: float = 11.0
    overrides: list[ConditionalOverride] = []

    @field_validator("base_page_limit")
    @classmethod
    def _positive_limit(cls, v: int) -> int:
        if v < 1:
            raise ValueError("page limit must be a positive ceiling")
        return v

    def resolve_limit(self, foa_code: str) -> int:
        """Overrides win over the base; first match is authoritative."""
        for rule in self.overrides:
            if rule.foa_code == foa_code:
                return rule.page_limit
        return self.base_page_limit

Resolving overrides before the base constraint is the whole point — a compiler that applies the default first and patches it afterward will race two contradictory rules. The branching resolution runs as follows:

Overrides resolve first: a matching FOA code swaps the 25-page default for its 12-page ceiling before any content is validated against the active limit.

Phase 3 — Edge cases and agency-specific overrides

Two classes of problem break a mapper that only reads the happy path: prose-encoded constraints that carry no clean number, and the fact that R01 rules are not universal across agencies.

Prose-encoded and cross-referenced rules. An FOA routinely defers to external authorities — the SF424 (R&R) Application Guide, the NIH Grants Policy Statement — that carry overlapping or superseding requirements. Treat each reference as a linked schema module keyed by version, not inline text, so authoritative rules are pulled once and never duplicated. Formatting mandates such as margin widths, allowed typefaces, and PDF/A archival expectations arrive as prose rather than integers; isolate the quantifiable boundary with deterministic extractors backed by NLP section-boundary detection, and apply strict type coercion that rejects an ambiguous value instead of guessing one.

Agency overrides. The 25/12-page logic and the modular budget ceiling are NIH constructs. The National Science Foundation (NSF) scopes its narrative and formatting rules through the Proposal & Award Policies & Procedures Guide (PAPPG), and a Department of Defense (DoD) Broad Agency Announcement (BAA) sets its own volume limits per announcement. Keep every ceiling data-driven and read from the notice, cross-checking sibling models built under the NSF proposal guide taxonomy and DoD BAA requirement extraction so one shared JSON shape serves all three agencies.

Mapped field	NIH R01	NSF (standard)	DoD BAA
Narrative section	Research Strategy	Project Description	Volume I technical
Page ceiling	12	15	Per-BAA
Min font size	11 pt	10 pt	Per-BAA
Min margin	0.5 in	1 in	Per-BAA
Budget model	Modular / detailed	Detailed	Per-BAA
Authority of record	Activity-code FOA	PAPPG (current)	Announcement

Phase 4 — Validate the payload and emit an audit record

The compiled schema only earns trust once an applicant payload is checked against it and the result is bound to the exact bytes that were validated. Run the payload through the schema, catch every failure explicitly, and serialize an immutable, timestamped record that maps each error back to its originating FOA clause — the reproducible baseline the compliance validation rule engines consume downstream.

python

import hashlib
import json
from datetime import datetime, timezone
from jsonschema import Draft202012Validator

def validate_submission(payload: dict, schema: dict, foa_version: str) -> dict:
    """Validate one R01 payload and return an audit-safe compliance record."""
    validator = Draft202012Validator(schema)
    errors = sorted(validator.iter_errors(payload), key=lambda e: list(e.absolute_path))
    record = {
        "foa_version": foa_version,
        "checked_at": datetime.now(timezone.utc).isoformat(),
        "payload_sha256": hashlib.sha256(
            json.dumps(payload, sort_keys=True).encode()
        ).hexdigest(),
        "is_compliant": not errors,
        "violations": [
            {
                "path": list(e.absolute_path),
                "message": e.message,
                "rule_id": e.schema.get("$id", "UNKNOWN"),
            }
            for e in errors
        ],
    }
    return record

Before releasing the record to the audit log, walk a short acceptance checklist that fixtures alone will miss: confirm the page ceiling was read from the FOA and not hardcoded, that overrides resolved ahead of the base rule, that each violation carries a JSON path back to its clause, and that a missing external reference module routes the submission to manual review rather than crashing the batch. Only a payload that clears every domain — and produces a record with is_compliant: true — is safe to hand to the assembly and submission stage.

Frequently asked questions

Why map the FOA to JSON instead of validating against the PDF directly?

The PDF is prose with conditional exceptions buried in it; a validator cannot reason over sentences reliably. Compiling the announcement into a typed JSON model once, with overrides made explicit, gives every downstream check a single deterministic source of truth and lets a policy change become a data edit rather than a code change.

Is the 25-page Research Strategy limit safe to store as a constant?

No. The default collapses to 12 pages — and other ceilings under other mechanisms — depending on the program announcement code in force. Model it as a base value plus an ordered override list, resolve overrides first, and read the authoritative number from the FOA record every run.

How should the mapper handle a formatting rule written as prose, like "0.5 inch margins"?

Extract the quantifiable boundary with a deterministic extractor, coerce it to a typed numeric field (36 points), and reject any value that cannot be parsed unambiguously rather than defaulting one. An ambiguous input that silently defaults is worse than a loud failure, because it clears shallow checks while omitting the real requirement.

Can one JSON shape serve NIH, NSF, and DoD?

Yes, if the shape is domain-oriented rather than agency-specific. Keep the narrative, budget, and eligibility domains generic and drive every ceiling from the notice; the NIH R01, an NSF PAPPG proposal, and a DoD BAA then differ only in their data values, not their structure.

What belongs in the audit record beyond pass/fail?

Bind the record to the SHA-256 of the exact payload, the FOA version, a UTC timestamp, and a JSON path plus rule id for every violation. That is what lets a compliance officer reconstruct precisely why a package failed pre-submission checks months later, satisfying institutional record-retention policy and federal audit review.

NIH FOA schema mapping — the parent workflow this R01 mapping plugs into.
Schema validation with Pydantic — the typed validation layer that hardens the compiled rules.
PDF text extraction with pdfplumber — the coordinate-aware extraction that feeds Phase 1.
DoD BAA requirement extraction — the sibling model for Broad Agency Announcements.
Compliance validation rule engines — where the compiled JSON schema is enforced against submissions.

Up one level: NIH FOA schema mapping

# How to map NIH R01 FOA requirements to JSON

# Phase 1 — Decompose the FOA into a hierarchical rule model

# Phase 2 — Encode conditional logic and branching constraints

# Phase 3 — Edge cases and agency-specific overrides

# Phase 4 — Validate the payload and emit an audit record

# Frequently asked questions

# Related