Schema Validation with Pydantic

Schema validation with Pydantic is the deterministic gate that stops malformed solicitation data from reaching proposal assembly, catching bad opportunity IDs, timezone-naive deadlines, and out-of-range page limits before they corrupt downstream automation. In federal grant pipelines, unstructured documents from the National Institutes of Health (NIH), the National Science Foundation (NSF), and the Department of Defense (DoD) arrive with inconsistent formatting, nested appendices, and agency-specific terminology. Within the broader RFP ingestion and parsing workflows, the validation stage converts loosely typed parser output into a strict data contract, so that every field is coerced, bounded, and rejected with an explicit error the moment it fails to conform. This page covers how to build that contract in Pydantic v2, how to tune it per agency, and how to wire it into the stages on either side of it.

Without a validation layer, silent data corruption propagates unchecked: a deadline parsed as a string sorts lexically instead of chronologically, a page limit of 0 slips past a budget calculator, and a missing eligibility field surfaces only as a KeyError deep inside a submission package generator. Pydantic replaces that fragility with fail-fast contracts. Rather than treating the library as a passive type checker, grant automation platforms use it as a compliance enforcement engine that centralizes rules which would otherwise scatter across every downstream component.

One gate, two exits: valid documents become typed objects; every failure becomes a structured, auditable rejection.

Prerequisites and environment setup

The schema layer targets Python 3.10 or newer, which is required for the X | None union syntax and modern typing features used throughout these examples. Pydantic v2 is mandatory — its validation core is a compiled Rust engine (pydantic-core) that is roughly an order of magnitude faster than the pure-Python v1 series, which matters when validating thousands of parsed solicitations per funding cycle.

bash

python3 -m venv .venv
source .venv/bin/activate
pip install "pydantic>=2.6,<3.0" email-validator python-dateutil

The email-validator package backs Pydantic’s EmailStr type, and python-dateutil helps normalize the heterogeneous date formats that appear across agency templates. Two assumptions govern the input to this stage. First, the data reaching Pydantic is already a Python dictionary or JSON string — the raw PDF has been converted upstream by PDF text extraction with pdfplumber and segmented by NLP section boundary detection. Second, field keys have been normalized to a canonical snake_case vocabulary before validation, so the schema validates structure and semantics rather than fighting over synonyms like “Closing Date” versus “Deadline.”

A critical v2 migration note: use field_validator and model_validator, not the deprecated v1 @validator. Code that still imports validator from pydantic is running on removed or shimmed APIs and will not receive the correct validation context in current releases.

Core mechanism — how Pydantic validation works internally

A Pydantic model is a subclass of BaseModel whose annotated attributes define a schema. When you call Model.model_validate(data), the engine walks each field in declaration order, coerces the input toward the annotated type (or refuses, under strict mode), runs Field constraints, then runs any field_validator functions, and finally runs model_validator functions that see the whole object. Any failure does not raise immediately — the engine collects every error and raises a single ValidationError at the end, so one call reports all problems in a document rather than only the first.

The two behaviors that make Pydantic a compliance tool rather than a convenience are ConfigDict(strict=True), which disables silent coercion (a string "15" will not become an int 15), and ConfigDict(extra="forbid"), which rejects any field the schema does not declare. Together they prevent schema drift, where an upstream parser starts emitting a new key that silently passes through unvalidated.

python

from pydantic import BaseModel, Field, ConfigDict, ValidationError

class MinimalRFP(BaseModel):
    model_config = ConfigDict(strict=True, extra="forbid")
    opportunity_id: str = Field(..., min_length=3)
    page_limit: int = Field(..., gt=0, le=100)

try:
    MinimalRFP.model_validate({"opportunity_id": "NSF-24", "page_limit": "15"})
except ValidationError as exc:
    # exc.errors() returns a list of dicts: loc, msg, type, input
    for err in exc.errors():
        print(err["loc"], err["type"], err["msg"])
    # ('page_limit',) int_type Input should be a valid integer ...

The structured exc.errors() output is the feature that makes Pydantic auditable. Each entry carries a loc tuple pinpointing the offending field, a machine-stable type string, the human-readable msg, and the original input. That structure feeds directly into audit trails and reviewer-facing rejection reports, which is what distinguishes a compliance gate from a bare try/except.

Rule-aware schema implementation

The production pattern models a federal solicitation as a set of nested typed models: agency contacts, funding identifiers, deadlines, eligibility, and page limits each become fields with explicit constraints and validators. Nesting keeps each concern independently testable and lets a validation error point precisely at, for example, primary_contact.email rather than a flat blob. The schema below is the reference contract that parsed output must satisfy before advancing; it enforces opportunity-ID formatting, a future-dated deadline in UTC, a bounded page limit, and a closed vocabulary of eligible entity types.

Each field carries its own constraint; nesting lets a failure point precisely at primary_contact.email rather than a flat blob.

python

from datetime import datetime, timezone
from pydantic import BaseModel, Field, field_validator, ValidationError, ConfigDict

class AgencyContact(BaseModel):
    model_config = ConfigDict(strict=True)
    name: str = Field(..., min_length=2, max_length=100)
    email: str = Field(..., pattern=r"^[\w\.-]+@[\w\.-]+\.\w{2,}$")
    phone: str | None = Field(None, pattern=r"^\+?1?\d{9,15}$")

class RFPComplianceSchema(BaseModel):
    model_config = ConfigDict(extra="forbid")

    opportunity_id: str = Field(..., pattern=r"^[A-Z]{2,4}-\d{4}-[A-Z0-9]{3,10}$")
    agency_name: str = Field(..., min_length=3)
    title: str = Field(..., max_length=250)
    submission_deadline: datetime
    page_limit: int = Field(..., gt=0, le=100)
    eligible_entities: list[str] = Field(..., min_length=1)
    primary_contact: AgencyContact

    @field_validator("submission_deadline")
    @classmethod
    def validate_deadline_future(cls, v: datetime) -> datetime:
        now = datetime.now(timezone.utc)
        aware_v = v if v.tzinfo is not None else v.replace(tzinfo=timezone.utc)
        if aware_v <= now:
            raise ValueError("Submission deadline must be in the future.")
        return aware_v

    @field_validator("eligible_entities")
    @classmethod
    def validate_entity_types(cls, v: list[str]) -> list[str]:
        allowed = {"university", "nonprofit", "small_business", "tribal_nation", "government"}
        invalid = {e.lower() for e in v} - allowed
        if invalid:
            raise ValueError(f"Invalid eligibility categories: {', '.join(sorted(invalid))}")
        return [e.lower() for e in v]

    @field_validator("page_limit")
    @classmethod
    def reject_placeholder_limits(cls, v: int) -> int:
        # Parsers sometimes emit 99 as a sentinel when no limit was found.
        if v == 99:
            raise ValueError("Page limit 99 is a parser sentinel, not a real limit.")
        return v

# Example usage in the pipeline
raw_parsed_data = {
    "opportunity_id": "NSF-2026-ENG001",
    "agency_name": "National Science Foundation",
    "title": "Advanced Materials Research Initiative",
    "submission_deadline": "2026-11-15T17:00:00-05:00",
    "page_limit": 15,
    "eligible_entities": ["university", "small_business"],
    "primary_contact": {
        "name": "Dr. Elena Rostova",
        "email": "e.rostova@nsf.gov",
        "phone": "+17035550199",
    },
}

try:
    validated_rfp = RFPComplianceSchema.model_validate(raw_parsed_data)
    print("Schema validation passed. Data ready for downstream compliance checks.")
except ValidationError as exc:
    print(f"Compliance validation failed:\n{exc}")

Two design choices carry weight here. Normalizing the deadline to a timezone-aware UTC value inside the validator (returning aware_v, not the raw v) means every downstream comparison operates on a single, unambiguous clock — a common source of off-by-one-day rejection errors near midnight deadlines. Normalizing eligible_entities to lowercase on the way out means the compliance validation rule engines downstream never have to re-case or re-guess the vocabulary. The opportunity_id pattern maps to the canonical identifiers produced by the NIH FOA schema mapping and NSF taxonomy work in the core architecture and RFP taxonomy domain, so IDs stay consistent from ingestion through submission.

Agency-specific configuration

A single rigid schema cannot express the real divergence between agencies. NIH, NSF, and DoD differ on identifier formats, page-limit governance, deadline semantics, and required attachments. The table below captures the differences this schema must accommodate.

Constraint	NIH	NSF	DoD
Opportunity ID format	`PA-##-###` / `RFA-XX-##-###`	`NSF ##-###` (program solicitation)	BAA / topic number (e.g. `HR001126S0001`)
Page-limit governance	Mechanism-specific (12-page R01 research strategy)	PAPPG-wide 15-page project description	Per-BAA white paper and full proposal caps
Deadline semantics	5:00 PM submitter local time	5:00 PM submitter local time	BAA-specified, often rolling windows
Portal / schema source	eRA Commons and Grants.gov	Research.gov and Grants.gov	eBRAP / SAM.gov
Cost governance	Modular vs. detailed budget	Fringe and indirect per PAPPG	FAR and DFARS attachment clauses

Rather than branching with if agency == ... throughout the codebase, keep one base contract and apply agency profiles as configuration. A registry of per-agency parameters lets the same validation core enforce different pattern strings and limits without duplicating logic.

python

from dataclasses import dataclass

@dataclass(frozen=True)
class AgencyProfile:
    id_pattern: str
    max_page_limit: int
    deadline_is_local: bool  # True: interpret naive deadline as submitter-local

AGENCY_PROFILES: dict[str, AgencyProfile] = {
    "NIH": AgencyProfile(r"^(PA|RFA|PAR)-\d{2}-\d{3}$", 12, True),
    "NSF": AgencyProfile(r"^NSF \d{2}-\d{3}$", 15, True),
    "DoD": AgencyProfile(r"^[A-Z]{2}\d{6}[A-Z]\d{4}$", 20, False),
}

def profile_for(agency_code: str) -> AgencyProfile:
    try:
        return AGENCY_PROFILES[agency_code]
    except KeyError as exc:
        raise ValueError(f"No compliance profile registered for {agency_code!r}") from exc

The registry stays deliberately small and data-shaped so that a versioned PAPPG update or a new DoD Broad Agency Announcement (BAA) format is a one-line change reviewed against the DoD BAA requirement extraction rules, not a code refactor. Field-level mappings for each agency are enumerated in the validating parsed RFP JSON against agency schemas reference, which aligns these constraints with NIH modular budget rules, NSF merit-review criteria, and DoD FAR/DFARS attachment requirements.

Error handling and edge cases

Well-formed inputs are the easy case; the schema earns its place by how it fails. The recurring edge cases in federal ingestion are timezone-naive deadlines, coercion surprises, aggregated multi-field failures, and parser sentinels masquerading as real values.

Naive datetimes. A deadline like "2026-11-15T17:00:00" has no offset. The validate_deadline_future validator treats it as UTC and normalizes it, but if the true intent was submitter-local time you must attach the correct zone upstream before validation, or a late-evening Eastern deadline can be misread as already past.
Strict versus lax coercion. Under strict=True, "15" is not a valid int. That is usually what you want for machine-parsed data, because a stringified number often signals an upstream extraction bug worth surfacing rather than papering over.
Aggregated errors. Because model_validate collects all failures, iterate exc.errors() to build a complete, reviewer-friendly rejection report in one pass instead of fixing one field and re-running.
Extra fields. With extra="forbid", an unexpected key raises rather than being dropped. This is how you detect that an upstream parser changed its output shape.

python

def validate_or_report(payload: dict) -> tuple[RFPComplianceSchema | None, list[dict]]:
    """Return (model, []) on success or (None, structured_errors) on failure."""
    try:
        return RFPComplianceSchema.model_validate(payload), []
    except ValidationError as exc:
        report = [
            {
                "field": ".".join(str(p) for p in err["loc"]),
                "error_type": err["type"],
                "message": err["msg"],
            }
            for err in exc.errors()
        ]
        return None, report

Returning a structured report instead of re-raising lets the caller decide policy: quarantine the record, route it to a human reviewer, or retry after a targeted re-parse. Each rejection reason maps to a specific federal mandate, which is exactly the documentation an institutional review or audit trail expects.

Integration with downstream pipeline

The validation gate sits between segmentation and the compliance engines. Parsed JSON enters, model_validate runs, and the branch is binary: a validated object advances to the next stage, or a ValidationError is caught, converted to a structured report, and logged to the audit trail. The diagram below shows how parsed JSON moves through the Pydantic validation gate before advancing.

The branch is binary: a validated object advances, or a structured error is caught, logged, and routed for review.

On the success path, the validated object is a typed, guaranteed-shape record that the compliance validation rule engines can consume without defensive checks — page-limit and font enforcement, required-section mapping, and threshold tuning all operate against fields they can trust. At ingestion scale, the validation call itself must not become the bottleneck. Because Pydantic v2’s core is compiled and releases the GIL for the heavy path, it pairs cleanly with the async batch processing for large RFPs stage, where thousands of documents are validated concurrently without blocking I/O threads. Serialize the validated object with model_dump(mode="json") when handing off to a queue or portal-sync connector, so the canonical, normalized representation — not the raw parser output — is what travels forward.

Testing and verification

Treat the schema as compliance-critical code and test it like a rule engine: assert that valid documents pass, and just as importantly, assert that each violation is rejected with the expected error type. The suite below uses pytest and checks the happy path plus the four edge cases from above.

python

import pytest
from datetime import datetime, timedelta, timezone
from pydantic import ValidationError

def _base_payload() -> dict:
    future = (datetime.now(timezone.utc) + timedelta(days=30)).isoformat()
    return {
        "opportunity_id": "NSF-2026-ENG001",
        "agency_name": "National Science Foundation",
        "title": "Advanced Materials Research Initiative",
        "submission_deadline": future,
        "page_limit": 15,
        "eligible_entities": ["University"],
        "primary_contact": {"name": "A. Reviewer", "email": "a@nsf.gov", "phone": None},
    }

def test_valid_payload_passes():
    model = RFPComplianceSchema.model_validate(_base_payload())
    assert model.eligible_entities == ["university"]      # normalized lowercase
    assert model.submission_deadline.tzinfo is not None    # normalized to aware UTC

def test_past_deadline_rejected():
    payload = _base_payload()
    payload["submission_deadline"] = "2000-01-01T00:00:00+00:00"
    with pytest.raises(ValidationError) as exc:
        RFPComplianceSchema.model_validate(payload)
    assert "future" in str(exc.value)

def test_unknown_eligibility_rejected():
    payload = _base_payload()
    payload["eligible_entities"] = ["university", "hedge_fund"]
    with pytest.raises(ValidationError):
        RFPComplianceSchema.model_validate(payload)

def test_extra_field_forbidden():
    payload = _base_payload()
    payload["unexpected_key"] = "drift"
    with pytest.raises(ValidationError) as exc:
        RFPComplianceSchema.model_validate(payload)
    assert any(e["type"] == "extra_forbidden" for e in exc.value.errors())

def test_sentinel_page_limit_rejected():
    payload = _base_payload()
    payload["page_limit"] = 99
    with pytest.raises(ValidationError):
        RFPComplianceSchema.model_validate(payload)

Beyond unit tests, keep a small verification checklist for every schema change before it ships:

Every new field carries an explicit type and, where a range applies, a Field constraint.
strict and extra="forbid" remain set on the top-level model.
Each field_validator uses the @classmethod form and returns the normalized value, not the raw input.
Each agency profile in the registry has a test asserting both an accepted and a rejected identifier.
The structured errors() output for at least one failing case is snapshot-tested, so reviewer-facing messages do not regress.

Treating Pydantic as a compliance enforcement engine rather than a type checker is what lets a grant automation platform eliminate silent data corruption, produce audit-ready rejection reports, and hold strict alignment with federal submission standards from ingestion through submission.

PDF text extraction with pdfplumber — the upstream stage that turns solicitation PDFs into the text this schema validates.
NLP section boundary detection — segments extracted text into the fields the schema maps.
Async batch processing for large RFPs — runs validation concurrently across high-volume ingestion.
Validating parsed RFP JSON against agency schemas — field-level agency mappings for NIH, NSF, and DoD.
Compliance validation rule engines — the downstream engines that consume validated RFP objects.

Up next in the workflow: return to RFP ingestion and parsing workflows for the full ingestion-to-submission pipeline.

# Schema Validation with Pydantic

# Prerequisites and environment setup

# Core mechanism — how Pydantic validation works internally

# Rule-aware schema implementation

# Agency-specific configuration

# Error handling and edge cases

# Integration with downstream pipeline

# Testing and verification

# Related pages

Explore this section

Schema Validation with Pydantic

Prerequisites and environment setup

Core mechanism — how Pydantic validation works internally

Rule-aware schema implementation

Agency-specific configuration

Error handling and edge cases

Integration with downstream pipeline

Testing and verification

Related pages