Compliance Validation & Rule Engines

Federal grant proposals live and die on formatting. A single missed page limit, a disallowed font, or an absent mandatory section triggers administrative rejection before a reviewer reads one word of scientific merit. For the National Institutes of Health (NIH), the National Science Foundation (NSF), and the Department of Defense (DoD), that screen happens automatically inside the submission portal, and it happens the same way at 4:59 PM on a deadline day as it does a week early. For research administrators, grant writers, university technology teams, and Python automation builders, programmatic compliance validation replaces manual checklist audits with deterministic rule engines that intercept violations during document assembly — while there is still time to fix them.

A compliance rule engine translates the textual mandates in a Notice of Funding Opportunity (NOFO) into executable logic. It is the enforcement layer that sits between the parsed requirements of a solicitation and the assembled proposal package. These systems do not merely flag errors after the fact; they enforce structural integrity, maintain immutable audit trails, and guarantee that every artifact moving through the pipeline adheres to the agency-specific mandates that govern it. This page is the map of that enforcement layer: how its parts fit together, where NIH, NSF, and DoD rules diverge, and how to implement the engine so that the same input always produces the same verdict.

How the compliance layer fits together

The compliance layer is organized around a single principle: policy is data, and evaluation is code. Rather than hardcoding “12 pages” or “11-point Arial” into scripts, a production system loads agency parameters from configuration and runs a common, deterministic engine over them. Four concerns make up the layer, each with its own dedicated workflow.

The first concern is establishing which parts of a document even count. Agencies distinguish countable narrative from exempt material — references, biosketches, and data management plans usually do not count against a page limit — so Required Section Mapping cross-references the mandatory headings named in the solicitation against the actual heading tree of the submitted document. Without accurate section delineation, every downstream measurement is wrong: a page counter overcounts exempt pages, and a completeness check flags sections that are present but mislabeled.

The second concern is the typographic and pagination envelope. Once countable content is isolated, Page Limit & Font Enforcement parses embedded font families, point sizes, line spacing, and margin offsets at the character level, comparing them against the active NOFO’s declared limits. This is where font substitution during PDF export, scaled point sizes, and multi-column figure captions have to be handled precisely rather than by eyeballing a rendered page.

The third concern is calibration. Real documents carry rounding artifacts and conversion noise, so a rule engine that demands bit-exact conformance produces false positives that erode trust. Threshold Tuning for Compliance defines the tolerance bands — how many hundredths of an inch of margin drift, how much optical character recognition (OCR) confidence, how much point-size deviation — that separate a genuine violation from a harmless rendering artifact.

The fourth concern is turning verdicts into action. A boolean pass/fail is useless to a principal investigator racing a deadline. Automated Checklist Generation consumes the engine’s structured verdicts and produces a NOFO-specific, human-readable deficiency report that routes each failure to the person who can fix it.

Underneath all four sits the same three-tier execution model — ingestion, evaluation, and resolution — that gives the engine its determinism.

Ingestion normalizes raw files (PDF, DOCX, XML, JSON) into a canonical representation that preserves formatting metadata, structural hierarchy, and embedded content. This stage typically reuses the same extraction machinery documented in PDF Text Extraction with pdfplumber, so the compliance layer and the intake layer agree on exactly what a “page” or a “span” is.

Evaluation executes rule sets compiled from parsed requirements, applying regular expressions, character-level text extraction, and document object model (DOM) traversal. Python implementations lean on python-docx, pdfplumber, lxml, and pandas to build modular checks.

Resolution generates a structured verdict, routes exceptions to human reviewers, or triggers automated remediation.

The critical design constraint is idempotency: the same input must always yield the same compliance verdict, regardless of execution order, batch size, or concurrent processing. A rule engine whose output depends on the order rules happened to register is not auditable, and an audit trail that cannot be reproduced is worthless in a dispute with a sponsored programs office.

The three-tier execution model: ingestion normalizes any input, evaluation applies deterministic rule sets, and resolution converges every path on one structured verdict.

Agency-specific constraint matrix

The reason compliance cannot be a single hardcoded rule set is that NIH, NSF, and DoD encode fundamentally different envelopes, express them in different governing documents, and enforce them through different portals. The engine must be parameterized so that swapping the active agency profile swaps every constraint at once. The matrix below captures the load-bearing differences for the core concern of this layer — the formatting and completeness rules that produce administrative rejection.

Constraint	NIH	NSF	DoD / DARPA
Governing document	FOA/NOFO plus the SF424 (R&R) Application Guide	Proposal & Award Policies & Procedures Guide (PAPPG), versioned, plus the program solicitation	Broad Agency Announcement (BAA) plus the specific topic or call
Research narrative limit	12 pages (R01 Research Strategy); 6 pages (R21)	15 pages (Project Description)	Per-BAA volume caps, often split across technical volumes
Project summary / abstract	1 page	1 page, structured as Overview, Intellectual Merit, Broader Impacts	Executive summary or abstract per BAA
Minimum font size	11 point	10 point	Per BAA, commonly 12 point
Approved typefaces	Arial, Helvetica, Palatino Linotype, Georgia	Arial, Courier New, Palatino Linotype, or similar	Times New Roman or equivalent serif, per BAA
Margins	0.5 inch on all sides	1 inch on all sides	Per BAA, commonly 1 inch
Mandatory ancillary plans	Data Management & Sharing Plan; Authentication of Key Resources	Data Management Plan; Mentoring Plan for postdocs; explicit Broader Impacts	Export-control (ITAR/EAR) statements; security classification markings; cost-reasonableness justification
Conditional triggers	Human-subjects and vertebrate-animal sections activate by scope	Special guidance activates by program (e.g. CAREER, MRI)	Dollar thresholds and foreign-collaborator involvement activate additional matrices
Submission portal	Grants.gov to eRA Commons	Research.gov (or Grants.gov)	eBRAP, Grants.gov, and SAM.gov registration

The agency identifiers in this table map directly onto the requirement schemas produced upstream. The NIH column is populated by the NIH FOA Schema Mapping process, the NSF column by the NSF Proposal Guide Taxonomy — which is also where versioned PAPPG deltas enter the system — and the DoD column by DoD BAA Requirement Extraction. Keeping the compliance engine agnostic to which column it is enforcing is what lets one institution run all three agencies through a single, testable pipeline. The full mapping from solicitation text to these profiles is owned by the Core Architecture & RFP Taxonomy layer that feeds this one.

Conditional logic and branching rules

Agency rules are not a flat list; they are a decision tree whose branches activate under specific conditions, and some branches override others. The engine has to resolve that tree deterministically before it can render a verdict. Three kinds of branching dominate.

Scope-activated requirements. Many rules exist only when a proposal touches a particular scope. An NIH application that involves human subjects activates an entire block of required forms and narrative sections that are simply absent for a bench-science R01. The engine must detect the scope flag, inject the corresponding rules into the active set, and only then evaluate — otherwise it either misses a genuine omission or flags a section that was never required.

Threshold-activated overrides. DoD solicitations are the sharpest example: crossing a dollar threshold activates cost-reasonableness narrative requirements, and the presence of foreign collaborators activates ITAR (International Traffic in Arms Regulations) and EAR (Export Administration Regulations) compliance matrices. These are override rules — when the trigger fires, a stricter constraint replaces the default rather than adding to it. Encoding override precedence explicitly, rather than relying on rule registration order, is what keeps the verdict reproducible.

Version-activated policy deltas. NSF revises its PAPPG on a schedule, and a proposal is judged against the version in force on its deadline, not the version in force when the pipeline was written. The engine therefore keys its rule set on an effective policy version, so that reprocessing a two-year-old submission reproduces the verdict it received then, not the verdict today’s rules would give.

The diagram below shows how these branches compose into a single resolution path, from scope detection through threshold overrides to the final verdict.

Scope flags add rules to the set (solid green); DoD dollar-threshold and foreign-collaborator triggers replace the defaults (dashed coral). Encoding override precedence explicitly — not by registration order — is what keeps the verdict reproducible.

Production pipeline implementation

A production-ready rule engine separates rule definition from execution orchestration, validates its own inputs, and never lets one failing rule crash a batch. The implementation below uses Pydantic v2 models to make the rule and verdict schemas self-validating, and sorts rules by identifier so that verdict ordering is a function of the rules alone — never of the order evaluators were registered. This is the concrete realization of the idempotency constraint described above, and it shares its typed-model discipline with the broader Pydantic validation layer used across the ingestion pipeline.

python

from __future__ import annotations

import logging
from enum import Enum
from typing import Any, Callable

from pydantic import BaseModel, Field, field_validator

logging.basicConfig(level=logging.INFO, format="%(levelname)s | %(name)s | %(message)s")
logger = logging.getLogger("compliance_engine")


class Severity(str, Enum):
    ERROR = "error"
    WARNING = "warning"
    INFO = "info"


class ValidationRule(BaseModel):
    rule_id: str
    description: str
    severity: Severity = Severity.ERROR
    required: bool = True

    @field_validator("rule_id")
    @classmethod
    def normalize_id(cls, value: str) -> str:
        token = value.strip().upper()
        if not token:
            raise ValueError("rule_id must be a non-empty token")
        return token


class ComplianceVerdict(BaseModel):
    rule_id: str
    passed: bool
    message: str
    severity: Severity
    metadata: dict[str, Any] = Field(default_factory=dict)


# An evaluator takes normalized document metadata and returns
# (passed, human message, structured metadata).
Evaluator = Callable[[dict[str, Any]], tuple[bool, str, dict[str, Any]]]


class RuleEngine:
    """Deterministic evaluator: identical input yields an identical verdict set."""

    def __init__(self, rules: list[ValidationRule]) -> None:
        # Sort by rule_id so verdict ordering never depends on registration order.
        self.rules = sorted(rules, key=lambda r: r.rule_id)
        self._registry: dict[str, Evaluator] = {}

    def register(self, rule_id: str, evaluator: Evaluator) -> None:
        self._registry[rule_id.strip().upper()] = evaluator

    def evaluate(self, doc: dict[str, Any]) -> list[ComplianceVerdict]:
        verdicts: list[ComplianceVerdict] = []
        for rule in self.rules:
            evaluator = self._registry.get(rule.rule_id)
            if evaluator is None:
                logger.warning("No evaluator for %s; recording as unresolved.", rule.rule_id)
                verdicts.append(ComplianceVerdict(
                    rule_id=rule.rule_id,
                    passed=False,
                    message="No evaluator registered for this rule.",
                    severity=rule.severity,
                    metadata={"unresolved": True},
                ))
                continue
            try:
                passed, message, meta = evaluator(doc)
            except Exception as exc:  # one bad rule must not sink the batch
                logger.error("Rule %s raised: %s", rule.rule_id, exc)
                passed, message, meta = False, f"Runtime error: {exc}", {"exception": str(exc)}
            verdicts.append(ComplianceVerdict(
                rule_id=rule.rule_id,
                passed=passed,
                message=message,
                severity=rule.severity,
                metadata=meta,
            ))
        return verdicts


# --- Agency-parameterized evaluators (loaded from config in production) -------

APPROVED_FONTS: dict[str, set[str]] = {
    "NIH": {"Arial", "Helvetica", "Palatino Linotype", "Georgia"},
    "NSF": {"Arial", "Courier New", "Palatino Linotype"},
    "DoD": {"Times New Roman"},
}


def check_required_sections(doc: dict[str, Any]) -> tuple[bool, str, dict[str, Any]]:
    required = set(doc.get("required_sections", []))
    present = set(doc.get("sections", []))
    missing = sorted(required - present)
    return (
        not missing,
        "All required sections present." if not missing else f"Missing sections: {missing}",
        {"missing": missing},
    )


def check_font_compliance(doc: dict[str, Any]) -> tuple[bool, str, dict[str, Any]]:
    agency = doc.get("agency", "NIH")
    allowed = APPROVED_FONTS.get(agency, set())
    used = set(doc.get("fonts_used", []))
    violations = sorted(used - allowed)
    return (
        not violations,
        "Font compliance verified." if not violations else f"Non-compliant fonts for {agency}: {violations}",
        {"agency": agency, "violations": violations},
    )


if __name__ == "__main__":
    engine = RuleEngine([
        ValidationRule(rule_id="req_sections", description="Verify mandatory headings"),
        ValidationRule(rule_id="font_check", description="Validate typography against NOFO specs"),
    ])
    engine.register("req_sections", check_required_sections)
    engine.register("font_check", check_font_compliance)

    sample_doc = {
        "agency": "NIH",
        "required_sections": ["Project Summary", "Budget Justification", "Biosketch"],
        "sections": ["Project Summary", "Budget Justification"],
        "fonts_used": ["Arial", "Calibri"],
    }

    for verdict in engine.evaluate(sample_doc):
        status = "PASS" if verdict.passed else "FAIL"
        logger.info("[%s] %s: %s", status, verdict.rule_id, verdict.message)

Two design choices carry the deterministic guarantee. Sorting the rules by identifier in the constructor means the verdict list is a pure function of the rule set, so two runs on different machines with different registration order produce byte-identical output. Wrapping each evaluator call in a try/except that records a failed verdict — rather than propagating — means a single malformed rule degrades one line of the report instead of aborting an overnight batch. For pattern-heavy checks, compile regular expressions once and prefer possessive or bounded quantifiers so that adversarial or malformed input cannot trigger catastrophic backtracking during high-volume validation.

Institutional scale and failure modes

A rule engine that works on one proposal at a developer’s desk behaves very differently against a research portfolio moving hundreds of submissions through a shared deadline. Several failure modes appear only at institutional scale.

Portfolio contention on shared deadlines. NIH cycle deadlines and NSF program deadlines concentrate load into narrow windows. A single-threaded validator that is comfortable at ten proposals per hour will not clear a queue of three hundred the night before a due date. Compliance validation is embarrassingly parallel — each artifact is independent — so it belongs behind the async batch processor that fans documents across workers while keeping per-verdict output deterministic.

Malformed and unextractable inputs. At scale, some fraction of submitted PDFs will be scanned images, encrypted, or produced by a generator that does not embed a clean font dictionary. The pipeline must degrade gracefully rather than crash: attempt an alternative extraction library (for example PyMuPDF after pdfplumber returns empty text), fall back to OCR, and if all fail, route the artifact to a manual review queue enriched with diagnostic metadata rather than silently marking it compliant. Failing closed — treating “could not measure” as “not yet passed” — is the only safe default when a rejection is irreversible.

Multi-PI and multi-project drift. A large center grant or a multi-institution BAA response assembles sections authored by many principals, each with their own templates and font habits. Without a shared canonical representation, the same nominal “11-point Arial” arrives as three subtly different font names. Normalizing at ingestion, before evaluation, is what keeps the font and section checks from producing a blizzard of false positives.

Versioned policy skew. When an agency updates its policy mid-cycle, a portfolio can briefly contain proposals bound to two different rule versions at once. The engine’s rule set must be selected per artifact by effective policy version, never by a global “current” flag, or in-flight proposals get judged against rules that did not exist when they were written.

Audit and version control

Because an administrative rejection is defensible only if you can show exactly what was checked, the compliance layer treats every verdict as a durable, versioned record rather than a transient console line. Three practices make the audit trail hold up.

Version the rule set, not just the code. Each evaluation run records the agency profile identifier and the effective policy version it used, alongside a hash of the compiled rule set. Reprocessing an archived submission with the same profile and version must reproduce the original verdict exactly; if it does not, either the input or the rules changed, and the hash tells you which. This is the same reproducibility guarantee the Pydantic validation layer provides upstream, extended across funding cycles.

Diff verdicts across drafts. As a proposal iterates toward its deadline, the interesting signal is not the current verdict but the change from the last one — which violations were resolved, which were newly introduced, and which thresholds moved a section from warning into failure. Storing structured ComplianceVerdict records rather than rendered text makes a verdict diff a straightforward set comparison, and that diff is what drives the deficiency updates in Automated Checklist Generation.

Enable rollback of compliance state. When a policy delta is applied and turns out to be misconfigured, teams need to pin an affected portfolio back to the prior rule version while the fix is validated. Keeping rule sets immutable and addressable by version — rather than editing them in place — makes that rollback a pointer change instead of a data-recovery exercise. Shifting from retrospective manual audits to this proactive, versioned model eliminates preventable submission failures, standardizes compliance across funding mechanisms, and produces the defensible record that sponsored programs offices need when a submission is contested.

Required Section Mapping — isolate countable narrative from exempt material against the solicitation’s heading tree.
Page Limit & Font Enforcement — character-level checks for fonts, point sizes, margins, and page counts.
Threshold Tuning for Compliance — tolerance bands that separate rendering artifacts from real violations.
Automated Checklist Generation — turn structured verdicts into routed, human-readable deficiency reports.
Core Architecture & RFP Taxonomy — how solicitation text becomes the agency profiles this engine enforces.

Up: grant-automation.org home

# Compliance Validation & Rule Engines

# How the compliance layer fits together

# Agency-specific constraint matrix

# Conditional logic and branching rules

# Production pipeline implementation

# Institutional scale and failure modes

# Audit and version control

# Related

Explore this section