Threshold Tuning for Compliance

A compliance engine that demands bit-exact conformance to an agency’s formatting rules fails in the field for a mundane reason: real Portable Document Format (PDF) files carry rounding noise. An eleven-point heading exports as 10.98 pt, a one-inch margin renders as 0.997 in, and a section titled “Research Strategy” arrives labeled “Project Narrative.” Treat every one of those as a violation and the pipeline drowns research administrators in false alarms until they stop trusting it; ignore them and a genuinely over-length narrative slips through to the portal. Threshold tuning is the calibration discipline that draws the line between a harmless rendering artifact and an actionable violation, and it is one of the four enforcement concerns inside the Compliance Validation & Rule Engines layer. This page covers how to model tolerance bands as validated data, how to derive their widths from labeled historical submissions rather than guesswork, and how the graduated verdicts they produce feed the rest of the assembly pipeline.

The failure this stage prevents is the erosion of trust that follows a high false-positive rate. When Page Limit & Font Enforcement flags a compliant document because a PDF exporter rounded a point size, and Required Section Mapping rejects a correctly ordered proposal because an institutional template renamed a heading, coordinators learn to override the engine — and once they override reflexively, the engine stops catching the real defects it exists to catch. Calibrated thresholds keep the automatic screen precise enough that a flag means something.

Prerequisites and environment setup

Threshold tuning is pure numeric and rule logic — it consumes scores and deltas that the extraction and enforcement stages already produced, so it has no heavy PDF dependencies of its own. Target Python 3.10 or later for the union-type syntax and structural pattern matching used throughout, and pin Pydantic v2 for the schema layer that keeps a malformed band configuration from silently passing every document.

bash

python3 --version          # expect 3.10.x or newer
python3 -m venv .venv && source .venv/bin/activate
pip install "pydantic>=2.6" "pytest>=8.0"

This stage assumes two inputs already exist. First, per-dimension raw measurements from the upstream checks: an effective point size in points, a minimum margin in inches, a section-similarity score in the unit interval, an optical character recognition (OCR) confidence in the unit interval. Second, a small corpus of labeled historical submissions — documents a human already judged compliant or non-compliant against the National Institutes of Health (NIH), the National Science Foundation (NSF), and the Department of Defense (DoD) rules. Calibration is only as good as that ground truth, so the corpus should span the institutional templates and PDF toolchains your submitters actually use.

Core mechanism — how tolerance bands work

The central idea is to replace a boolean check with a three-band verdict. Every compliance dimension yields a raw measurement; the band model maps that measurement to one of three states — PASS, REVIEW, or FAIL — instead of a single true/false. The PASS band is what the engine accepts automatically, the FAIL band is what it rejects automatically, and the REVIEW band in between is the deliberate holding zone for borderline documents that a human resolves. Widening the REVIEW band trades throughput for safety; narrowing it does the reverse. That trade is the entire subject of tuning.

Two kinds of thresholds coexist, and conflating them is the most common tuning error. Absolute-delta thresholds apply to physical measurements with real units: a point size may be allowed to fall 0.05 pt below the nominal floor, a margin 0.01 in inside the declared minimum, because those deltas are the known magnitude of PDF-export rounding. Normalized-score thresholds apply to similarity and confidence values already squeezed into the unit interval: a section-mapping score of 0.85 auto-approves, 0.65–0.84 routes to review, below 0.65 fails. The band model below handles the normalized case directly; physical measurements are converted to a normalized headroom score before they enter it, so one engine reasons about every dimension uniformly.

python

from __future__ import annotations

from enum import Enum

from pydantic import BaseModel, Field, field_validator, model_validator


class Verdict(str, Enum):
    PASS = "pass"      # auto-accept
    REVIEW = "review"  # route to a human queue
    FAIL = "fail"      # auto-reject


class ToleranceBand(BaseModel):
    """Graduated thresholds for one compliance dimension.

    A raw score is nudged by ``tolerance_delta`` to absorb known rounding
    noise, then classified against two floors. All scores live in [0, 1].
    """

    dimension: str
    hard_floor: float = Field(description="At or above -> PASS")
    review_floor: float = Field(description="At or above (but below hard_floor) -> REVIEW")
    tolerance_delta: float = Field(default=0.0, ge=0.0, le=0.25)

    @field_validator("hard_floor", "review_floor")
    @classmethod
    def within_unit_interval(cls, value: float) -> float:
        if not 0.0 <= value <= 1.0:
            raise ValueError("band floors must lie in the closed interval [0.0, 1.0]")
        return value

    @model_validator(mode="after")
    def floors_are_ordered(self) -> "ToleranceBand":
        if self.review_floor > self.hard_floor:
            raise ValueError("review_floor cannot exceed hard_floor")
        return self

    def classify(self, raw_score: float) -> Verdict:
        adjusted = min(1.0, max(0.0, raw_score + self.tolerance_delta))
        if adjusted >= self.hard_floor:
            return Verdict.PASS
        if adjusted >= self.review_floor:
            return Verdict.REVIEW
        return Verdict.FAIL

Modeling the band as a Pydantic record rather than a loose tuple is deliberate: the floors_are_ordered validator makes an inverted configuration — a review floor above the hard floor, which would make every document either pass or fail with no review path — impossible to load, and the tolerance_delta ceiling stops an operator from quietly setting a delta so large that it swallows genuine violations. A compliance parameter that a batch job can silently misconfigure is not auditable.

The three graduated states reflect how a scored document is routed through those two floors.

One raw score, two floors, three verdicts: the review floor and the hard floor carve the [0, 1] axis into auto-reject, human-review, and auto-accept regions, and the tolerance delta shifts a borderline score just far enough to cross a floor.

Calibration-aware implementation

Choosing the floors and the delta by intuition is where tuning goes wrong. The production pattern is to derive them from labeled history so the band hits an explicit error target. Frame it as a classification problem: on the labeled corpus, a false positive is a truly compliant document the engine would auto-FAIL, and a false negative is a truly non-compliant document the engine would auto-PASS. Raising tolerance_delta drives the false-positive rate down and the false-negative rate up; the calibrator walks the delta upward and stops at the smallest value that keeps the false-positive rate under a target while the false-negative rate stays inside its own ceiling.

python

from __future__ import annotations

from dataclasses import dataclass


@dataclass(frozen=True)
class LabeledCase:
    """One historical measurement with its human-assigned ground truth."""

    raw_score: float
    truly_compliant: bool


def _error_rates(
    cases: list[LabeledCase], band: ToleranceBand
) -> tuple[float, float]:
    """Return (false_positive_rate, false_negative_rate) over the corpus."""
    compliant = [c for c in cases if c.truly_compliant]
    non_compliant = [c for c in cases if not c.truly_compliant]

    # A truly-compliant doc auto-failed is a false positive.
    fp = sum(band.classify(c.raw_score) is Verdict.FAIL for c in compliant)
    # A truly-non-compliant doc auto-passed is a false negative.
    fn = sum(band.classify(c.raw_score) is Verdict.PASS for c in non_compliant)

    fpr = fp / len(compliant) if compliant else 0.0
    fnr = fn / len(non_compliant) if non_compliant else 0.0
    return fpr, fnr


def calibrate_delta(
    cases: list[LabeledCase],
    dimension: str,
    hard_floor: float,
    review_floor: float,
    target_fpr: float = 0.03,
    max_fnr: float = 0.0,
    step: float = 0.005,
    max_delta: float = 0.25,
) -> ToleranceBand:
    """Smallest tolerance_delta meeting the false-positive target without
    admitting a genuine violation beyond ``max_fnr``."""
    delta = 0.0
    best = ToleranceBand(
        dimension=dimension, hard_floor=hard_floor,
        review_floor=review_floor, tolerance_delta=0.0,
    )
    while delta <= max_delta:
        band = ToleranceBand(
            dimension=dimension, hard_floor=hard_floor,
            review_floor=review_floor, tolerance_delta=round(delta, 4),
        )
        fpr, fnr = _error_rates(cases, band)
        if fnr > max_fnr:
            break  # further widening would only admit more real violations
        best = band
        if fpr <= target_fpr:
            return band
        delta += step
    return best

The default max_fnr=0.0 encodes a hard rule for federal submission: never let the tolerance band pass a document a human labeled non-compliant, even if that costs some false positives. A missing biosketch or an over-limit narrative is a categorical rejection, so the calibrator refuses to buy a lower false-positive rate by admitting a real defect. For advisory dimensions where a near-miss is genuinely harmless, an operator can relax max_fnr, but that relaxation is now an explicit, reviewable argument rather than a magic constant buried in the enforcement code.

Physical measurements enter this machinery through a normalization step so the same band logic serves point sizes and similarity scores alike. Headroom above a floor becomes a unit-interval score, and the previously separate point-size delta from Page Limit & Font Enforcement — the point_tolerance field on its formatting rule — is exactly this dimension’s tolerance_delta expressed in points instead of a normalized fraction:

python

def normalize_headroom(measured: float, floor: float, span: float) -> float:
    """Map a physical measurement to a [0, 1] compliance score.

    ``span`` is the measurement range treated as fully compliant headroom,
    e.g. 1.0 pt above an 11 pt floor. Values below the floor go negative
    before clamping, which the band then reads as FAIL territory.
    """
    return min(1.0, max(0.0, (measured - floor) / span))

The verdicts these bands emit are consumed directly by Automated Checklist Generation, which reflects each dimension’s calibrated width in the compliance_weight it assigns — a near-miss on a low-weight dimension surfaces as a warning, a FAIL on a weight-1.0 dimension hard-blocks packaging.

Agency-specific configuration

The three agencies do not merely set different limits; they justify different tolerance widths, because their documents travel through different toolchains and their screens punish different things. NIH’s grants portal is aggressive about point size and countable pages, so those bands run tight; NSF draws its countable/exempt boundary differently and its structural naming drifts more across programs, so its section-mapping band is wider; DoD restates its envelope in every Broad Agency Announcement (BAA), so its bands are hydrated per solicitation rather than fixed. The table captures representative starting points that calibration then refines against local history.

Threshold dimension	NIH	NSF	DoD (Broad Agency Announcement)
Point-size tolerance	11 pt floor, ±0.05 pt for export rounding, never below 10.95 pt	10 pt floor, ±0.05 pt	Per-BAA floor (often 10–12 pt), ±0.05 pt
Margin tolerance	0.5 in floor, 0.01 in slack for renderer drift	1.0 in floor, 0.01 in slack	Per-BAA (commonly 1.0 in), 0.01 in slack
Section-similarity `hard_floor`	0.88 (stable heading nomenclature)	0.82 (more template variance)	0.85, per-BAA heading set
Section-similarity `review_floor`	0.68	0.62	0.65
OCR-confidence floor for image PDFs	0.90 before trusting extracted text	0.90	0.90
Target false-positive rate	0.02 (deadline-day volume)	0.03	0.03

python

NIH_SECTION_BAND = ToleranceBand(
    dimension="section_similarity",
    hard_floor=0.88, review_floor=0.68, tolerance_delta=0.0,
)

NSF_SECTION_BAND = ToleranceBand(
    dimension="section_similarity",
    hard_floor=0.82, review_floor=0.62, tolerance_delta=0.0,
)

# DoD bands are populated from the extracted solicitation, not a default.
DOD_SECTION_BAND = ToleranceBand(
    dimension="section_similarity",
    hard_floor=0.85, review_floor=0.65, tolerance_delta=0.0,
)

The DoD row is the volatile case. Because each BAA restates its own formatting clause, the DoD bands are almost never used at their defaults; they are hydrated from the extracted solicitation requirements produced by the DoD BAA requirement extraction workflow before the engine runs. NIH and NSF starting points, by contrast, trace to the standing envelopes described in the NIH FOA schema mapping and NSF proposal guide taxonomy references, whose Proposal & Award Policies & Procedures Guide (PAPPG) versioning is what makes NSF’s section naming drift and thus its band wider.

Error handling and edge cases

Calibrated bands still meet inputs that break naive logic, and each needs an explicit branch rather than a default that silently mislabels a document.

Boundary thrash. A document sitting exactly on a floor (10.98 pt against an 11 pt floor) must not flip between PASS and FAIL across re-exports of the same source. The REVIEW band is the anti-thrash mechanism: anything within a delta of the floor lands in review and stays there until a human resolves it, rather than oscillating. If you observe a dimension thrashing across the review boundary specifically, widen that band rather than nudging the delta.
Hysteresis on re-runs. When the manifest is a live ledger and a document is re-scored after revision, applying a slightly different threshold than the prior run makes the audit diff noisy. Pin the band configuration by version and record which version scored each run, so a status change always reflects a document change, never a silent threshold change.
Empty or image-only pages. A scanned page yields no extractable spans, which a naive normalized score reads as a trivial 1.0 “pass.” Gate every dimension behind the OCR-confidence band first: if confidence is below its floor, the document routes to the OCR fallback or a manual queue before any typographic or structural band even runs.
Missing calibration corpus. For a brand-new solicitation with no labeled history, calibrate_delta has nothing to fit. Fall back to the conservative agency defaults in the table above and mark the band provisional, so downstream reporting can note that its widths are untuned until real submissions accumulate.
Conflicting rule sources. When a solicitation-specific clause overrides a standing agency default — a BAA permitting a wider margin than the standing DoD template — the more specific document wins, and the band it produces must record its provenance so a reviewer can see why the threshold is what it is. That precedence lives with the requirement extractor, not here; this stage consumes whichever band the resolved rule handed it.

Integration with downstream pipeline

Threshold tuning is a stage, not an endpoint. Its per-dimension verdicts converge into an aggregate routing decision, and that decision is what the rest of the assembly pipeline reads. The band verdicts feed Automated Checklist Generation, which turns each REVIEW or FAIL into a routed, clause-traceable deficiency line; the aggregate gate feeds the document assembler, which blocks packaging while any weight-1.0 dimension is failing. Upstream, the raw scores these bands consume come from the coordinate-aware geometry of PDF text extraction with pdfplumber and the boundary detection of NLP section boundary detection, so a “score” means the same thing at measurement time and at classification time.

The diagram below traces how the calibrated bands sit between raw measurement and the routing decision the pipeline acts on: four independent dimensions are classified in parallel, their verdicts collapse into one aggregate under a fixed precedence, and that aggregate fans out to the pipeline consumers — with the review queue looping resolved verdicts back in.

Each dimension is classified independently, the aggregate router collapses the four verdicts under a fixed precedence, and only the review queue feeds a verdict back — so a document changes state only when the document, or a reviewer, actually changes.

Testing and verification

Because the band model is pure numeric logic, it tests deterministically with no fixtures or mocked I/O. The suite pins three properties the pipeline relies on: a malformed band fails at construction, classification lands each score in the correct state, and the calibrator honors its error targets on a known corpus.

python

import pytest
from pydantic import ValidationError


def test_inverted_floors_are_rejected_at_load() -> None:
    with pytest.raises(ValidationError):
        ToleranceBand(dimension="x", hard_floor=0.60, review_floor=0.80)


def test_classification_lands_in_correct_band() -> None:
    band = ToleranceBand(dimension="section", hard_floor=0.85, review_floor=0.65)
    assert band.classify(0.90) is Verdict.PASS
    assert band.classify(0.74) is Verdict.REVIEW
    assert band.classify(0.50) is Verdict.FAIL


def test_tolerance_delta_absorbs_boundary_rounding() -> None:
    # 0.84 nudged by 0.02 clears an 0.85 hard floor.
    band = ToleranceBand(
        dimension="font", hard_floor=0.85, review_floor=0.65, tolerance_delta=0.02,
    )
    assert band.classify(0.84) is Verdict.PASS


def test_calibrator_never_admits_a_real_violation() -> None:
    cases = [
        LabeledCase(raw_score=0.83, truly_compliant=True),   # rounding false alarm
        LabeledCase(raw_score=0.86, truly_compliant=True),
        LabeledCase(raw_score=0.60, truly_compliant=False),  # genuine defect
    ]
    band = calibrate_delta(
        cases, dimension="section", hard_floor=0.85, review_floor=0.65,
    )
    # The 0.60 non-compliant case must never be nudged into PASS.
    assert band.classify(0.60) is not Verdict.PASS


def test_normalize_headroom_clamps_below_floor() -> None:
    # 10.90 pt against an 11 pt floor is below the envelope -> 0.0.
    assert normalize_headroom(10.90, floor=11.0, span=1.0) == 0.0
    assert normalize_headroom(11.5, floor=11.0, span=1.0) == 0.5

Beyond the unit suite, a pre-submission verification checklist keeps the tuning honest against real solicitations:

Every band’s hard_floor and review_floor trace to a documented agency envelope or a calibrated value, never a bare literal in the enforcement code.
Re-scoring an unchanged document under a pinned band version produces an identical verdict set — no threshold drift between runs.
The measured false-positive rate on the holdout corpus is at or below the agency target in the configuration table.
No band admits a document the ground-truth corpus labeled non-compliant.
A provisional band with no calibration history is flagged as such in the deficiency report.

Calibrated tolerance bands are what let an automatic screen stay both strict and trusted: strict because a genuine violation never slips through, trusted because a flag is rarely a false alarm. Treating thresholds as versioned, corpus-derived parameters rather than hardcoded constants keeps that balance stable across funding cycles and agency guidance updates.

Page Limit & Font Enforcement — produces the point-size and margin measurements these bands calibrate.
Required Section Mapping — produces the section-similarity scores classified here.
Automated Checklist Generation — consumes band verdicts and reflects each width in an item’s compliance weight.
DoD BAA Requirement Extraction — hydrates the per-solicitation DoD bands that have no fixed default.
NLP Section Boundary Detection — supplies the boundary confidence that feeds the section-similarity dimension.

Up one level: Compliance Validation & Rule Engines.

# Threshold Tuning for Compliance

# Prerequisites and environment setup

# Core mechanism — how tolerance bands work

# Calibration-aware implementation

# Agency-specific configuration

# Error handling and edge cases

# Integration with downstream pipeline

# Testing and verification

# Related

Explore this section

Threshold Tuning for Compliance

Prerequisites and environment setup

Core mechanism — how tolerance bands work

Calibration-aware implementation

Agency-specific configuration

Error handling and edge cases

Integration with downstream pipeline

Testing and verification

Related