Training spaCy for grant proposal section detection

A section detector that misreads a heading fails a proposal silently: every downstream page-limit, font, and required-section check then fires against the wrong span of text, and the error only surfaces when the National Institutes of Health (NIH) eRA Commons or the Grants.gov validator rejects the package after the deadline. Deterministic regex parsers break the moment a principal investigator renames “Specific Aims” or an NSF applicant nests a subsection the template did not anticipate. Training a spaCy spancat (span categorizer) component on real solicitations replaces that brittleness with a learned boundary model that generalizes across formatting drift. This page is the model-training endpoint of the NLP section boundary detection workflow: it walks corpus assembly, spancat configuration and training, agency-specific edge cases, and audit-safe validation, so that a National Science Foundation (NSF) or Department of Defense (DoD) submission is segmented into typed, machine-actionable sections before any compliance rule runs against it.

The build has four phases: assemble and annotate a corpus, configure and train the spancat component, resolve adversarial and cross-agency edge cases, then validate the detector’s output before it feeds the pipeline.

Phase 1 — Assemble and annotate the training corpus

A spancat model is only as reliable as the boundaries it was shown, so the corpus has to capture both the canonical structure of each funding mechanism and the real-world deviations that break naive parsers. This stage consumes already-extracted plain text with reading order preserved — feed raw agency PDFs through PDF text extraction with pdfplumber first so headers, footers, and multi-column layout are reconstructed before a single token reaches the annotator.

Implementation steps:

Aggregate historical submissions. Compile a representative set of funded and rejected proposals across every target mechanism (NIH R01, NSF CAREER, DoD Broad Agency Announcement responses). Strip watermarks, normalize Unicode whitespace such as the non-breaking space and zero-width space, and preserve paragraph-level tokenization.
Define a span-labeling schema. Align labels to the agency compliance matrices the pipeline already enforces — SPECIFIC_AIMS, RESEARCH_STRATEGY, BUDGET_JUSTIFICATION, BROADER_IMPACTS — so a detected span maps one-to-one onto a downstream rule.
Annotate exact character offsets. Record start and end offsets for both section headers and their content spans, and label transitional phrases (“The following section addresses…”) as boundary markers so the model learns the boundary as a property of surrounding language, not line shape.
Sample adversarial structure deliberately. Include proposals with misplaced appendices, merged subsections, unauthorized formatting, and missing mandatory sections. These teach the detector to recognize violations rather than memorize a static heading string.

Serialize annotated documents to spaCy’s DocBin for efficient, version-controllable I/O, and keep an audit trail of every annotation decision so the training set itself can survive a compliance review.

Phase 2 — Configure and train the spancat component

spancat is the correct primitive for this task rather than a named-entity recognizer, because section headers and their content spans overlap and a single token can belong to more than one label. The core transformation is expressed declaratively in config.cfg: a shared tok2vec feeds a span categorizer whose predictions land under a named spans key.

ini

[paths]
train = "./data/train.spacy"
dev = "./data/dev.spacy"

[system]
gpu_allocator = "pytorch"

[nlp]
lang = "en"
pipeline = ["tok2vec", "spancat"]

[components.tok2vec]
factory = "tok2vec"

[components.spancat]
factory = "spancat"
spans_key = "sc"
threshold = 0.5

[components.spancat.model]
@architectures = "spacy.SpanCategorizer.v1"

[components.spancat.model.reducer]
@layers = "spacy.mean_max_reducer.v1"
hidden_size = 128

[components.spancat.model.tok2vec]
@architectures = "spacy.Tok2Vec.v2"

[components.spancat.model.tok2vec.embed]
@architectures = "spacy.MultiHashEmbed.v2"
width = 96
attrs = ["NORM", "PREFIX", "SUFFIX", "SHAPE"]
rows = [5000, 1000, 2500, 2500]
include_static_vectors = false

[components.spancat.model.tok2vec.encode]
@architectures = "spacy.MaxoutWindowEncoder.v2"
width = 96
depth = 4
window_size = 1
maxout_pieces = 3

[training]
max_epochs = 20
patience = 3

[training.optimizer]
@optimizers = "Adam.v1"

Run the training loop with early stopping and validation checkpointing:

bash

python -m spacy train config.cfg --output ./output --gpu-id 0

Monitor spancat_loss, spancat_f, and spancat_p, and target a development F1 score of at least 0.88 before promoting the model to staging. The component writes its predictions to doc.spans["sc"] — the same string given as spans_key in the config — and the per-span confidence values live in the parallel doc.spans["sc"].attrs["scores"] array, which Phase 4 uses for gating. The end-to-end flow from corpus aggregation through the F1 gate to deployment branches as follows:

Phase 3 — Edge cases and agency-specific overrides

Two forces break a section detector in production: adversarial document structure the training set under-represents, and the fact that a section schema tuned for one agency does not transfer to another. The inference wrapper therefore reads scores explicitly and stays deterministic about how spans become sections:

python

import spacy


class GrantSectionDetector:
    """Load a trained spancat model and emit typed, scored section spans."""

    def __init__(self, model_path: str) -> None:
        self.nlp = spacy.load(model_path)

    def extract_sections(self, text: str) -> dict[str, list[dict]]:
        doc = self.nlp(text)
        group = doc.spans.get("sc")
        scores = group.attrs.get("scores", []) if group is not None else []
        sections: dict[str, list[dict]] = {}
        for idx, span in enumerate(group or []):
            confidence = float(scores[idx]) if idx < len(scores) else None
            sections.setdefault(span.label_, []).append({
                "text": span.text,
                "start_char": span.start_char,
                "end_char": span.end_char,
                "confidence": confidence,
            })
        return sections

Cross-agency schema drift. The same conceptual section carries different names, and sometimes no equivalent, across mechanisms. A model trained only on NIH labels will silently drop an NSF BROADER_IMPACTS section or mislabel a DoD cost volume. Keep the label set a superset of all target agencies and drive per-agency expectations from a configuration table rather than the model weights, cross-checking the sibling models under required section mapping:

Section concept	NIH R01 label	NSF (PAPPG) label	DoD BAA label
Objectives statement	`SPECIFIC_AIMS`	folded into Project Description	`OBJECTIVES`
Narrative core	`RESEARCH_STRATEGY`	`PROJECT_DESCRIPTION`	`TECHNICAL_APPROACH`
Societal value	not required	`BROADER_IMPACTS`	not required
Budget rationale	`BUDGET_JUSTIFICATION`	`BUDGET_JUSTIFICATION`	`COST_VOLUME`
Prior support	varies by resubmission	`RESULTS_PRIOR_SUPPORT`	not required

Adversarial structure at inference time. OCR artifacts from scanned legacy proposals, merged headings, and unauthorized markdown formatting all produce low-confidence spans rather than clean failures. Rather than trust them, isolate the quantifiable boundary and route ambiguous documents to review — the same discipline the NSF CAREER mandatory-section mapping applies when a required span is missing. For deployment, cache model weights in memory and serve behind a FastAPI endpoint or Celery workers; when a batch spans hundreds of solicitations overnight, hand the queue to the asyncio batch processor so cold-start latency amortizes across the run.

Phase 4 — Validation and audit-safe verification

A detected section set only earns trust once it is cross-checked against the agency’s required-section schema and the result is bound to the exact input that produced it. Run rule-based validation over the extracted sections, then serialize an immutable, timestamped record — the reproducible baseline the compliance validation rule engines consume downstream. The typed record itself is hardened by the same Pydantic validation layer the rest of the ingestion pipeline relies on.

python

import hashlib
import json
from datetime import datetime, timezone

REQUIRED_SECTIONS: set[str] = {
    "SPECIFIC_AIMS",
    "RESEARCH_STRATEGY",
    "BUDGET_JUSTIFICATION",
}
CONFIDENCE_FLOOR = 0.65


def validate_compliance(
    extracted: dict[str, list[dict]], raw_text: str, doc_id: str
) -> dict:
    """Cross-check detected sections against the schema and emit an audit record."""
    found = set(extracted.keys())
    violations: list[dict] = []

    missing = REQUIRED_SECTIONS - found
    if missing:
        violations.append(
            {"type": "MISSING_SECTION", "details": sorted(missing), "severity": "CRITICAL"}
        )

    for label, spans in extracted.items():
        if len(spans) > 1:
            violations.append(
                {"type": "DUPLICATE_SECTION", "label": label, "severity": "WARNING"}
            )
        for span in spans:
            score = span.get("confidence")
            if score is not None and score < CONFIDENCE_FLOOR:
                violations.append(
                    {"type": "LOW_CONFIDENCE", "label": label, "severity": "REVIEW"}
                )

    return {
        "doc_id": doc_id,
        "compliant": not violations,
        "violations": violations,
        "input_sha256": hashlib.sha256(raw_text.encode()).hexdigest(),
        "checked_at": datetime.now(timezone.utc).isoformat(),
    }

Before releasing the record, walk a short acceptance checklist that fixtures alone miss:

Confirm every required section for the document’s agency is present, not just the NIH default set.
Confirm no span below the 0.65 confidence floor was accepted silently — low-confidence documents belong in the manual review queue with their boundaries highlighted for adjudication.
Confirm a duplicate label is treated as a merged-heading signal, not a benign repeat.
Confirm that when the model returns zero spans, a deterministic regex fallback fires against known agency heading patterns and the fallback invocation is logged as a compliance deviation rather than passing an empty result downstream.
Confirm the record binds the SHA-256 of the exact input and a UTC timestamp, and that it lands in append-only storage so every structural deviation traces back to a precise model version and input snapshot.

Only a document that clears every check — and produces a record with compliant: true — is safe to hand to the assembly and submission stage.

Frequently asked questions

Why use spancat instead of a named-entity recognizer for sections?

Named-entity recognition assigns each token to at most one non-overlapping label, but a section header and its content span overlap, and a heading line can belong to more than one conceptual boundary. spancat is built for overlapping, multi-label spans, so it can label both the header and the enclosing content region without forcing a single flat segmentation.

What development F1 score is safe before promoting the model?

Target at least 0.88 on a held-out development set that includes adversarial samples, not only clean proposals. A model that scores well only on canonical structure will collapse on OCR artifacts and merged headings, so the dev set has to mirror the messiness of real submissions before the number means anything.

Where does the per-span confidence score actually live?

spancat does not attach a score to each Span object. The confidences are stored in the parallel doc.spans["sc"].attrs["scores"] array, aligned by index to the spans in doc.spans["sc"]. Read them by index, as the inference wrapper in Phase 3 does, rather than expecting a score attribute on the span itself.

Should the pipeline ever fall back to regex?

Yes — as a bounded safety net, never as the primary parser. When the model returns zero spans or every span falls below the confidence floor, a deterministic regex pass against known agency heading patterns keeps a malformed document from producing a silently empty result. Log every fallback as a compliance deviation so the gap in model coverage is visible and can be annotated back into the training set.

Can one trained model serve NIH, NSF, and DoD documents?

One model can, if its label set is a superset of every target agency’s sections and per-agency required-section expectations are driven from a configuration table rather than baked into the weights. Keep the labels generic and the agency rules data-driven, and the NIH R01, the NSF PAPPG proposal, and the DoD BAA response differ only in which labels are mandatory, not in the model you load.

NLP section boundary detection — the parent workflow this trained model plugs into.
PDF text extraction with pdfplumber — the coordinate-aware extraction that feeds the corpus and inference input.
Required section mapping — the schema the detected sections are validated against.
Schema validation with Pydantic — the typed layer that hardens the audit record.
Asyncio patterns for processing 100 RFPs overnight — how batch inference scales across hundreds of solicitations.

Up one level: NLP section boundary detection

# Training spaCy for grant proposal section detection

# Phase 1 — Assemble and annotate the training corpus

# Phase 2 — Configure and train the spancat component

# Phase 3 — Edge cases and agency-specific overrides

# Phase 4 — Validation and audit-safe verification

# Frequently asked questions

# Related