Async Batch Processing for Large RFPs

When a federal funding cycle drops dozens of solicitations across the National Institutes of Health (NIH), the National Science Foundation (NSF), and the Department of Defense (DoD) inside a two-week window, the failure mode is not a single bad parse — it is a missed deadline caused by a serial pipeline that cannot clear its queue overnight. This is the specific compliance risk that asynchronous batch processing exists to eliminate: turning a backlog of hundreds of unstructured PDFs into pre-validated compliance matrices before research administrators arrive at 06:00 ET. It is the throughput layer of the broader RFP Ingestion & Parsing Workflows system, sitting between raw document intake and the structured records that feed proposal assembly. A batch processor that blocks on a single 400-page DoD Broad Agency Announcement (BAA), leaks memory across a run, or silently drops a malformed NSF announcement does not just run slowly — it produces an incomplete compliance picture that no downstream validator can recover.

The core engineering problem is decoupling I/O-bound work (disk reads, portal polling, database writes) from CPU-bound work (PDF text extraction and natural-language section detection) so that neither starves the other. Python’s asyncio event loop coordinates the I/O and scheduling; a process pool absorbs the CPU-heavy extraction. Get that separation right and throughput scales close to linearly with cores while memory stays bounded; get it wrong and the event loop stalls behind a synchronous parser, back-pressure disappears, and a single 500 MB solicitation exhausts the heap.

Prerequisites and environment setup

The pipeline targets Python 3.10 or newer, because it relies on structural pattern matching, asyncio.TaskGroup (3.11+ preferred, with a fallback shown below for 3.10), and modern type-hint syntax such as list[Path] without typing imports. Install the runtime dependencies:

bash

python -m pip install "pdfplumber>=0.11" "pydantic>=2.6" "aiofiles>=23.2" "anyio>=4.2"

Three assumptions about the source documents shape every design choice below:

File format is PDF, not HTML or XML. NIH Funding Opportunity Announcements (FOAs), NSF program solicitations, and DoD BAAs are distributed as PDFs — often scanned or multi-column — so extraction is CPU-bound and must run outside the event loop. The extraction mechanics themselves live in PDF Text Extraction with pdfplumber; this page treats that extractor as a synchronous callable to be offloaded.
Batch size is unbounded and bursty. A quarter-end cycle can deliver 300+ documents in a day. The processor must apply concurrency limits rather than launching one task per file, or it will exceed portal rate limits and the host’s file-descriptor ceiling.
Documents are heterogeneous in size. A single-page NSF Dear Colleague Letter and a 400-page DoD BAA appendix travel through the same queue, so per-task timeouts and memory ceilings are mandatory, not optional.

A minimal project layout keeps the async orchestration, the synchronous extractors, and the validation schema in separate modules so the CPU-bound code can be pickled and shipped to worker processes cleanly:

text

ingest/
  orchestrator.py   # asyncio event loop, queue, semaphore, DB writes
  extractors.py     # synchronous pdfplumber calls (run in ProcessPoolExecutor)
  schema.py         # Pydantic v2 models for the compliance matrix

Core mechanism — offloading CPU work without blocking the loop

The defining rule of this architecture is that no synchronous, CPU-heavy call ever runs directly inside a coroutine. pdfplumber renders pages and reconstructs layout geometry synchronously; awaiting it on the event loop would freeze scheduling, progress tracking, and database writes for every other document in the batch. The fix is to hand extraction to a ProcessPoolExecutor via loop.run_in_executor, which returns an awaitable the loop can suspend on while workers grind through pages in parallel across cores.

python

import asyncio
from concurrent.futures import ProcessPoolExecutor
from pathlib import Path

from extractors import extract_text_sync  # synchronous, CPU-bound pdfplumber call


async def dispatch_pdf_parsing(
    file_paths: list[Path],
    max_workers: int = 4,
) -> list[str | BaseException]:
    """Offload CPU-heavy PDF parsing to a process pool, awaited concurrently.

    ProcessPoolExecutor (not ThreadPoolExecutor) sidesteps the GIL for
    genuinely parallel text extraction. return_exceptions=True ensures one
    corrupt PDF cannot abort the entire overnight batch.
    """
    loop = asyncio.get_running_loop()
    with ProcessPoolExecutor(max_workers=max_workers) as executor:
        tasks = [
            loop.run_in_executor(executor, extract_text_sync, path)
            for path in file_paths
        ]
        return await asyncio.gather(*tasks, return_exceptions=True)

Two choices in that block are load-bearing. ProcessPoolExecutor rather than ThreadPoolExecutor is required because text extraction is CPU-bound and the Global Interpreter Lock would serialize threads; separate processes give true parallelism. And return_exceptions=True converts a per-document failure into a returned value instead of an exception that cancels the whole gather, so one malformed BAA cannot take down a 300-document run — the failures are inspected and routed afterward.

Threads still have a place. For genuinely I/O-bound steps — reading files off disk, POSTing results to a database — asyncio.to_thread() is lighter than spawning processes, because there is no pickling cost and no GIL contention on work that spends its time waiting. The rule of thumb: process pool for parsing, threads or native async for I/O.

Production concurrency implementation

A real overnight run needs three properties the naive gather above lacks: a bounded concurrency limit so the pool is never handed 300 files at once, a per-document timeout so a pathological PDF cannot hang a worker forever, and structured exception routing so failures land in a review queue instead of scrolling past in a log. A semaphore-guarded worker coroutine delivers all three, and pairs cleanly with the extraction and validation stages that surround it.

python

import asyncio
from concurrent.futures import ProcessPoolExecutor
from dataclasses import dataclass
from pathlib import Path

from extractors import extract_text_sync
from schema import RFPComplianceMatrix, validate_compliance


@dataclass(slots=True)
class BatchResult:
    path: Path
    matrix: RFPComplianceMatrix | None
    error: str | None


async def process_one(
    path: Path,
    executor: ProcessPoolExecutor,
    sem: asyncio.Semaphore,
    timeout_s: float = 90.0,
) -> BatchResult:
    """Extract, then validate a single solicitation under a concurrency cap."""
    loop = asyncio.get_running_loop()
    async with sem:  # cap simultaneous work regardless of batch size
        try:
            raw_text = await asyncio.wait_for(
                loop.run_in_executor(executor, extract_text_sync, path),
                timeout=timeout_s,
            )
        except asyncio.TimeoutError:
            return BatchResult(path, None, f"extraction timeout after {timeout_s}s")
        except Exception as exc:  # noqa: BLE001 — quarantine, never abort the batch
            return BatchResult(path, None, f"extraction failed: {exc!r}")

    try:
        matrix = validate_compliance(parse_sections(raw_text))
    except ValueError as exc:
        return BatchResult(path, None, f"validation failed: {exc}")
    return BatchResult(path, matrix, None)


async def run_batch(
    file_paths: list[Path],
    max_workers: int = 4,
    max_concurrency: int = 8,
) -> list[BatchResult]:
    sem = asyncio.Semaphore(max_concurrency)
    with ProcessPoolExecutor(max_workers=max_workers) as executor:
        coros = [process_one(p, executor, sem) for p in file_paths]
        return await asyncio.gather(*coros)

The max_concurrency semaphore is deliberately decoupled from max_workers: the pool has, say, four CPU workers, but eight documents can be in flight — several awaiting extraction results while others move through validation and database writes. This keeps the CPU cores saturated without letting the in-memory working set grow without bound.

Between extraction and validation sits parse_sections, which locates eligibility criteria, submission deadlines, budget caps, and evaluation rubrics inside the raw text. Because federal solicitations never share a consistent layout, static regular expressions are not reliable here; the boundary logic belongs to NLP Section Boundary Detection, which the batch processor invokes per document and which can itself be offloaded to the pool for transformer-based classifiers. Its output is a plain dict handed straight to the validation layer.

That validation layer is where structure becomes contractual. Every parsed solicitation is coerced into a strict Pydantic v2 model before it is allowed near the review queue, using the pattern documented in Schema Validation with Pydantic:

python

from datetime import date

from pydantic import BaseModel, Field, ValidationError, field_validator


class RFPComplianceMatrix(BaseModel):
    solicitation_id: str = Field(pattern=r"^[A-Z0-9\-]+$")
    agency: str
    deadline: date
    budget_cap_usd: float | None = Field(default=None, ge=0)
    eligibility_criteria: list[str] = Field(min_length=1)
    evaluation_rubric: dict[str, float] = Field(default_factory=dict)

    @field_validator("agency")
    @classmethod
    def known_agency(cls, v: str) -> str:
        allowed = {"NIH", "NSF", "DoD"}
        if v not in allowed:
            raise ValueError(f"agency must be one of {sorted(allowed)}, got {v!r}")
        return v


def validate_compliance(raw_data: dict) -> RFPComplianceMatrix:
    try:
        return RFPComplianceMatrix.model_validate(raw_data)
    except ValidationError as exc:
        # Surface a flat message so the batch worker can quarantine this record.
        raise ValueError(str(exc)) from exc

Using Pydantic v2 field_validator (not the deprecated @validator) keeps the schema forward-compatible, and raising a plain ValueError from validate_compliance lets process_one treat a schema breach exactly like an extraction failure: quarantine the record, keep the batch moving.

Agency-specific configuration

The three agencies differ in how fast they will let an automated client poll, how large their documents run, and where the parsed record ultimately syncs. Tuning the concurrency cap, per-document timeout, and poll interval per agency prevents both rate-limit lockouts and wasted overnight hours. Treat the table below as starting values validated against typical solicitation profiles, not hard federal limits.

Parameter	NIH	NSF	DoD
Primary portal	Grants.gov + eRA Commons	Research.gov	SAM.gov / eBRAP
Typical document size	30–80 pp FOA	15–40 pp solicitation	100–400 pp BAA
Suggested `max_concurrency`	8	10	4
Suggested per-doc `timeout_s`	90	60	240
Poll interval (portal watcher)	15 min	15 min	30 min
Dominant parse hazard	Multi-column FOA tables	Dense reference sections	Scanned appendices / OCR

python

from dataclasses import dataclass


@dataclass(frozen=True, slots=True)
class AgencyProfile:
    agency: str
    max_concurrency: int
    timeout_s: float
    poll_interval_s: int


AGENCY_PROFILES: dict[str, AgencyProfile] = {
    "NIH": AgencyProfile("NIH", max_concurrency=8, timeout_s=90.0, poll_interval_s=900),
    "NSF": AgencyProfile("NSF", max_concurrency=10, timeout_s=60.0, poll_interval_s=900),
    "DoD": AgencyProfile("DoD", max_concurrency=4, timeout_s=240.0, poll_interval_s=1800),
}

The DoD profile intentionally lowers concurrency and raises the timeout: BAA appendices are large and frequently scanned, so each document consumes more memory and more wall-clock time, and running fewer of them in parallel keeps the resident set inside the host’s ceiling.

Error handling and edge cases

A batch that runs unattended overnight has no operator to nurse it, so every foreseeable failure must resolve to a routed record rather than a crash. The recurring hazards:

Corrupt or password-protected PDFs. pdfplumber raises on open. Because process_one wraps extraction in a broad except, the document is quarantined with its error string and the batch continues. Never let one file’s exception propagate into the shared gather.
Runaway documents. A malformed page tree can send a parser into effectively unbounded work. The asyncio.wait_for timeout is the backstop; on TimeoutError the record is quarantined and the worker freed. Set the timeout from the agency profile, not a global constant.
Memory pressure on large batches. The semaphore caps how many documents are resident at once, but the process pool itself should be sized to min(cpu_count, memory_budget // peak_doc_mb). For DoD BAAs, peak per-document memory dominates the calculation.
Silent partial extraction. A page that yields empty text is not an exception — it is worse, because it validates as a thin but structurally valid record. Guard against it before validation by asserting a minimum extracted character count and routing suspiciously short documents to manual review.
Poison-pill records that kill a worker. A segfault in a native PDF dependency can kill a pool process outright. ProcessPoolExecutor will surface a BrokenProcessPool; catch it at the run_batch level, rebuild the pool, and resubmit only the unfinished paths rather than restarting the whole run.

python

MIN_CHARS = 500  # below this, treat extraction as failed, not merely sparse


def guard_extraction(path: Path, raw_text: str) -> str:
    if len(raw_text.strip()) < MIN_CHARS:
        raise ValueError(f"suspiciously short extraction ({len(raw_text)} chars)")
    return raw_text

The consistent principle: convert every failure into a BatchResult carrying an error, then report the quarantine set as a first-class output. A run that processes 297 of 300 documents and clearly names the three it could not is a success; a run that aborts on document 41 is not.

Integration with the downstream pipeline

The batch processor is a stage, not an endpoint. Its output — a list of validated RFPComplianceMatrix records plus a quarantine set — feeds three consumers: the persistence layer that writes each matrix to the grant-management database, the Compliance Validation & Rule Engines that apply agency-specific pass/fail rules to the structured fields, and the review dashboard that surfaces quarantined documents for human triage. The event loop owns the database writes directly, because they are I/O-bound and belong on the loop rather than in the CPU pool.

The diagram below shows how the async dispatch layer separates I/O scheduling from CPU-bound extraction, then rejoins the streams at validation before handing off downstream.

For the concurrency internals behind this stage — event-loop tuning, back-pressure, and memory management across a full 100-document overnight run — see Asyncio patterns for processing 100+ RFPs overnight.

Testing and verification

Concurrency bugs hide until load exposes them, so the test suite must exercise the failure paths deliberately rather than trusting a happy-path run. Verify three properties: that one bad document never aborts a batch, that the timeout actually fires, and that the concurrency cap is respected under a large input.

python

import asyncio
from pathlib import Path

import pytest

from orchestrator import run_batch, process_one


@pytest.mark.asyncio
async def test_one_corrupt_pdf_does_not_abort_batch(tmp_path: Path) -> None:
    good = [tmp_path / f"nih_{i}.pdf" for i in range(5)]
    bad = tmp_path / "corrupt.pdf"
    bad.write_bytes(b"%PDF-1.4 not really a pdf")
    results = await run_batch(good + [bad], max_workers=2, max_concurrency=3)

    assert len(results) == 6
    assert sum(r.error is not None for r in results) == 1  # only the corrupt one
    assert all(r.matrix is not None for r in results if r.error is None)


@pytest.mark.asyncio
async def test_timeout_is_enforced(monkeypatch, tmp_path: Path) -> None:
    async def never_returns(*_a, **_k):
        await asyncio.sleep(10)

    # A hanging extraction must resolve to a quarantined result, not a hang.
    result = await asyncio.wait_for(
        process_one(tmp_path / "slow.pdf", executor=..., sem=asyncio.Semaphore(1),
                    timeout_s=0.1),
        timeout=2.0,
    )
    assert result.error and "timeout" in result.error

Beyond automated tests, a pre-handoff checklist confirms the batch is safe to release downstream:

Every input path appears exactly once in the results (no dropped documents).
The quarantine set is non-empty only for genuinely unparseable files, each with a specific error string.
Peak resident memory stayed within the host budget for the largest DoD BAA in the batch.
Every non-quarantined record passed RFPComplianceMatrix.model_validate with a recognized agency.
Wall-clock time scaled sub-linearly with document count relative to a serial baseline.

Only when those hold does the batch clear for the compliance engines and proposal-assembly stages that depend on it.

PDF Text Extraction with pdfplumber — the synchronous extractor this pipeline offloads to the process pool.
NLP Section Boundary Detection — locates eligibility, deadline, and budget sections in each extracted document.
Schema Validation with Pydantic — the strict model every parsed solicitation must satisfy.
Asyncio patterns for processing 100+ RFPs overnight — event-loop tuning and memory management at scale.
Compliance Validation & Rule Engines — the downstream consumer that applies agency pass/fail rules to the structured output.

Up one level: RFP Ingestion & Parsing Workflows

# Async Batch Processing for Large RFPs

# Prerequisites and environment setup

# Core mechanism — offloading CPU work without blocking the loop

# Production concurrency implementation

# Agency-specific configuration

# Error handling and edge cases

# Integration with the downstream pipeline

# Testing and verification

# Related pages

Explore this section

Async Batch Processing for Large RFPs

Prerequisites and environment setup

Core mechanism — offloading CPU work without blocking the loop

Production concurrency implementation

Agency-specific configuration

Error handling and edge cases

Integration with the downstream pipeline

Testing and verification

Related pages