Categorization stage (two-pass + single-pass)

First real implementation of the categorization pipeline stage. Ports two of the three legacy Gemini-via-OpenRouter strategies. Introduces the backend's first LLM integration, a per-item DB table, DB-backed categorizer settings, and a Storage abstraction with a local-disk implementation.

Spec Draft 2026-05-19 · pending user review

TL;DRSummary

What this changes

The categorization stage is a stub today (features/pipeline/stages.py:25). Legacy has three Gemini-via-OpenRouter variants under checkin-pipeline/app/steps/classify/; we port two-pass (current legacy default) and single-pass.

The slice does more than translate two strategy modules. Categorization is the first stage that needs a per-item DB record (deal_items; segmentation populates it), calls an LLM (new integrations/openrouter/, predictable home for appraisal next), and reads files via storage (new core/storage.py protocol; LocalFsStorage only — Azure backend is a follow-up slice that just implements the protocol). Strategy selection mirrors segmentation: a singleton categorizer_settings row + build_categorizer() factory. Per-item failures don't fail the stage; re-runs reprocess pending + failed.

§1Context

What exists

features/pipeline/stages.py:25 — categorization_stage is a return None stub. STAGES/NEXT_STAGE already route SEGMENTATION → CATEGORIZATION → APPRAISAL.
features/segmentation/ — precedent we mirror. step.py entry, config.py with kind enum + build_segmenter() factory, settings.py over a singleton DB row.
features/pipeline/models.py — PipelineRun audit row. No token-usage columns today.
core/settings.py:19 — data_dir: Path = Path("data"). The only storage primitive in the backend; segmentation pokes pathlib.Path directly.
features/deal_images/ — stores URLs of source photos, not bytes on disk. Not relevant to crop storage.
No backend code imports any LLM SDK; httpx>=0.27 is the only HTTP dependency.

Legacy categorization (the source we are porting)

checkin-pipeline/app/steps/classify/two_pass/ — two Gemini vision calls (category → category-scoped subcategory). Legacy default.
checkin-pipeline/app/steps/classify/single_pass/ — one Gemini vision JSON call returning both fields.
checkin-pipeline/app/steps/classify/describe_classify/ — vision-describe + text-classify. Not ported in this slice.
_shared.py — CLASSIFICATION_SCHEMA, resolve_subcategory(), format_size_suffix(), strip_code_fences().
deal_step.py — orchestrator: registry-based variation loader, per-deal Semaphore(10), token accumulation on pipeline_runs.
app/gemini/client.py — OpenRouter HTTP client with three wrappers (vision_request, vision_json_request, text_json_request).
config/categories.toml (11 categories) + subcategories.csv (per-category list with hint metadata).
app/steps/_prompts.py reads prompt .md files on every call (no caching); injects {categories} and {subcategory_tree}.

§2Scope (in / out)

In scope

New core/storage.py: Storage protocol + LocalFsStorage. VCC_STORAGE_KIND=local (default).
New api/routers/files.py: GET /files/{key:path} proxy for LocalFsStorage.
New integrations/openrouter/: client.py (lifted), schemas.py.
New features/categorization/: models, settings, categorizer protocol, two strategies, factory, taxonomy loader, stage entry, prompts, TOML+CSV.
Extend run_segmentation_stage: insert one DealItem per crop; compute physical width_cm / height_cm from app_settings.photo_area.* + source-image dims; refactor I/O onto Storage.
New generic app_settings KV table (key/JSONB) + get_setting / set_setting helpers in core/. Seeds photo_area.width_cm / photo_area.height_cm as JSON null.
Migration 0016_categorization.py: deal_items (incl. width_cm / height_cm), categorizer_settings, app_settings + seed rows, llm_usage JSONB NULL on pipeline_runs.
Wire categorization_stage in features/pipeline/stages.py.
Worker startup validates VCC_OPENROUTER_API_KEY.
Manual smoke script backend/scripts/smoke_categorize.py (one fixture image; hand-run).

Out of scope

Describe-then-classify strategy.
Azure Blob Storage implementation (follow-up).
Source-image download from deal_images.url into storage.
Admin UI for editing categorizer_settings or app_settings.photo_area.*.
Retrofitting existing typed singletons (segmenter_settings) onto app_settings — future slice once we have 3+ such tables.
Prompt-management or taxonomy-editing UI.
Per-deal categorizer-strategy override.
Live OpenRouter calls in CI.
Per-LLM-call llm_calls audit table.

§3Architecture

Directory layout

backend/src/vcc_backend/
├── core/
│   └── storage.py                   # NEW — Storage Protocol + LocalFsStorage
├── api/
│   └── routers/
│       └── files.py                 # NEW — GET /files/{key:path}
├── integrations/
│   └── openrouter/                  # NEW — first LLM integration
│       ├── __init__.py
│       ├── client.py                # lifted from checkin-pipeline/app/gemini/client.py
│       └── schemas.py               # ChatResponse, JsonResponse, TokenUsage
├── features/
│   ├── segmentation/
│   │   └── step.py                  # EXTENDED — storage I/O + DealItem insert
│   ├── categorization/              # NEW
│   │   ├── __init__.py
│   │   ├── models.py                # DealItem, CategorizerSettings ORM
│   │   ├── schemas.py               # pydantic Settings types
│   │   ├── settings.py              # get_categorizer_settings(session) + upsert
│   │   ├── categorizer.py           # Categorizer Protocol, dataclasses
│   │   ├── two_pass.py              # TwoPassCategorizer
│   │   ├── single_pass.py           # SinglePassCategorizer
│   │   ├── factory.py               # build_categorizer(settings, http)
│   │   ├── taxonomy.py              # loads TOML+CSV; resolve_subcategory()
│   │   ├── step.py                  # run_categorization_stage(session, deal)
│   │   ├── config/
│   │   │   ├── categories.toml
│   │   │   └── subcategories.csv
│   │   └── prompts/
│   │       ├── two_pass.category.md
│   │       ├── two_pass.subcategory.md
│   │       └── single_pass.md
│   └── pipeline/
│       └── stages.py                # EDITED — wire run_categorization_stage

Sequence

process_deal(deal_id) [Procrastinate worker] │ ▼ pipeline.orchestrator.run_pipeline(session, deal) │ ├─► run_segmentation_stage(session, deal) │ │ │ ├─► load photo_area.{width_cm,height_cm} from app_settings │ ├─► storage.write_bytes(key, crop_png_bytes) [per crop] │ └─► INSERT deal_items (status=pending, crop_key, │ width_cm = round(bbox.w / img_w_px * area_w_cm, 1) | NULL, │ height_cm = round(bbox.h / img_h_px * area_h_cm, 1) | NULL, …) [per crop] │ ├─► COMMIT (segmentation transaction) │ ├─► run_categorization_stage(session, deal) │ │ │ ├─► load DealItem WHERE status IN ('pending','failed') │ ├─► settings = await get_categorizer_settings(session) │ ├─► categorizer = build_categorizer(settings, http) [Two-pass | Single-pass] │ ├─► async with TaskGroup, Semaphore(settings.concurrency): │ │ ├─► storage.read_bytes(item.crop_key) │ │ ├─► size_hint = SizeHint(item.width_cm, item.height_cm) if both else None │ │ ├─► categorizer.classify(image_bytes, size_hint) │ │ ├─► → 1 or 2 calls to OpenRouterClient │ │ ├─► → resolve_subcategory(label) → canonical id │ │ └─► UPDATE deal_item SET category/subcategory/…, status='classified' │ │ (per-item exception → SET status='failed', log; do NOT raise) │ ├─► PipelineRun.input_tokens / output_tokens / model = aggregates │ └─► (orchestrator commits stage transaction) │ └─► run_appraisal_stage(…) [stub today]

Why a core/storage.py seam Segmentation today writes crops via raw pathlib.Path under data_dir. Categorization needs to read those crops. Adding a second feature that pokes data_dir directly would mean rewriting both when Azure lands. A Storage protocol contains the blast radius: Azure becomes a new class, not a refactor. LocalFsStorage is the only implementation in this slice; the API surface is pinned so the future Azure backend is mechanical.

Why integrations/openrouter/ and not features/categorization/llm.py Mirrors integrations/hubspot/. Appraisal will be the next LLM consumer; co-locating now avoids a move later. The client is portable (httpx, no domain assumptions) — fits the integration boundary.

§4Data model

New table — `deal_items`

Column	Type	Notes
id	UUID PK	`uuid4()` default
deal_id	UUID FK → `deals.id`	ON DELETE CASCADE; indexed
source_image	text NOT NULL	Filename of the original photo
crop_key	text NOT NULL	Storage-agnostic key, e.g. `deals/{deal_id}/crops/IMG_0421/item_03.png`
bbox	JSONB NULL	`{x, y, w, h}` from segmentation; nullable for forward-compat
width_cm	double NULL	Physical width, computed by segmentation from `bbox` + `photo_area.width_cm`. NULL when `photo_area.*` is unset
height_cm	double NULL	Physical height; same provenance
category	text NULL	Populated by categorization
category_confidence	double NULL
subcategory	text NULL	Free-text label from the model
subcategory_id	text NULL	Canonical id from `resolve_subcategory()` (NULL if unresolved)
subcategory_confidence	double NULL
status	enum (`pending\|classified\|failed`) NOT NULL	default `pending`
error	text NULL	On `failed`, last error truncated to 2000 chars
created_at / updated_at	timestamptz NOT NULL	`func.now()` server default; `onupdate` on updated_at

Lives in features/categorization/models.py. Segmentation is the row creator but does not own the table — same as how PipelineRun lives in features/pipeline/ even though other features write to it.

New table — `categorizer_settings` (singleton)

Mirrors segmenter_settings. Single-row table; pydantic-serialized settings JSONB plus updated_at / updated_by.

class TwoPassConfig(BaseModel):
    model: str = "google/gemini-2.5-flash"
    timeout_seconds: float = 30.0
    temperature: float = 0.0

class SinglePassConfig(BaseModel):
    model: str = "google/gemini-2.5-flash"
    timeout_seconds: float = 30.0
    temperature: float = 0.0

class CategorizerSettings(BaseModel):
    kind: Literal["two_pass", "single_pass"] = "two_pass"
    concurrency: int = 10
    two_pass: TwoPassConfig = Field(default_factory=TwoPassConfig)
    single_pass: SinglePassConfig | None = None

New table — `app_settings` (generic key-value)

Column	Type	Notes
key	text PK	Dotted-namespace convention, e.g. `photo_area.width_cm`
value	JSONB NOT NULL	Any JSON scalar or object. Use JSON `null` for "explicitly unset"
updated_at	timestamptz NOT NULL	`func.now()` server default + `onupdate`
updated_by	text NULL	Free-text label (admin user / `"cron"` / `"migration"`)

Lives in core/app_settings.py (cross-cutting). Two helpers:

async def get_setting[T](session, key: str, *, as_: type[T], default: T | None = None) -> T | None:
    row = await session.get(AppSetting, key)
    if row is None or row.value is None:
        return default
    return TypeAdapter(as_).validate_python(row.value)

async def set_setting(session, key: str, value: Any, *, updated_by: str | None = None) -> None:
    row = await session.get(AppSetting, key)
    payload = TypeAdapter(type(value)).dump_python(value, mode="json")
    if row is None:
        session.add(AppSetting(key=key, value=payload, updated_by=updated_by))
    else:
        row.value = payload
        row.updated_by = updated_by

Migration seeds photo_area.width_cm and photo_area.height_cm, both value = JSON null. The seeded rows are documentation-by-existence — they tell a future admin endpoint "here are the keys we know about" without forcing a non-null default for an unconfigured photo rig.

Why introduce this now and not later The two photo-area scalars are the smallest, flattest thing that doesn't deserve its own singleton table. Once a generic KV table exists, future flat scalars (feature flags, cron cadences, ad-hoc thresholds) land in it for free. segmenter_settings and categorizer_settings stay as typed singletons for this slice — consolidating them onto app_settings rows is a future slice once we have 3+ such tables and the pattern is worth the migration churn.

Modified table — `pipeline_runs`

Column	Type	Notes
llm_usage	JSONB NULL	Per-model breakdown of all LLM calls in this stage execution

Shape:

{
  "google/gemini-2.5-flash": {"input_tokens": 12500, "output_tokens": 320, "calls": 12},
  "google/gemini-2.5-pro":   {"input_tokens": 200,   "output_tokens": 100, "calls": 1}
}

Why JSONB and not flat columns A single stage execution may use multiple models (two-pass with mixed flash/pro, future appraisal stages, etc.). Flat input_tokens / output_tokens / model columns would either lose model attribution or have to multiply rows. JSONB keeps pipeline_runs at one row per stage execution while preserving per-model detail. Aggregate via jsonb_each / jsonb_path_query. Older rows stay NULL.

Migration `0016_categorization.py`

One Alembic revision. Up creates deal_items (incl. width_cm / height_cm), categorizer_settings, app_settings (with seed rows for photo_area.width_cm and photo_area.height_cm at JSON null), and llm_usage JSONB NULL on pipeline_runs. Down drops in reverse. Append deal_items, categorizer_settings, app_settings to TRUNCATE_TABLES in backend/tests/conftest.py.

§5Storage abstraction

`Storage` protocol

# core/storage.py
class Storage(Protocol):
    async def read_bytes(self, key: str) -> bytes: ...
    async def write_bytes(self, key: str, data: bytes) -> None: ...
    async def exists(self, key: str) -> bool: ...
    async def delete_prefix(self, prefix: str) -> None: ...
    def signed_url(self, key: str, ttl_seconds: int = 3600) -> str: ...

Keys are POSIX-style strings (e.g. deals/{deal_id}/crops/IMG_0421/item_03.png). Never contain backslashes; never start with /. The protocol is intentionally minimal — no streaming, no metadata. Add when needed.

`LocalFsStorage(root: Path)`

root is Settings.data_dir (default Path("data")).
read_bytes / write_bytes via aiofiles; parent dirs created on write.
delete_prefix recursively removes root / prefix; safe against .. escape (validated).
signed_url(key) returns f"/files/{quote(key)}". The ttl_seconds argument is accepted but ignored (documented).

`GET /files/{key:path}` proxy route

Resolves the key under data_dir, validates it stays under data_dir, streams the file with sniffed Content-Type. 404 for missing keys, 400 for invalid keys. Documented as dev/local only — in production with Azure, the proxy is unmounted and the Azure backend issues real signed URLs.

Selection

Settings.storage_kind: Literal["local"] = "local" (env: VCC_STORAGE_KIND). A second pydantic-settings field group per kind (only local today). core/storage.py exposes get_storage() — lru_cached, mirrors get_db_engine.

Future Azure backend (out of slice) AzureBlobStorage(account, container, credential) implements the same protocol. signed_url issues SAS URLs. No code change needed in segmentation or categorization — the contract is the seam.

§6OpenRouter client (`integrations/openrouter/`)

Lift

Copy checkin-pipeline/app/gemini/client.py to integrations/openrouter/client.py. Rename methods onto a class:

gemini_vision_request → OpenRouterClient.vision_request(image, system, model, …) → ChatResponse
gemini_vision_json_request → OpenRouterClient.vision_json_request(image, system, schema, model, …) → JsonResponse
gemini_text_json_request → OpenRouterClient.text_json_request(prompt, schema, model, …) → JsonResponse

Class takes an httpx.AsyncClient and API key in __init__; instances constructed per-stage (cheap, no global state).

Typed errors

# integrations/openrouter/client.py
class OpenRouterError(Exception):
    """Base. Carries status + first 500 bytes of body."""
class OpenRouterAuthError(OpenRouterError):    pass   # 401, 403
class OpenRouterQuotaError(OpenRouterError):   pass   # 429, 402
class OpenRouterServerError(OpenRouterError):  pass   # 5xx
class OpenRouterParseError(OpenRouterError):   pass   # JSON-mode response didn't parse

The categorization stage uses these to distinguish per-item failures (caught, item → failed) from stage-level failures (auth/quota, re-raised past the per-item try/except).

Schemas

# integrations/openrouter/schemas.py
class TokenUsage(BaseModel):
    input_tokens: int
    output_tokens: int

class ChatResponse(BaseModel):
    text: str
    usage: TokenUsage
    model: str

class JsonResponse(BaseModel):
    data: dict[str, Any]      # already JSON-parsed; raises on bad JSON
    usage: TokenUsage
    model: str

Settings additions (`core/settings.py`)

openrouter_api_key:        SecretStr | None = None
openrouter_base_url:       str = "https://openrouter.ai/api/v1"
openrouter_default_model:  str = "google/gemini-2.5-flash"
storage_kind:              Literal["local"] = "local"

Worker startup adds VCC_OPENROUTER_API_KEY to the existing required-env validation block (next to HUBSPOT_ACCESS_TOKEN).

§7Categorizer strategies

`Categorizer` protocol

# features/categorization/categorizer.py
class SizeHint(BaseModel):
    width_cm:  float
    height_cm: float

class ItemClassification(BaseModel):
    category:               str | None
    category_confidence:    float | None
    subcategory:            str | None
    subcategory_id:         str | None
    subcategory_confidence: float | None

class LlmCallRecord(BaseModel):
    model: str
    usage: TokenUsage

class Categorizer(Protocol):
    async def classify(
        self, *, image_bytes: bytes, size_hint: SizeHint | None
    ) -> tuple[ItemClassification, list[LlmCallRecord]]: ...

list[LlmCallRecord] (rather than a single summed TokenUsage) lets the stage attribute every round-trip to its model. Two-pass appends two records; single-pass appends one; a future mixed-model strategy appends as many as it makes.

`TwoPassCategorizer`

class TwoPassCategorizer:
    def __init__(self, client, taxonomy, cfg): ...

    async def classify(self, *, image_bytes, size_hint):
        # 1. category call
        sys1 = render_prompt("two_pass.category.md",
            categories=self.taxonomy.categories_block(),
            size_suffix=format_size_suffix(size_hint))
        r1 = await self.client.vision_json_request(image_bytes, sys1, CATEGORY_SCHEMA,
            model=self.cfg.model, timeout=self.cfg.timeout_seconds, temperature=self.cfg.temperature)
        category = r1.data["category"]; cat_conf = r1.data["confidence"]

        # 2. subcategory call (scoped to chosen category)
        sys2 = render_prompt("two_pass.subcategory.md",
            subcategory_block=self.taxonomy.subcategory_block(category),
            category=category, size_suffix=format_size_suffix(size_hint))
        r2 = await self.client.vision_json_request(image_bytes, sys2, SUBCATEGORY_SCHEMA,
            model=self.cfg.model, timeout=self.cfg.timeout_seconds, temperature=self.cfg.temperature)
        subcat_label = r2.data["subcategory"]; sub_conf = r2.data["confidence"]
        subcat_id = self.taxonomy.resolve_subcategory(category, subcat_label)

        return (ItemClassification(category=category, category_confidence=cat_conf,
                                   subcategory=subcat_label, subcategory_id=subcat_id,
                                   subcategory_confidence=sub_conf),
                [LlmCallRecord(model=r1.model, usage=r1.usage),
                 LlmCallRecord(model=r2.model, usage=r2.usage)])

`SinglePassCategorizer`

class SinglePassCategorizer:
    async def classify(self, *, image_bytes, size_hint):
        sys = render_prompt("single_pass.md",
            tree=self.taxonomy.tree_block(), size_suffix=format_size_suffix(size_hint))
        r = await self.client.vision_json_request(image_bytes, sys, SINGLE_PASS_SCHEMA,
            model=self.cfg.model, timeout=self.cfg.timeout_seconds, temperature=self.cfg.temperature)
        cat = r.data["category"]; subcat_label = r.data["subcategory"]; conf = r.data["confidence"]
        subcat_id = self.taxonomy.resolve_subcategory(cat, subcat_label)
        # single-pass returns one confidence; copy to both fields for shape parity
        return (ItemClassification(category=cat, category_confidence=conf,
                                   subcategory=subcat_label, subcategory_id=subcat_id,
                                   subcategory_confidence=conf),
                [LlmCallRecord(model=r.model, usage=r.usage)])

`build_categorizer(settings, http)`

# features/categorization/factory.py
async def build_categorizer(settings: CategorizerSettings, http: httpx.AsyncClient) -> Categorizer:
    cfg = get_settings()
    if not cfg.openrouter_api_key:
        raise RuntimeError("VCC_OPENROUTER_API_KEY not configured")
    client = OpenRouterClient(http, api_key=cfg.openrouter_api_key.get_secret_value(),
                              base_url=cfg.openrouter_base_url)
    taxonomy = Taxonomy.load()  # reads TOML+CSV each call
    if settings.kind == "two_pass":
        return TwoPassCategorizer(client, taxonomy, settings.two_pass)
    if settings.kind == "single_pass":
        return SinglePassCategorizer(client, taxonomy, settings.single_pass or SinglePassConfig())
    raise ValueError(f"unknown categorizer kind: {settings.kind}")

`Taxonomy` loader

Taxonomy.load() reads config/categories.toml + config/subcategories.csv from disk every call (no cache — matches legacy).
categories_block() renders category names + descriptions.
subcategory_block(category) renders category-scoped subcategory list with hint metadata.
tree_block() renders the full category→subcategory tree.
resolve_subcategory(category, label) case-insensitive substring match, exact preferred; returns canonical CSV id or None.

Prompts

Copied from checkin-pipeline/app/steps/classify/{two_pass,single_pass}/*.md, placeholder names normalized:

prompts/two_pass.category.md — {categories}, {size_suffix}
prompts/two_pass.subcategory.md — {category}, {subcategory_block}, {size_suffix}
prompts/single_pass.md — {tree}, {size_suffix}

render_prompt(name, **kwargs) does Template(text).safe_substitute(**kwargs) against the file contents — re-read each call.

§8Stage entry (`features/categorization/step.py`)

async def run_categorization_stage(session: AsyncSession, deal: Deal) -> None:
    items = (await session.scalars(
        select(DealItem).where(DealItem.deal_id == deal.id,
                               DealItem.status.in_([ItemStatus.PENDING, ItemStatus.FAILED]))
    )).all()
    if not items:
        logger.info("categorization_skipped_no_items", extra={"deal_id": str(deal.id)})
        return

    settings      = await get_categorizer_settings(session)
    storage       = get_storage()
    usage_by_model: dict[str, dict[str, int]] = defaultdict(
        lambda: {"input_tokens": 0, "output_tokens": 0, "calls": 0})

    async with httpx.AsyncClient() as http:
        categorizer = await build_categorizer(settings, http)
        sem = asyncio.Semaphore(settings.concurrency)

        async def classify_one(item: DealItem) -> None:
            async with sem:
                try:
                    image_bytes  = await storage.read_bytes(item.crop_key)
                    size_hint = (
                        SizeHint(width_cm=item.width_cm, height_cm=item.height_cm)
                        if item.width_cm is not None and item.height_cm is not None
                        else None
                    )
                    result, calls = await categorizer.classify(image_bytes=image_bytes, size_hint=size_hint)
                    item.category               = result.category
                    item.category_confidence    = result.category_confidence
                    item.subcategory            = result.subcategory
                    item.subcategory_id         = result.subcategory_id
                    item.subcategory_confidence = result.subcategory_confidence
                    item.status                 = ItemStatus.CLASSIFIED
                    item.error                  = None
                    for call in calls:  # one entry per underlying LLM round-trip
                        bucket = usage_by_model[call.model]
                        bucket["input_tokens"]  += call.usage.input_tokens
                        bucket["output_tokens"] += call.usage.output_tokens
                        bucket["calls"]         += 1
                except Exception as exc:
                    item.status = ItemStatus.FAILED
                    item.error  = repr(exc)[:2000]
                    logger.warning("categorization_item_failed",
                                   extra={"deal_id": str(deal.id),
                                          "item_id": str(item.id),
                                          "error": repr(exc)})
                finally:
                    await session.flush()  # keep ORM state consistent within the open tx

        async with asyncio.TaskGroup() as tg:
            for item in items:
                tg.create_task(classify_one(item))

    # Record per-model usage onto the in-progress PipelineRun row
    run = await _current_pipeline_run(session, deal, PipelineStage.CATEGORIZATION)
    if run is not None:
        run.llm_usage = dict(usage_by_model)  # JSONB: {model: {input_tokens, output_tokens, calls}}
    # commit happens in the orchestrator

_current_pipeline_run(...) selects the in-flight pipeline_runs row (status=running) for this deal+stage; the orchestrator at features/pipeline/orchestrator.py:31-38 creates and flushes that row before calling the stage.

Per-call attribution Categorizer.classify returns tuple[ItemClassification, list[LlmCallRecord]] instead of a single TokenUsage. Each record carries model: str + usage: TokenUsage for one underlying LLM round-trip. Two-pass returns two records, single-pass one. Future mixed-model strategies just append more records — no signature change. LlmCallRecord lives in integrations/openrouter/schemas.py.

Where size_hint comes from The legacy rig assumption is that every photo is taken with the camera framed on a known physical area (the table/surface). The new backend stores that calibration in two app_settings keys — photo_area.width_cm and photo_area.height_cm — both nullable. When run_segmentation_stage inserts each DealItem, it reads those keys plus the source image's pixel dimensions (already loaded for SAM3/recursive) and writes width_cm = round(bbox.w / img_w_px * area_w_cm, 1) / height_cm = round(bbox.h / img_h_px * area_h_cm, 1). If either photo_area.* key is null, both columns stay NULL. Categorization just reads them off the item — no recomputation. Changing photo_area.* later does not backfill existing items; re-run segmentation to recompute. Matches legacy semantics (checkin-pipeline/app/steps/crop/deal_step.py:28-113).

Flush vs commit session.flush() per item only pushes pending ORM changes into the open transaction so subsequent reads see them — it does not commit. If the stage itself raises (auth/quota/build error), the orchestrator's per-stage rollback wipes every flushed update, returning items to their original pending/failed state for the next retry. Consistent with the orchestrator's documented transaction model.

§9Error handling

Per-item LLM failure (timeout, OpenRouter 5xx, JSON parse error): item marked failed, error captured, stage proceeds. Re-runs reprocess pending + failed.
Per-item storage read failure: same as per-item LLM failure.
Settings / build_categorizer / get_storage failure: bubble up. Orchestrator rolls back the stage transaction. Deal stays at SEGMENTATION. Procrastinate retries with backoff.
OpenRouter auth (401) / quota (429): bubble up after the first item — not worth burning items on a known-bad config. OpenRouterClient raises typed OpenRouterAuthError / OpenRouterQuotaError on these statuses; classify_one re-raises them past the try/except. Other exceptions are caught.
TaskGroup cancellation: if any task re-raises (auth/quota cases), the TaskGroup cancels the rest. In-flight item updates already flushed survive in memory; the stage transaction then rolls back. Items reprocess on retry.

§10Concurrency

asyncio.Semaphore(settings.concurrency) per deal (default 10).
No global cap across deals. Worker is single-process today; concurrent deals are not a real concern.
When the worker scales, a global cap will likely live on the OpenRouterClient — out of slice.

§11Configuration

Environment

Env var	Default	Required	Purpose
VCC_STORAGE_KIND	`local`	no	Storage backend selector
VCC_OPENROUTER_API_KEY	(none)	worker startup	OpenRouter auth
VCC_OPENROUTER_BASE_URL	`https://openrouter.ai/api/v1`	no
VCC_OPENROUTER_DEFAULT_MODEL	`google/gemini-2.5-flash`	no	Fallback for settings defaults

Runtime (DB-backed)

categorizer_settings.settings: kind, concurrency, per-kind model / timeout_seconds / temperature.

app_settings:

Key	Type	Default	Purpose
photo_area.width_cm	float \| null	`null`	Physical width of the camera-framed area
photo_area.height_cm	float \| null	`null`	Physical height of the camera-framed area

Both seeded by migration as JSON null. When unset, items get width_cm = height_cm = NULL and categorization omits the size suffix.

Repo (no env)

features/categorization/config/categories.toml, config/subcategories.csv, prompts/*.md — copied verbatim from legacy. Edited via PR.

§12Testing

1 · Unit — strategies (`test_two_pass.py`, `test_single_pass.py`)

Mock OpenRouterClient.vision_json_request to return canned JsonResponses. Assert:

Two-pass makes exactly two calls; single-pass makes one.
Prompt text contains injected {categories} / {subcategory_block} / {tree}.
size_suffix appended when size_hint provided, omitted when not.
Known subcategory label resolves to right id; case-insensitive match works; unknown label → subcategory_id=None (label preserved).
Returned list[LlmCallRecord] has length 2 for two-pass (one per round-trip, each carrying its own model) and length 1 for single-pass.
When size_hint is provided, the rendered prompt contains "Physical size: approximately {W}cm × {H}cm."; when None, it does not.

2 · Unit — app_settings (`tests/core/test_app_settings.py`)

get_setting / set_setting round-trip for a float scalar, an object, and None (JSON null). Default fallback when the key is absent.

3 · Unit — taxonomy (`test_taxonomy.py`)

Load the real shipped TOML + CSV. Assert 11 categories load, known subcategory resolves both case-sensitively and case-insensitively, rendered blocks for each category are non-empty.

4 · Integration — stage (`test_step.py`, real Postgres, mocked LLM + storage)

Uses the existing testcontainers conftest. Per test:

Create a Deal, insert 3 DealItem rows (one with width_cm / height_cm set, two without — covers both prompt branches).
Patch OpenRouterClient.vision_json_request to return [success, raises httpx.ReadTimeout, success].
Patch Storage.read_bytes (monkeypatch at get_storage() cache) to return a canned 64×64 PNG.
Run run_categorization_stage(session, deal).
Assert: 2 items classified with fields set, 1 item failed with classification NULL and error set, PipelineRun.llm_usage for CATEGORIZATION is a dict keyed by model with input_tokens / output_tokens / calls (two-pass on 2 successful items → calls=4 under that model), stage returned without raising.

Second test (test_step_rerun.py): pre-seed one failed and one classified item; run again; assert only failed retried, classified untouched.

5 · Unit — storage (`tests/core/test_storage.py`)

LocalFsStorage round-trip under tmp_path; delete_prefix recurses; signed_url returns /files/{key} with URL-quoting; path-escape attempt (..) rejected.

Not tested Live OpenRouter in CI (manual backend/scripts/smoke_categorize.py handles that hand-run before merge). Full segmentation+categorization end-to-end pipeline (segmentation's existing test gets one new assertion that DealItem rows are inserted; orchestrator's existing test covers wiring).

Fixtures

tests/features/categorization/fixtures/sample.png — single 64×64 RGB PNG checked into the repo. ~200 bytes.

§13Decisions made (during brainstorm)

DealItem rows come from segmentation. Cleaner contract than letting categorization scan disk. Categorization reads WHERE status IN ('pending','failed').
DB-backed CategorizerSettings with kind enum (two_pass | single_pass). Mirrors SegmenterSettings; consistent with the existing factory pattern.
LLM client lives in integrations/openrouter/. Parallels integrations/hubspot/; predictable home for appraisal next.
Categories / subcategories / prompts: copy legacy files as-is under features/categorization/. Re-read each call. Edits live in dev.
Token tracking: one llm_usage JSONB column on pipeline_runs with a per-model breakdown ({model: {input_tokens, output_tokens, calls}}). Captures multi-model stages without a separate llm_calls table. Strategy contract is tuple[ItemClassification, list[LlmCallRecord]] so per-call model attribution survives all the way from the OpenRouter client to the aggregate.
Per-deal concurrency: 10, configurable. Matches legacy.
Storage abstraction lands in this slice; LocalFsStorage only. Azure backend is a follow-up that only implements the protocol.
Describe-then-classify is not ported. The variation registry pattern leaves the seam.
Per-item failures don't fail the stage. Items get status='failed' + error; re-runs reprocess them. Stage-level failures (auth, quota, config) bubble up for orchestrator backoff/retry.
Photo-rig calibration via generic app_settings table rather than a new singleton. Keys: photo_area.width_cm, photo_area.height_cm (both nullable, seeded as JSON null). Future flat scalars (feature flags, thresholds, cadences) land in the same table for free. Existing typed singletons (segmenter_settings, categorizer_settings) stay as-is for this slice; consolidation is a future slice.
Per-item physical dimensions computed by segmentation from bbox × photo_area, written to DealItem.width_cm / height_cm at insert time. Categorization reads them off the row; size_hint is derived in the stage entry. Stale-on-settings-change matches legacy semantics (re-run segmentation to recompute).

Categorization stage (two-pass + single-pass)

TL;DRSummary

§1Context

What exists

Legacy categorization (the source we are porting)

§2Scope (in / out)

In scope

Out of scope

§3Architecture

Directory layout

Sequence

§4Data model

New table — deal_items

New table — categorizer_settings (singleton)

New table — app_settings (generic key-value)

Modified table — pipeline_runs

Migration 0016_categorization.py

§5Storage abstraction

Storage protocol

LocalFsStorage(root: Path)

GET /files/{key:path} proxy route

Selection

§6OpenRouter client (integrations/openrouter/)

Lift

Typed errors

Schemas

Settings additions (core/settings.py)

§7Categorizer strategies

Categorizer protocol

TwoPassCategorizer

SinglePassCategorizer

build_categorizer(settings, http)

Taxonomy loader

Prompts

§8Stage entry (features/categorization/step.py)

§9Error handling

§10Concurrency

§11Configuration

Environment

Runtime (DB-backed)

Repo (no env)

§12Testing

1 · Unit — strategies (test_two_pass.py, test_single_pass.py)

2 · Unit — app_settings (tests/core/test_app_settings.py)

3 · Unit — taxonomy (test_taxonomy.py)

4 · Integration — stage (test_step.py, real Postgres, mocked LLM + storage)

5 · Unit — storage (tests/core/test_storage.py)

Fixtures

§13Decisions made (during brainstorm)

New table — `deal_items`

New table — `categorizer_settings` (singleton)

New table — `app_settings` (generic key-value)

Modified table — `pipeline_runs`

Migration `0016_categorization.py`

`Storage` protocol

`LocalFsStorage(root: Path)`

`GET /files/{key:path}` proxy route

§6OpenRouter client (`integrations/openrouter/`)

Settings additions (`core/settings.py`)

`Categorizer` protocol

`TwoPassCategorizer`

`SinglePassCategorizer`

`build_categorizer(settings, http)`

`Taxonomy` loader

§8Stage entry (`features/categorization/step.py`)

1 · Unit — strategies (`test_two_pass.py`, `test_single_pass.py`)

2 · Unit — app_settings (`tests/core/test_app_settings.py`)

3 · Unit — taxonomy (`test_taxonomy.py`)

4 · Integration — stage (`test_step.py`, real Postgres, mocked LLM + storage)

5 · Unit — storage (`tests/core/test_storage.py`)