A Claude Code session, start to end

Asynchronous bulk prediction, built end to end.

Ten stacked pull requests that take an ML modeling platform from no bulk-prediction capability to a working pipeline: submit thousands of molecules, fan them across recommended models, run them on GPU workers that scale from zero, and merge the results back — driven, hardened, reviewed, and documented in one session.

10
stacked PRs
2
new tables
74
tests green
0
lint + type errors
~1M
SMILES, max run

What we built

One request, fanned across models, run on demand.

A BulkRequest is the experiment id and the queryable index; it fans out into one BulkJob per resolved model. Durable state lives in S3, the rows track lifecycle. Execution is Celery over Amazon SQS, with AWS Batch autoscaling GPU workers from zero on queue depth. The same one-shot worker drains both the inference and finalize queues, so the merge runs before the GPU node releases.

flowchart LR U["Scientist / client"] --> API["/api/bulk/"] API --> DB[("bulk_requests + bulk_jobs")] API -->|process_job on commit| Q1["SQS bulk-inference"] Q1 --> AL["CloudWatch alarm"] AL --> EB["EventBridge rule"] EB -->|SubmitJob| BW["AWS Batch GPU worker (scale from zero)"] BW -->|inference| S3J[("S3 per-job results")] BW -->|finalize| Q2["SQS bulk-finalize"] Q2 --> BW BW -->|merge| S3R[("S3 result.parquet")] BW -->|status CAS| DB

The stack

Ten PRs, each one thing, stacked on the last.

Every PR does exactly one job and builds on the one before it — from the app skeleton through the engine, the API, the worker image, the AWS infrastructure, the UI, and the docs.

PR 1 · #80App skeleton — models, migration, local stackadd
bulk-pr1 → dev

Adds the apps/bulk app skeleton: two tables and the local dev stack. Additive only — no registry/predict/web paths touched.

  • BulkRequest and BulkJob models with migration 0001_initial.
  • App registration, Django admin, factories, and model tests.
  • Local stack: a Redis broker service in docker-compose plus justfile recipes.
erDiagram bulk_requests ||--o{ bulk_jobs : "fans out to" bulk_requests { bigint id PK "exp id" text status text target text model_id FK text smiles_path int smiles_count json result_paths } bulk_jobs { bigserial id PK bigint request_id FK int index text status text model_id int completed_smiles }
PR 2 · #81Extract format_predictions into a shared modulerefactor
bulk-pr2 → bulk-pr1

Moves prediction-result formatting out of the predict view into a shared module so the bulk pipeline can reuse it. Logic moved verbatim — no behaviour change.

  • New apps/predict/services/formatting.py with format_predictions / extract_prediction.
  • Predict view updated to import from the shared module; no other call sites.
graph LR subgraph Before V1["predict view"] --> F1["formatting logic (inline)"] end subgraph After V2["predict view"] --> M["formatting.py"] B["bulk process_job (PR 4)"] --> M end
PR 3 · #82Ingestion + routing validation servicesadd
bulk-pr3 → bulk-pr2

Adds the service layer that reads input SMILES and validates routing before a request is accepted.

  • storage_service — read SMILES from an inline list or an S3 source (txt/csv/parquet/sdf), save the input, merge per-job results.
  • validation — check routing (target/series/model_id) against recommendations and sample-check the SMILES.
flowchart LR IN["smiles[] or smiles_source"] --> ING["read SMILES"] ROUTE["target / series / model_id"] --> VAL["validate routing + SMILES"] ING --> OK{valid?} VAL --> OK OK -- yes --> RDY["ready to submit"] OK -- no --> ERR["400"]
PR 4 · #83Celery wiring + worker tasks + service engineadd
bulk-pr4 → bulk-pr3

Adds the Celery app, the worker task graph, and the service engine that runs a request end to end.

  • Task chain orchestrate -> process_job (one per model) -> finalize.
  • BulkService submit / cancel / retry; routing resolver shared with the preview endpoint.
  • Inference-engine and experiment-tracking integration.
  • Dispatch on transaction commit; cancel terminates running Batch jobs; empty routing fails fast; finalize idempotent via status check-and-set.
flowchart TD S["submit_request"] -->|on commit| O["orchestrate"] O -->|no models| F1["mark FAILED"] O -->|1..n models| P["process_job (per model)"] P --> J[("S3 job result")] P -->|last job| FIN["finalize"] FIN --> R[("S3 merged result")] FIN --> C["request COMPLETED"] X["cancel"] -. "TerminateJob + flag" .-> P
PR 5 · #84/api/bulk/ REST APIadd
bulk-pr5 → bulk-pr4

Exposes the bulk pipeline over /api/bulk/ (browser, Okta) and /api/service/bulk/ (PAT). Same views back both prefixes.

  • Views, serializers, URLs: submit (202), list (paginated, scoped to caller), status, cancel, retry, result (presigned downloads).
  • Per-caller ownership scoping; rate-limit and input-size breaches surface as 429 / 400.
flowchart LR C["client"] --> API["/api/bulk/"] API --> SUB["POST requests/ -> 202"] API --> LIST["GET requests/"] API --> RES["GET requests/{id}/result/"] API --> CAN["POST requests/{id}/cancel/"] SUB --> SVC["BulkService"] CAN --> SVC SVC --> DB[("bulk_requests / bulk_jobs")]
PR 6 · #85GPU worker image + cuda extra + CI build jobadd
bulk-pr6 → bulk-pr5

Builds the GPU worker container image and wires its CI build. At this point the worker still drains bulk_gpu — queues are renamed in PR 7.

  • Dockerfile.bulk — CUDA image that runs the one-shot Celery worker (bulk_one_shot).
  • cuda optional-dependency extra: the GPU inference stack plus pycurl for the SQS consumer.
  • CI build-bulk job builds the image and pushes it to ECR.
flowchart LR CI["CI build-bulk"] --> IMG["Dockerfile.bulk (CUDA + cuda extra)"] IMG --> ECR["ECR: bulk-worker-dev"] ECR --> W["one-shot worker: bulk_one_shot"]
PR 7 · #86AWS infra provisioner + SQS broker wiringrefactor
bulk-pr7 → bulk-pr6

Provisions the AWS runtime and switches Celery onto the SQS broker. Resources are renamed from the PR 6 placeholders to descriptive, role-based names.

  • Idempotent provisioner: SQS queues + DLQs, Batch GPU compute environment / queue / job definition, CloudWatch alarm, EventBridge autoscale rule + IAM role.
  • Celery on the SQS broker, routed to bulk-inference / bulk-finalize.
  • Renamed: queues bulk_gpu / bulk_cpu, the worker CMD, and the task docstrings.
flowchart LR subgraph Before["Before (PR 6)"] W1["web / worker"] --> BG["SQS bulk_gpu"] W1 --> BC["SQS bulk_cpu"] end subgraph After["After (this PR)"] W2["web: submit"] -->|process_job| Q1["SQS bulk-inference"] Q1 --> AL["CloudWatch alarm"] AL --> EB["EventBridge rule"] EB -->|SubmitJob| BW["Batch GPU worker (scale from zero)"] BW --> S3[("S3 results")] BW -->|finalize| Q2["SQS bulk-finalize"] Q2 --> BW end
PR 8 · #87Bulk Predictions UI (frontend)add
bulk-pr8 → bulk-pr7

Adds the Bulk Predict screen to the React app, styled to match Quick Predict.

  • frontend/src/bulk.jsx: submit form, requests list (refresh + client-side pagination), detail pane (progress, per-job status, cancel/retry, downloads).
  • Route and sidebar registration.
flowchart LR UI["Bulk Predict screen"] --> FORM["submit form"] UI --> LISTV["requests list"] UI --> DET["detail pane"] FORM -->|POST| A1["/api/bulk/requests/"] LISTV -->|GET| A1 DET -->|GET| A2["requests/{id}/ and /result/"] DET -->|POST| A3["cancel / retry"]
PR 9 · #88Submit-form BFF endpoints — options / quota / preview-routingadd
bulk-pr9 → bulk-pr8

Adds the read-only helper endpoints the submit form needs.

  • GET options/ — routable targets and series, output formats, limits (cached 60s).
  • GET quota/ — the caller's per-user usage plus the global cluster cap.
  • GET preview-routing/ — the model count for a routing, via the shared resolve_model_ids so it never drifts from submit.
flowchart LR FORM["submit form"] --> OPT["GET options/"] FORM --> QUO["GET quota/"] FORM --> PRE["GET preview-routing/"] OPT --> REC[("recommendations")] QUO --> DB[("bulk_requests")] PRE --> RES["resolve_model_ids (shared with submit)"]
PR 10 · #89Documentation — app, AWS resources, schema, ERDadd
bulk-pr10 → bulk-pr9

Documents the bulk feature across the reference docs. Docs only — no code changes.

  • schema.md (bulk tables), api-reference.md (/api/bulk/ endpoints), architecture.md (flow + diagrams).
  • code-organization.md (now five apps), bulk-aws-resources.md, README index, CLAUDE.md.
  • erd.html (v9) — the bulk_requests / bulk_jobs tables with their FK edges.
graph TD PR["PR 10: docs"] --> SCH["schema.md: bulk tables"] PR --> API["api-reference.md: /api/bulk/"] PR --> ARCH["architecture.md: flow"] PR --> ORG["code-organization.md: 5 apps"] PR --> AWS["bulk-aws-resources.md"] PR --> ERD["erd.html v9"]

The journey

How it came together.

The work wasn't just writing code — it was running it for real, finding where it lied (cancel), renaming the world, stripping the slop, and reviewing every line twice.

Run it for real
Drove the bulk pipeline end to end against the dev service API. Confirmed real predictions landing in S3 — including a multi-model run (9,564 SMILES x 2 models = 19,128 rows merged correctly).
Polish the UI
Aligned the Bulk screen with Quick Predict: matched buttons, added section dividers, paginated the long Requests list, and dialled the auto-refresh back to 10s.
Make cancel actually cancel
Verified cancel was cosmetic — it didn't stop running Batch jobs. Recorded AWS_BATCH_JOB_ID per job, called batch:TerminateJob on cancel, and added a cooperative status poll. Proven on a 195k-SMILES run cancelled at ~20%: progress froze, job marked FAILED, Batch job exited cleanly.
Rename the infrastructure
Renamed every AWS resource to a descriptive, role-based scheme, provisioned the new set, and smoke-tested a fresh run through the renamed queues, compute environment, and worker.
Delete the old, with approval
Presented the exact delete list and only removed the superseded SQS / Batch / alarm / rule / role / IAM / log group / ECR tag after explicit sign-off. Verified old gone, new intact.
Strip the AI slop
Ran a reviewer per branch to remove dead comments, redundant guards, and stray casts. De-duped the edits to each file's origin PR, then cascade-rebased the whole stack. Lint and mypy clean, tests green.
Review every PR — twice
Reviewed all 10 PRs against their stated purpose. Fixed a transaction-commit dispatch race, a misplaced worker command, a duplicated resolver, a missing quota cap, an empty-routing hang, a progress clamp, and a stale cancel claim in the docs.
Document it
Added the bulk tables to schema.md, the /api/bulk/ endpoints to api-reference.md, the fifth app to code-organization.md, and the two tables to the browsable ERD (v9). Rewrote all 10 PR descriptions to one brief, diagram-led format.
Branch for landing
Created a dev branch off main and retargeted the base of the stack to it, leaving the rest of the chain intact.

Proven end to end

Not a mock — real runs on dev.

Every claim was checked against real requests flowing through the live dev pipeline and real predictions written to S3.

Multi-model merge

One target across two recommended models: 9,564 SMILES × 2 = 19,128 rows, merged into one result and completed.

multi-model run · completed

Rename smoke test

A fresh run drove the whole renamed stack — new queues, new compute environment, new worker — to a downloadable result.

smoke run · completed

Cancel that bites

A 997,547-SMILES run cancelled at ~20%: progress froze, the job went FAILED, and the AWS Batch job was terminated.

cancel test · cancelled