Platform Roadmap

This note tracks the migration from the current single-process demo flow to a deployment model that can run on Google Cloud, AWS, or a university-managed cluster without rewriting the synthesis logic.

Goals

  • Keep the synthesis API and preprocessing pipeline cloud-agnostic.
  • Replace in-memory session state with durable metadata and artifact storage.
  • Support both public-cloud auth and campus SSO with the same backend claims model.
  • Let the same job abstraction target Cloud Run Jobs, ECS/Fargate or AWS Batch, and Slurm/Kubernetes Jobs.

Phase 1: Durable Foundations

Status: started

  • Add environment-driven settings (web_app/settings.py).
  • Add a metadata persistence layer (web_app/metadata_store.py).
  • Route metadata store construction through a backend factory so sqlite local dev and future Postgres deploys share the same call sites.
  • Add an object storage abstraction with local and S3-compatible backends (web_app/object_storage.py).
  • Keep the existing FastAPI flow unchanged while the new primitives harden under tests.
  • Persist preview/inference bundles so confirmation survives in-memory session loss.

Deliverables:

  • SQLite metadata store for local development.
  • Local artifact storage rooted under temp_synthesis_output/state/artifacts.
  • Tests that lock in user/job/artifact semantics.

Phase 2: Job Model

  • Status: in progress
  • Introduce explicit job states: queued, running, succeeded, failed, cancelled.
  • Convert /confirm_synthesis into job submission plus status polling.
  • Move synthesized CSVs and uploaded parquet files behind the object storage abstraction.
  • Keep inline execution as the local default backend to preserve the current dev UX.
  • Keep backend selection config-driven through PRIVSYN_JOB_BACKEND so routes do not have to change when slurm or cloud backends land.
  • Persist confirmed run bundles so remote workers can consume portable input artifacts rather than route-local temp state.
  • Treat queued backends as first-class job submissions: only inline-complete runs should populate the legacy in-memory session payload.

Phase 3: Auth Model

  • Add a normalized users table keyed by external subject (sub) and provider.
  • Accept OIDC-backed identity claims in the backend.
  • Map cloud auth providers and campus SSO into the same internal user record.
  • Add per-job ownership checks before artifact download and evaluation.

Phase 4: Platform Adapters

Google Cloud

  • Web/API: Cloud Run
  • Jobs: Cloud Run Jobs
  • Object storage: Google Cloud Storage
  • Database: Cloud SQL Postgres or external Postgres
  • Auth: Google login, Clerk, or another OIDC provider

AWS

  • Web/API: App Runner
  • Jobs: ECS/Fargate or AWS Batch
  • Object storage: S3
  • Database: RDS Postgres
  • Auth: Cognito or another OIDC provider

University-managed deployment

  • Web/API: campus VM or Kubernetes ingress
  • Jobs: Slurm or Kubernetes Jobs
  • Object storage: MinIO, Ceph, or shared storage behind the object storage interface
  • Database: campus Postgres
  • Auth: campus SSO via OIDC or SAML-to-OIDC bridge

Integration Rules

  • Business logic should not import cloud-specific SDKs directly.
  • Storage code should depend on storage adapters, not filesystem paths.
  • Job submission should go through one backend interface, even for inline local runs.
  • Authenticated user identity should enter the synthesis flow as a normalized user record, not as provider-specific fields.

Safe Rollout and Rollback

  • Introduce each new subsystem as a dual-write or read-fallback layer first.
  • Keep SessionStore and current local run directories working until metadata-backed paths are proven in tests.
  • Preserve current API response shapes while adding job and artifact metadata behind the scenes.
  • Only remove legacy state paths after one full release cycle of stable tests and manual validation.

University Deployment Checklist

  • Public entrypoint: confirm whether campus IT will host a VM, reverse proxy, or Kubernetes ingress.
  • Job submission: confirm whether web services may submit to Slurm or another scheduler.
  • Identity: confirm OIDC or SAML app registration path.
  • Data services: confirm Postgres and object storage availability.
  • Security review: confirm whether user-uploaded datasets require privacy or compliance review.

Immediate Next Steps

  1. Persist preview/inference artifacts so remote runners do not depend on SessionStore.
  2. Persist scheduler-side diagnostics (exit code, stderr pointer, submission host) back into job metadata.
  3. Add a GCS adapter and production Postgres deployment path behind the existing storage and metadata interfaces.
  4. Layer per-user ownership and auth checks on top of the durable job/artifact APIs.
  5. Add a higher-level deployment guide that compares local, Slurm, AWS Batch, and Cloud Run setup requirements side by side.