Backend Guide¶
Architecture Overview¶
- FastAPI application:
web_app/main.py /synthesizeinfers metadata and caches the original DataFrame plus draft domain info./confirm_synthesisreconstructs the DataFrame with user overrides and invokes the synthesizer./jobs/{job_id}returns persisted job metadata and registered output artifacts./download_synthesized_data/{session_id}streams the generated CSV for a confirmed synthesis session./evaluatecalculates metadata-aware metrics usingweb_app/data_comparison.py(keyed by the same session ID).- Synthesis service:
web_app/synthesis_service.py - Bridges the cached inference bundle to the selected synthesizer.
- Handles preprocessing (clipping, binning, categorical remap) before handing off to PrivSyn or AIM.
Algorithm references: PrivSyn follows the approach in PrivSyn: Differentially Private Data Synthesis; the AIM adapter implements The AIM Mechanism for Differentially Private Synthetic Data.
Key Modules¶
| Module | Role |
|---|---|
web_app/data_inference.py |
Detect column types, normalise metadata, and prepare draft domain/info payloads. |
web_app/synthesis_service.py |
Applies overrides, constructs the preprocesser, runs the synthesizer, and persists outputs. |
web_app/job_service.py |
Dual-writes inline synthesis runs into durable job + artifact records without changing the current UI flow. |
web_app/job_bundle.py |
Persists confirmed run inputs as portable bundle artifacts for future Slurm/cloud workers. |
web_app/job_runner.py |
Swappable synthesis execution backend; current inline runner isolates execution from route-level preprocessing. |
web_app/slurm_plan.py |
Builds Slurm submission scripts from persisted confirmed job bundles without coupling Slurm details to the route layer. |
web_app/aws_batch_plan.py |
Builds aws batch submit-job ... commands from durable confirmed bundles without coupling AWS CLI details to the route layer. |
web_app/cloud_run_plan.py |
Builds gcloud run jobs execute ... commands from durable confirmed bundles without coupling Cloud Run CLI details to the route layer. |
web_app/aws_batch_control.py |
Wraps AWS Batch status and cancellation commands so queued batch jobs can be observed and cancelled through the same API shape. |
web_app/cloud_run_control.py |
Wraps Cloud Run execution status and cancellation commands so queued cloud jobs can be observed and cancelled through the same API shape. |
web_app/job_execution.py |
Shared helpers for materializing run directories and reconstructing synthesis inputs inside either the API process or a remote worker. |
web_app/run_confirmed_job.py |
Worker entrypoint that rehydrates a persisted confirmed bundle and executes the synthesis run outside the web process. |
web_app/auth.py |
Request-time auth adapter layer; currently supports none and trusted-header identity injection. |
web_app/metadata_store.py |
Metadata store abstraction with sqlite and Postgres backends behind the same CRUD surface. |
web_app/object_storage.py |
Object storage abstraction with local and S3-compatible backends, including local materialization for remote workers. |
web_app/settings.py |
Environment-driven runtime configuration for state roots, storage backends, and future auth/job adapters. |
privsyn_platform/ |
Shared platform-oriented import path for auth, storage, metadata, and settings reuse across future tabular/image/text apps. |
web_app/data_comparison.py |
Implements histogram-aware TVD and other metrics for evaluation. |
method/synthesis/privsyn/privsyn.py |
PrivSyn implementation (marginal selection + GUM). |
method/api/base.py |
Core synthesizer API (SynthRegistry, PrivacySpec, RunConfig, Synthesizer protocol). |
method/api/utils.py |
Helper utilities used by adapters (e.g., split_df_by_type, schema enforcement). |
method/synthesis/AIM/adapter.py |
Adapter wiring AIM into the unified interface provided by method/api. |
method/preprocess_common/ |
Shared discretizers (PrivTree, DAWA) and helper utilities. |
Unified Synthesis Interface¶
method/api/base.py defines the shared contract every synthesis method must follow:
SynthRegistryexposesregister,get, andlisthelpers so adapters (e.g.,method/synthesis/privsyn/__init__.py,method/synthesis/AIM/__init__.py) can self-register at import time.PrivacySpecandRunConfigcapture the caller’s DP/compute requirements and are passed through to each adapter._AdapterSynthand_AdapterFittedwrap legacy prepare/run functions so existing method code needs minimal changes.
The backend dispatcher (web_app/methods_dispatcher.py) and tests such as test/test_methods_dispatcher.py rely on this registry to treat every method uniformly. Method-specific modules (method/synthesis/<name>/native.py, config.py, parameter_parser.py, etc.) stay alongside each algorithm because they encode behaviour that other methods do not share (e.g., PrivSyn’s marginal-selection parameters or AIM’s workload configuration). Keep the registry small and general, and let each method own its internal configuration files.
Endpoint Notes¶
POST /synthesize¶
- Expects multipart form (fields documented in
test/test_api_contract.py). - For sample runs, omit the file and set
dataset_name=adult. - Stores the uploaded DataFrame and inferred metadata under a temporary UUID in memory.
- Also persists the preview bundle (input parquet + inferred metadata + synthesis params) so
/confirm_synthesiscan recover after in-memory session loss. - All columns from the uploaded table participate in metadata inference; the API no longer accepts or drops a distinct target column.
POST /confirm_synthesis¶
- Requires the
unique_idreturned by/synthesize. - Accepts JSON strings for
confirmed_domain_dataandconfirmed_info_data. - Runs the chosen synthesizer (
privsynoraim) and writes synthesized CSV + evaluation bundle to the temp directory. - Also registers a durable job record plus synthesized artifact metadata so later platform adapters can replace inline execution without changing API contracts.
- Falls back to the persisted preview bundle when the in-memory inference session has expired or the process has restarted.
- Returns first-class job fields such as
job_id,status,status_url, anddownload_url, while still preserving the legacysession_idalias for compatibility. - Persists a confirmed input parquet plus
job_request.jsonartifact so remote workers can execute the same confirmed run without route-local state. - Only populates the legacy in-memory evaluation session when the job finishes inline; queued backends rely on
/jobs/{job_id}, durable artifacts, and the remote worker path instead.
GET /jobs/{job_id}¶
- Returns the persisted job state (
running,succeeded,failed, etc.), metadata, and registered artifacts. - Exposes whether the legacy in-memory session bundle is still available for evaluation.
- This endpoint is the bridge toward future remote execution backends such as Cloud Run Jobs, AWS Batch/ECS/Fargate, or Slurm.
- The current inline execution path already routes through
web_app/job_runner.py, so future backends can be swapped in without rewriting the preprocessing route. - Confirmed input artifacts and request bundles are now registered alongside synthesized outputs, which gives remote backends a stable payload to consume.
- For queued backends, this is the authoritative polling endpoint until a remote worker runs
python -m web_app.run_confirmed_job --job-id ... --job-request-key .... - The generated Slurm script now exports the shared metadata/artifact roots plus object-storage backend settings so the worker writes back into the same durable state as the web/API tier.
- Slurm-backed jobs also opportunistically refresh
queued/runningstate fromsqueueso the API reflects scheduler progress without waiting for a terminal worker callback. - AWS Batch-backed jobs now use the same
/jobs/{job_id}and/jobs/{job_id}/cancelroutes for observation and cancellation, viaaws batch describe-jobs,cancel-job, andterminate-job. - Cloud Run job submission now goes through a parallel planning layer that emits
gcloud run jobs executecommands; in production this still expects shared durable metadata/artifact backends rather than local-only paths. - Cloud Run-backed jobs now use the same
/jobs/{job_id}and/jobs/{job_id}/cancelroutes for observation and cancellation, viagcloud run jobs executions describe/cancel. - Remote submission backends now persist a lightweight submission diagnostic bundle in job metadata, including the backend name and the concrete CLI command used to enqueue the job.
- Metadata store construction now goes through a factory seam keyed by
PRIVSYN_METADATA_BACKEND/PRIVSYN_DATABASE_URL, defaulting to file-backed sqlite for local development and accepting Postgres URLs for shared deployments. - Request auth now goes through
PRIVSYN_AUTH_BACKEND, with the first concrete non-demo adapter using trusted headers from a campus reverse proxy or other OIDC front end. - Preview bundles, durable jobs, downloads, evaluation, and RC compatibility jobs all now carry owner metadata so authenticated deployments can enforce per-user access without rewriting the synthesis core.
POST /jobs/{job_id}/cancel¶
- Cancels a Slurm-backed job via
scanceland marks the durable job record ascancelled. - Returns the same serialized job payload as
GET /jobs/{job_id}so clients can reuse the polling shape after a cancel request.
GET /download_synthesized_data/{session_id}¶
- Streams the generated CSV for a previously confirmed synthesis session.
- Reads from the legacy in-memory
SessionStorewhen available, but now falls back to the persisted artifact registry so downloads survive session cleanup.
POST /evaluate¶
- Accepts
session_id(form field) and reuses cached original/synth data to compute metrics (e.g., histogram TVD for numeric columns). - Falls back to the persisted confirmed input parquet plus synthesized CSV artifact when the in-memory session has already been evicted.
Local Development¶
uvicorn web_app.main:app --reload --port 8001
# Optionally set VITE_API_BASE_URL when running the frontend separately
export VITE_API_BASE_URL=http://127.0.0.1:8001
Configuration Tips¶
- CORS origins are defined in
web_app/main.py. Update theallow_originslist to include any new frontend domains. - Set the
ADDITIONAL_CORS_ORIGINSenvironment variable (comma-separated list) in production to append extra origins, especially for Vercel preview/prod URLs. CORS_ALLOW_ORIGINSis still accepted as a deprecated alias so older deploys do not break immediately.- Temporary artifacts (original data, synthesized CSVs) land under
temp_synthesis_output/. Keep an eye on disk usage during iterative testing. - Use environmental overrides or
.envfiles for production secrets (database URLs, etc.)—the current setup only handles the stateless demo flow. - For shared cloud or campus deployments, set
PRIVSYN_METADATA_BACKEND=postgresandPRIVSYN_DATABASE_URL=postgresql://...; the backend normalizes this to thepsycopgSQLAlchemy driver automatically. - For internal deployments behind campus SSO or another trusted proxy, set
PRIVSYN_AUTH_BACKEND=trusted-headerand have the proxy injectX-Privsyn-Subjectplus optionalX-Privsyn-Email,X-Privsyn-Name, andX-Privsyn-Admin.