Client Workflows

In the following sections, we will describe the typical Compox client workflows for training and inference.

Execution workflow

This doc covers the client-facing flow for execution in Compox: upload data, start an execution, poll status, and retrieve results. All endpoints and payloads are derived from the current server code under compox/src/compox/routers.

Upload data files

Endpoint:

POST /api/v0/files

Behavior (from file_controller.py):

The request body is treated as raw bytes and must be a valid HDF5 file.
On success, returns { "file_id": "<uuid>" }.

Minimal example:

POST /api/v0/files
Content-Type: application/octet-stream

<HDF5 bytes>

Response:

{ "file_id": "..." }

Notes:

The server validates that the uploaded bytes open as HDF5.
Files are stored in the data-store bucket/collection.
Files by default expire after 1 day (configurable in S3Connection).

Execute an algorithm

Endpoint:

POST /api/v0/execute-algorithm

Payload model: IncomingExecutionRequest

algorithm_id: string
input_dataset_ids: list of file IDs
checkpoint_id: optional checkpoint to load assets from
algorithm_minor_version: optional minor version to execute
execution_device_override: optional device override (e.g. "cpu", "cuda:0")
additional_parameters: dict (free-form)
session_token: optional session identifier

Example:

{
  "algorithm_id": "<algorithm_id>",
  "input_dataset_ids": ["<file_id_1>", "<file_id_2>"],
  "checkpoint_id": null,
  "algorithm_minor_version": null,
  "execution_device_override": null,
  "additional_parameters": {
    "threshold": 0.5,
    "tile_size": 512
  },
  "session_token": null
}

Response:

{ "execution_id": "..." }

Validation behavior (from execution_controller.py):

All referenced input file IDs must exist in data-store.
The algorithm ID must exist in algorithm-store (via find_algorithm_by_id).

Execution mode:

If inference.backend_settings.executor is fastapi_background_tasks, execution runs via execution_task_fastapi.
If executor is celery, the task is queued under task (Celery).

Progress/status details (from TaskHandler):

Valid statuses are PENDING, STARTED, RUNNING, COMPLETED, FAILED, STOPPED.
execution_controller.py creates the record with status="PENDING" and progress=0.0.
TaskHandler.mark_as_completed() sets progress=1.0, time_completed, output_dataset_ids, and status="COMPLETED".
TaskHandler.mark_as_failed() sets status="FAILED", progress=1.0, clears output_dataset_ids, and stores the exception in the log.
If a stop request is posted, TaskHandler._check_for_stop_request() acknowledges it and calls mark_as_stopped(), which sets status="STOPPED" and raises TaskStoppedException.

Sessions (optional)

The execution API supports an optional session_token that can be used to share an in‑memory cache across multiple executions.

Behavior (from TaskSession and TaskHandler):

If you omit session_token and the backend uses FastAPI background tasks, a new session token is generated and stored in the execution record.
You can retrieve it via GET /api/v0/executions/{execution_id} and pass it in subsequent executions to reuse the cache.
Sessions are in‑memory only (single process), expire after ~24 hours, and are capped in size/number of caches.
Celery does not support sessions: if you attempt to use session features in Celery mode, a NotImplementedError is raised internally and session_token will be None in the execution record.

Practical guidance:

Use sessions only when running with fastapi_background_tasks.
Treat sessions as a performance optimization (e.g., caching model intermediates), not a persistent store.

Check execution status

Endpoint:

GET /api/v0/executions/{execution_id}

Response model: ExecutionRecord

Includes status, progress, log, and output_dataset_ids.

Example response (shape):

{
  "execution_id": "...",
  "algorithm_id": "...",
  "status": "RUNNING",
  "progress": 0.3,
  "time_started": "...",
  "time_completed": "",
  "log": "",
  "input_dataset_ids": ["<file_id_1>", "<file_id_2>"],
  "output_dataset_ids": [],
  "execution_device_override": null,
  "additional_parameters": {},
  "session_token": null,
  "checkpoint_id": null,
  "algorithm_minor_version": null
}

Notes:

output_dataset_ids is the key field for downstream retrieval of results.

Stop execution (optional)

Endpoint:

POST /api/v0/executions/{execution_id}/stop

Behavior:

Only PENDING, RUNNING or STARTED statuses are stoppable.
A stop request is posted to stop-requests, which the task checks.

Retrieve output datasets

Endpoint:

GET /api/v0/files/{file_id}

Behavior:

Returns the raw HDF5 bytes for each dataset ID returned in output_dataset_ids.

Delete files (optional)

Endpoint:

DELETE /api/v0/files/{file_id}

Behavior:

Deletes a file from data-store immediately.

Notes:

Files already expire automatically (default 1 day), but you may want to delete them earlier to free up storage.

End-to-end summary

Upload HDF5 files -> get file_ids
Execute algorithm with algorithm_id + input_dataset_ids -> get execution_id
Optionally reuse a session_token across executions (FastAPI background tasks only)
Poll execution record -> read status + output_dataset_ids (and session_token if used)
Optionally stop execution
Download each output dataset by ID
Optionally delete files to free up storage early

Training workflow

This doc covers the client-facing flow for training in Compox: upload data, create training samples, start training, and retrieve results. All endpoints and payloads are derived from the current server code under compox/src/compox/routers.

Upload data files

Endpoint:

POST /api/v0/files

Behavior (from file_controller.py):

The request body is treated as raw bytes and must be a valid HDF5 file.
On success, returns { "file_id": "<uuid>" }.

Minimal example:

POST /api/v0/files
Content-Type: application/octet-stream

<HDF5 bytes>

Response:

{ "file_id": "..." }

Notes:

The server validates that the uploaded bytes open as HDF5.
Files are stored in the data-store bucket/collection.
Files by default expire after 1 day (configurable in S3Connection).

Create training samples

Endpoint:

POST /api/v0/sample

Payload model: IncomingSampleRequest (see pydantic_models.py)

files: list of dicts mapping arbitrary keys to lists of file IDs
tags: list of strings (optional)

Example:

{
  "files": [
    { "input": ["<file_id_1>", "<file_id_2>"], "target": ["<file_id_3>"] }
  ],
  "tags": ["modality:ct", "anatomy:brain", "author:me"]
}

Response:

{ "sample_id": "..." }

Validation behavior:

Each referenced file ID must exist in data-store, otherwise the API returns 404.

Related endpoints:

GET /api/v0/sample/{sample_id} returns the stored sample record
GET /api/v0/sample/all?positive_tags=...&negative_tags=... filters by tags
DELETE /api/v0/sample/{sample_id} deletes a sample

Notes:

Files referenced by samples are not copied; the sample record just points to existing files.
Files referenced by sample do not expire as long as the sample exists.

Start training

Endpoint:

POST /api/v0/train-algorithm

Payload model: IncomingTrainingRequest

algorithm_id: string
training_data: list of sample IDs
checkpoint_id: optional checkpoint to start from
algorithm_minor_version: optional minor version string
tags: list of strings
additional_parameters: dict (free-form)

Example:

{
  "algorithm_id": "<algorithm_id>",
  "training_data": ["<sample_id>"],
  "checkpoint_id": null,
  "algorithm_minor_version": null,
  "tags": ["experiment:42", "author:me"],
  "additional_parameters": {
    "learning_rate": 0.001,
    "batch_size": 4,
    "num_epochs": 10
  }
}

Response:

{ "training_id": "..." }

Validation behavior (from training_controller.py):

All referenced sample IDs must exist in sample-store.
All files referenced by those samples must exist in data-store.
The algorithm ID must exist in algorithm-store (via find_algorithm_by_id).

Execution mode:

If inference.backend_settings.executor is fastapi_background_tasks, training runs via training_task_fastapi.
If executor is celery, the task is queued under training_task.

Progress/status details (from TrainingHandler and TaskHandler):

Status transitions are written to training-store via TrainingHandler.status (inherited from TaskHandler).
Valid statuses are PENDING, STARTED, RUNNING, COMPLETED, FAILED, STOPPED.
training_controller.py creates the record with status="PENDING" and progress=0.0.
TrainingHandler.mark_as_completed() sets progress=1.0, time_completed, updates the log, and sets status="COMPLETED".
If a stop request is posted, TaskHandler._check_for_stop_request() acknowledges it and calls mark_as_stopped(), which sets status="STOPPED" and raises TaskStoppedException.
TaskHandler.mark_as_failed() sets status="FAILED", progress=1.0, clears output_dataset_ids, and stores the exception in the log.

Check training status

Endpoint:

GET /api/v0/training/{training_id}

Response model: TrainingRecord

Includes status, progress, log, output_checkpoint_ids, etc.

Example response (shape):

{
  "training_id": "...",
  "algorithm_id": "...",
  "status": "RUNNING",
  "progress": 0.3,
  "time_started": "...",
  "time_completed": null,
  "log": "",
  "training_data": ["<sample_id>"],
  "state": {},
  "tags": ["experiment:42"],
  "checkpoint_id": null,
  "algorithm_minor_version": null,
  "output_checkpoint_ids": []
}

Notes:

output_checkpoint_ids is the key field for downstream retrieval of training results.

Stop training (optional)

Endpoint:

POST /api/v0/training/{training_id}/stop

Behavior:

Only PENDING, RUNNING or STARTED statuses are stoppable.
A stop request is posted to stop-requests, which the training task checks.

Retrieve training results (checkpoints)

Endpoint:

GET /api/v0/checkpoint/{checkpoint_id}
GET /api/v0/checkpoint/all (filtering supported by query params; see checkpoint_controller.py)

Results:

Checkpoint metadata is returned (not the model bytes directly).

Checkpoint behavior (from TrainingHandler.save_checkpoint()):

save_checkpoint() validates that every asset path exists in the algorithm’s assets.
New assets are stored in asset-store, and a new checkpoint_id is created.
A checkpoint manifest is saved in algorithm-checkpoint-store.
The new checkpoint ID is appended to output_checkpoint_ids in the training handler, so it appears in the training record.

Checkpoint metadata shape (from AlgorithmCheckpointRecord):

{
  "checkpoint_id": "...",
  "training_id": "...",
  "parent_algorithm_id": "...",
  "created_at": "...",
  "properties": {},
  "tags": [],
  "parent_checkpoint_id": null
}

Details on properties and tags:

properties is a free-form dictionary provided by the algorithm when it calls save_checkpoint(assets, properties). This is the place to store metrics, hyperparameters, dataset IDs, evaluation scores, or any other metadata you want to query later.
tags are inherited from the training run tags (IncomingTrainingRequest.tags). When the checkpoint is created, TrainingHandler.save_checkpoint() copies the training record’s tags into the checkpoint manifest.
parent_checkpoint_id is copied from the training record’s checkpoint_id, so you can track lineage if you trained from an existing checkpoint.

Filtering by tags:

GET /api/v0/checkpoint/all?positive_tags=tag1&positive_tags=tag2
GET /api/v0/checkpoint/all?negative_tags=tag_to_exclude
You can combine both positive_tags and negative_tags in the same request.

Export trained algorithm (optional)

Endpoint:

GET /api/v0/algorithm/{algorithm_name}/{algorithm_major_version}/export

Query params:

algorithm_minor_version (optional)
checkpoint_id (optional; overrides assets with the checkpoint)

Response:

Streaming zip download (application/zip) of the algorithm package.

What algorithm_minor_version means:

The minor version is the build number stored for a given algorithm name + major version.
Supplying it lets you export a specific build; if you omit it, the latest build is exported.

What checkpoint_id means:

A checkpoint is a snapshot of trained assets (typically weights).
Supplying checkpoint_id tells the exporter to swap the algorithm’s assets with the checkpoint’s assets before packaging.
This is how you get a trained package out of a training run.

Export protection:

If the algorithm is marked with "exportable": false in its AlgorithmConfigSchema, the export endpoint returns 403 Forbidden.

What the zip file is:

A complete deployable algorithm package: Runner.py, pyproject.toml, and the assets under files/.
If checkpoint_id is provided, those asset files come from the checkpoint instead of the original algorithm assets.

End-to-end summary

Upload HDF5 files → get file_ids
Create training sample(s) referencing file IDs → get sample_ids
Start training with algorithm ID + sample IDs → get training_id
Poll training record → read status + output_checkpoint_ids
Optionally stop training or fetch checkpoint metadata
Optionally export an algorithm package using checkpoint ID