Client Workflows

In the following sections, we will describe the typical Compox client workflows for training and inference.

Execution workflow

This doc covers the client-facing flow for execution in Compox: upload data, start an execution, poll status, and retrieve results. All endpoints and payloads are derived from the current server code under compox/src/compox/routers.


Upload data files

Endpoint:

  • POST /api/v0/files

Behavior (from file_controller.py):

  • The request body is treated as raw bytes and must be a valid HDF5 file.

  • On success, returns { "file_id": "<uuid>" }.

Minimal example:

POST /api/v0/files
Content-Type: application/octet-stream

<HDF5 bytes>

Response:

{ "file_id": "..." }

Notes:

  • The server validates that the uploaded bytes open as HDF5.

  • Files are stored in the data-store bucket/collection.

  • Files by default expire after 1 day (configurable in S3Connection).


Execute an algorithm

Endpoint:

  • POST /api/v0/execute-algorithm

Payload model: IncomingExecutionRequest

  • algorithm_id: string

  • input_dataset_ids: list of file IDs

  • checkpoint_id: optional checkpoint to load assets from

  • algorithm_minor_version: optional minor version to execute

  • execution_device_override: optional device override (e.g. "cpu", "cuda:0")

  • additional_parameters: dict (free-form)

  • session_token: optional session identifier

Example:

{
  "algorithm_id": "<algorithm_id>",
  "input_dataset_ids": ["<file_id_1>", "<file_id_2>"],
  "checkpoint_id": null,
  "algorithm_minor_version": null,
  "execution_device_override": null,
  "additional_parameters": {
    "threshold": 0.5,
    "tile_size": 512
  },
  "session_token": null
}

Response:

{ "execution_id": "..." }

Validation behavior (from execution_controller.py):

  • All referenced input file IDs must exist in data-store.

  • The algorithm ID must exist in algorithm-store (via find_algorithm_by_id).

Execution mode:

  • If inference.backend_settings.executor is fastapi_background_tasks, execution runs via execution_task_fastapi.

  • If executor is celery, the task is queued under task (Celery).

Progress/status details (from TaskHandler):

  • Valid statuses are PENDING, STARTED, RUNNING, COMPLETED, FAILED, STOPPED.

  • execution_controller.py creates the record with status="PENDING" and progress=0.0.

  • TaskHandler.mark_as_completed() sets progress=1.0, time_completed, output_dataset_ids, and status="COMPLETED".

  • TaskHandler.mark_as_failed() sets status="FAILED", progress=1.0, clears output_dataset_ids, and stores the exception in the log.

  • If a stop request is posted, TaskHandler._check_for_stop_request() acknowledges it and calls mark_as_stopped(), which sets status="STOPPED" and raises TaskStoppedException.


Sessions (optional)

The execution API supports an optional session_token that can be used to share an in‑memory cache across multiple executions.

Behavior (from TaskSession and TaskHandler):

  • If you omit session_token and the backend uses FastAPI background tasks, a new session token is generated and stored in the execution record.

  • You can retrieve it via GET /api/v0/executions/{execution_id} and pass it in subsequent executions to reuse the cache.

  • Sessions are in‑memory only (single process), expire after ~24 hours, and are capped in size/number of caches.

  • Celery does not support sessions: if you attempt to use session features in Celery mode, a NotImplementedError is raised internally and session_token will be None in the execution record.

Practical guidance:

  • Use sessions only when running with fastapi_background_tasks.

  • Treat sessions as a performance optimization (e.g., caching model intermediates), not a persistent store.


Check execution status

Endpoint:

  • GET /api/v0/executions/{execution_id}

Response model: ExecutionRecord

  • Includes status, progress, log, and output_dataset_ids.

Example response (shape):

{
  "execution_id": "...",
  "algorithm_id": "...",
  "status": "RUNNING",
  "progress": 0.3,
  "time_started": "...",
  "time_completed": "",
  "log": "",
  "input_dataset_ids": ["<file_id_1>", "<file_id_2>"],
  "output_dataset_ids": [],
  "execution_device_override": null,
  "additional_parameters": {},
  "session_token": null,
  "checkpoint_id": null,
  "algorithm_minor_version": null
}

Notes:

  • output_dataset_ids is the key field for downstream retrieval of results.


Stop execution (optional)

Endpoint:

  • POST /api/v0/executions/{execution_id}/stop

Behavior:

  • Only PENDING, RUNNING or STARTED statuses are stoppable.

  • A stop request is posted to stop-requests, which the task checks.


Retrieve output datasets

Endpoint:

  • GET /api/v0/files/{file_id}

Behavior:

  • Returns the raw HDF5 bytes for each dataset ID returned in output_dataset_ids.


Delete files (optional)

Endpoint:

  • DELETE /api/v0/files/{file_id}

Behavior:

  • Deletes a file from data-store immediately.

Notes:

  • Files already expire automatically (default 1 day), but you may want to delete them earlier to free up storage.


End-to-end summary

  1. Upload HDF5 files -> get file_ids

  2. Execute algorithm with algorithm_id + input_dataset_ids -> get execution_id

  3. Optionally reuse a session_token across executions (FastAPI background tasks only)

  4. Poll execution record -> read status + output_dataset_ids (and session_token if used)

  5. Optionally stop execution

  6. Download each output dataset by ID

  7. Optionally delete files to free up storage early

Training workflow

This doc covers the client-facing flow for training in Compox: upload data, create training samples, start training, and retrieve results. All endpoints and payloads are derived from the current server code under compox/src/compox/routers.


Upload data files

Endpoint:

  • POST /api/v0/files

Behavior (from file_controller.py):

  • The request body is treated as raw bytes and must be a valid HDF5 file.

  • On success, returns { "file_id": "<uuid>" }.

Minimal example:

POST /api/v0/files
Content-Type: application/octet-stream

<HDF5 bytes>

Response:

{ "file_id": "..." }

Notes:

  • The server validates that the uploaded bytes open as HDF5.

  • Files are stored in the data-store bucket/collection.

  • Files by default expire after 1 day (configurable in S3Connection).


Create training samples

Endpoint:

  • POST /api/v0/sample

Payload model: IncomingSampleRequest (see pydantic_models.py)

  • files: list of dicts mapping arbitrary keys to lists of file IDs

  • tags: list of strings (optional)

Example:

{
  "files": [
    { "input": ["<file_id_1>", "<file_id_2>"], "target": ["<file_id_3>"] }
  ],
  "tags": ["modality:ct", "anatomy:brain", "author:me"]
}

Response:

{ "sample_id": "..." }

Validation behavior:

  • Each referenced file ID must exist in data-store, otherwise the API returns 404.

Related endpoints:

  • GET /api/v0/sample/{sample_id} returns the stored sample record

  • GET /api/v0/sample/all?positive_tags=...&negative_tags=... filters by tags

  • DELETE /api/v0/sample/{sample_id} deletes a sample

Notes:

  • Files referenced by samples are not copied; the sample record just points to existing files.

  • Files referenced by sample do not expire as long as the sample exists.


Start training

Endpoint:

  • POST /api/v0/train-algorithm

Payload model: IncomingTrainingRequest

  • algorithm_id: string

  • training_data: list of sample IDs

  • checkpoint_id: optional checkpoint to start from

  • algorithm_minor_version: optional minor version string

  • tags: list of strings

  • additional_parameters: dict (free-form)

Example:

{
  "algorithm_id": "<algorithm_id>",
  "training_data": ["<sample_id>"],
  "checkpoint_id": null,
  "algorithm_minor_version": null,
  "tags": ["experiment:42", "author:me"],
  "additional_parameters": {
    "learning_rate": 0.001,
    "batch_size": 4,
    "num_epochs": 10
  }
}

Response:

{ "training_id": "..." }

Validation behavior (from training_controller.py):

  • All referenced sample IDs must exist in sample-store.

  • All files referenced by those samples must exist in data-store.

  • The algorithm ID must exist in algorithm-store (via find_algorithm_by_id).

Execution mode:

  • If inference.backend_settings.executor is fastapi_background_tasks, training runs via training_task_fastapi.

  • If executor is celery, the task is queued under training_task.

Progress/status details (from TrainingHandler and TaskHandler):

  • Status transitions are written to training-store via TrainingHandler.status (inherited from TaskHandler).

  • Valid statuses are PENDING, STARTED, RUNNING, COMPLETED, FAILED, STOPPED.

  • training_controller.py creates the record with status="PENDING" and progress=0.0.

  • TrainingHandler.mark_as_completed() sets progress=1.0, time_completed, updates the log, and sets status="COMPLETED".

  • If a stop request is posted, TaskHandler._check_for_stop_request() acknowledges it and calls mark_as_stopped(), which sets status="STOPPED" and raises TaskStoppedException.

  • TaskHandler.mark_as_failed() sets status="FAILED", progress=1.0, clears output_dataset_ids, and stores the exception in the log.


Check training status

Endpoint:

  • GET /api/v0/training/{training_id}

Response model: TrainingRecord

  • Includes status, progress, log, output_checkpoint_ids, etc.

Example response (shape):

{
  "training_id": "...",
  "algorithm_id": "...",
  "status": "RUNNING",
  "progress": 0.3,
  "time_started": "...",
  "time_completed": null,
  "log": "",
  "training_data": ["<sample_id>"],
  "state": {},
  "tags": ["experiment:42"],
  "checkpoint_id": null,
  "algorithm_minor_version": null,
  "output_checkpoint_ids": []
}

Notes:

  • output_checkpoint_ids is the key field for downstream retrieval of training results.


Stop training (optional)

Endpoint:

  • POST /api/v0/training/{training_id}/stop

Behavior:

  • Only PENDING, RUNNING or STARTED statuses are stoppable.

  • A stop request is posted to stop-requests, which the training task checks.


Retrieve training results (checkpoints)

Endpoint:

  • GET /api/v0/checkpoint/{checkpoint_id}

  • GET /api/v0/checkpoint/all (filtering supported by query params; see checkpoint_controller.py)

Results:

  • Checkpoint metadata is returned (not the model bytes directly).

Checkpoint behavior (from TrainingHandler.save_checkpoint()):

  • save_checkpoint() validates that every asset path exists in the algorithm’s assets.

  • New assets are stored in asset-store, and a new checkpoint_id is created.

  • A checkpoint manifest is saved in algorithm-checkpoint-store.

  • The new checkpoint ID is appended to output_checkpoint_ids in the training handler, so it appears in the training record.

Checkpoint metadata shape (from AlgorithmCheckpointRecord):

{
  "checkpoint_id": "...",
  "training_id": "...",
  "parent_algorithm_id": "...",
  "created_at": "...",
  "properties": {},
  "tags": [],
  "parent_checkpoint_id": null
}

Details on properties and tags:

  • properties is a free-form dictionary provided by the algorithm when it calls save_checkpoint(assets, properties). This is the place to store metrics, hyperparameters, dataset IDs, evaluation scores, or any other metadata you want to query later.

  • tags are inherited from the training run tags (IncomingTrainingRequest.tags). When the checkpoint is created, TrainingHandler.save_checkpoint() copies the training record’s tags into the checkpoint manifest.

  • parent_checkpoint_id is copied from the training record’s checkpoint_id, so you can track lineage if you trained from an existing checkpoint.

Filtering by tags:

  • GET /api/v0/checkpoint/all?positive_tags=tag1&positive_tags=tag2

  • GET /api/v0/checkpoint/all?negative_tags=tag_to_exclude

  • You can combine both positive_tags and negative_tags in the same request.


Export trained algorithm (optional)

Endpoint:

  • GET /api/v0/algorithm/{algorithm_name}/{algorithm_major_version}/export

Query params:

  • algorithm_minor_version (optional)

  • checkpoint_id (optional; overrides assets with the checkpoint)

Response:

  • Streaming zip download (application/zip) of the algorithm package.

What algorithm_minor_version means:

  • The minor version is the build number stored for a given algorithm name + major version.

  • Supplying it lets you export a specific build; if you omit it, the latest build is exported.

What checkpoint_id means:

  • A checkpoint is a snapshot of trained assets (typically weights).

  • Supplying checkpoint_id tells the exporter to swap the algorithm’s assets with the checkpoint’s assets before packaging.

  • This is how you get a trained package out of a training run.

Export protection:

  • If the algorithm is marked with "exportable": false in its AlgorithmConfigSchema, the export endpoint returns 403 Forbidden.

What the zip file is:

  • A complete deployable algorithm package: Runner.py, pyproject.toml, and the assets under files/.

  • If checkpoint_id is provided, those asset files come from the checkpoint instead of the original algorithm assets.


End-to-end summary

  1. Upload HDF5 files → get file_ids

  2. Create training sample(s) referencing file IDs → get sample_ids

  3. Start training with algorithm ID + sample IDs → get training_id

  4. Poll training record → read status + output_checkpoint_ids

  5. Optionally stop training or fetch checkpoint metadata

  6. Optionally export an algorithm package using checkpoint ID