Running a pipeline

Prerequisites

Always

  • Python 3.10+ with the package installed (pip install biocomposer).

  • An LLM API key, connectors are generated by a language model. Supply one of GEMINI_API_KEY / GOOGLE_API_KEY / ANTHROPIC_API_KEY / OPENAI_API_KEY (see Configuration below).

  • The bv registry client, which fetches tool manifests and images:

    cargo install biov      # needs Rust (https://rustup.rs)
    

Local runs only

  • Docker, installed and running. Each tool runs in a Docker container; if the daemon is down the run fails immediately. Start Docker Desktop (Mac/Windows) or sudo systemctl start docker (Linux).

Cloud runs only

  • Modal, configured once with pip install modal && modal setup. On Modal you don’t need Docker or bv locally, the sandbox provides them.

Project layout

The CLI expects two folders in your working directory:

inputs/            # your FASTA / PDB / parameter files
run/
    run_pipeline.py   # your pipeline script

On a cloud run the inputs/ folder is uploaded and appears inside the sandbox at /vol/inputs/, which is why scripts reference inputs as /vol/inputs/db.fasta rather than a local path.

Writing a run script

A run script is plain Python that builds a graph and executes it.

1. Import and create the graph.

import json
from biocomposer import Graph

g = Graph()

2. Add inputs, tools, and edges. Input-node keys must match the tool’s declared input names, read them with bv show <tool> --format json (see Keys must match the downstream tool). Use /vol/inputs/... paths for files you placed in inputs/.

seqs  = g.add_input_node(sequences="/vol/inputs/family.fasta")
align = g.add_node("clustalo")
g.add_edge((seqs, align))

3. Set the output and execute. execute() returns one result per output node.

g.set_output_node(align)
print(json.dumps(g.execute(), indent=2, default=str))

See Graph, Nodes, and Example pipelines for the full vocabulary.

Configuration

API keys. Connector generation calls an LLM, so at least one provider key is required. Create a .env file in your working directory, for example:

printf 'GEMINI_API_KEY=your-key-here\n' > .env

Any one of these keys works (you only need one):

GEMINI_API_KEY=...        # Google Gemini
ANTHROPIC_API_KEY=...     # Anthropic Claude
OPENAI_API_KEY=...        # OpenAI

Point the CLI at the file with --env (it defaults to .env in the current directory):

biocomp run run/run_pipeline.py --env .env
biocomp run run/run_pipeline.py --env secrets/keys.env   # a different path

On a cloud run, the keys in this file are passed into the Modal sandbox as secrets. If no key is found the run still starts but any LLM-backed step (every connector) fails, so set one before running. biocomp secrets prints setup guidance.

Which LLM is used. To pin the provider and model from inside a script, call Graph.set_llm() before execute():

g.set_llm("anthropic", "claude-haiku-4-5", api_key)
g.set_llm("openai", "gpt-4o-mini", api_key)
g.set_llm("google", "gemini-2.5-flash-lite", api_key)

Running it

Locally, tools run in Docker on your machine; outputs go to ./results/:

biocomp run run/run_pipeline.py --env .env

On the cloud (Modal), for GPU tools (RFdiffusion, ColabFold). biocomposer uploads your script, inputs/, and the package to a Modal sandbox, runs there, and stores results on a persistent volume named biocomp:

biocomp run --modal run/run_pipeline.py --env .env

Retrieve a result file from the volume:

modal volume get biocomp results/<tool>_output_1/stdout/<file> ./<file>

Note

The sandbox is temporary but the volume persists, tool images are cached on it, so the second run of a heavy tool skips the multi-gigabyte re-download.

Flags

Flag

Scope

Meaning

--env <path>

both

env file with API keys (default .env)

--clean

both

delete previous results/ and temp files first

--modal

both

run on Modal instead of locally

--gpu <type>

modal

GPU type (default A10G; e.g. T4, A100)

--memory <MB>

modal

sandbox memory in MB (default 8192)

--shell

modal

drop into an interactive shell in the sandbox (debugging)