Running a pipeline¶

Prerequisites¶

Always

Python 3.10+ with the package installed (pip install biocomposer).
An LLM API key, connectors are generated by a language model. Supply one of GEMINI_API_KEY / GOOGLE_API_KEY / ANTHROPIC_API_KEY / OPENAI_API_KEY (see Configuration below).

The bv registry client, which fetches tool manifests and images:

cargo install biov      # needs Rust (https://rustup.rs)

Local runs only

Docker, installed and running. Each tool runs in a Docker container; if the daemon is down the run fails immediately. Start Docker Desktop (Mac/Windows) or sudo systemctl start docker (Linux).

Cloud runs only

Modal, configured once with pip install modal && modal setup. On Modal you don’t need Docker or bv locally, the sandbox provides them.

Project layout¶

The CLI expects two folders in your working directory:

inputs/            # your FASTA / PDB / parameter files
run/
    run_pipeline.py   # your pipeline script

On a cloud run the inputs/ folder is uploaded and appears inside the sandbox at /vol/inputs/, which is why scripts reference inputs as /vol/inputs/db.fasta rather than a local path.

Writing a run script¶

A run script is plain Python that builds a graph and executes it.

1. Import and create the graph.

import json
from biocomposer import Graph

g = Graph()

2. Add inputs, tools, and edges. Input-node keys must match the tool’s declared input names, read them with bv show <tool> --format json (see Keys must match the downstream tool). Use /vol/inputs/... paths for files you placed in inputs/.

seqs  = g.add_input_node(sequences="/vol/inputs/family.fasta")
align = g.add_node("clustalo")
g.add_edge((seqs, align))

3. Set the output and execute. execute() returns one result per output node.

g.set_output_node(align)
print(json.dumps(g.execute(), indent=2, default=str))

See Graph, Nodes, and Example pipelines for the full vocabulary.

Configuration¶

API keys. Connector generation calls an LLM, so at least one provider key is required. Create a .env file in your working directory, for example:

printf 'GEMINI_API_KEY=your-key-here\n' > .env

Any one of these keys works (you only need one):

GEMINI_API_KEY=...        # Google Gemini
ANTHROPIC_API_KEY=...     # Anthropic Claude
OPENAI_API_KEY=...        # OpenAI

Point the CLI at the file with --env (it defaults to .env in the current directory):

biocomp run run/run_pipeline.py --env .env
biocomp run run/run_pipeline.py --env secrets/keys.env   # a different path

On a cloud run, the keys in this file are passed into the Modal sandbox as secrets. If no key is found the run still starts but any LLM-backed step (every connector) fails, so set one before running. biocomp secrets prints setup guidance.

Which LLM is used. To pin the provider and model from inside a script, call Graph.set_llm() before execute():

g.set_llm("anthropic", "claude-haiku-4-5", api_key)
g.set_llm("openai", "gpt-4o-mini", api_key)
g.set_llm("google", "gemini-2.5-flash-lite", api_key)

Running it¶

Locally, tools run in Docker on your machine; outputs go to ./results/:

biocomp run run/run_pipeline.py --env .env

On the cloud (Modal), for GPU tools (RFdiffusion, ColabFold). biocomposer uploads your script, inputs/, and the package to a Modal sandbox, runs there, and stores results on a persistent volume named biocomp:

biocomp run --modal run/run_pipeline.py --env .env

Retrieve a result file from the volume:

modal volume get biocomp results/<tool>_output_1/stdout/<file> ./<file>

Note

The sandbox is temporary but the volume persists, tool images are cached on it, so the second run of a heavy tool skips the multi-gigabyte re-download.

Flags¶

Flag	Scope	Meaning
`--env <path>`	both	env file with API keys (default `.env`)
`--clean`	both	delete previous `results/` and temp files first
`--modal`	both	run on Modal instead of locally
`--gpu <type>`	modal	GPU type (default `A10G`; e.g. `T4`, `A100`)
`--memory <MB>`	modal	sandbox memory in MB (default `8192`)
`--shell`	modal	drop into an interactive shell in the sandbox (debugging)