Running a pipeline¶
Prerequisites¶
Always
Python 3.10+ with the package installed (
pip install biocomposer).An LLM API key, connectors are generated by a language model. Supply one of
GEMINI_API_KEY/GOOGLE_API_KEY/ANTHROPIC_API_KEY/OPENAI_API_KEY(see Configuration below).The bv registry client, which fetches tool manifests and images:
cargo install biov # needs Rust (https://rustup.rs)
Local runs only
Docker, installed and running. Each tool runs in a Docker container; if the daemon is down the run fails immediately. Start Docker Desktop (Mac/Windows) or
sudo systemctl start docker(Linux).
Cloud runs only
Modal, configured once with
pip install modal && modal setup. On Modal you don’t need Docker orbvlocally, the sandbox provides them.
Project layout¶
The CLI expects two folders in your working directory:
inputs/ # your FASTA / PDB / parameter files
run/
run_pipeline.py # your pipeline script
On a cloud run the inputs/ folder is uploaded and appears inside the sandbox
at /vol/inputs/, which is why scripts reference inputs as
/vol/inputs/db.fasta rather than a local path.
Writing a run script¶
A run script is plain Python that builds a graph and executes it.
1. Import and create the graph.
import json
from biocomposer import Graph
g = Graph()
2. Add inputs, tools, and edges. Input-node keys must match the tool’s declared
input names, read them with bv show <tool> --format json (see
Keys must match the downstream tool). Use /vol/inputs/... paths for files you placed in
inputs/.
seqs = g.add_input_node(sequences="/vol/inputs/family.fasta")
align = g.add_node("clustalo")
g.add_edge((seqs, align))
3. Set the output and execute. execute() returns one result per output node.
g.set_output_node(align)
print(json.dumps(g.execute(), indent=2, default=str))
See Graph, Nodes, and Example pipelines for the full vocabulary.
Configuration¶
API keys. Connector generation calls an LLM, so at least one provider key is
required. Create a .env file in your working directory, for example:
printf 'GEMINI_API_KEY=your-key-here\n' > .env
Any one of these keys works (you only need one):
GEMINI_API_KEY=... # Google Gemini
ANTHROPIC_API_KEY=... # Anthropic Claude
OPENAI_API_KEY=... # OpenAI
Point the CLI at the file with --env (it defaults to .env in the current
directory):
biocomp run run/run_pipeline.py --env .env
biocomp run run/run_pipeline.py --env secrets/keys.env # a different path
On a cloud run, the keys in this file are passed into the Modal sandbox as secrets.
If no key is found the run still starts but any LLM-backed step (every connector)
fails, so set one before running. biocomp secrets prints setup guidance.
Which LLM is used. To pin the provider and model from inside a script,
call Graph.set_llm() before
execute():
g.set_llm("anthropic", "claude-haiku-4-5", api_key)
g.set_llm("openai", "gpt-4o-mini", api_key)
g.set_llm("google", "gemini-2.5-flash-lite", api_key)
Running it¶
Locally, tools run in Docker on your machine; outputs go to ./results/:
biocomp run run/run_pipeline.py --env .env
On the cloud (Modal), for GPU tools (RFdiffusion, ColabFold). biocomposer
uploads your script, inputs/, and the package to a Modal sandbox, runs there,
and stores results on a persistent volume named biocomp:
biocomp run --modal run/run_pipeline.py --env .env
Retrieve a result file from the volume:
modal volume get biocomp results/<tool>_output_1/stdout/<file> ./<file>
Note
The sandbox is temporary but the volume persists, tool images are cached on it, so the second run of a heavy tool skips the multi-gigabyte re-download.
Flags¶
Flag |
Scope |
Meaning |
|---|---|---|
|
both |
env file with API keys (default |
|
both |
delete previous |
|
both |
run on Modal instead of locally |
|
modal |
GPU type (default |
|
modal |
sandbox memory in MB (default |
|
modal |
drop into an interactive shell in the sandbox (debugging) |