Running a pipeline ================== Prerequisites ------------- **Always** - **Python 3.10+** with the package installed (``pip install biocomposer``). - **An LLM API key**, connectors are generated by a language model. Supply one of ``GEMINI_API_KEY`` / ``GOOGLE_API_KEY`` / ``ANTHROPIC_API_KEY`` / ``OPENAI_API_KEY`` (see `Configuration`_ below). - **The bv registry client**, which fetches tool manifests and images:: cargo install biov # needs Rust (https://rustup.rs) **Local runs only** - **Docker**, installed and running. Each tool runs in a Docker container; if the daemon is down the run fails immediately. Start Docker Desktop (Mac/Windows) or ``sudo systemctl start docker`` (Linux). **Cloud runs only** - **Modal**, configured once with ``pip install modal && modal setup``. On Modal you don't need Docker or ``bv`` locally, the sandbox provides them. Project layout -------------- The CLI expects two folders in your working directory: .. code-block:: text inputs/ # your FASTA / PDB / parameter files run/ run_pipeline.py # your pipeline script On a cloud run the ``inputs/`` folder is uploaded and appears inside the sandbox at ``/vol/inputs/``, which is why scripts reference inputs as ``/vol/inputs/db.fasta`` rather than a local path. Writing a run script -------------------- A run script is plain Python that builds a graph and executes it. **1. Import and create the graph.** .. code-block:: python import json from biocomposer import Graph g = Graph() **2. Add inputs, tools, and edges.** Input-node keys must match the tool's declared input names, read them with ``bv show --format json`` (see :ref:`input-keys`). Use ``/vol/inputs/...`` paths for files you placed in ``inputs/``. .. code-block:: python seqs = g.add_input_node(sequences="/vol/inputs/family.fasta") align = g.add_node("clustalo") g.add_edge((seqs, align)) **3. Set the output and execute.** ``execute()`` returns one result per output node. .. code-block:: python g.set_output_node(align) print(json.dumps(g.execute(), indent=2, default=str)) See :doc:`graph`, :doc:`nodes/index`, and :doc:`examples` for the full vocabulary. Configuration ------------- **API keys.** Connector generation calls an LLM, so at least one provider key is required. Create a ``.env`` file in your working directory, for example:: printf 'GEMINI_API_KEY=your-key-here\n' > .env Any one of these keys works (you only need one):: GEMINI_API_KEY=... # Google Gemini ANTHROPIC_API_KEY=... # Anthropic Claude OPENAI_API_KEY=... # OpenAI Point the CLI at the file with ``--env`` (it defaults to ``.env`` in the current directory):: biocomp run run/run_pipeline.py --env .env biocomp run run/run_pipeline.py --env secrets/keys.env # a different path On a cloud run, the keys in this file are passed into the Modal sandbox as secrets. If no key is found the run still starts but any LLM-backed step (every connector) fails, so set one before running. ``biocomp secrets`` prints setup guidance. **Which LLM is used.** To pin the provider and model from inside a script, call :meth:`Graph.set_llm` before ``execute()``: .. code-block:: python g.set_llm("anthropic", "claude-haiku-4-5", api_key) g.set_llm("openai", "gpt-4o-mini", api_key) g.set_llm("google", "gemini-2.5-flash-lite", api_key) Running it ---------- **Locally**, tools run in Docker on your machine; outputs go to ``./results/``: .. code-block:: bash biocomp run run/run_pipeline.py --env .env **On the cloud (Modal)**, for GPU tools (RFdiffusion, ColabFold). biocomposer uploads your script, ``inputs/``, and the package to a Modal sandbox, runs there, and stores results on a persistent volume named ``biocomp``: .. code-block:: bash biocomp run --modal run/run_pipeline.py --env .env Retrieve a result file from the volume:: modal volume get biocomp results/_output_1/stdout/ ./ .. note:: The sandbox is temporary but the volume persists, tool images are cached on it, so the second run of a heavy tool skips the multi-gigabyte re-download. Flags ~~~~~ .. list-table:: :header-rows: 1 :widths: 22 12 66 * - Flag - Scope - Meaning * - ``--env `` - both - env file with API keys (default ``.env``) * - ``--clean`` - both - delete previous ``results/`` and temp files first * - ``--modal`` - both - run on Modal instead of locally * - ``--gpu `` - modal - GPU type (default ``A10G``; e.g. ``T4``, ``A100``) * - ``--memory `` - modal - sandbox memory in MB (default ``8192``) * - ``--shell`` - modal - drop into an interactive shell in the sandbox (debugging)