Running a pipeline
==================

Prerequisites
-------------

**Always**

- **Python 3.10+** with the package installed (``pip install biocomposer``).
- **An LLM API key**, connectors are generated by a language model. Supply one of
  ``GEMINI_API_KEY`` / ``GOOGLE_API_KEY`` / ``ANTHROPIC_API_KEY`` / ``OPENAI_API_KEY``
  (see `Configuration`_ below).
- **The bv registry client**, which fetches tool manifests and images::

      cargo install biov      # needs Rust (https://rustup.rs)

**Local runs only**

- **Docker**, installed and running. Each tool runs in a Docker container; if the
  daemon is down the run fails immediately. Start Docker Desktop (Mac/Windows) or
  ``sudo systemctl start docker`` (Linux).

**Cloud runs only**

- **Modal**, configured once with ``pip install modal && modal setup``. On Modal
  you don't need Docker or ``bv`` locally, the sandbox provides them.

Project layout
--------------

The CLI expects two folders in your working directory:

.. code-block:: text

    inputs/            # your FASTA / PDB / parameter files
    run/
        run_pipeline.py   # your pipeline script

On a cloud run the ``inputs/`` folder is uploaded and appears inside the sandbox
at ``/vol/inputs/``, which is why scripts reference inputs as
``/vol/inputs/db.fasta`` rather than a local path.

Writing a run script
--------------------

A run script is plain Python that builds a graph and executes it.

**1. Import and create the graph.**

.. code-block:: python

    import json
    from biocomposer import Graph

    g = Graph()

**2. Add inputs, tools, and edges.** Input-node keys must match the tool's declared
input names, read them with ``bv show <tool> --format json`` (see
:ref:`input-keys`). Use ``/vol/inputs/...`` paths for files you placed in
``inputs/``.

.. code-block:: python

    seqs  = g.add_input_node(sequences="/vol/inputs/family.fasta")
    align = g.add_node("clustalo")
    g.add_edge((seqs, align))

**3. Set the output and execute.** ``execute()`` returns one result per output node.

.. code-block:: python

    g.set_output_node(align)
    print(json.dumps(g.execute(), indent=2, default=str))

See :doc:`graph`, :doc:`nodes/index`, and :doc:`examples` for the full vocabulary.

Configuration
-------------

**API keys.** Connector generation calls an LLM, so at least one provider key is
required. Create a ``.env`` file in your working directory, for example::

    printf 'GEMINI_API_KEY=your-key-here\n' > .env

Any one of these keys works (you only need one)::

    GEMINI_API_KEY=...        # Google Gemini
    ANTHROPIC_API_KEY=...     # Anthropic Claude
    OPENAI_API_KEY=...        # OpenAI

Point the CLI at the file with ``--env`` (it defaults to ``.env`` in the current
directory)::

    biocomp run run/run_pipeline.py --env .env
    biocomp run run/run_pipeline.py --env secrets/keys.env   # a different path

On a cloud run, the keys in this file are passed into the Modal sandbox as secrets.
If no key is found the run still starts but any LLM-backed step (every connector)
fails, so set one before running. ``biocomp secrets`` prints setup guidance.

**Which LLM is used.** To pin the provider and model from inside a script, 
call :meth:`Graph.set_llm` before
``execute()``:

.. code-block:: python

    g.set_llm("anthropic", "claude-haiku-4-5", api_key)
    g.set_llm("openai", "gpt-4o-mini", api_key)
    g.set_llm("google", "gemini-2.5-flash-lite", api_key)


Running it
----------

**Locally**, tools run in Docker on your machine; outputs go to ``./results/``:

.. code-block:: bash

    biocomp run run/run_pipeline.py --env .env

**On the cloud (Modal)**, for GPU tools (RFdiffusion, ColabFold). biocomposer
uploads your script, ``inputs/``, and the package to a Modal sandbox, runs there,
and stores results on a persistent volume named ``biocomp``:

.. code-block:: bash

    biocomp run --modal run/run_pipeline.py --env .env

Retrieve a result file from the volume::

    modal volume get biocomp results/<tool>_output_1/stdout/<file> ./<file>

.. note::

   The sandbox is temporary but the volume persists, tool images are cached on it,
   so the second run of a heavy tool skips the multi-gigabyte re-download.

Flags
~~~~~

.. list-table::
   :header-rows: 1
   :widths: 22 12 66

   * - Flag
     - Scope
     - Meaning
   * - ``--env <path>``
     - both
     - env file with API keys (default ``.env``)
   * - ``--clean``
     - both
     - delete previous ``results/`` and temp files first
   * - ``--modal``
     - both
     - run on Modal instead of locally
   * - ``--gpu <type>``
     - modal
     - GPU type (default ``A10G``; e.g. ``T4``, ``A100``)
   * - ``--memory <MB>``
     - modal
     - sandbox memory in MB (default ``8192``)
   * - ``--shell``
     - modal
     - drop into an interactive shell in the sandbox (debugging)