How It Works ============ biocomposer decomposes the reasoning required to run a bioinformatics workflow into a stack of abstraction levels, so that at each level only one kind of decision has to be made. Composing tools normally forces the user to reason about everything at once: which tools to chain, how their formats and arguments line up, how to invoke each one, and how to provision its environment. biocomposer separates these concerns into layers, and each layer resolves its own class of decision before delegating the rest downward. The result is that the burden of reasoning about data flow moves smoothly from the user's intent down to a concrete execution, with no single level holding all of the complexity. Concretely, a pipeline is expressed against a small Python API; the API operates on a graph of typed nodes; and the nodes are backed by a layer of specializations (connector synthesis, scatter/gather, feedback loops via decision Nodes). Each level is a thin surface over the one beneath it, the user calls :func:`~biocomposer.Graph.add_node`, which resolves a typed tool from the registry, which in turn pulls an image and triggers a container run. .. raw:: html
biocomposer architecture biocomposer architecture
Each layer wraps the one below it: the API surface wraps the graph and node classes, which wrap a layer of specialized helpers; tool installation wraps the registry client, which wraps the container runtime; and the tool library wraps external image sources (BioContainers, Docker Hub).
Pipeline Dataflow ------------------------------------ A pipeline is a **directed acyclic graph (DAG)**: tools are **nodes** and data hand-offs are **directed edges**. .. mermaid:: flowchart LR A(["inputs"]) --> B["align"] B --> C["trim"] C --> D["fold"] D --> E(["result"]) The DAG representation gives the system three properties. Dependencies are **explicit**: an edge ``A → B`` declares that ``B`` consumes ``A``'s output, and the engine derives run order by traversing the graph backward from the output node, so ordering is never specified by hand. The model **generalises past linear chains**: fan-out, fan-in, branching, and result sharing are all instances of the same node/edge structure. And it **localises data conversion**: mapping one tool's output to another's input is resolved per edge, against two typed schemas and the actual runtime data, rather than as global script logic. The API surface --------------- The top layer is a small graph-construction API: :func:`~biocomposer.Graph.add_input_node` for user-supplied values, :func:`~biocomposer.Graph.add_node` for tools, :func:`~biocomposer.Graph.add_edge` to wire them, :func:`~biocomposer.Graph.set_output_node` to mark the result, and :func:`~biocomposer.Graph.execute` to run. Construction is declarative: it records the graph, and no tool runs until ``execute()`` is called. These calls are deliberately thin. ``add_node`` triggers manifest resolution and image setup beneath it, and ``execute`` drives the evaluation described below, so the user never reasons about either. The graph and node layer ------------------------ Each node carries the typed input/output schema of its tool's manifest. The following node types are specified: - :doc:`Input nodes ` hold user files and parameters. - :doc:`Tool nodes ` wrap a single registry tool and run once. - :doc:`Gather nodes ` run a one-input tool once per item of an upstream collection, then gather the results. - :doc:`Decision nodes ` re-run a node with adjusted inputs until a scored condition holds, a feedback loop contained *within* the node, so the graph as a whole stays acyclic. The score and modifier are each a Python function or a registry tool. - :doc:`Subgraphs ` expose a one-in/one-out pipeline as a single node. Edges are **generated rather than written**. Each edge is realised as a **connector** function from the upstream output schema, the downstream input schema, and a snapshot of the files actually produced at run time; it performs format conversion, field renaming, and selection of the correct file from a directory. Connectors are generated once per edge and cached, so a feedback loop or a wide fan-out reuses a single function (:doc:`connectors`). When two edges supply the same input key, resolution is deterministic and governed by edge order (:ref:`merge-order`). Map, decision, and connector behaviour are themselves a layer of specialized helpers beneath the node classes, the API never exposes them directly. The tool and container layer ---------------------------- At the base, each tool is a **container image paired with a typed manifest**. ``add_node("clustalo")`` resolves the manifest and pulls the image from DockerHub/Biocontainers/etc. Executing each tool in its own container pins its exact software and versions, making runs reproducible and letting tools with incompatible dependencies coexist in one graph. This layer is itself a wrapper: the tool library sits over external image sources, purpose-built images, `BioContainers `_, or `Docker Hub `_, so adding a tool means writing a manifest over an existing image rather than repackaging it (:doc:`tools`, :doc:`contributing_tools`). What happens when you call ``execute()`` ---------------------------------------- Building the graph (``add_node``, ``add_edge``, …) only *describes* the pipeline. Nothing runs until ``execute()``. At that point biocomposer starts from the final output and works backwards, running each step it depends on. For each step: .. mermaid:: flowchart TD U["Upstream step finishes
(produces an output)"] --> C["Connector maps that output
into this step's inputs"] C --> S["A fresh sandbox is created"] S --> ST["Inputs are copied in;
the command is assembled"] ST --> R["The tool runs in its container"] R --> H["Outputs are harvested
back out of the sandbox"] H --> N["Result passed to the next step"] 1. **Run upstream first.** A step's inputs come from the steps before it, so those run first (results are reused if a step feeds more than one place). 2. **Connect.** The connector converts the upstream result into this step's input dictionary. 3. **Sandbox.** A temporary working directory (a *sandbox*) is created. Input files are copied in, output folders are pre-created, and the tool's command line is assembled from its manifest template (filling in placeholders like ``{input}`` and ``{output}``). 4. **Execute.** The tool runs inside its container, reading and writing only inside the sandbox. 5. **Harvest.** The files the tool wrote are copied out of the sandbox into a numbered results folder (``_output_1``, ``_2``, …), and returned as the step's output. The sandbox is discarded. Under the hood: how a tool is actually invoked ---------------------------------------------- For each tool run, biocomposer: * **Installs the tool** (``bv add`` + ``bv sync``) the first time it appears, pulling its image. Pulled images are cached on the persistent volume, so later runs restore them locally instead of re-downloading. * **Reads the manifest** (``bv show --format json``) to learn the tool's inputs, outputs, command, and image. * **Builds a sandbox**, a temporary working directory on fast local storage. Input files are copied in flat; output directories are pre-created; the command line is assembled from the manifest's ``args_template`` by filling ``{slot}`` placeholders with the staged filenames and scalar values. * **Executes** with ``bv exec`` inside the tool's container, with the sandbox as the working directory. * **Harvests** the declared outputs back out of the sandbox into a numbered ``results/`` folder, and returns them as the step's output dictionary. We also implemented the following: directory-typed outputs are returned as the directory itself (so a connector can look inside it), tools that write ``stdout`` are written to .txt format. Local vs. cloud --------------- The same pipeline script runs in two places (see :doc:`installation`): * **Locally**, using Docker on your machine, fine for lightweight tools. * **On the cloud (Modal)**, which spins up a remote machine with a GPU, needed for heavy tools like RFdiffusion or ColabFold. biocomposer uploads your script, your ``inputs/`` folder, and itself to a cloud **sandbox**, runs the pipeline there, and stores results on a persistent cloud volume you can download from. Because cloud machines are temporary, biocomposer caches downloaded tool images on the persistent volume, so the second run of a large tool doesn't re-pull the docker image. Checkpoints ----------------- ``execute()`` returns the final step's output as a dictionary of named results, typically paths to the files each tool produced. Every intermediate step's output is also saved on disk under ``results/``, so nothing is lost between steps. .. note:: Tools are cited and used faithfully: each is run as its authors originally presented it, without modifying the original architecture or design choices.