Tools and the registry ====================== biocomposer uses the **bv** registry and client by default. Because tools are just container images plus a manifest, the catalogue can wrap images from anywhere, purpose-built images, `BioContainers `_, or plain Docker Hub images. .. note:: The bv registry and ``bv`` client are developed by Tejas Prabhune, see the `bv-registry project `_. biocomposer builds on bv for tool resolution, container management, and execution. Finding tools ------------- List or inspect tools with the ``bv`` client: .. code-block:: bash bv search clustalo # find a tool bv show proteinmpnn # human-readable manifest bv show proteinmpnn --format json # machine-readable (use this for input keys) ``bv show`` is the authoritative source for a tool's **input names**, the exact keyword names you pass to ``add_input_node``. See :ref:`input-keys`. Anatomy of a manifest --------------------- A manifest has four parts: identity, **inputs**, **outputs**, and how to run it (image + entrypoint). Below is the real ProteinMPNN manifest, trimmed to the essential fields. Identity ~~~~~~~~ .. code-block:: toml [tool] id = "proteinmpnn" version = "1.0.1" description = "Deep-learning protein sequence design (inverse folding) ..." homepage = "https://github.com/dauparas/ProteinMPNN" license = "MIT" Inputs ~~~~~~ Each input is one ``[[tool.inputs]]`` block. These are the fields that matter when building a pipeline: .. code-block:: toml [[tool.inputs]] name = "pdb_path" # the key you use in add_input_node / connectors type = "pdb" # what kind of data it is required = true # must be supplied cardinality = "one" # exactly ONE item (not a list) description = "Single input backbone to redesign (--pdb_path) ..." [[tool.inputs]] name = "num_seq_per_target" type = "file" required = true cardinality = "one" description = "Number of sequences to generate per backbone (--num_seq_per_target) ..." The four fields, and why they matter: .. list-table:: :header-rows: 1 :widths: 20 80 * - Field - Meaning * - ``name`` - The key. ``add_input_node(pdb_path=...)`` must use this exact name; the connector maps onto it; and the command template references it as ``{pdb_path}``. * - ``type`` - The data kind (``pdb``, ``fasta``, ``dir``, ``file``, …). Connectors use it to decide whether a format conversion is needed between two tools. * - ``required`` - Whether the tool fails without it. Optional inputs can be omitted. * - ``cardinality`` - ``one`` = a single item; ``many`` = a list. **This is the field that decides whether you need a gather node.** If an upstream produces many items but this input is ``cardinality = "one"``, the tool consumes one at a time, wrap it in a :doc:`nodes/gather` to run it once per item. Outputs ~~~~~~~ Outputs use the same fields. ProteinMPNN produces one output directory: .. code-block:: toml [[tool.outputs]] name = "sequences" type = "dir" required = true cardinality = "one" description = "Output folder; contains seqs/.fa with the designed sequences ..." An output's ``name`` is the key you see in the result dict, and, for a gather node, the value you pass as ``split_key`` when this collection should be fanned out. Image and entrypoint ~~~~~~~~~~~~~~~~~~~~~ These tell biocomposer which container to run and how to invoke the tool. The ``args_template`` is the command line, with ``{slot}`` placeholders that match input/output ``name`` fields: .. code-block:: toml [tool.image] backend = "docker" reference = "docker.io/rosettacommons/proteinmpnn:latest" [tool.hardware.gpu] required = false min_vram_gb = 4 cuda_version = "11.3" [tool.entrypoint] command = "python" args_template = "/app/proteinmpnn/protein_mpnn_run.py --pdb_path {pdb_path} --out_folder {sequences} --num_seq_per_target {num_seq_per_target} --sampling_temp {sampling_temp}" [tool.binaries] exposed = ["python", "protein_mpnn_run.py", "parse_multiple_chains.py"] At run time the placeholders are filled: ``{pdb_path}`` becomes the staged input file, ``{sequences}`` the output directory, and scalar inputs like ``{num_seq_per_target}`` their literal values. A placeholder that nothing supplies is stripped, which is why an input whose **key doesn't match a slot** silently has no effect. The ``exposed`` binaries list is what makes ``entrypoint_override`` possible: a single tool image (e.g. ViennaRNA) can expose several programs, and you select one with ``entrypoint_override="RNAplot"``. See :doc:`nodes/tool`. How the manifest shapes your pipeline ------------------------------------- Reading a manifest answers the three questions that come up while wiring a pipeline: #. **What do I name my input-node keys?** → the input ``name`` fields. #. **Do I need a gather node here?** → is the downstream input ``cardinality = "one"`` while the upstream produces many? If so, yes. #. **What comes out, and under what key?** → the output ``name`` fields (and the one you'd use as ``split_key``).