Tools and the registry¶
biocomposer uses the bv registry and client by default. Because tools are just container images plus a manifest, the catalogue can wrap images from anywhere, purpose-built images, BioContainers, or plain Docker Hub images.
Note
The bv registry and bv client are developed by Tejas Prabhune,
see the bv-registry project.
biocomposer builds on bv for tool resolution, container management, and execution.
Finding tools¶
List or inspect tools with the bv client:
bv search clustalo # find a tool
bv show proteinmpnn # human-readable manifest
bv show proteinmpnn --format json # machine-readable (use this for input keys)
bv show is the authoritative source for a tool’s input names, the exact
keyword names you pass to add_input_node. See Keys must match the downstream tool.
Anatomy of a manifest¶
A manifest has four parts: identity, inputs, outputs, and how to run it (image + entrypoint). Below is the real ProteinMPNN manifest, trimmed to the essential fields.
Identity¶
[tool]
id = "proteinmpnn"
version = "1.0.1"
description = "Deep-learning protein sequence design (inverse folding) ..."
homepage = "https://github.com/dauparas/ProteinMPNN"
license = "MIT"
Inputs¶
Each input is one [[tool.inputs]] block. These are the fields that matter when
building a pipeline:
[[tool.inputs]]
name = "pdb_path" # the key you use in add_input_node / connectors
type = "pdb" # what kind of data it is
required = true # must be supplied
cardinality = "one" # exactly ONE item (not a list)
description = "Single input backbone to redesign (--pdb_path) ..."
[[tool.inputs]]
name = "num_seq_per_target"
type = "file"
required = true
cardinality = "one"
description = "Number of sequences to generate per backbone (--num_seq_per_target) ..."
The four fields, and why they matter:
Field |
Meaning |
|---|---|
|
The key. |
|
The data kind ( |
|
Whether the tool fails without it. Optional inputs can be omitted. |
|
|
Outputs¶
Outputs use the same fields. ProteinMPNN produces one output directory:
[[tool.outputs]]
name = "sequences"
type = "dir"
required = true
cardinality = "one"
description = "Output folder; contains seqs/<name>.fa with the designed sequences ..."
An output’s name is the key you see in the result dict, and, for a gather node,
the value you pass as split_key when this collection should be fanned out.
Image and entrypoint¶
These tell biocomposer which container to run and how to invoke the tool. The
args_template is the command line, with {slot} placeholders that match
input/output name fields:
[tool.image]
backend = "docker"
reference = "docker.io/rosettacommons/proteinmpnn:latest"
[tool.hardware.gpu]
required = false
min_vram_gb = 4
cuda_version = "11.3"
[tool.entrypoint]
command = "python"
args_template = "/app/proteinmpnn/protein_mpnn_run.py --pdb_path {pdb_path} --out_folder {sequences} --num_seq_per_target {num_seq_per_target} --sampling_temp {sampling_temp}"
[tool.binaries]
exposed = ["python", "protein_mpnn_run.py", "parse_multiple_chains.py"]
At run time the placeholders are filled: {pdb_path} becomes the staged input
file, {sequences} the output directory, and scalar inputs like
{num_seq_per_target} their literal values. A placeholder that nothing supplies
is stripped, which is why an input whose key doesn’t match a slot silently has
no effect.
The exposed binaries list is what makes entrypoint_override possible: a
single tool image (e.g. ViennaRNA) can expose several programs, and you select one
with entrypoint_override="RNAplot". See Tool nodes.
How the manifest shapes your pipeline¶
Reading a manifest answers the three questions that come up while wiring a pipeline:
What do I name my input-node keys? → the input
namefields.Do I need a gather node here? → is the downstream input
cardinality = "one"while the upstream produces many? If so, yes.What comes out, and under what key? → the output
namefields (and the one you’d use assplit_key).