Example pipelines¶

In the diagrams: rounded boxes are inputs, plain boxes are tools, a box marked (map) runs once per item, and a diamond is a decision loop.

Walkthrough 1: align a set of sequences¶

Task: align a set of protein sequences with Clustal Omega.

        flowchart LR
    IN(["family.fasta"]) --> CL["clustalo"]
    CL --> OUT(["alignment"])

Step 1: inspect the tool. Every input-node key must match a name the tool declares, so start by reading the manifest:

$ bv show clustalo --format json

{
  "tool": {
    "id": "clustalo",
    "description": "Clustal Omega: fast and scalable multiple sequence aligner ...",
    "image": { "reference": "quay.io/biocontainers/clustalo:1.2.4--h503566f_10" },
    "inputs": [
      {
        "name": "sequences",
        "type": "fasta",
        "cardinality": "one",
        "description": "Unaligned input sequences (protein or nucleotide)"
      }
    ],
    "outputs": [
      { "name": "alignment", "type": "msa", "cardinality": "one",
        "description": "Multiple sequence alignment" }
    ],
    "entrypoint": {
      "command": "clustalo",
      "args_template": "--threads={cpu_cores} -i {sequences} -o {alignment} --outfmt=fa"
    }
  }
}

Step 2: pick the inputs. Clustal Omega declares one input, sequences (required is implied for the only input; in tools with several inputs, supply every one marked "required": true). Its single output is alignment.

Step 3: put your file in inputs/. The CLI uploads the inputs/ folder, which appears in the run as /vol/inputs/:

inputs/family.fasta

Step 4: write the pipeline. The input-node key is exactly the manifest name, sequences:

from biocomposer import Graph

g = Graph()
seqs  = g.add_input_node(sequences="/vol/inputs/family.fasta")  # key = manifest input name
align = g.add_node("clustalo")
g.add_edge((seqs, align))
g.set_output_node(align)
print(g.execute())

Step 5: run it:

biocomp run run/align.py --env .env

The result dict is keyed by the tool’s output name, alignment.

Walkthrough 2: design a sequence for each generated backbone¶

Task: generate protein backbones with RFdiffusion, then design an amino-acid sequence for each one with ProteinMPNN. This introduces two things: a tool with several inputs (some required), and a gather node, whose split_key is also read from a manifest.

        flowchart LR
    RFI(["input_pdb, contigs,<br/>num_designs"]) --> RF["rfdiffusion"]
    KN(["num_seq_per_target,<br/>sampling_temp"]) -.-> MP
    RF -->|"designs (N backbones)"| MP["proteinmpnn (map)"]
    MP --> OUT(["sequences: [...]"])

Step 1: inspect both tools.

$ bv show rfdiffusion --format json

{
  "tool": {
    "id": "rfdiffusion",
    "inputs": [
      { "name": "input_pdb", "type": "pdb", "cardinality": "one", "required": false, ... },
      { "name": "contigs",   "type": "file", "cardinality": "one", "required": true,  ... },
      { "name": "num_designs","type": "file","cardinality": "one", "required": true,  ... }
    ],
    "outputs": [
      { "name": "designs", "type": "dir", "cardinality": "one",
        "description": "Output directory; one design_<n>.pdb backbone per design ..." }
    ]
  }
}

$ bv show proteinmpnn --format json

{
  "tool": {
    "id": "proteinmpnn",
    "inputs": [
      { "name": "pdb_path",          "type": "pdb",  "cardinality": "one", "required": true,  ... },
      { "name": "num_seq_per_target","type": "file", "cardinality": "one", "required": true,  ... },
      { "name": "sampling_temp",     "type": "file", "cardinality": "one", "required": true,  ... }
    ],
    "outputs": [
      { "name": "sequences", "type": "dir", "cardinality": "one", ... }
    ]
  }
}

Step 2: pick the inputs. Supply every "required": true input. For RFdiffusion that’s contigs and num_designs (input_pdb is optional but used here). For ProteinMPNN: pdb_path, num_seq_per_target, sampling_temp.

Step 3: why a gather node, and what split_key is. RFdiffusion’s output designs is cardinality = "one" but it is a directory of many backbones, while ProteinMPNN’s pdb_path takes cardinality = "one", one backbone per run. So ProteinMPNN is a gather node, and split_key is the upstream output name holding the collection to fan out over, here, "designs". (You read split_key from the producing tool’s outputs, not the consuming tool’s inputs.)

Step 4: put inputs in inputs/:

inputs/1l9h.pdb

Step 5: write the pipeline. Note the two input nodes: one feeds RFdiffusion, the other carries ProteinMPNN’s shared parameters. Every key below is a manifest input name.

from biocomposer import Graph

g = Graph()
rf_in = g.add_input_node(
    input_pdb="/vol/inputs/1l9h.pdb",   # rfdiffusion input names
    contigs="[150-150]",
    num_designs="2",
)
rfdiffusion = g.add_node("rfdiffusion")
g.add_edge((rf_in, rfdiffusion))

mpnn_in = g.add_input_node(             # proteinmpnn input names (shared scalars)
    num_seq_per_target="2",
    sampling_temp="0.1",
)
proteinmpnn = g.add_gather_node("proteinmpnn", split_key="designs")  # = rfdiffusion's output name
g.add_edge((rfdiffusion, proteinmpnn), (mpnn_in, proteinmpnn))
g.set_output_node(proteinmpnn)
print(g.execute())

Step 6: run (GPU tool → use Modal):

biocomp run --modal run/design.py --env .env

ProteinMPNN runs once per backbone; the result gathers each run’s sequences output into a list.

Multiple sequence alignment, then trim, then fold¶

Task: align an RNA family, trim noisy columns, predict a consensus secondary structure, and draw it.

A four-tool chain. Each edge’s connector handles the format changes between the aligner, the trimmer, and the ViennaRNA programs.

        flowchart LR
    IN(["rna_family.fasta"]) --> CL["clustalo"]
    CL --> TR["trimal"]
    TR --> AF["RNAalifold"]
    AF --> PL["RNAplot"]

from biocomposer import Graph

g = Graph()
inp        = g.add_input_node(sequences="/vol/inputs/rna_family.fasta")
clustalo   = g.add_node("clustalo")
trimal     = g.add_node("trimal",
                        args_override="-in {alignment} -out {trimmed} -fasta -gappyout")
rnaalifold = g.add_node("viennarna", entrypoint_override="RNAalifold",
                        args_override="-f F --noPS {sequences} > {structures}")
rnaplot    = g.add_node("viennarna", entrypoint_override="RNAplot",
                        args_override="{sequences} > {structures}")

g.add_edge((inp, clustalo), (clustalo, trimal),
           (trimal, rnaalifold), (rnaalifold, rnaplot))
g.set_output_node(rnaplot)
print(g.execute())

viennarna exposes several programs, selected per node with entrypoint_override (see Tool nodes).

Backbones → sequences → predicted structures¶

Task: generate backbones, design sequences for each, then fold every designed sequence into a 3-D structure.

ColabFold predicts structure from sequence, so backbones first become sequences (ProteinMPNN), then each sequence is folded (ColabFold). Two map stages chain: the first fans over backbones, the second over the sequences the first produced.

        flowchart LR
    RFI(["1l9h.pdb"]) --> RF["rfdiffusion"]
    RF -->|"designs"| MP["proteinmpnn (map)"]
    MP -->|"sequences"| CF["colabfold (map)"]
    CF --> OUT(["predicted structures"])

import json
from biocomposer import Graph

g = Graph()
rf_in = g.add_input_node(
    input_pdb="/vol/inputs/1l9h.pdb", contigs="[150-150]",
    num_designs="2", config_name="base", diffuser_T="20",
)
rfdiffusion = g.add_node("rfdiffusion")
g.add_edge((rf_in, rfdiffusion))

mpnn_in = g.add_input_node(num_seq_per_target="2", sampling_temp="0.1")
proteinmpnn = g.add_gather_node("proteinmpnn", split_key="designs")
g.add_edge((rfdiffusion, proteinmpnn), (mpnn_in, proteinmpnn))

cf_in = g.add_input_node(
    model_type="auto", num_models="1", num_recycles="3",
    msa_mode="single_sequence", num_seeds="1",
    rank_by="plddt", stop_at_score="100",
)
colabfold = g.add_gather_node("colabfold", split_key="sequences")
g.add_edge((proteinmpnn, colabfold), (cf_in, colabfold))
g.set_output_node(colabfold)
print(json.dumps(g.execute(), indent=2, default=str))

Self-consistency: does a design fold back to its intended shape?¶

Question: for each de-novo backbone, do the designed sequences fold back into that backbone? This is the standard self-consistency RMSD filter for protein designs.

The pipeline is the two-stage fan-out above (rfdiffusion -> map(proteinmpnn) -> map(colabfold)), followed by a geometry step that superposes each predicted structure onto its source backbone and reports the deviation (low = the design folds as intended).

        flowchart LR
    RF["rfdiffusion"] -->|"backbones"| MP["proteinmpnn (map)"]
    MP -->|"sequences"| CF["colabfold (map)"]
    RF -. compare .-> SC
    CF --> SC{{"scRMSD<br/>predicted vs. backbone"}}
    SC --> OUT(["pass / fail per design"])

The fan-out uses two gather nodes as above; the RMSD comparison is a short Python step over the results (a join between each prediction and its source backbone, which the graph does not express as an edge). The full script is run/run_scrmsd_pipeline.py.

Note

This pipeline is heavy, ColabFold runs once per designed sequence. Use --modal with a GPU and start small (num_designs=2, diffuser_T=20).