Example pipelines ================= In the diagrams: rounded boxes are **inputs**, plain boxes are **tools**, a box marked *(map)* runs **once per item**, and a diamond is a **decision loop**. Walkthrough 1: align a set of sequences ---------------------------------------- **Task:** align a set of protein sequences with Clustal Omega. .. mermaid:: flowchart LR IN(["family.fasta"]) --> CL["clustalo"] CL --> OUT(["alignment"]) **Step 1: inspect the tool.** Every input-node key must match a name the tool declares, so start by reading the manifest: .. code-block:: console $ bv show clustalo --format json .. code-block:: json { "tool": { "id": "clustalo", "description": "Clustal Omega: fast and scalable multiple sequence aligner ...", "image": { "reference": "quay.io/biocontainers/clustalo:1.2.4--h503566f_10" }, "inputs": [ { "name": "sequences", "type": "fasta", "cardinality": "one", "description": "Unaligned input sequences (protein or nucleotide)" } ], "outputs": [ { "name": "alignment", "type": "msa", "cardinality": "one", "description": "Multiple sequence alignment" } ], "entrypoint": { "command": "clustalo", "args_template": "--threads={cpu_cores} -i {sequences} -o {alignment} --outfmt=fa" } } } **Step 2: pick the inputs.** Clustal Omega declares one input, ``sequences`` (``required`` is implied for the only input; in tools with several inputs, supply every one marked ``"required": true``). Its single output is ``alignment``. **Step 3: put your file in** ``inputs/``. The CLI uploads the ``inputs/`` folder, which appears in the run as ``/vol/inputs/``:: inputs/family.fasta **Step 4: write the pipeline.** The input-node key is exactly the manifest name, ``sequences``: .. code-block:: python from biocomposer import Graph g = Graph() seqs = g.add_input_node(sequences="/vol/inputs/family.fasta") # key = manifest input name align = g.add_node("clustalo") g.add_edge((seqs, align)) g.set_output_node(align) print(g.execute()) **Step 5: run it**:: biocomp run run/align.py --env .env The result dict is keyed by the tool's output name, ``alignment``. Walkthrough 2: design a sequence for each generated backbone ------------------------------------------------------------- **Task:** generate protein backbones with RFdiffusion, then design an amino-acid sequence for each one with ProteinMPNN. This introduces two things: a tool with **several inputs** (some required), and a **gather node**, whose ``split_key`` is also read from a manifest. .. mermaid:: flowchart LR RFI(["input_pdb, contigs,
num_designs"]) --> RF["rfdiffusion"] KN(["num_seq_per_target,
sampling_temp"]) -.-> MP RF -->|"designs (N backbones)"| MP["proteinmpnn (map)"] MP --> OUT(["sequences: [...]"]) **Step 1: inspect both tools.** .. code-block:: console $ bv show rfdiffusion --format json .. code-block:: text { "tool": { "id": "rfdiffusion", "inputs": [ { "name": "input_pdb", "type": "pdb", "cardinality": "one", "required": false, ... }, { "name": "contigs", "type": "file", "cardinality": "one", "required": true, ... }, { "name": "num_designs","type": "file","cardinality": "one", "required": true, ... } ], "outputs": [ { "name": "designs", "type": "dir", "cardinality": "one", "description": "Output directory; one design_.pdb backbone per design ..." } ] } } .. code-block:: console $ bv show proteinmpnn --format json .. code-block:: text { "tool": { "id": "proteinmpnn", "inputs": [ { "name": "pdb_path", "type": "pdb", "cardinality": "one", "required": true, ... }, { "name": "num_seq_per_target","type": "file", "cardinality": "one", "required": true, ... }, { "name": "sampling_temp", "type": "file", "cardinality": "one", "required": true, ... } ], "outputs": [ { "name": "sequences", "type": "dir", "cardinality": "one", ... } ] } } **Step 2: pick the inputs.** Supply every ``"required": true`` input. For RFdiffusion that's ``contigs`` and ``num_designs`` (``input_pdb`` is optional but used here). For ProteinMPNN: ``pdb_path``, ``num_seq_per_target``, ``sampling_temp``. **Step 3: why a gather node, and what** ``split_key`` **is.** RFdiffusion's output ``designs`` is ``cardinality = "one"`` *but it is a directory of many backbones*, while ProteinMPNN's ``pdb_path`` takes ``cardinality = "one"``, one backbone per run. So ProteinMPNN is a **gather node**, and ``split_key`` is the **upstream output name** holding the collection to fan out over, here, ``"designs"``. (You read ``split_key`` from the *producing* tool's outputs, not the consuming tool's inputs.) **Step 4: put inputs in** ``inputs/``:: inputs/1l9h.pdb **Step 5: write the pipeline.** Note the two input nodes: one feeds RFdiffusion, the other carries ProteinMPNN's shared parameters. Every key below is a manifest input name. .. code-block:: python from biocomposer import Graph g = Graph() rf_in = g.add_input_node( input_pdb="/vol/inputs/1l9h.pdb", # rfdiffusion input names contigs="[150-150]", num_designs="2", ) rfdiffusion = g.add_node("rfdiffusion") g.add_edge((rf_in, rfdiffusion)) mpnn_in = g.add_input_node( # proteinmpnn input names (shared scalars) num_seq_per_target="2", sampling_temp="0.1", ) proteinmpnn = g.add_gather_node("proteinmpnn", split_key="designs") # = rfdiffusion's output name g.add_edge((rfdiffusion, proteinmpnn), (mpnn_in, proteinmpnn)) g.set_output_node(proteinmpnn) print(g.execute()) **Step 6: run** (GPU tool → use Modal):: biocomp run --modal run/design.py --env .env ProteinMPNN runs once per backbone; the result gathers each run's ``sequences`` output into a list. Multiple sequence alignment, then trim, then fold ------------------------------------------------- **Task:** align an RNA family, trim noisy columns, predict a consensus secondary structure, and draw it. A four-tool chain. Each edge's connector handles the format changes between the aligner, the trimmer, and the ViennaRNA programs. .. mermaid:: flowchart LR IN(["rna_family.fasta"]) --> CL["clustalo"] CL --> TR["trimal"] TR --> AF["RNAalifold"] AF --> PL["RNAplot"] .. code-block:: python from biocomposer import Graph g = Graph() inp = g.add_input_node(sequences="/vol/inputs/rna_family.fasta") clustalo = g.add_node("clustalo") trimal = g.add_node("trimal", args_override="-in {alignment} -out {trimmed} -fasta -gappyout") rnaalifold = g.add_node("viennarna", entrypoint_override="RNAalifold", args_override="-f F --noPS {sequences} > {structures}") rnaplot = g.add_node("viennarna", entrypoint_override="RNAplot", args_override="{sequences} > {structures}") g.add_edge((inp, clustalo), (clustalo, trimal), (trimal, rnaalifold), (rnaalifold, rnaplot)) g.set_output_node(rnaplot) print(g.execute()) ``viennarna`` exposes several programs, selected per node with ``entrypoint_override`` (see :doc:`nodes/tool`). Backbones → sequences → predicted structures -------------------------------------------- **Task:** generate backbones, design sequences for each, then fold every designed sequence into a 3-D structure. ColabFold predicts structure from *sequence*, so backbones first become sequences (ProteinMPNN), then each sequence is folded (ColabFold). Two map stages chain: the first fans over backbones, the second over the sequences the first produced. .. mermaid:: flowchart LR RFI(["1l9h.pdb"]) --> RF["rfdiffusion"] RF -->|"designs"| MP["proteinmpnn (map)"] MP -->|"sequences"| CF["colabfold (map)"] CF --> OUT(["predicted structures"]) .. code-block:: python import json from biocomposer import Graph g = Graph() rf_in = g.add_input_node( input_pdb="/vol/inputs/1l9h.pdb", contigs="[150-150]", num_designs="2", config_name="base", diffuser_T="20", ) rfdiffusion = g.add_node("rfdiffusion") g.add_edge((rf_in, rfdiffusion)) mpnn_in = g.add_input_node(num_seq_per_target="2", sampling_temp="0.1") proteinmpnn = g.add_gather_node("proteinmpnn", split_key="designs") g.add_edge((rfdiffusion, proteinmpnn), (mpnn_in, proteinmpnn)) cf_in = g.add_input_node( model_type="auto", num_models="1", num_recycles="3", msa_mode="single_sequence", num_seeds="1", rank_by="plddt", stop_at_score="100", ) colabfold = g.add_gather_node("colabfold", split_key="sequences") g.add_edge((proteinmpnn, colabfold), (cf_in, colabfold)) g.set_output_node(colabfold) print(json.dumps(g.execute(), indent=2, default=str)) Self-consistency: does a design fold back to its intended shape? ---------------------------------------------------------------- **Question:** for each de-novo backbone, do the designed sequences fold back into that backbone? This is the standard **self-consistency RMSD** filter for protein designs. The pipeline is the two-stage fan-out above (``rfdiffusion -> map(proteinmpnn) -> map(colabfold)``), followed by a geometry step that superposes each predicted structure onto its source backbone and reports the deviation (low = the design folds as intended). .. mermaid:: flowchart LR RF["rfdiffusion"] -->|"backbones"| MP["proteinmpnn (map)"] MP -->|"sequences"| CF["colabfold (map)"] RF -. compare .-> SC CF --> SC{{"scRMSD
predicted vs. backbone"}} SC --> OUT(["pass / fail per design"]) The fan-out uses two gather nodes as above; the RMSD comparison is a short Python step over the results (a *join* between each prediction and its source backbone, which the graph does not express as an edge). The full script is ``run/run_scrmsd_pipeline.py``. .. note:: This pipeline is heavy, ColabFold runs once per designed sequence. Use ``--modal`` with a GPU and start small (``num_designs=2``, ``diffuser_T=20``).