BcForms: a toolkit for concretely describing macromolecular complexes

BcForms is a toolkit for concretely describing the molecular structure (atoms and bonds) of macromolecular complexes, including non-canonical monomeric forms, circular topologies, and crosslinks. BcForms was developed to help describe the semantic meaning of whole-cell computational models .

BcForms includes a grammar for describing forms of macromolecular complexes composed of DNA, RNA, protein, and small molecular subunits and crosslinks between the subunits. The DNA, RNA, and protein subunits can be described using BpForms and the small molecule subunits can be described using SMILES. BcForms also includes four software tools for verifying descriptions of complexes and calculating physical properties of complexes such as their molecular structure, formula, molecular weight, and charge: this website, a JSON REST API , a command line interface , and a Python API . BcForms is available open-source under the MIT license.

BcForms verifier/calculator

Enter a complex and its subunits

Calculated properties of the complex

Features

BcForms has the following features:

  • Concrete: To help researchers communicate and integrate data about macromolecules, the grammar can capture the primary structures of complexes, including non-canonical (NC) residues, caps, crosslinks, and nicks.
  • Abstract: To facilitate network research, BcForms uses alphabets of residues and an ontology of crosslinks to abstract the structures of polymers.
  • Extensible: To capture any complex, users can define residues and crosslinks inline or define custom alphabets and ontologies.
  • Structured coordinates: To compose residues and crosslinks into complexes, each subunit, residue and atom has a unique coordinate relative to its parent.
  • Context-free: To help integrate information about the processes which synthesize and modify macromolecules, the grammar captures the structures of macromolecules separately from the processes which generate them.
  • User-friendly: To ensure BcForms is easy to use, the grammar is human-readable, and BcForms includes a web application and a command-line program.
  • Machine-readable: The grammar is machine-readable to enable analyses of macromolecules.
  • Composable: To facilitate network research, BcForms includes protocols for composing the grammar with formats such as CellML and SBML.

Grammar for complexes

Overview

The BcForms represents complexes as a sets of subunits, including their stoichiometries, and a set of interchain/intersubunit crosslinks. Furthermore, BcForms can be combined with BpForms and SMILES descriptions of subunits to calculate properties of complexes.

BcForms descriptions of complexes consist of two parts:

  • Subunit composition of the complex: The subunit composition of the complex, including the stoichiometry of each subunit, is described as a linear expression (e.g., 3 * subunit_a + 2 * subunit_b).
  • Interchain crosslinks: Each crosslink is described as indicated below (e.g., | x-link: [...]).

The BcForms grammar is defined in Lark syntax , which is based on EBNF syntax .

Examples

Heterodimer with no crosslinks
complex: sub_a + sub_b
sub_a: bpforms.ProteinForm(AC)
sub_b: bpforms.ProteinForm(MK)

Structure: C[C@H]([NH3+])C(=O)N[C@H](C(=O)O)CS.CSCC[C@H]([NH3+])C(=O)N[C@@H](CCCC[NH3+])C(=O)O
Formula: C17H38N5O6S2
Molecular weight: 472.64
Charge: 3

Homodimer with a crosslink
complex: 2 * sub_c | x-link: [
    l-bond-atom: sub_c(1)-1S11 |
    l-displaced-atom: sub_c(1)-1H11 |
    r-bond-atom: sub_c(2)-1S11 |
    r-displaced-atom: sub_c(2)-1H11
  ]
sub_c: bpforms.ProteinForm(CA)

Structure: C(=O)([C@@H]([NH3+])CSSC[C@@H](C(=O)N[C@@H](C)C(=O)O)[NH3+])N[C@@H](C)C(=O)O
Formula: C12H24N4O6S2
Molecular weight: 384.466
Charge: 2

Crosslinks between subunits

The x-link attribute can be used to indicate a bond between atoms from different subunits. For example, this attribute can describe interstrand disulfide bonds between cysteines in proteins and crosslinks in DNA.

Each crosslink can be described by enclosing attributes which indicate the atoms involved in the bond within square brackets and delimiting the attributes with pipes (e.g., "| x-link: [l-bond-atom: sub_a(1)-1C1 | r-bond-atom: sub_b(1)-3C2 | ...]").

BcForms allows two ways of defining inter-subunit crosslinks: inline definition and definition using our ontology of crosslinks.

Examples

User-defined crosslinks

Each crosslink can be described using the following attributes:

  • l-bond-atom and r-bond-atom: These attributes indicate the atoms involved in the bond. The values of these attributes are the position of the monomeric form within the sequence of the subunit, the element of the atom, the position of the atom within the monomeric form, and the charge of the atom (e.g., sub_a(1)-8N3+1). Open Babel can be used to display the numbers of the atoms within monomeric forms.
  • l-displaced-atom and r-displaced-atom: These attributes indicate the atoms displaced by the formation of the bond. The values of these attributes are also the position of the monomeric form within the sequence of the subunit, the element of the atom, the position of the atom within the monomeric form, and the charge of the atom.
  • order: This attribute can indicate the order (single, double, triple, aromatic) of the bond.
  • stereo: This attribute can indicate the stereochemistry of the bond (wedge, hash, up, down).
  • comments: This attribute can indicate comments about the crosslink, such as uncertainty about its location or structure.

Each crosslink can have one or more left and right bond atoms, and zero or more left and right displaced atoms. Each crosslink must have the same number of left and right bond atoms.

Examples

Interchain disulfide bond
| x-link: [ l-bond-atom: sub_c(1)-1S11 |
            l-displaced-atom: sub_c(1)-1H11 |
            r-bond-atom: sub_c(2)-1S11 |
            r-displaced-atom: sub_c(2)-1H11 ]
Interchain isopeptide bond
| x-link: [ l-bond-atom: b(1)-4C2 |
            r-bond-atom: a(2)-1N1-1  |
            l-displaced-atom: b(1)-4O1 |
            l-displaced-atom: b(1)-4H1 |
            r-displaced-atom: a(2)-1H1+1 |
            r-displaced-atom: a(2)-1H1 ]

Ontology definition of crosslinks

Each crosslink can alternatively be described by using our ontology with three attributes. The list of crosslinks defined in the ontology is available at bpforms.org/crosslink .

  • type: This attributes indicates the type of the crosslink.
  • l and r: These attributes indicate the position of the monomeric form within the sequence of the subunit.

Each crosslink must have one type, one left monomeric form, and one right monomeric form.

Complexes can have zero, one, or more crosslinks.

Examples

Interchain disulfide bond
| x-link: [ type: disulfide |
            l: sub_c(1)-1 |
            r: sub_c(2)-1 ]
Interchain isopeptide bond
| x-link: [ type: glycyl_lysine_isopeptide |
            l: b(1)-4 |
            r: a(2)-1 ]

Coordinate system

Each subunit, residue, and atom represented by BcForms has a unique coordinate. The coordinates of repeated subunits range from one to the stoichiometry of the subunit. The coordinate of each residue is a two-tuple of the coordinate of its parent subunit and its position within the residue sequence of its parent subunit. The coordinate of each atom is a three-tuple of the coordinate of its parent subunit, the position of its parent residue within the residue sequence of its parent polymer, and its position within the canonical SMILES ordering of its parent residue (which can be displayed by Open Babel).

Example

The example below illustrates the atom coordinates for the modified amino acid N5-methyl-L-arginine.

[id: "AA0305"
  | name: "N5-methyl-L-arginine"
  | structure: "OC(=O)[C@H](CCCN(C(=[NH2])N)C)
                [NH3+]"
  | l-bond-atom: N16-1
  | l-displaced-atom: H16+1
  | l-displaced-atom: H16
  | r-bond-atom: C2
  | r-displaced-atom: O1
  | r-displaced-atom: H1
  ]

Syntactic and semantic verification of descriptions of complexes

To help quality control information about macromolecules, the BcForms user interfaces include methods for verifying the syntactic and semantic correctness of complexes:

  • Check that each residue has a defined structure, each atom that bonds an adjacent residue has a defined element and position which is consistent with the structure of its parent residue, and each pair of consecutive residues can form a bond.
  • Check that the element and position of each atom in each crosslink are consistent with the structure of its parent residue. For example, this can identify invalid proteins that contain consecutive residues that cannot bond because the first residue lacks a carboxyl terminus or the second residue lacks an amino terminus.
  • Check that each subunit is semantically concrete and that the element and position of each atom in each crosslink are consistent with the structure of its parent residue.

User interfaces

BcForms includes four software interfaces for verifying descriptions of complexes and calculating properties such as their molecular structures, formulae, molecular weights, and charges.

Webform

The webform above can be used to validate BcForms and calculate their properties.

JSON REST API

A JSON REST API is available at https://bcforms.org/api. Documentation is available by opening this URL in your browser.

Command line interface

A command line interface is available from PyPI . Installation instructions and documentation are available at docs.karrlab.org .

Python library

A Python library is available from PyPI . Installation instructions are available at docs.karrlab.org . Documentation is available inline by running bcforms --help.

Integrating BcForms into the CellML and SBML standards for kinetic models

BcForms can be used in conjunction with commonly used standards in systems biology. BcForms is also easy to embed into documents such as Excel workbooks and comma-separated tables.

CellML

BcForms can be used to concretely describe the meaning of CellML components which represent complexes. BcForms can be used with the RDF element of component objects.

Example

This example illustrates how to annotate the semantic meaning of a CellML component which represents a homotrimer of a protein which contains two phosphorylated serines (RESID: AA0037 ) at the fourth and eighth residues.
...

<component cmeta:id="complex" name="complex">
  <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
    <rdf:Description rdf:about="#complex">
      <bcforms:BcForm xmlns:bcforms="https://bcforms.org">
        3 * subunit
      </bcforms:BcForm>
    </rdf:Description>
  </rdf:RDF>
</component>

<component cmeta:id="subunit" name="subunit">
  <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
    <rdf:Description rdf:about="#subunit">
      <bpforms:ProteinForm xmlns:bpforms="https://bpforms.org">
        LID{AA0037}MAN{AA0037}FVGTR
      </bpforms:ProteinForm>
    </rdf:Description>
  </rdf:RDF>
</component>

...

SBML

BcForms can be used to concretely describe the meaning of Systems Biology Markup Language (SBML) species elements which represent complexes. BcForms can be used with the annotation element of species elements.

Example

This example illustrates how to annotate the semantic meaning of an SBML species which represents a homodimer of a protein which contains a selenocysteine (U, RESID: AA0022 ) at the second residue.
...

<species name="complex">
  <annotation>
    <rdf:RDF
      xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
      <rdf:Description rdf:about="#complex">
        <bcforms:BcForm xmlns:bcforms="https://bcforms.org">
          2 * subunit
        </bcforms:BcForm>
      </rdf:Description>
    </rdf:RDF>
  </annotation>
</species>

<species name="subunit">
  <annotation>
    <rdf:RDF
      xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
      <rdf:Description rdf:about="#complex-a">
        <bpforms:ProteinForm xmlns:bpforms="https://bpforms.org">
          A{U}CR
        </bpforms:ProteinForm>
      </rdf:Description>
    </rdf:RDF>
  </annotation>
</species>

...

Resources for determining the structures of complexes

Below are several resources which can be helpful for determining the subunit and crosslink composition of complexes.

Subunit composition

  • BioCyc
  • Complex Portal
  • CORUM
  • Protein Data Bank (PDB)
  • UniProt

Crosslink composition

  • Protein Data Bank (PDB)
  • UniProt

Drawing chemical structures

  • ChemAxon Marvin
  • Open Babel

Tutorials, documentation, and help

Documentation for the grammar

Documentation for the grammar is available above . The definition of the grammar is available at GitHub .

Query builder for the REST API

A visual interface for building REST queries is available at bcforms.org/api .

Documentation for the REST API

Documentation for the REST API is available at bcforms.org/api .

Installation instructions for the CLI and Python API

Installation instructions are available at docs.karrlab.org . A minimal Dockerfile is also available from the Git repository for BpForms.

Documentation for the command line program

Documentation for the command line program is available inline by running bcforms --help.

Tutorial for the Python API

A Jupyter notebook with an interactive tutorial is available at sandbox.karrlab.org .

Documentation for the Python API

Detailed documentation for the Python API is available at docs.karrlab.org .

Questions

Please contact the Karr Lab with any questions.

Contributing to BcForms

To contribute to the software, please submit a Git pull request .

About BcForms

Source code

BcForms is available open-source from GitHub .

License

BcForms is released under the MIT license .

Citing BcForms

Coming soon!

Team

BcForms was developed by Jonathan Karr and Xiaoyue Zheng in the Karr Lab at the Icahn School of Medicine at Mount Sinai in New York, USA.

Acknowledgements

BcForms was supported by a National Institute of Health P41 award , a National Institute of Health MIRA R35 award , and a National Science Foundation INSPIRE award .

Questions/comments

Please contact the Karr Lab with any questions or comments.