A Rule-Based Library for the Verification and Modification of Synthetic DNA Sequences | AIChE

A Rule-Based Library for the Verification and Modification of Synthetic DNA Sequences

Authors 

Oberortner, E. - Presenter, Lawrence Berkeley National Laboratory
Deutsch, S., DOE Joint Genome Institute
Cheng, J. F., Lawrence Berkeley National Laboratory
Hillson, N. J., DOE Joint BioEnergy Institute

Many synthetic biology research groups design complex DNA sequences in silico according to biological discoveries and rules, but order the physical sequence from DNA synthesis vendors. Before the physical synthesis process starts, at latest, the designed sequences must be verified against rules of DNA synthesis methods and technologies. If a sequence violates a rule, then corrective actions are performed manually that depend on the type of rule and the designers' knowledge. Moreover, manual sequence modifications are hard to reenact, error-prone, time-consuming, and circumvent fully automated workflows.

Various algorithms have been developed to verify sequences against rules in silico. Biological advances have shown that rules can be mainly categorized into two types: repeats and GC content. Both types can occur in the entire sequence (i.e. globally) or in specific regions (i.e. locally). Direct repeats denote k-mers (e.g. polymers, dimers, trimers) and inverted repeats indicate potential undesired secondary structures (e.g. hairpins). Here, we demonstrate the Sequence Polishing Library (SPL) and its integration into an automated workflow at the DOE Joint Genome Institute. Currently, the SPL imports sequence information from GenBank, FASTA, FASTQ, or SBOL file formats. Besides finding only exact matches of sequence repeats, the SPL can find repeats to a configurable degree of similarity. Therefore, SPL programmatically integrates the BBMap bioinformatics library (http://sourceforge.net/projects/bbmap).

For the specification of rules, we develop a yet simple but expressive language, which also supports the specification of corrective actions. The SPL differentiates between human, recommended, and automated corrective actions. Per default, human corrective actions must be performed since which action to perform depends on where and what type of violation occurs. In order to modify sequences in an automated fashion, information is needed about the sequences' structure. For example, if a coding sequence contains repeats, then organism-specific codon tables can be consulted to modify the DNA sequence. If a promoter violates a rule, then the SPL can recommend alternative promoters to the designer. Such structural information can be specified, for example, using the GenBank or SBOL format.

The SPL can be invoked programmatically via an Application Programming Interface (API) and utilized manually via a web-based User Interface (UI). Both interfaces provide expressive feedback about rule violations and corrective actions. Therefore, we believe that synthetic biology researchers as well as DNA synthesis vendors can benefit from utilizing the SPL, contributing to the automation of internal and cross-organizational workflows.