Development and Implementation of an Automated Data Analysis Workflow for High Throughput Part Characterisation Data | AIChE

Development and Implementation of an Automated Data Analysis Workflow for High Throughput Part Characterisation Data

Authors 

Ainsworth, C. M. - Presenter, Imperial College London
Bultelle, M. A., Imperial College
Sainz de Murieta, I., Imperial College
Kitney, R. I., Imperial College

Synthetic biology relies on constructing devices from small genetic parts which are being catalogued by various registries and databases. However, the characterisation data associated with these parts is often brief or missing and so knowing how these parts will function is unclear. The Centre for Synthetic Biology and Innovation at Imperial College has begun to develop automated characterisation pipelines for biological parts. This can create large amounts of characterisation data (for example, an automated characterisation platform can collect data for up to 24 constitutive promoters a day) and so the data analysis methods must be high throughput to match. We describe an analysis pipeline used to handle the data stream which demonstrates a high level of automation in the procedure. Once analysed, the data is presented on the SynBIS database to be used by other scientists when designing synthetic constructs. We have standardised the data input: the raw files are time series data from plate readers and fcs files from flow cytometers. The R programming language is used to read the files and is where the analysis takes place. The data is stored according to the â??tidy dataâ? principles (Wickham, 2014) to ensure that the data objects are both human readable, but also amenable to a modular and flexible approach of analysis. Just as modularity is becoming recognised as important when building biological devices in Synthetic Biology, it is also important for the analysis of the characterisation data. Each step, accounting for backgrounds, fitting growth curves, calculating synthesis rates, etc. is treated as a separate module of analysis to make the pipeline as flexible as possible. This means that the scripts can handle data that may already have backgrounds accounted for etc. We present an algorithm developed for the analysis of constitutive and inducible biological part characterisation data, be it from time series or flow cytometry sources. The approach has allowed us to quickly analyse large datasets and calculate the key metrics, as well as extract the associated metadata so that the output can be presented on SynBIS. We have been able to compare far more data simultaneously which allows for a more big data approach; excluding data that is anomalous in cluster analysis for example. As a result, we can analyse large amounts of characterisation data quickly and accurately and output the results straight onto SynBIS where the data is disseminated. This characterisation data will aid a Synthetic Biologist in building new devices with predictable properties, reducing the time that the design, test, build cycle takes. Wickham, H. (2014). Tidy Data. Journal of Statistical Software, Vol. 59 (10).