Complete Genomics Analysis Platform

 

Our proprietary whole human genome sequencing technology consists of the following: our sequencing technology, our high-throughput process automation technology and our complete data management solution.

Proprietary Sequencing Technology

There are two primary components of our proprietary human genome sequencing technology: DNA nanoball arrays and combinatorial probe-anchor ligation reads.

DNA Nanoball Arrays

We have developed a novel approach to preparing fragmented DNA for reading on our sequencing instruments. Using a well-known biochemical process, we reproduce each DNA fragment in a manner that connects all of the copies together in a head-to-tail configuration, forming a long single molecule of connected nucleotides. We have developed proprietary techniques for causing each long single molecule to consolidate, or ball up, into a small particle of DNA that we call a DNA nanoball, or DNB™. The DNBs are approximately 200 nanometers in diameter on average. Each DNB contains hundreds of copies of the 70 bases of DNA we are seeking to read in each fragment.


Figure 1: Formation of DNA Nanoball (DNB) Arrays


Figure 2: Patterned Array

The small size and biochemical characteristics of our DNBs enable us to pack them together very tightly on a silicon chip. We use established photolithography processes developed in the semiconductor industry to create a silicon chip that has a grid pattern of small spots. The small spots are approximately 300 nanometers in diameter, and the center of each spot is separated by approximately 700 nanometers from neighboring spots. Each silicon chip has approximately 2.8 billion spots in an area 25 millimeters wide and 75 millimeters long. We have developed a proprietary process to make DNA adhere to these spots, which we refer to as “sticky spots,” while conversely preventing the DNA from adhering to the area between the sticky spots. When a solution of DNBs is spread across the chip, the DNBs stick to the spots one DNB per spot. We have also developed proprietary techniques to fill over 90% of the sticky spots with exactly one DNB. We refer to the silicon chip filled with DNBs as a DNA nanoball array. Each finished DNA nanoball array contains up to 180 billion bases of genomic DNA prepared for imaging.


Figure 3: DNA Nanoball Array Preparation

Combinatorial Probe-Anchor Ligation

To read the sequence of nucleotides in each DNB, we have developed a proprietary ligase-based DNA reading technology called combinatorial probe-anchor ligation, which we refer to as cPAL™. Our cPAL technology uses the naturally occurring ligase enzyme, which accurately distinguishes between A, C, T and G, to attach fluorescent molecules that light up red, blue, green and yellow to the nucleotides in each DNB.


Figure 4: Combinatorial Probe-Anchor Ligation (cPAL™) Chemistry

By imaging the color lights of a DNB array and decoding the color images, we can determine the sequence of nucleotides in each DNB. A key characteristic of our cPAL technology is its high accuracy of reading very short 5-base sequences of DNA. We have developed a proprietary technique for preparing the DNA fragments so that we can read seven 5-base segments from each of the two ends of each DNA fragment for a total of 70 bases from each fragment. We have also developed proprietary software that enables us to accurately reconstruct over 90% of the whole human genomes from these 70 base reads from each fragment.

High-Throughput Process Automation

There are five major components of our high-throughput process automation technology: high-throughput sample preparation, high-throughput sequencing instruments, high-performance computing infrastructure, workflow automation software and service delivery technology.

High-Throughput Sample Preparation

Our high-throughput sample preparation technology consists of step-by-step protocols for preparing DNA for sequencing and pipetting robots that automatically execute the standard protocols. We prepare genome samples in batches of 88 and load the samples into a 96-well plate (the other eight wells in the plate contain known, or reference, DNA that we use to monitor the quality of the sample preparation process). A sample preparation run processes four 88-sample plates for a total of 352 genomes per run. We have the instrument and staffing capacity to perform two runs in parallel, for a total of 704 genomes prepared for sequencing. The result of a sample preparation run is up 704 genomes loaded onto flow slides, ready to be loaded on a sequencing instrument. Our capacity can be scaled by adding additional instruments and staff as needed.

High-Throughput Sequencing Instruments

Our sequencing instruments consist of a fluidics robot that pipettes multiple types of chemical reagents (including fluorescent molecules) onto the flow slides and an imaging system that records images of the fluorescent molecules attached to the DNA. Each sequencing instrument processes 18 flow slides at a time. The 18 flow slides are robotically moved back-and-forth from the fluidics robot to the imaging system. While one flow slide is being imaged, the other 17 flow slides are prepared with reagents or wait for the imager to become available. A sequencing run takes approximately 12 days. Currently, our sequencing instruments can generate between 20 and 60 gigabases of usable data from each flow slide in a 12-day run. To sequence a whole human genome at an average redundancy of 40 times requires 120 gigabases of usable data. We expect to make continued advancements in our technology that will further increase the amount of usable data we get from each flow slide.

High-Performance Computing Infrastructure

We have built a genomics data processing facility that consists of approximately [5,000] core processors and [1,500 terabytes] (a terabyte is one thousand gigabytes) of high-speed disk storage. Our sequencing instruments are connected to our data center by a fiber[-optic] network connection that transfers data at a rate of [30 gigabits per second]. Our data center has the capacity to perform all of the required computation, starting with the images generated by the sequencing instruments, and ending with sequencing whole human genomes, for several hundred genomes per month. We plan to expand our data center as needed, and we expect to make continued advancements to our software that will further increase the efficiency of our data center.

Workflow Automation Software

Our workflow automation software tracks each sample from arrival at our facility to delivery of research-ready data to the customer. Sample tracking is accomplished through bar codes. Each 96-well plate of samples has a bar code, and each flow slide has a bar code. The instruments that process plates and flow slides have bar code readers attached to them. To perform processing, they read a bar code, and the workflow automation software instructs them what process steps to perform on each individual bar-coded sample. User interfaces to our workflow automation software allow us to track the progress of each sample throughout sample preparation, sequencing and computing.

Service Delivery Technology

Our cloud-based data delivery system is based on our partnership with Amazon Web Services (AWS). We upload our customers’ finished genomic data to AWS. Our customers can then either (1) download the genomic data from AWS onto their computers, have AWS copy their data to hard disks and ship them the hard disks or (2) pay AWS to store their data on an ongoing basis. As we develop our analysis tools, we plan to host them on AWS so that customers can rapidly analyze their genomic data as soon as it is available. We are also developing a web-based customer portal to enable customers to track their projects real-time throughout the sequencing process.

Complete Data Management Solution

There are two major components of our complete data management solution: assembly software and analysis tools.

Assembly Software

Assembly is the process of using computers to organize all of the overlapping 70-base nucleotide sequences to reconstruct the complete genome. We have developed a proprietary approach to assembly that uses a combination of advanced data analysis algorithms and statistical modeling techniques to accurately reconstruct over 90% of the complete human genome from approximately two billion 70-base reads. Our assembly pipeline takes further steps past mapping and base calling to identify, summarize, and annotate variants of all types in each genome that differ from the human genome reference. We have designed our assembly software to run in parallel across our large network of Linux computers. 

As reported in the Science article (Science 1 January 2010: Vol. 327. no. 5961, pp. 78 – 81), we generated high-quality diploid base calls in as much as 95% of the genomes sequenced, identifying between 3.2 million and 4.5 million sequence variants per genome processed. Detailed validation of one genome dataset demonstrated a sequence accuracy of just one false variant per 100 kilobases.

Analysis Tools

We have developed a suite of analysis tools that enable our customers to rapidly analyze the data we generate from their samples. The provided set of open-source tools allow researchers to compare variations between genomes, convert Complete Genomics native file formats to standard formats such as SAM/BAM, and perform other genomic data manipulations. See CGA Tools for more information.

CGA™ Platform consists of three major technologies

Third-generation sequencing technology

There are two primary components of our proprietary human genome sequencing technology: DNA nanoball arrays and combinatorial probe-anchor ligation reads.

High-throughput process automation technology

There are five major components of our high-throughput process automation technology: high-throughput sample preparation, high-throughput sequencing instruments, high-performance computing infrastructure, workflow automation software and service delivery technology.

Complete data management solution

There are two major components of our complete data management solution: assembly software and analysis tools.

Copyright © 2012 Complete Genomics Incorporated. All rights reserved. Use of this website signifies your agreement to the Terms of Use and Online Privacy Policy. Contact Webmaster.