Reads are initially mapped to the reference genome using a fast algorithm, and these initial mappings are both expanded and refined by a form of local de novo assembly, which is applied to all regions of the genome that appear to contain variants (SNPs, indels, and block substitutions) based on the initial mappings. The de novo assembly leverages mate-pair information, which allows reads to be recruited into variant calling with higher sensitivity than genome-wide mapping methods provide. Assemblies are diploid, and thus we produce two separate result sequences for each locus in diploid regions. Variants are called by independently comparing each of the diploid assemblies to the reference. The process is described in our Science paper (Drmanac et al. Science, Jan 2010), and our assembly algorithms are described in detail in our publication in the Journal of Computational Biology (Carnevali et al, Journal of Computational Biology 2011).
Copy number variable (CNV) regions are called based on depth-of-coverage analysis. Sequence coverage is averaged and corrected for GC bias over a fixed window and normalized relative to a set of standard genomes. In the case of a tumor-normal comparative analysis provided through our Cancer Sequencing Service, coverage in the tumor genome is normalized to coverage for the same region in the matched normal genome. A hidden Markov model (HMM) is used to classify segments of the genome as having 0, 1, 2, 3 copies…up to a maximum value.
Structural variations (SVs) are detected by analyzing DNB mappings found during the standard assembly process described above and identifying clusters of DNBs in which each arm maps uniquely to the reference genome, but with an unexpected mate pair length or anomalous orientation. Local de novo assembly is applied to refine junction breakpoints and resolve the transition sequence. Novel insertions of mobile elements into the sequenced genome are identified as clusters of reads that uniquely map to the reference genome with one arm and to ubiquitous sequence with the other arm.
The location, type, and orientation of the inserted elements are identified using mate pairs that map in the vicinity of the insertion site, aligning each unmapped arm to sequences of a defined set of mobile elements. The process for CNV and SV detection is described in more detail in Complete Genomics Data File Formats.
Complete Genomics identifies small variants, including SNPs, indels, and block substitutions, as well as copy number variations (CNVs), structural variations (SVs), and mobile element insertions (MEIs). For the Standard Sequencing Service, all of these variation types are identified in comparison to the human genome reference. For the Cancer Sequencing Service, somatic small variants, CNVs, and SVs are each also identified in comparison to the baseline sample within a pair or trio.
By contrast with some other sequencing technologies, which have a high rate of within-read indels (most single molecule sequencing and pyrosequencing-based methods have this attribute), the intra-read gaps in Complete Genomics data are relatively easy to handle. First, they always occur at, and only at, precisely known locations in each paired end. Secondly, the gaps sizes are highly predictable and generally only +/- 1 base pair from the known mid-value. Thus, algorithms can readily be designed to map, assemble, and call variants in these reads. Complete Genomics’ analysis methods for these gapped reads have been shown to produce high quality results for both SNPs and indels variant calls.
Because of the gaps, coverage for comparable detection power does need to be modestly higher than if the reads had no gaps. However, this coverage requirement is balanced by the improved base-call accuracy (and consistent accuracy over the length of the read) in Complete Genomics sequences, improving the power of the data on a per base-call basis. This accuracy is enabled by the gapped construct that provides multiple sequencing reaction priming sites in each arm of each DNB. Because the sequencing of DNBs is highly cost-effective, Complete Genomics can also generate very deep coverage of these reads and thus produce high-quality variant calls over a large fraction of the genome.
We use stringent thresholds in our variant-calling algorithms that take into account base-call accuracy, mis-mapping probability, and both quantity and consistency of evidence. A fully called position is one where we have determined the full diploid sequence (that is, we have assembled both alleles) at these thresholds. By contrast with some other pipelines, Complete Genomics’ data analysis methods are careful to distinguish regions of the genome that are confidently called homozygous reference from those which are no-called. This greatly facilitates comparison between genomes by reducing false negatives.
For clarification, when we measure a percentage of the genome called, we are referring to a percentage of the bases corresponding to the complete NCBI reference genome sequence. We are not referring to a fraction of the non-repetitive or non-degenerate genome, or to a fraction of the genome within a certain AT/GC range.
Customers with genomes processed by Assembly Pipeline version 1.5.0 or later can order re-analysis of these genomes using Assembly Pipeline version 1.10 and later. Customers have the option to indicate whether they prefer a specific version or would prefer reanalysis on the most current Assembly Pipeline version at the time of processing. Since Complete Genomics does not retain customer data, the complete and original data set must be shipped back to Complete Genomics via hard disk drive. For more information, see the Reanalysis Flyer, or contact us at firstname.lastname@example.org.
Complete Genomics’ data processing software is not distributed at this time.