Reads are initially mapped to the reference genome using a fast algorithm, and these initial mappings are both expanded and refined by a form of local de novo assembly, which is applied to all regions of the genome that appear to contain variation (SNPs, indels, and block substitutions) based on these initial mappings. The de novo assembly fully leverages mate-pair information, allowing reads to be recruited into variant calling with higher sensitivity than genome-wide mapping methods alone typically provide. Assemblies are diploid, and we produce two separate result sequences for each locus in diploid regions (exceptions: mitochondria are assembled as haploid and for males the nonpseudoautosomal regions in the sex chromosomes are assembled as haploid). Variants are called by independently comparing each of the diploid assemblies to the reference.
Copy number variable (CNV) regions are called based on depth-of-coverage analysis. Sequence coverage is averaged and corrected for GC bias over a fixed window and normalized relative to a set of standard genomes. A hidden Markov model (HMM) is used to classify segments of the genome as having 0, 1, 2, 3 copies…up to a maximum value.
Structural variations (SVs) are detected by analyzing DNB mappings found during the standard assembly process described above and identifying clusters of DNBs in which each arm maps uniquely to the reference genome, but with an unexpected mate pair length or anomalous orientation. Local de novo assembly is applied to refine junction breakpoints and resolve the transition sequence. The process for CNV and SV detection is described in more detail in Complete Genomics Data File Formats.
In the summary file (summary-[ASM-ID].tsv), you will see a variety of metrics that may be helpful in understanding the quality of the delivered genome. For example:
There are additional biological metrics which one would expect to be roughly consistent across genomes from individuals of the same ethnicity (even to genomes sequenced using other methods). These are also quite useful for quality control. They include:
Please note that while the application of these and other metrics to normal diploid genomes is relatively clear, correctly interpreting these and similar calculations for a cancer or non-diploid genome can be more difficult.
In the REPORTS directory of our data delivery, you will find several files reporting various aspects of the sequence data that can be used to assess the quality of the delivered genome. For example:
“Gross mapping yield” counts aligned bases within DNBs where at least one arm is mapped to the reference genome, excluding reads marked as overflow (large number of mappings to the reference genome indicative of highly repetitive sequence). “Both arms mapped yield” counts aligned bases within DNBs where both arms mapped to the reference genome on the correct strand and orientation and within the expected distance.
“Fully called” indicates that the assemblies of both diploid alleles meet the minimum required confidence thresholds, and thus both alleles are considered called. In this case, both alleles may be variant, or one may be reference and the other variant. If both are variant, they may be the same (homozygous) or different (heterozygous).
At a “partially called” or “half-called” site, only one allele meets the threshold to call the site confidently while the other does not. The Complete Genomics software reports this partial information for that locus (rather than no-calling the site entirely). Effectively, this is a statement that “we know this allele is present” but we can say little about what other allele is also present in a diploid region.
A “no-called” allele is one where we cannot determine the sequence of the sample at our minimum thresholds. See What exactly is a reference call? How is this different from a no-call?
The number of homozygous SNPs is calculated from the var-[ASM-ID].tsv file, and is equal to the sum of all diploid loci where the same SNP is present on both alleles.
The number of heterozygous SNPs is calculated from the var-[ASM-ID].tsv file, and is equal to the sum of SNPs present in the following types of loci:
The total number of SNPs is calculated from the var-[ASM-ID].tsv file, and includes SNPs present in all of the following types of loci:
The exome is defined as the coding regions (CDS) of protein-coding genes, plus all of the untranslated genes, minus any transcripts (coding or otherwise) that are rejected by the annotation pipeline. A small percentage of transcripts in Build 36 and Build 37 are excluded from the annotation results due to the one or more of the following reasons:
To obtain the list of excluded transcripts, please contact firstname.lastname@example.org.
The number of exonic SNPs is calculated from the var-[ASM-ID].tsv file and includes SNPs present in all of the following categories:
Each variation is only counted once per genome. If a variation has different functional consequencesin different transcripts, only the most severe functional effect is counted.
The following rules are used for counting variations: