A “variation” or “variant” refers to an allele sequence that is different from the reference at as little as a single base or for a longer (potentially much longer) interval. In general the distinction between “variation” and “polymorphism” is that polymorphisms are by definition variable sites within or between populations. “Variation” makes no assumption about degree of polymorphism except by comparison between a sample and the reference (recall that the reference sequence can be wrong at some sites). Thus, scientists will sometimes use the term Single Nucleotide Variant (SNV) over SNP (single Nucleotide Polymorphism). However, we continue to use the acronym SNP as it is more ubiquitous, if not entirely precise in this case.
SNPs, small insertions and deletions, and small block substitutions are indicated as variants in the Complete Genomics variation files, found in the ASM directory. By “small”, reported indels may be up to about 50 bases in length, although the precise upper limit varies by region and coverage. In addition to these variants, we also call Copy Number Variants (CNVs), Structural Variants (SVs), and Mobile Element Insertions (MEIs), which are reported in separate folders within the ASM directory. While these variants are all determined in comparison to the human genome reference, genomes submitted for the Cancer Sequencing Service are additionally analyzed for somatic variants called in comparison to the baseline genome within the submitted pair or trio.
Complete Genomics makes a strong distinction between a no-call and a confident homozygous reference call. Some other pipelines identify variants in sequence but do not make this distinction. Where they fail to call variants, one must rely on rough surrogate measures (such as depth of coverage and mapping scores) to help interpret whether non-variant sites are homozygous reference or are simply not callable. This distinction can be one source of confusion when comparing data across technologies.
Errors in regions falsely called homozygous reference (false-negative variant calls) are included in Complete Genomics overall error rate estimates.
A “sub” is a block substitution, where a series of nearby reference bases have been replaced with a different series of bases in an allele. The sample’s allele and reference may be the same length (“length-conserving”) or not (“length-changing”). In data generated by Complete Genomics pipeline versions prior to 1.7 a “sub” was denoted as a “delins”.
Complete Genomics calls variants on each allele by comparing the assembly of that allele to the reference sequence. This process is repeated independently for each of the two diploid alleles at each autosomal locus. Bases in the genome with variants on either or both alleles in close proximity are grouped together as a single variant locus. For example, the middle three positions at the site below are considered one variant locus:
Reference: TAG TCG CCT
Allele 1: TAG TTG CCT one ref + one SNP + one ref
Allele 2: TAG CAC CCT a 3 base block substitution
Generally, if two or more reference bases on both alleles are called between two variant sequences, then the site is broken into smaller events.
No. The allele number is assigned arbitrarily at each locus and does not indicate phase. Where phase is determined, generally because variants are within the same vicinity, the haplink field in the variations file will be populated to indicate this. Variant alleles with the same haplink ID are known to be in cis-phase, that is, on the same parental chromosome.
Note that prior to pipeline version 1.8, the “allele” column in the variation file was called “haplotype”.
An “N” indicates that a specific base could not be resolved on the allele in question; however the flanking (non-N) sequence may have been called. A “?” indicates that the unresolved region may include zero or more unknown bases. For example, “ATGC?” means that the exact number and composition of bases (if any) immediately after ATGC on that allele could not be determined.
Loci with either or both “N’s” or “?” will also always be marked as no-call, no-call-rc (no-call, reference-consistent) or no-call-ri (no-call, reference-inconsistent) as appropriate (see Please explain “no-call-ri”, “no-call-rc”, “ref-consistent” and “ref-inconsistent”. How should I use these?).
All no-call variant types indicate that the sequence could not be fully resolved, either because of limited or no information, or because of contradictory information. When some portions of the allele sequence can be called but others not, we indicate this as “no-call-rc” (no-call, reference-consistent) if those called portions are the same as the reference. We use no-call-ri (no-call, referenceinconsistent) if they are not. Ref-consistent and ref-inconsistent are the names for no-call-rc and nocall-ri, respectively, used by versions of Complete Genomics pipeline versions prior to 1.7. We changed the names to highlight the fact that these alleles contain no-calls.
In some cases, one may wish to be conservative and consider any such region entirely no-called, and thus neither a match nor a mismatch between sample and reference.
For a small fraction of assembled loci, there can be support for one allele (reference or variant) but some ambiguity as to whether the other allele is supported by the data. This can happen, for example, when very few reads from one of the two chromosomes are seen. Also, in regions of low coverage, the algorithm may see reads consistent with a single allele (i.e., consistent with a homozygous call), but may judge that too few reads in total were seen to have had a good chance of sampling both chromosomes. In these cases the variation file reports a partial or half-called locus; a fully resolved allele (reference or variant) on one chromosome, but a no-call on the other.
Diploidy is not assumed when calling small variants. The small variant caller considers heterozygous hypotheses at a wide range of allele frequencies between 20% and 80%, including but not limited to 50%. This is to accommodate small variants that occur at sites of copy number variation as well as in samples that are not pure: for example, due to tumor heterogeneity or sample mosaicism.
Note that two variant scores are provided for each called allele: one derived from the probability of this call assuming variable allele fractions (allele1VarScoreVAF, allele2VarScoreVAF, or varScoreVAF), and one derived from the probability of the given call assuming equal allele fraction, or diploidy (allele1VarScoreEAF, allele2VarScoreEAF, or varScoreEAF). Additionally, triploid hypotheses are considered in the assembly optimization step, and the step of alleles in an evidenceInterval record may describe a triploid top hypothesis. Regardless of models used to call small variants, the results of variation intervals where the top hypothesis is triploid will still be presented as two alleles at each locus.
Mitochondria and sex chromosomes are handled as special cases: see How does Complete Genomics handle mitochondrial sequences? and How does Complete Genomics handle the sex chromosomes?
The varScoreVAF and varScoreEAF are the best indicators of variant quality (these correspond to allele1VarScoreVAF, allele2VarScoreVAF, allele1VarScoreEAF, and allele2VarScoreEAF in the master variations file masterVar). The varScoreEAF best reflects the quality of a call for variants at 50% allele fraction, while the varScoreVAF is a better score for variants at low allele fraction.
For reference-called positions, Complete Genomics provides scores in the coverageRefScore files in the REF directory, rather than the var or masterVar files. The reference scores within that directory are the best indicator of the quality of reference calls.
For variants, select the score based on the type of sample being studied, as follows:
When there is not enough information about the sample to determine the best score approach, Complete recommends using varScoreVAF as the general-purpose variant score.
Mitochondrial sequences are treated as having a ploidy of 1. The circular nature of mitochondrial DNA is taken into account so that coverage is not suppressed at the start and end of the mitochondrial chromosome.
In males, the majority of the X chromosome is treated as having a ploidy of 1 while in females the X chromosome has a ploidy of 2. In males, variants in the pseudoautosomal region of the Y chromosome are reported on the corresponding regions of the X chromosome, where ploidy 2 is assumed. The pseudoautosomal region of the Y chromosome itself will be indicated as “PAR-calledin-X” in the variant file.
Areas of the genome that are highly variable are assembled using the default reference sequence at this time. Therefore, the no-call rate may be higher than other locations of the genome. We are looking into improved calling methods for these regions in the future.
This is a locus in the genome where we cannot rule out the possibility that there is an insertion present.
Yes, we find both length-conserving and length-changing block substitutions in our assembly process, at both homozygous and heterozygous loci. In many genomes, we find a number of these substitutions where a portion of the locus is a known variant (such as, SNP), while the remainder of the substitution is novel and called with high confidence.
We presently call variants relative to the NCBI reference in each genome sequenced. This facilitates comparison between any set of samples desired.
Complete Genomics develops an open source tools package, Complete Genomics Analysis Tools(CGATM Tools), for downstream analysis of Complete Genomics data. Currently, CGA Tools contains tools for comparing variants between two genomes. We are working on additional methods for multiple sample and other comparison tools. For more information on CGA Tools, see the Complete Genomics website.
Yes. Known variants are used as a supplementary source of seeds for local de novo assembly when searching for candidate variants. Knowledge of which variants are known and which are novel is not used in variant scoring. The set of known variants used to supplement local de novo assembly is comprised of indels and short block substitutions from dbSNP (dbSNP 130 for Build 36, and dbSNP 132 for Build 37) and the Complete Genomics Diversity Panel (69 genomes using assembly pipeline version 1.10).
No, currently Complete Genomics does not use another reference other than NCBI nor can we directly assemble one genome against another.
Any pair of DNBs from the same library that have at least one arm whose initial mappings have a common mapping (based on chromosome, offset, and strand) are considered candidates for deduplication. Each pair of candidate duplicate DNBs is evaluated for sequence similarity. If the DNBs have at most four discordances for each arm (up to two discordances per read), allowing the gaps to differ by up to two bases (except for the clone end reads), they are considered duplicates, and one of the two DNBs is selected at random for removal. De-duplication is performed such that it does not affect the initial mappings or coverage, but it does apply to all of the following: