Complete Genomics local de novo assemblies are typically 30-40 bp although they can be smaller or much larger (100s of bp). Approximately 5-10% of genome is typically assembled.
No, one assembly can contain multiple variant loci. It’s possible to phase closely neighboring variants if they are inside the same local assembly. The variant file has a haplink column that indicates phasing, if it can be determined.
By chromosome and position, you can find the appropriate row in the corresponding evidenceIntervals file. This data provide you with the assembler results (essentially, consensus sequences) for each allele, with a gapped alignment of that result against the reference. The underlying reads in each assembly can be found in the corresponding evidenceDnbs file, by looking up records (rows) using the evidence interval ID found in the evidenceIntervals file.
The possibility of one or both alleles at a site being reference is always considered as a possible hypothesis in the likelihood ratio tests. Essentially, we demand that the data disprove this hypothesis, which is an appropriate null in that most sites in the genome of any sample are indeed reference.
As mentioned in What scores are produced for variant calls? What score thresholds are used?, the score reflects separation between the top hypothesis and a hypothesis of homozygous reference. Inclusion of the reference allele in the evidenceIntervals file can help one understand how strongly the reference was rejected.
The assembly process has both more sensitivity and specificity than the initial mapping process. Reads which were sufficiently different from the reference (for example, those containing many indels or groups of SNPs) or which had multiple possible initial mapping locations may not be initially mapped to the variant site but can be brought into assembly using the mapping of the corresponding mate-pair.
Conversely, reads initially mapping to a region but which prove inconsistent with the preponderance of evidence in an assembly can be down weighted or excluded. Moreover, reads must have their mate-pair mapped nearby to a variant region to participate in an assembly, and thus reads with only one end mapping during the de novo assembly process are presently always excluded from assembly.
Complete Genomics does not produce assemblies (evidence intervals) for regions of the genome where the mapped reads are highly consistent with the reference sequence. Furthermore, assemblies are not reported for regions where variants are not found (this is an unusual case, however). Thus, variation scores are also not produced for these regions.
Instead, Complete Genomics computes a “reference score” for every base in the reference genome that is reported in the coverageRefScore file. This score indicates whether the corresponding mapped reads are consistent with the reference sequence (positive values) or not (negative values). This score is an excellent predictor for the strength of evidence for homozygous reference calls. See What is the “Reference Score” and what is it used for?
Complete Genomics provides an assembly for the reference allele for all sites, including those called diploid variant.
Determining evidence for other alternate hypotheses is not easily supported by our assembler output at this time. Users would need to perform read level analysis, essentially recapitulating the details of the recruitment and assembly processes at a site (as described in Drmanac et al., Science 2010).
In the case of repetitive regions (such as highly conserved segmental duplications or interspersed repeats) the algorithm can allow reads to participate in more than one assembly, weighted by their mapping probabilities. See How are duplications and highly conserved repeats in the reference handled?