Analysis of the Genomes Published in Science on November 5, 2009
Complete Genomics has published three human genome sequences in the journal Science.
Results presented in this paper and discussed below include:
NOTE: Analysis of these data is ongoing, and we have made considerable additions to our production analysis software since this paper was written. Check back here for updated results.
Summary of Genomes Sequenced
Summary information from mapping and assembly of the three genomes.
| NA19240 | 178 | 63 | 95% | 1% |
| NA07022 | 241 | 87 | 91% | 2% |
| NA20431 | 124 | 45 | 86% | 3% |
Results were obtained by mapping sequence reads to the human genome reference (NCBI Build 36) and assembling variants with custom algorithms specifically designed for Complete Genomics data. More information on Complete Genomics technology can be found here. Between 124 and 241 Gigabases (Gb) were mapped, for an overall depth of coverage of 45– to 87-fold per genome. Fully called regions are those where both diploid alleles could be determined at high accuracy (see below), while partially called regions are those where one of the two alleles was determined but the second was not.
Summary of Variations
Variations detected relative to reference genome (NCBI Build 36).
| SNPs | All | 4,042,801 |
19% | 3,076,869 |
10% | 2,905,517 | 10% |
| Homozygous | 1,297,601 |
4% | 1,097,899 |
2% | 965,029 | 1% |
| Heterozygous | 2,639,864 |
27% | 1,800,287 |
15% | 1,657,540 | 16% |
| Transitions | 3,635,882 | | 2,858,818 | | 2,658,112 | |
| Transversions | 1,706,195 | | 1,316,837 | | 1,213,232 | |
| Coding | 23,000 |
16% | 18,723 |
9% | 16,532 | 10% |
| Non-synonymous | 11,400 |
19% | 9,286 |
11% | 8215 | 12% |
| Indels | Short insertions | 242,391 |
40% | 168,909 |
37% | 136,786 | 37% |
| Short deletions | 253,803 |
44% | 168,726 |
37% | 133,008 | 36% |
| Total |
496,194 |
|
337,635 |
|
269,794 |
|
| Coding short indels | 549 |
56% | 556 |
58% | 435 | 59% |
| Frameshifting short indels | 327 |
61% | 310 |
62% | 299 | 71% |
| Block substitutions | Length conserving | 54,054 |
39% | 40,103 |
42% | 38,449 | 33% |
| Length changing | 34,432 |
64% | 22,680 |
61% | 18,166 | 60% |
Between 2.91 to 4.04 million single nucleotide polymorphisms (SNPs) with respect to the reference genome were identified. Of these SNPs, 81 – 90% have been previously reported in dbSNP build 129. This is consistent with other reports of complete human genome sequences compared to this reference.
Concordance with HapMap and other orthogonal technologies
The data generated show excellent concordance with SNP genotypes generated by the HapMap project, particularly with the highest quality Illumina Infinium™ subset. The HapMap paper can be found at: http://hapmap.org/downloads/presentations/nature_hapmap3.pdf; see Supplementary Table 3 for details of genotyping accuracy by technology and center: http://www.hapmap.org/downloads/presentations/nature_supp3.pdf. The high concordance of our genotypes with those generated using independent technologies affirms the accuracy of Complete Genomics' sequencing technology for discovery and validation of polymorphisms.
Sample NA19240
Genotype calls compared against HapMap Phase I and II genotypes and the HapMap Infinium subset.
| # reported | 3.8M | 144K |
| % called | 98.46% | 98.45% |
| % locus concordance | 99.14% | 99.85% |
| HapMap genotype calls | Homozygous ref (% concordance) | 99.22% | 99.92% |
| Heterozygous (% concordance) | 99.62% | 99.81% |
| Homozygous alt (% concordance) | 98.26% | 99.79% |
Sample NA07022
Genotype calls compared against Infinium 1M, HapMap Phase I and II genotypes, and the HapMap Infinium subset. To determine whether discordances were due to errant calls in the Complete Genomics data or the Infinium subset of HapMap , discordant loci were tested by Sanger sequencing.
| # reported | 1M | 3.9M | 143K | These data correct | These data incorrect | % affirmed |
| % called | 95.98% | 94.39% | 96.00% |
| % locus concordance | 99.89% | 99.15% | 99.88% |
| HapMap Genotype calls | Homozygous ref (% concordance) | 99.96% | 99.34% | 99.96% | 18 | 2 | 90% |
| Heterozygous (% concordance) | 99.78% | 99.39% | 99.80% | 28 | 46 | 38% |
| Homozygous alt (% concordance) | 99.81% | 98.14% | 99.84% | 28 | 12 | 70% |
Additionally, to determine a whole-genome false positive rate, 291 novel non-synonymous variants (a category enriched for errors) were tested with Sanger sequencing. This approach yielded an extrapolated false positive rate of approximately 1 in 100 kilobases. For more detail, see Supplemental Table S8 in the Science publication.
Sample NA20431
Genotype calls compared to Affymetrix 500K SNP genotypes. Genotypes were assayed in duplicate, and only SNPs with identical calls between the replicates are considered.
| # reported | 475K |
| % called | 94.18% |
| % locus concordance | 99.75% |
| Array genotype calls | Homozygous ref (% concordance) | 99.88% |
| Heterozygous (% concordance) | 99.45% |
| Homozygous alt (% concordance) | 99.78% |