Complete Genomics has published a report describing three human genome sequences in the journal Science. Two of the samples were derived from potentially different passages of cell lines used in the International HapMap project: 1) a Caucasian male of European descent (NA07022), and 2) a Yoruban female (NA19240). The third sample was generated from lymphoblast DNA from a Personal Genome Project (PGP) Caucasian male sample (NA20431). Sequencing of these genomes was conducted at Complete Genomics’ commercial-scale genome center. Results are presented in this paper and discussed below include:
NOTE: Analysis of these data is ongoing, and we have made considerable additions to our production analysis software since this paper was written.
Summary information from mapping and assembly of the three genomes.
Sample |
Mapped Sequence (Gb) |
Average Coverage Depth (fold) |
Percent of Genome Called |
|
Fully |
Partially |
|||
| NA19240 | 178 | 63 | 95% | 1% |
| NA07022 | 241 | 87 | 91% | 2% |
| NA20431 | 124 | 45 | 86% | 3% |
Results were obtained by mapping sequence reads to the human genome reference ( NCBI Build 36) and assembling variants with custom algorithms specifically designed for Complete Genomics data. Between 124 and 241 Gigabases (Gb) were mapped, for an overall mean depth of coverage of 45- to 87-fold per genome. Fully called regions are those where both diploid alleles could be determined at high accuracy (see below), while partially called regions are those where one of the two alleles was determined but the second was not.
Variations detected relative to reference genome (NCBI Build 36).
Variation type |
NA19240 |
NA07022 |
NA20431 |
||||
Count |
Novel |
Count |
Novel |
Count |
Novel |
||
SNPs |
All |
4,042,801 | 19% | 3,076,869 | 10% | 2,905,517 | 10% |
Homozygous |
1,297,601 | 4% | 1,097,899 | 2% | 965,029 | 1% | |
Heterozygous |
2,639,864 | 27% | 1,800,287 | 15% | 1,657,540 | 16% | |
Transitions |
3,635,882 | 2,858,818 | 2,658,112 | ||||
Transversions |
1,706,195 | 1,316,837 | 1,213,232 | ||||
Coding |
23,000 | 16% | 18,723 | 9% | 16,532 | 10% | |
Non‐synonymous |
11,400 | 19% | 9,286 | 11% | 8215 | 12% | |
Indels |
Short Insertions |
242,391 | 40% | 168,909 | 37% | 136,786 | 37% |
Short Deletions |
253,803 | 44% | 168,726 | 37% | 133,008 | 36% | |
Total |
496,194 | 337,635 | 269,794 | ||||
Coding Short Indels |
549 | 56% | 556 | 58% | 435 | 59% | |
Frameshifting Short Indels |
327 | 61% | 310 | 62% | 299 | 71% | |
Block substitutions |
Length conserving |
54,054 | 39% | 40,103 | 42% | 38,449 | 33% |
Length changing |
34,432 | 64% | 22,680 | 61% | 18,166 | 60% | |
Between 2.91 to 4.04 million single nucleotide polymorphisms (SNPs) with respect to the reference genome were identified. Of these SNPs, 81 – 90% have been previously reported in dbSNP build 129. This is consistent with reports of other complete human genome sequences from different ethnicities compared to this reference.
The data generated show excellent concordance with SNP genotypes generated by the HapMap project, particularly with the highest quality Illumina Infinium™ subset. The HapMap paper can be found at: http://hapmap.org/downloads/presentations/nature_hapmap3.pdf; see Supplementary Table 3 for details of genotyping accuracy by technology and center:http://www.hapmap.org/downloads/presentations/nature_supp3.pdf. The high concordance of our genotypes with those generated using independent technologies affirms the accuracy of Complete Genomics' sequencing technology for discovery and validation of polymorphisms.
Genotype calls compared against HapMap Phase I and II genotypes and the HapMap Infinium subset.
HapMap Phase 1&11 SNPs |
HapMap Infinium subset |
||
# reported |
3.8M | 144K | |
% called |
98.46% | 98.45% | |
% locus concordance |
99.14% | 99.85% | |
HapMap Genotype calls |
Homozygous ref |
99.22% | 99.92% |
Heterozygous |
99.62% | 99.81% | |
Homozygous alt |
98.26% | 99.79% | |
Genotype calls compared against Infinium 1M, HapMap Phase I and II genotypes, and the HapMap Infinium subset. To determine whether discordances were due to errant calls in the Complete Genomics data or the Infinium subset of HapMap, discordant loci were tested by Sanger sequencing.
Infinium 1M |
HapMap Phase 1&11 SNPs |
HapMap Infinium subset |
HapMap Infinium SNPs test for accuracy by Sanger sequencing |
||||
# reported |
1M | 3.9M | 143K | These data correct | These data incorrect | % affirmed | |
% called |
95.98% | 94.39% | 96.00% | ||||
% locus concordance |
99.89% | 99.15% | 99.88% | ||||
HapMap Genotype calls |
Homozygous ref (% concordance) |
99.96% | 99.34% | 99.96% | 18 | 2 | 90% |
Heterozygous (% concordance) |
99.78% | 99.39% | 99.80% | 28 | 46 | 38% | |
Homozygous alt (% concordance) |
99.81% | 98.14% | 99.84% | 28 | 12 | 70% | |
Additionally, to determine a whole-genome false positive rate, 291 novel non-synonymous variants (a category enriched for errors) were tested with Sanger sequencing. This approach yielded an extrapolated false positive rate of approximately 1 in 100 kilobases. For more detail, see Supplemental Table S8 in the Science publication.
Genotype calls compared to Affymetrix® 500K SNP genotypes. Genotypes were assayed in duplicate, and only SNPs with identical calls between the replicates are considered.
500K |
||
# reported |
475K | |
% called |
94.18% | |
% locus concordance |
99.75% | |
Array genotype calls |
Homozygous ref |
99.88% |
Heterozygous |
99.45% | |
Homozygous alt |
99.78% | |
©2010 Complete Genomics, Inc. All rights reserved. cPAL and DNB are trademarks of Complete Genomics, Inc. in the US and certain other countries. All other trademarks are the property of their respective owners.