Data Summary from Science Publication

Complete Genomics has published a report describing three human genome sequences in the journal Science. Two of the samples were derived from potentially different passages of cell lines used in the International HapMap project: 1) a Caucasian male of European descent (NA07022), and 2) a Yoruban female (NA19240). The third sample was generated from lymphoblast DNA from a Personal Genome Project (PGP) Caucasian male sample (NA20431). Sequencing of these genomes was conducted at Complete Genomics’ commercial-scale genome center. Results are presented in this paper and discussed below include:

NOTE: Analysis of these data is ongoing, and we have made considerable additions to our production analysis software since this paper was written.

Summary of Genomes Sequenced

Summary information from mapping and assembly of the three genomes.

Sample
Mapped Sequence (Gb)
Average Coverage Depth (fold)
Percent of Genome Called
Fully
Partially
NA19240 178 63 95% 1%
NA07022 241 87 91% 2%
NA20431 124 45 86% 3%

Results were obtained by mapping sequence reads to the human genome reference ( NCBI Build 36) and assembling variants with custom algorithms specifically designed for Complete Genomics data. Between 124 and 241 Gigabases (Gb) were mapped, for an overall mean depth of coverage of 45- to 87-fold per genome. Fully called regions are those where both diploid alleles could be determined at high accuracy (see below), while partially called regions are those where one of the two alleles was determined but the second was not.

Summary of Variations

Variations detected relative to reference genome (NCBI Build 36).

Variation type
NA19240
NA07022
NA20431
Count
Novel
Count
Novel
Count
Novel
SNPs
All
4,042,801 19% 3,076,869 10% 2,905,517 10%
Homozygous
1,297,601 4% 1,097,899 2% 965,029 1%
Heterozygous
2,639,864 27% 1,800,287 15% 1,657,540 16%
Transitions
3,635,882
2,858,818
2,658,112
Transversions
1,706,195
1,316,837
1,213,232
Coding
23,000 16% 18,723 9% 16,532 10%
Non‐synonymous
11,400 19% 9,286 11% 8215 12%
Indels
Short Insertions
242,391 40% 168,909 37% 136,786 37%
Short Deletions
253,803 44% 168,726 37% 133,008 36%
Total
496,194
337,635
269,794
Coding Short Indels
549 56% 556 58% 435 59%
Frameshifting Short Indels
327 61% 310 62% 299 71%
Block substitutions
Length conserving
54,054 39% 40,103 42% 38,449 33%
Length changing
34,432 64% 22,680 61% 18,166 60%

Between 2.91 to 4.04 million single nucleotide polymorphisms (SNPs) with respect to the reference genome were identified. Of these SNPs, 81 – 90% have been previously reported in dbSNP build 129. This is consistent with reports of other complete human genome sequences from different ethnicities compared to this reference.

Concordance with HapMap and other orthogonal technologies

The data generated show excellent concordance with SNP genotypes generated by the HapMap project, particularly with the highest quality Illumina Infinium™ subset. The HapMap paper can be found at: http://hapmap.org/downloads/presentations/nature_hapmap3.pdf; see Supplementary Table 3 for details of genotyping accuracy by technology and center:http://www.hapmap.org/downloads/presentations/nature_supp3.pdf. The high concordance of our genotypes with those generated using independent technologies affirms the accuracy of Complete Genomics' sequencing technology for discovery and validation of polymorphisms.

Sample NA19240

Genotype calls compared against HapMap Phase I and II genotypes and the HapMap Infinium subset.


HapMap Phase 1&11 SNPs
HapMap Infinium subset
# reported
3.8M 144K
% called
98.46% 98.45%
% locus concordance
99.14% 99.85%
HapMap Genotype calls
Homozygous ref
99.22% 99.92%
Heterozygous
99.62% 99.81%
Homozygous alt
98.26% 99.79%

 

Sample NA07022

Genotype calls compared against Infinium 1M, HapMap Phase I and II genotypes, and the HapMap Infinium subset. To determine whether discordances were due to errant calls in the Complete Genomics data or the Infinium subset of HapMap, discordant loci were tested by Sanger sequencing.

 
Infinium 1M
HapMap Phase 1&11 SNPs
HapMap Infinium subset
HapMap Infinium SNPs test for accuracy by Sanger sequencing
# reported
1M 3.9M 143K These data correct These data incorrect % affirmed
% called
95.98% 94.39% 96.00%
% locus concordance
99.89% 99.15% 99.88%
HapMap Genotype calls
Homozygous ref (% concordance)
99.96% 99.34% 99.96% 18 2 90%
Heterozygous (% concordance)
99.78% 99.39% 99.80% 28 46 38%
Homozygous alt (% concordance)
99.81% 98.14% 99.84% 28 12 70%

Additionally, to determine a whole-genome false positive rate, 291 novel non-synonymous variants (a category enriched for errors) were tested with Sanger sequencing. This approach yielded an extrapolated false positive rate of approximately 1 in 100 kilobases. For more detail, see Supplemental Table S8 in the Science publication.

Sample NA20431

Genotype calls compared to Affymetrix® 500K SNP genotypes. Genotypes were assayed in duplicate, and only SNPs with identical calls between the replicates are considered.

 
500K
# reported
475K
% called
94.18%
% locus concordance
99.75%
Array genotype calls
Homozygous ref
99.88%
Heterozygous
99.45%
Homozygous alt
99.78%

 

©2010 Complete Genomics, Inc. All rights reserved. cPAL and DNB are trademarks of Complete Genomics, Inc. in the US and certain other countries. All other trademarks are the property of their respective owners.

 

Copyright © 2011 Complete Genomics Incorporated. All rights reserved. Use of this website signifies your agreement to the Terms of Use and Online Privacy Policy. Contact Webmaster.