A diverse data set of whole human genomes are freely available for public use to enhance any genomic study or evaluate Complete Genomics data results and file formats. These include 69 DNA samples sequenced using our Standard Sequencing Service, which includes whole genome sequencing, mapping of the resulting reads to a human reference genome, comprehensive detection of variations, scoring, and informative annotation. The 69 genomes data set includes:
- A Yoruban trio
- A Puerto Rican trio
- A 17-member CEPH pedigree across three generations
- A diversity panel representing unrelated individuals from nine different populations
The CEPH samples within the pedigree and diversity sets are from the NIGMS Repository; the remainder of the samples is from the NHGRI Repository, and both are housed at the Coriell Institute for Medical Research. These samples were sequenced with an average genome-wide coverage of 80X (a range of 51X to 89X).
For each publicly available genome, the following whole genome information is provided:
- Variation Reporting, including:
- SNPs, small insertions, deletions, substitutions, and complex small variants
- Copy number variations (CNVs) and structural variations (SVs)
- Mobile element insertions (MEIs)
- Annotations, including:
- Genes and functional impact
- Minor allele frequency (MAF) from the 1000 Genomes Project
- Non-coding RNA
- Known segmental duplication
- Type of SV
- Extensive scoring
- Evidence files, including read support information for the variation
- Per base-pair coverage for each position in the reference genome
Additionally, Complete Genomics provides aggregate information for the small variants identified across the 69 genome data set as well as a 55 genome subset that excludes any closely-related individuals. These include:
- A list of all of the variants in the set of genomes, generated using the CGA Tools (v1.5) listvariants command
- A table indicating which variants is present in which genomes generated using the CGA Tools (v1.5) testvariants command.
- A table indicating which variants is present in which genomes, including variant frequency, in VCF format.
The data are freely available for use in a publication with the following stipulations:
- The Coriell and ATCC Repository number(s) of the cell line(s) or the DNA sample(s) must be cited in publications or presentations that are based on the use of these materials.
- The Complete Genomics Science paper must be referenced (R. Drmanac, et. al. Science 327(5961), 78. [DOI: 10.1126/science.1181498]).
- The version number of the Complete Genomics assembly software with which the data was generated must be referenced. This can be found in the header of the summary.tsv file (# Software_Version).
For further technical assistance:
Call toll-free: 1-855-CMPLETE or 1-855-267-5383
Additional documentation can be found here.