Complete Genomics offers free access to a variety of whole human genome data sets on its FTP server (ftp2.completegenomics.com). The genomes were sequenced at the Complete Genomics commercial genome sequencing center in Mountain View, California and represent results from both our Standard Sequencing Service and our Cancer Sequencing Service. These data are largely consistent with the quality and attributes of other data provided to Complete Genomics customers.
The data are freely available for use in a publication with the following stipulations:
There are four sets of data sequenced using our Standard Sequencing Service: a Yoruba trio, a Puerto Rican trio, a 17-member pedigree across three generations, and a diversity panel representing nine different populations. The CEPH samples within the pedigree and diversity sets are from the NIGMS Repository; the remainder of the samples is from the NHGRI Repository, and both are housed at the Coriell Institute for Medical Research. These samples were sequenced with an average genome-wide coverage of about 80X (range of 51X to 89X).
There are also two cancer pairs including matched tumor and normal genomes derived from cell-line samples of patients with breast cancer (invasive ductal carcinomas) which were sequenced using our Cancer Sequencing Service. The cell line-derived DNA are housed at ATCC. Samples have been sequenced to an average genome-wide coverage of 123X for three of the samples, and 92X for HCC2218 BL.
By providing the research community with these whole genome data sets, we encourage researchers to validate our performance and to further improve data analysis and interpretation methods. Some samples in this data set have been also been analyzed by the 1000 Genomes Project and the International HapMap Project.
For additional documentation and support, please refer to the Public Genome Data Repository Service Note.
Complete Genomics uses either NCBI build 36 (hg18) or NCBI build 37 (hg191) as a reference genome during its data analysis process.
These genomes were released in February 2011 using an earlier Complete Genomics Analysis Pipeline for calling variations across the genome (version 1.10.0), which was the most current pipeline version at that time. The current release of the dataset was generated with the most recent Analysis Pipeline (version 2.0.0), and some samples were resequenced with the latest library preparation process. Please Note: the original 1.10 versions of the public genomes will only be maintained on the ftp site through January 31, 2012. To view the new features and enhancements in each release of Complete Genomics analysis pipeline, please refer to the Release Notes.
For more information about this dataset including download and un-archiving instructions, please refer to the Public Genome Data Repository Service Note and the Quick Start Guide: Working with Complete Genomics Data. For detailed descriptions of the directory hierarchy and the files contained within the directories, please refer to either the Standard Sequencing Service Data File Formats or the Cancer Sequencing Service Data File Formats, depending on the sample group under study. Full documentation, including Frequently Asked Questions, can be found here.
For each publicly available genome, Complete Genomics provides the following whole genome information:
Additionally, Complete Genomics provides aggregate information for the small variants identified across the 69 genomes as well as a 54 genome subset that excludes any closely-related individuals. These files include:
More information on these additional files can be found here.
1Note that hg19 uses a Yoruba mitochondrion sequence while Complete Genomics uses the Cambridge Reference Sequence.