Should a Rich Genome Variant File End the Storage of Raw Read Data?

Deep sequencing of a human genome produces large datasets sized in the hundreds of gigabytes, while the associated genome variant file measures about half a gigabyte. Now that we are sequencing thousands of human genomes has the time come to store and use only this variant file?

Generation of an accurate and comprehensive genome variant file is a critical first step in extracting biological knowledge from sequence data. This file lists all the sequence differences found in the analyzed genome contrasted to a commonly-used human reference genome. Researchers compare files from a set of genomes to find the genetic variations that might lead to a specific disease. Since the early days of Complete Genomics’ human genome sequencing service, we have provided these research-ready genomic reports. Scientists have thereby obtained the data from complete human genome sequencing without the overhead and expense of large computational facilities to handle, map, and assemble raw reads—even though the raw data was always supplied as well.

In fact, the first peer-reviewed papers published using our service employed the variant files and didn’t require the raw read data. The Institute for Systems Biology (ISB) sequenced a family where both children suffered from skeletal malformations caused by Miller syndrome, while both parents were unaffected. By sequencing these four genomes and comparing the variant files, ISB was able to identify the causal variant.1 The University of Texas Southwestern Medical Center used Complete Genomics’ service to solve a clinical case in which an 11‑month-old child presented with a very high cholesterol level (LDL‑C of 837). A standard blood test reported the infant was not lacking the protein that would bring down the cholesterol level. However, the variant file data revealed that, in fact, the protein must be absent.2

Preserving raw data is always a good policy when using a new technology. Initially, the sequencing cost per genome was high, and studies rarely included more than a dozen genomes. Structural variants and copy number variants were not reported and the accuracy of the reported variants was either unknown or not trusted by the research community. Now, after two years of intense R&D and a substantial reduction in sequencing price, what were good reasons in the past may be unnecessary limitations today. Is it wise now to store only the variant file and delete all the raw data? Yes—but only if the variant file is rich enough and accurate enough.

At Complete Genomics, by using more than 40x read coverage and local-de novo assembly, we provide a rich variant file that describes and fully annotates SNPs, indels, substitutions, copy number variants, structural variants, and mobile element insertions for virtually the entire genome. In addition, every reported variant has a precise confidence score which researchers can use to obtain the desired tradeoff between sensitivity and specificity. The increased comprehensiveness of variant detection, and flexibility to adjust the balance between false positives and false negatives offered by the variant files, mitigate much of the need to store an enormous amount of raw read data including a quality score for each base in each read.

I have heard the argument that it is important to save raw read data for reanalysis in the future with improved software. But wouldn’t it be cheaper to re-sequence and re-annotate selected genomes rather than to pay for storing raw read data for all genomes in all studies? Another argument offered in favor of keeping the reads is that scientific journals might require them as a condition of publication. This requirement is based on yesterday’s technology. Journals should instead require a rich variant file as support for biological and medical conclusions. The research community should lead a discussion about format and content standards for such a data file. In the end, there remains only one case where I would keep raw data—when the DNA samples initially sequenced are so small that re-sequencing is not an option.

The concern of many researchers dealing with the complexities of the large datasets represented by the raw read data file could practically vanish; they can work with the smaller and simpler-to-use variant file complemented by a rapidly improving set of software tools. This will greatly simplify and reduce the cost of larger studies which will involve hundreds or thousands of complete genomes. These large studies will be required to fully identify and understand molecular pathologies leading to human diseases.

Advances in the completeness, accuracy, and confidence levels of our genome variant files, coupled with the affordable sequencing price, make it acceptable to save and use only the rich and simple-to-use, research-ready variant file.

1 Roach JC, Glusman G, Smit AFA et al.: Analysis of genetic inheritance in a family quartet by whole-genome sequencing. Science 328(5978), 636-639 (2010).

2 Rios J, Stein E, Shendure J, Hobb HH, Cohen JJ: Identification by whole-genome resequencing of gene defect responsible for severe hypercholesterolemina. Hum. Mol. Genet. 19(22), 4313-4318 (2010).

Comments:


[...] over to Complete Genomic’s blog where C.S.O. Dr. Rade Drmanac wrote an entry titled “Should a Rich Genome Variant File End the Storage of Raw Read Data?“  It’s an interesting perspective where he suggests that, as the article’s title [...]

Posted By: A Reply to: Should a Rich Genome Variant File End the Storage of Raw Read Data? | Fejes.ca on 07/28/2011 at 09:55 AM PDT

We share this vision at Portable Genomics and agree on that.
Would you upload a rich variant file demo here. We'd love to use one within our interface.
Thanxx.

Posted By: Portable Genomics on 07/29/2011 at 06:10 AM PDT

Using multiple assembly programs is an interesting idea. There is no perfect sequence assembly and variant calling software, and usually it is not cost beneficial to be perfect. At the end it is all about cost-benefit ratio. Costs of repeated sequencing, additional assembly, raw-data storage and cost of reduced sensitivity should be compared to decide what is the best approach for a given study. Keeping a rich and highly accurate and complete variant file and discarding raw read data is becoming one of the valuable options.

Posted By: Complete Genomics on 08/02/2011 at 10:29 AM PDT

All of our public data including variant file data is located on our FTP site.
Details are on our website: http://www.completegenomics.com/sequence-data/download-data/

Posted By: Complete Genomics on 08/02/2011 at 10:31 AM PDT

If you look at this publication http://www.nature.com/nbt/journal/v29/n8/full/nbt.1904.html it is clear that there is still room for improvement in the assembly of genomes. A rich variant file will not allow you to reassemble using better methods, but I don't think it matters with the rapidly dropping price of sequencing. Much more importantly, the quality of sequencing data keeps getting better and better. So not only is resequencing a possible money saver but you will have better data to assemble.

Really the deciding criteria just boils down to how difficult was it to get the sample? Do you have more of it? It is one thing to pull a ready to sequence sample out of the freezer, it is a whole different thing if you procure a new one.

And eventually you simply just have to clean house. In general, I'd say keep it for a little while until it feels like there is no point.

Posted By: Ethan Ford on 08/10/2011 at 12:12 AM PDT


Post Your Own Comment:
Please enter your name

Please enter your Email

Please enter a Comment


About This Blog

Dr. Clifford Reid
Chairman, President and CEO

Dr. Reid is a successful, serial entrepreneur. He enjoys commercializing disruptive computing and life sciences technologies.

Dr. Rade Drmanac
Chief Scientific Officer

Dr. Drmanac is a genome sequencing pioneer; his inventions include massively parallel DNA sequencing by hybridization and combinatorial probe ligation. As a group leader at the Argonne National Labs, he was part of the Human Genome Project. In 1993, he cofounded Hyseq, one of the first genomic companies.

Legal Notice

Subscribe

From Twitter

Copyright © 2012 Complete Genomics Incorporated. All rights reserved. Use of this website signifies your agreement to the Terms of Use and Online Privacy Policy. Contact Webmaster.