A junction is defined as two separate regions of the reference genome that appear to be near each other in the genome being sequenced. Deletions are represented by a single junction (Figure 1), while other events such as inversions and intrachromosomal translocations can be represented by more than one junction (Figure 2).
Figure 1: Deletion of Segment BC in the Sequenced Genome Represented by Junction AD
As shown in Figure 1, deletion of segment BC in the sequenced genome would be represented by junction AD: a junction that connects sections A and D. leftStrand, leftPosition, rightStrand, rightPosition, and distance are fields reported in junction files. leftStrand and rightStrand values indicate that left and right side of the junction have the same strand orientation, while the distance value of 2,000 indicates that the position on left and right side of the junction closest to the breakpoint is 2,000 bp apart on the reference genome.
Figure 2: Inversion of Segment BC in the Sequenced Genome Represented by Two Different Junctions
As shown in Figure 2, inversion of segment BC in the sequenced genome would be represented by two different junctions: junction AC that connects sections A and C and junction BD that connects sections B and D. Note that unlike what is shown in Figure 5, coordinates for paired junctions are not typically identical for real events. leftStrand and rightStrand values indicate that, for both junctions, left and right side of the junction have the opposite strand orientation, while the distance value of 2,000 indicates that the positions on left and right side of the junction closest to the breakpoint are 2,000 bp apart on the reference genome.
Complete Genomics provides two files—allSvEventsBeta and highConfidenceSvEventsBeta—that report structural variation events involving identified junctions found in the allJunctionsBeta and highConfidenceJunctionBeta files, respectively. The CGATM Tools junctions2events command is used to identify structural variation events such as deletions, inversions, and translocations from lists of junctions. It determines which event type a junction is consistent with by identifying possible relationships among the provided junctions. Single-sample junctions are rationalized into event types using this tool, but somatic junctions are not rationalized into event types at this time.
Currently, Complete Genomics does not report somatic events. In other words, we do not attempt to rationalize somatic junctions into somatic events. The EventId, Type, and RelatedJunctions annotations in the somaticAllJunctionsBeta file and the somaticHighConfidenceJunctionBeta file refer to a description of the event identified in the tumor sample.
Currently, we do not attempt to call zygosity of the junction. However, zygosity can be inferred, to a certain extent, by interrogating the coverage in the junction region. For example, if coverage in a putative deletion junction region is near zero, you can infer that it is likely a homozygous deletion event.
Small insertion and deletion events are detected during the assembly process. They are only reported in the small variant files (e.g. var, masterVarBeta, vcfBeta), and not repeated in the junctions files, as they are not detected by the discordant mate pair mapping method employed for the detection of larger structural variations.
These five files represent outputs at various steps of our SV detection pipeline. Junctions are detected by identifying clusters of DNBs in which each arm maps uniquely to the reference genome, but with an unexpected mate pair distance or anomalous orientation. If a cluster contains three or more DNBs, a junction is output. These junctions, annotations such as the putative junction breakpoints and size of structural variant, and transition length estimated from this initial clustering of DNBs are reported in the evidenceJunctionClustersBeta file. Once junctions are detected, local de novo assembly is attempted on each junction. These junctions, annotations such as breakpoint, size of SV, transition sequence, and length that have been refined by local de novo assembly are reported in the allJunctionsBeta file. So, while evidenceJunctionClustersBeta and allJunctionsBeta files report the same junctions, for junctions in which local de novo assembly was successful, junctions annotations differ.
A set of filtering criteria is applied to junctions in the allJunctionsBeta file to obtain a list of high-confidence junctions, which are then reported in the highConfidenceJunctionBeta file. So, highConfidenceJunctionBeta file contains a subset of the allJunctionsBeta files, but the annotations for junctions found in both files are the same.
For samples submitted for the Cancer Sequencing Service, two additional files are provided. The somaticAllJunctionsBeta file represents junctions that are identified in the allJunctionsBeta file for the tumor but not in the allJunctionsBeta file for the normal sample. The somaticHighConfidenceJunctionBeta file includes junctions identified in the highConfidenceJunctionBeta file for the tumor, but not in the allJunctionsBeta file for the normal match.
There are several columns of information in the allJunctionsBeta and highConfidenceJunctionBeta files that can be used to gauge the confidence level of the called junction. These same metrics are used to filter for high-confidence junctions reported in the highConfidenceJunctionBeta file.
Our pipeline has known limitations that we are working to improve. These limitations are:
Yes. For samples analyzed using Analysis Pipeline version 2.0 or later, the SV baseline genome set is comprised of 52 genomes from the Complete Genomics Diversity Panel. You can download a file that summarizes the detected junctions and their frequencies across the SV baseline set from the Complete Genomics FTP site.
The accompanying SV Baseline Genome Dataset: Data Format Description document provides the identifiers for each genome in the SV baseline set and describes the data file format for the SV baseline genome composite file. Note that the same genomes are used to construct our CNV, SV, and MEI baseline sets.