Sequins built for the human genome.
Synthetic DNA standards representing common and clinically-important genes and mutations.
Sequins for human genome sequencing.
Next generation sequencing has become a central tool in biomedical research and clinical diagnosis. However, the complexity of the human genome, combined with errors that accumulate during NGS, confound accurate analysis and diagnosis.
Sequins are synthetic DNA standards that are ‘spiked-in’ to your DNA sample, and act as internal controls during genome sequencing. Sequins allow you to identify errors, measure diagnostic performance and improve the analysis of every NGS library.
Any human sequence can be mirrored to form a sequin sequence:
Whilst the mirrored sequin sequence is distinct, it retains the same nucleotide composition, repetitiveness and architecture as the original human sequence.
Additionally, the sequin performs equivalently to the human sequence during laboratory steps (PCR, hybridisation, library prep. and sequencing) and bioinformatic steps (alignment, variant detection etc.). Therefore, the sequin is an ideal reference standard to the original human genome sequence.
Using sequins in NGS.
Sequins are added at a fractional concentrion (typcally ~1%) to a human DNA sample. The sample and sequins then together undergo library preparation and sequencing. As the sequins accompany the sample through the NGS workflow, they accumulate the same errors and bias.
Following sequencing, the sequin reads can be distinguished from the human reads by their synthetic sequence in the output library. The sequins can then be analyzed as internal qualitative and quantitative controls.
Sequins allow you to measure the diagnostic performance (including true-positive, false-positive, and false-negative rate) for detecting different variant types in each individual NGS library.
Sequins can also be integrated with laboratory information systems to routinely monitor laboratory and bioinformatic performance, and thereby inform operational decisions.
Sequins can also measure and mitigate technical variation, enabling more accurate normalization between large patient cohorts.
These analysis (and more) can be performed using the anaquin software toolkit.
What is in the genome mixture?
The genome mixture includes over 1,200 different sequins that cover almost 2.3Mb of the human genome.
This comprehensive mixture encompasses the most important human genome regions, including:
Common genetic variants.
Each human genome harbors many common variants. We have developed a set of sequins representing homozygous and heterozygous SNVs and indels that are commonly found within human populations. These sequins provide a reference by which to evaluate the detection of germline variants with whole-genome sequencing and your choice of downstream bioinformatics tools.
Difficult genetic variants.
Low-complexity or repetitive sequences are among the most polymorphic sites in the human genome, and these difficult variants have established roles in a range of diseases. However, it can be difficult to distinguish small variants from sequencing or alignment errors at repetitive sites or within GC- or AT-rich sequences. Therefore, we have developed sequins that represent germline variants occurring at simple repeats (mono-, di-, tri- & quad-nucleotide) or within GC-/AT-rich regions. These can be used to evaluate the detection of variants in challenging regions of the human genome, and understanding the limitations of alternative protocols (e.g., comparing the performance of standard vs PCR-free library preparations).
We have developed a set of large sequins (>6 kb) that represent a pair of paternal/maternal haplotype sequences for each human chromosome (single paternal for the Y-chromosome). Each sequin encodes multiple SNVs and indels that are either shared (homozygous) or unique to the maternal/paternal haplotype (heterozygous). These sequins can be used to evaluate the performance of bioinformatic tools for variant discovery and phasing, or the accuracy of de novo sequence assembly. Moreover, the large size of these sequins makes them useful for the analysis of long-read sequencing technologies (e.g., PacBio, Oxford Nanopore, 10X Genomics).
Structural variants (SVs) are a major form of human genomic variation and have recurrent roles in inherited diseases and cancer. However, the size and diversity of SVs, and their common inclusion of repetitive sequences, pose challenges for SV detection using NGS. We have developed a set of sequins that represent a broad selection of SV types, including large deletions, inversions and tandem duplications, chromosomal translocations, and insertions of both exogenous viral sequences and mobile elements. This set includes both common SVs, which can often be compared to matched examples within an accompanying human DNA sample, and clinically relevant SVs, such as known oncogenic translocations and viral insertions. For each synthetic SV, the non-affected allele is also represented, thereby emulating a heterozygous genotype.
SV sequins provide an ground-truth references that can measure the sensitivity for detecting different SV types, sizes, and breakpoints. This includes the evaluation of software tools that identify different types of SVs from different sources of evidence, including read-depth (e.g., CNVnator), chimeric reads (e.g., LUMPY), de novo sequence assembly (e.g., Pamir) or long-read sequencing (e.g., Picky).
Translocations occur when two different chromosomes are aberrantly joined together. This can result in a fusion gene that have established roles in cancer. However, the detection of translocations using NGS can be difficult due to a high rate of false-positvie detection, and the confounding presence of repetitive sequences. We have developed a set of sequins that represent a range of translocations that recur in blood cancers and solid tumors. In each case, the non-translocated allele is also represented, thereby emulating a heterozygous genotype.
Inherited disease genes.
Heritable mutations in many human genes cause disease. We have developed a set of sequins that represent the clinically informative domains/exons from over 90 genes that are associated with heritable human diseases, including cystic fibrosis, haemophilia, cardiac myopathies, hereditary cancer and triplet-expansion disorders. This includes most genes recommended by the ACMG for reporting of incidental findings in clinical exome and genome sequencing. In each case, we have represented the human reference sequence, thereby providing an internal standard with which to interpret candidate variants detected in the accompanying human DNA sample. For example, by representing the HTT gene, we provide a standard that can be used to assess the reliability of a possible triplet repeat expansion in this gene, detected with NGS.
Pharmacogenes are involved in the drug metabolism. However, variants in these genes can impact the response of an individual to the response and efficacy of drug treatments. However, the genotyping of many pharmacogenes, such as CYP2D6 is confounded by the presence of pseudogenes and copy-number variation. We have developed sequins that represent many important pharmacogenes, thereby providing an internal reference by which to interpret candidate variants detected in the accompanying human DNA sample.
Many human genes have been causatively associated with cancer, and the detection of mutations in these genes can inform patient prognosis and treatment. We have developed a set of sequins that represent the clinically informative domains/exons from over 100 genes causally associated with human cancers, such as BRCA1, TP53, ERBB2 and ALK. For each gene, we have represented the wild-type sequence, providing a reference with which to interpret candidate germline and somatic mutations detected in the accompanying human DNA sample. Cancer gene exons can provide internal controls for WGS or exome sequencing, and can be used during the design, validation and on-going quality control of targeted oncology gene panels.
The accurate detection of somatic mutations in tumor DNA samples can inform prognosis and treatment for cancer patients. However, due to the presence of non-cancerous and clonal cell populations within a sample, somatic mutations often occur at low variant allele frequencies (VAFs). High sequencing coverage is therefore required to detect these low-VAF mutations that can be difficult to distinguish from sequencing errors. We have developed a set of sequins that represent known cancer driver mutations at staggered concentrations to form a quantitative VAF ladder (100%-0.1%). This ladder provides an internal scale to evaluate the sensitivity, precision and quantitative accuracy with which somatic mutations can be detected in a given NGS library. To ensure compatibility with popular bioinformatic tools (e.g., Strelka2, Mutect2) that call somatic mutations in tumor DNA by comparison to a matched-normal sample, we also provide a separate ‘normal’ sequin mixture, in which only the wild-type gene sequence at each cancer mutation is represented, thereby providing a background against which somatic mutations can be called.
Microsatellite instability (MSI) is indicative of mismatch repair deficiency in cancer, informing patient prognosis and treatment decisions. MSI diagnosis involves the detection of insertions and deletions at microsatellite sequences (short tandem repeats) throughout the genome. However, repeats are refractory to NGS analysis, and prone to sequencing and alignment errors that confound MSI diagnosis. To evaluate the diagnosis of MSI with WGS or targeted sequencing approaches, we have designed sequins that represent both stable and unstable instances of microsatellite loci that are commonly used as markers for MSI profiling (Bethesda panel). These allow false-positive and false-negative results to be identified during MSI profiling, and can be used to assess the impact of technical variables during library preparation and sequencing (e.g., read-length, number of PCR cycles).
T & B cell receptors.
Sequencing of immunoglobulin and T-cell receptor loci following somatic recombination and hyper-mutation can reveal the immune-repertoire within a sample, and can indicate the presence of clonal immune-cell populations. However, due to the number, repetitiveness and complexity of possible clonotypes, immune-repertoire profiling remains challenging. To improve and standardize this technique, we have developed sequins that represent somatically rearranged immunoglobulin (IgH, IgL and IgK) and T-cell receptor genes (TCRA/D, TCRB and TRCG). These can be used to measure the accuracy with which clonotype sequences are determined, and can indicate the quantitative accuracy of measurements of clonal cell populations. Immune sequins are compatible with WGS, targeted immune repertoire sequencing, as well as emerging long-read sequencing techniques that enable single-molecule B- and T-cell receptor characterization.
Human leukocyte antigen (HLA) alleles.
The human leukocyte antigen (HLA) genes have established roles in autoimmune disease aetiology, adverse drug reactions and cancer. However, accurate genotyping of HLA genes can be difficult due to high rates of polymorphism, and the presence of additional homologous sequences in the genome. Accordingly, we have developed sequins that represent two common human alleles for each of the major HLA genes (HLA-A, HLA-B, HLA-C, HLA-DR and HLA-Q). These provide internal reference standards for HLA-typing by WGS, exome or targeted HLA sequencing approaches, and can be used to benchmark HLA-typing software (e.g., HLAscan).
Viral (HPV) insertions.
The insertion of human papilloma virus (HPV) is associated with a range of cancers, including ovarian and neck and throat cancers. However, the detection of HPV insertions can be difficult due to foreign and repetitive sequences and structural variation at insertional sites. We have developed sequins that represent the insertion of HPV into various sites within the human genome. In each case, the non-affected allele is also represented, thereby emulating a heterozygous genotype. These HPV sequins provide a useful ground-truth references to determine the sensitivity and accuracy by which viral insertions are resolved using NGS.
Use of synthetic DNA spike-in controls (sequins) for human genome sequencing. (2019) Blackburn et. al.,
Representing human genetic variation with synthetic DNA standards. (2016)
Deveson et. al.,
Chiral DNA sequences as commutable reference standards for clinical genomics.
(2019) Deveson et. al.,
Using sequins with human whole genome sequencing.