Introducing the HP Advanced Custom Recipe for NextSeq 1000/2000 XLEAP-SBS P3 and P4 300 cycle kits

Published October 3, 2024

At Illumina, innovation and research align with a commitment to scientific access, data, and community and we continuously strive to bring the latest advancements in genomics to our users. We are happy to announce that the HP Advanced Custom Recipe is now available to download. Our research teams have developed this recipe through a modification in the standard clustering protocol to improve sequencing in difficult-to-read regions. The HP Advanced Custom Recipe may be of interest to users who have a need for higher variant calling performance in certain classes of repetitive sequences for research purposes, and we have therefore released it on our Advanced Research Protocols portal.

The HP Advanced Custom Recipe has been tested on the NextSeq 2000 P3 and P4 XLEAP-SBS reagent kits and has been demonstrated to significantly reduce errors and missed calls associated with strings of repeated nucleotides (homopolymers) and dinucleotide motifs.

Important notice

Sequencing recipes, scripts, and protocols released through the Advanced Research Protocols portal have been developed and tested by Illumina R&D scientists but have not gone through the formal product development process. As a result, official specifications may not be applicable when using these protocols, that is, reported Q30 score, output, and run time may vary relative to instrument specifications. The Illumina products mentioned herein are intended for research use only, not for use in diagnostic procedures. Support for Illumina Research scripts falls outside the scope of Illumina’s standard service plan coverage; however, select on-demand services may be available. Contact your sales representative for more information.

Relevance of homopolymers and dinucleotide repeats

Homopolymers are repeating units of a single nucleic acid (for example, AAAAAAAAAAA), while dinucleotide repeats are repeating units of two nucleic acids (for example, ATATATATAT). Repetitive sequences can create challenges for alignment and variant calling, owing to their low complexity, heterogeneity, movement, and duplication within the genome.1,2 Sequencing artifacts in these contexts are marked by a higher rate of mismatches and increased soft-clipping, and can negatively impact variant calling performance.3,4 Illumina DRAGEN secondary analysis software provides speed and accuracy for variant calling and incorporates specialized methods that address the challenges created by repetitive sequences.5,6

To assess the relevance of the sequence contexts addressed by this custom recipe, we cross-referenced genomic stratifications for these sequence contexts with the ClinVar database (August 25, 2024 release). Table 1 details the sizes of homopolymers, dinucleotide repeats, and associated flanking regions relative to the genome, as well as the density for each stratification of ClinVar germline pathogenic and likely pathogenic variants with review status ≥ 2 stars. While homopolymer and dinucleotide repeats have a lower density of ClinVar variants than the genome average, they overall account for a total of 1205.

Table 1: Size and relevance of low-complexity regions
Stratification Definition Region size (% of genome) Number of ClinVar P+LP 2+ variants Density of ClinVar P+LP 2+ variants
(variants per Mb)
Homopolymers (≥ 10 bp) Perfect homopolymers of length ≥ 10 bp 0.50% 73 5.1
Homopolymer flanks (50 bp) 50 bp regions flanking perfect homopolymers of length ≥10 bp 3.27% 693 7.4
Dinucleotide repeats ≥5 Perfect repeats of dinucleotide motifs with size ≥5 repeats 0.30% 59 6.8
Dinucleotide repeat flanks (50 bp) 50 bp regions flanking perfect repeats of dinucleotide motifs with size ≥5 repeats 1.53% 380 8.6
Exome All exons 3.3% 59,450 616.8
Genome-wide (autosomes) All autosomal chromosomes (chr1-chr22) 100% 66,016 23.0


How does the recipe perform?

To assess the performance of the HP Advanced Custom Recipe, we first assessed NA24385 (HG002) with PCR-free whole-genome sequencing. NA24385 is a human cell line sample that has been well characterized by the Genome in a Bottle Consortium, which generated a truth set of small variants that can be used for benchmarking purposes.5

We prepared replicates of NA24385 (HG002) using the TruSeq DNA PCR-Free library prep kit, and sequenced libraries using 2x151 bp read length with NextSeq 2000 P4 XLEAP-SBS reagents. We sequenced the libraries in three runs performed with the HP Advanced Custom Recipe and three runs performed with the default recipe available in the NextSeq 1000/2000 Control Software Suite v1.7.1. We then analyzed the sequencing data with the DRAGEN Germline v4.3.6 workflow after downsampling to 30× coverage for variant calling comparisons. Average run metrics across the three HP Advanced Custom Recipe runs are shown in Table 2.

Table 2: Average primary metrics and run times for three runs performed with the HP Advanced Custom Recipe
Note that specifications using the HP Advanced Custom Recipe may fall below published reagent specifications.
  HP Advanced Custom Recipe with XLEAP-SBS P4 Reagent Kit (300 Cycles)c NextSeq 1000/2000 XLEAP-SBS P4 Reagent Kit (300 Cycles) Specificationa,b,c
Reads passing filter 1.76 B 1.8 B
Yield (Gb) 515 Gb 540 Gb
Quality score 91.39% ≥ Q30 90% ≥ Q30
Run time 47 hours, 27 minutes 44 hours

a. Output specifications are based on an Illumina PhiX control library at supported cluster densities.
b. Quality scores are based on an Illumina PhiX control library; performance may vary based on library type and quality, insert size, loading concentration, and other experimental factors.
c. Run time includes cluster generation, sequencing, and base calling.


Homopolymer resolution

To investigate correct resolution of homopolymer regions, we calculated the accuracy of the hompolymer length reported in sequenced reads. To exclude any potential variant from the accuracy calculation, we only considered homopolymers present in the confident regions of the NIST 4.2.1 truth set that are distant > 50 bp from any true variant. For those homopolymers, we compared the length reported in the reads that fully span the event to the length in the human reference GRCh38 to calculate an accuracy metric. This assessment demonstrated high accuracy for short homopolymers (< 10 bp) in both the XLEAP-SBS Standard Recipe and the HP Advanced Custom Recipe. For longer homopolymers, which are more challenging to sequencing technologies, the HP Advanced Custom Recipe yielded a significant improvement in accuracy compared to the XLEAP-SBS Standard Recipe.

Figure 1: Homopolymer length accuracy by homopolymer length—Performances for libraries sequenced on three runs with XLEAP-SBS Standard Recipe are shown in blue; performances for libraries sequenced on three runs with HP Advanced Custom Recipe are shown in orange.

Variant calling performance

We measured analytical sensitivity and specificity for small variants called with DRAGEN 4.3.6 multigenome (graph) aligner against the NIST 4.2.1 benchmarking set. Variant calling errors are shown genome-wide, in homopolymer ≥ 10 bp regions and 50 bp flanks, in dinucleotide repeats ≥ 5 regions and 50 bp flanks.

On average, the custom recipe delivers an 8% reduction in small variants errors, with consistent benefits in low-complexity regions. Notably, the impact of improved resolution of homopolymers and flanks can only be partially demonstrated using the NIST 4.2.1 benchmark, which covers only 69% of these regions.

Figure 2: Small variants called with DRAGEN 4.3.6 and measured against the NIST 4.2.1 benchmark

Figure 3: Small variants called with DRAGEN 4.3.6 measured against the NIST 4.2.1 benchmark in homopolymers ≥ 10 bp and 50 bp flanks

Figure 4: Small variants called with DRAGEN 4.3.6 measured against the NIST 4.2.1 benchmark in dinucleotide repeats ≥ 5 repeats and 50 bp flanks

Examples of improved support for variant calling

Figures 5–7 provide examples of locations of true variants in the NA24385 (HG002) genome where HP Advanced Custom Recipe has enabled correct variant calling through improved resolution of homopolymers and dinucleotide repeats.

Figure 5: A heterozygous CA>A deletion at the end of a homopolymer in the RARS2 gene is missed with the XLEAP-SBS Standard Recipe due to insufficient reads support—Reads sequenced with the HP Advanced Custom Recipe have higher homopolymer length accuracy and enable correct calling of the variant.

Figure 6: A heterozygous A>ATG insertion at the start of a long dinucleotide repeat is missed with the XLEAP-SBS Standard Recipe due to insufficient reads support—Reads sequenced with the HP Advanced Custom Recipe support the presence of an insertion on both strands and enable correct calling of the variant.

Figure 7: A multiallelic CTTT>CTT,C deletion at the start of a homopolymer is called both by XLEAP-SBS Standard Recipe sequencing data and HP Advanced Custom Recipe sequencing data—However, due to higher noise in the XLEAP-SBS Standard Recipe reads, with this data only the CTTT>CTT allele is detected and the deletion is incorrectly genotyped. HP Advanced Custom Recipe data enables the detection of both alleles and correct genotyping.

How to access the HP Advanced Custom Recipe

This HP Advanced Custom Recipe has been tested only on and is compatible only with the NextSeq 2000 P3 and P4 XLEAP-SBS Reagent Kits. It is not currently available on other platforms or kit configurations. To enable the custom recipe for a sequencing run, please visit the Advanced Research Protocols web page to download the recipe file.

References:

1. Liao X, Zhu W, Zhou J, et al. Repetitive DNA sequence detection and its role in the human genome. Commun Biol. 2023;6(954). doi:10.1038/s42003-023-05322-y

2. Rajan-Babu I-S, Dolzhenko E, Eberle MA, Friedman JM. Sequence composition changes in short tandem repeats: heterogeneity, detection, mechanisms and clinical implications. Nat Rev Genet. 2024;25:476-499. doi:10.1038/s41576-024-00696-z

3. Singer-Berk M, Gudmundsson S, Baxter S, et al. Advanced variant classification framework reduces the false positive rate of predicted loss-of-function variants in population sequencing data. Am J Hum Genet. 2023;110(9):1496-1508. doi:10.1016/j.ajhg.2023.08.005

4. Stoler N, Nekrutenko A. Sequencing error profiles of Illumina sequencing instruments. NAR Genom Bioinform. 2021;3(1). doi:10.1093/nargab/lqab019

5. Behera S, Catreux S, Rossi M, et al. Comprehensive and accurate genome analysis at scale using DRAGEN accelerated algorithms. Preprint. bioRxiv. 2024;2024.01.02.573821. Published 2024 Jan 6. doi:10.1101/2024.01.02.573821

6. Illumina. Fully featured genome: Expanding the hunt for genomic variation with DRAGEN STR. illumina.com/science/genomics-research/articles/str-expansionhunter.html. Published October 10, 2022. Accessed September 13, 2024.

7. National Institute of Standards and Technology. Genome in a Bottle. nist.gov/programs-projects/genome-bottle. Accessed September 13, 2024.