Multiomics interpretation and integration

Illumina Connected Multiomics provides a powerful data science platform to streamline 5-base methylation and genomic multiomic analyses. The platform enables teams to design, experiment, collaborate seamlessly, and interact with traditionally complex workflows in real-time. Connected Multiomics transforms raw data into actionable biological insights. It digests DRAGEN outputs into a unified, multisample data structure that facilitates cohort-level analyses. This architecture simplifies common tasks such as data quality filtering, unsupervised clustering, and differential methylation analysis. Furthermore, it enables multiomic integration of informative methylation features and genomic variants. Here, we demonstrate a representative analysis workflow to showcase the capabilities of Connected Multiomics with an acute myeloid leukemia (AML) sample cohort. 

Data quality control 

The platform first ingests the outputs of DRAGEN and summarizes the data set at the multisample cohort level. Figure 1 illustrates an automatically generated dashboard that visualizes the distribution of common whole-genome sequencing quality control metrics across the cohort. Percent methylation per sample is defined as the average methylation level across all CpG positions in the sample genome. Percent unmethylated control and percent methylated control represent the average methylation across all CpG positions in spiked-in control genomes and are used to assess methylation conversion efficiency. Higher methylation levels in the methylated control and lower methylation levels in the unmethylated control indicate improved conversion quality.

Figure 1: Quality control dashboard showing relevant methylation conversion and sequencing quality metrics

Figure 2 shows how you can visualize a histogram over a QC metric of interest and set custom filters. These filters can exclude samples to potentially improve the quality of downstream data analysis. 

Figure 2: Cohort filtering interface to exclude samples with poor quality control metrics

Supervised and unsupervised clustering 

After a sample cohort is defined, you can perform exploratory analyses, such as clustering, to visualize global structure and heterogeneity within the data set. Connected Multiomics supports clustering at both single-CpG resolution and over aggregated genomic features, such as promoter regions, where CpG methylation is averaged across each feature. In addition, you can define custom feature sets tailored to the context of the study to enhance clustering performance further. 

Figure 3 illustrates how you can evaluate principal component analysis (PCA) clustering performance using generic promoter regions or a custom region set of AML specific epigenomic features. Notably, certain AML subtypes, including KMT2Ar and IDH-mutant cases, show improved separation when clustering is performed using AML-specific features. To enhance clustering performance further, non-linear dimensionality reduction methods, such as UMAP and t-SNE, are also supported. However, these methods often require parameter optimization. 

Figure 3: Two different PCA visualizations based on different genomic features

For uniform manifold approximation and projection (UMAP), parameters such as the number of principal components and the number of nearest neighbors must be carefully tuned. Figure 4 illustrates how you can set up multiple UMAP optimizations and visualize the results together. From this UMAP parameter screen, UMAP parameter set 3 achieves strong separation of all AML subtypes. 

Figure 4: Typical parameter screening for UMAP clustering

To validate the clustering results, Figure 5 shows the application of k-means clustering over a range of cluster numbers, identifying five as the optimal number for this dataset. You can annotate the UMAP with the k-means cluster labels with the number of clusters parameter set at five. This quantitative agreement confirms the biological relevance of the visually observed clusters.

Figure 5: K-means cluster parameter screening and confirmation of clusters from UMAP Parameter Set 3

Differentially methylated region calling 

Connected Multiomics streamlines the identification of differentially methylated regions (DMRs) by integrating a widely used DMR caller that uses dispersion shrinkage for sequencing data (DSS) directly into its interactive sandbox environment. Sample groupings can be defined from metadata or cluster labels from the PCA/UMAP tasks. DSS models the CpG position methylation as a beta binomial distribution, and statistically significant differentially methylated positions between sample groups are stitched together to create DMRs. Figure 6 shows how DMRs can be easily visualized and filtered for downstream analyses. Consistent with literature, AML patients carrying IDH mutations typically have broadly hypermethylated phenotypes, which result in the larger number of hypermethylated DMRs compared to hypomethylated ones. The diff.Methy metric represents the average methylation difference between the two sample groups over a specific genomic region, and the length is the base-pair length of the DMRs. The areaStat metric is integrated with the statistical significance of all the CpG positions in a DMR, which is most strongly correlated with DMR length. Larger DMRs that have larger methylation differences will result in a larger areaStat absolute value. Significance labels are provided as a guide to help you interpret DMRs at a glance. However, biological context and study-specific priors must guide the interpretation of DMRs.  

Figure 6: DSS DMR calling results volcano plot based on typically useful DMR metrics

Pathway analysis

Following DMR calling, Connected Multiomics facilitates the translation of DMRs into more functional inferences. Figure 7 shows how DMRs of interest can be filtered for high methylation differences (for example, greater than 0.2 methylation difference) and annotated with gene names within 5 kb of transcription start/stop sites. You can customize the maximum genomic distance to tune the interpretation of DMR-gene associations relevant to the biological context of their study. 

Because DNA methylation typically regulates gene expression at promoters, most DMRs associated with genes are localized to transcription start site (TSS) regions. Depending on the applied filtering criteria, identified genes can exhibit either hypo- or hypermethylation relative to the IDH-mutant patient group. These gene-level findings can be contextualized further at the pathway level using the Connected Multiomics integrated gene set enrichment analysis. This functionality enables a broader interpretation of the underlying biological processes.

Figure 7: Annotation of DMRs of interest with nearest genes and gene set enrichment identifying gene pathways of interest

Multiomic analysis

Variant analysis modules 

Connected Multiomics provides a unified environment for integrating methylation and genomic variant analyses, unlocking the multiomic potential of the Illumina 5-base assay. The representative workflow described in this section overlays DMRs with genes containing small genomic variants, including single nucleotide polymorphisms (SNPs) and small insertions/deletions (indels). Figure 8 shows how variants can be filtered using standard variant call format (VCF) fields such as depth (DP). In addition, Connected Multiomics uses Illumina-specific and popular public databases to refine the variants of interest further. For example, the gnomAD, DRAGEN Haplotype Database, and primate AI can be used to remove germline variants from somatic variant calling results. Promoter AI can be used to predict the gene activity. Figure 9 shows how variants can also be viewed at the cohort level to observe shared variants among the cohort. 

Methylation and variant integration modules 

Connected Multiomics integrates methylation and variant information at the gene level whereby both DMRs and variants must first be annotated with genes as shown in Figures 7 and 10, respectively. This gene-centric integration prioritizes functionally relevant regions of the genome, with planned extensions to additional regulatory loci in future releases. Figure 11 shows the output table after DMRs and variants are intersected. This output has been embellished with a regional methylation view and additional graphics generated outside of Connected Multiomics to provide context. In this example loci, there is a cluster of variants at the HOXA9 gene in KMT2Ar-mutated patients, which correlates with hypomethylation of the HOXA9 gene. This correlation could imply that these HOXA9 variants have functional consequences as hypomethylated genes that are associated with gene expression. Thus, DMRs could give functional inferences to interpret variants of unknown significance. 

Figure 8: Variant filtering functions to enrich for informative genomic variants
Figure 9: Cohort level summary of variants
Figure 10: Annotation of variants by proximity to genes.
Figure 11: Output of multiomic intersection of DMRs and variant calls embellished with additional graphics

Workflow visualization

Through the AML case study presented, we demonstrate an end-to-end analysis starting with data quality control in Figure 12. Connected Multiomics provides methylation and variant analysis tools to use the multiomic nature of the Illumina 5-base data type. You can perform rigorous clustering validation, DMR calling based on metadata and cluster labels, and contextualization of DMRs with gene and pathway information. In parallel, you can annotate and filter genomic variants and visualize variants at the cohort level. Variants can be annotated further with DMRs to provide a more complete interpretation of regulatory and genetic drivers underlying disease. Figure 12 also highlights the transparency of a collaborative analysis because teams can track progress in real-time and branch analyses. In summary, these capabilities demonstrate how Connected Multiomics brings multiomic data, analysis, and interpretation into a single transparent and collaborative environment, accelerating biological insights from Illumina 5-base data sets.

Figure 12: Representative workflow for the presented AML cohort analysis. Pink rectangles indicate analysis module featured in this blog.