SNV/Indel Processing

Methods for SNV and indel preprocessing, annotation, and prioritization via the IMPACT-SNV module
Authors
Affiliation

Nicholas Boehler

University of Toronto Mississauga

Hai-Ying Mary Cheng

University of Toronto Mississauga

Published

December 18, 2025

1 Overview

NoteIMPACT-SNV Pipeline Integration

This page documents SNV/Indel processing as implemented in the IMPACT-SNV upstream module. IMPACT-VIS receives preprocessed annotated GDS (aGDS) files with variants already scored via the IMPACT scoring system.

Upstream Repository: IMPACT-SNV (companion to IMPACT-VIS) Expected Output: {sample_id}_SNV_IMPACT.gds + {sample_id}_SNV_IMPACT.gds_variant_states.rds

For reproducibility details, see the IMPACT-SNV Documentation.

2 Introduction

The IMPACT-SNV module preprocesses single-nucleotide variants (SNVs) and small indels for phenotype-aware interpretation. This section details the data preprocessing workflow, annotation approach, prioritization algorithm, and integration with IMPACT-VIS for interactive variant visualization.

3 Workflow Overview

The IMPACT-SNV module follows a standardized three-stage pipeline:

  1. Normalization & Conversion: Raw VCF → normalized → GDS format
  2. Annotation: Functional annotations via FAVOR database (190 attributes)
  3. Prioritization: Tier-based scoring with phenotype-specific gene relevance

4 Data Preprocessing

4.1 VCF Normalization

Raw VCF files from the DRAGEN 3.7.8 pipeline (GRCh38/hg38 reference) are normalized using bcftools:

  • Left-align variants to GRCh38 reference genome
  • Split multi-allelic sites into individual records
  • Merge chromosome-specific files
  • Prune extraneous INFO fields to reduce file size

The merged, normalized VCF is then partitioned into chromosome-specific files for efficient downstream processing.

4.2 GDS Conversion

Normalized VCF files are converted to Genomic Data Structure (GDS) format using the SeqArray framework:

Rationale: GDS provides superior compression and rapid random-access compared to VCF format, enabling efficient on-disk filtering during interactive visualization [SeqArray documentation].

File Format: HDF5-based hierarchical container optimized for high-throughput sequencing data

File Naming Convention: - {sample_id}_SNV_IMPACT.gds — Annotated GDS file (aGDS) - {sample_id}_SNV_IMPACT.gds_variant_states.rds — Per-sample annotation state (user notes, classifications)

4.3 Functional Annotation

Annotated GDS files are generated using favorannotator, which appends 190 functional attributes from the FAVOR database, including:

  • Predicted consequence (VEP/Gencode/RefSeq)
  • Allele frequency (gnomAD, BRAVO, 1000 Genomes)
  • Regulatory impact (CADD, REVEL, SIFT, PolyPhen)
  • Gene-level annotations (constraints, intolerance scores)
  • Clinical annotations (ClinVar assertions, disease associations)

The resulting annotated GDS (aGDS) files serve as the foundation for variant prioritization and visualization.

5 Variant Prioritization: IMPACT Scoring System

The IMPACT scoring system integrates variant-level pathogenicity with phenotype-specific gene relevance. Each variant is assigned to one of four tiers based on functional severity, and a final IMPACT score (0-100) is computed combining the tier-specific base score with the gene–disease association (GDA) score from Open Targets.

5.1 Tier Classification

Variants are stratified into four tiers based on functional consequence:

Tier 1: Variants documented in ClinVar as “Pathogenic” or “Likely Pathogenic”

Tier 2: Predicted frameshift or stop-gain variants by Gencode, RefSeq, or UCSC

Tier 3: Other coding or splicing variants predicted to alter transcript structure

Tier 4: Remaining variants, scored using normalized FAVOR aPC protein impact metrics

5.2 IMPACT Scoring Formula

For each tier, a base score is combined with the gene–disease association (GDA) score from Open Targets to produce a final IMPACT score ranging from 0 to 100:

Tier Formula Interpretation
Tier 1 \(\text{IMPACT} = 80 + 20 \times \text{GDA}\) Clinically established pathogenic variants; GDA modulates priority
Tier 2 \(\text{IMPACT} = 60 + 40 \times \text{GDA}\) High-impact predicted variants; phenotype relevance strongly weighted
Tier 3 \(\text{IMPACT} = 20 + 80 \times \text{GDA}\) Moderate-impact variants; phenotype relevance is dominant factor
Tier 4 \(\text{IMPACT} = 50 \times \text{aPC\_protein} + 50 \times \text{GDA}\) Rare variants scored equally on predicted impact and phenotype relevance

Variables: - \(\text{GDA}\) = Gene–disease association score (0–1, from Open Targets platform) - \(\text{aPC\_protein}\) = Normalized FAVOR aPC protein impact score (0–1)

Interpretation: - Tier 1 variants receive high baseline priority, as they have prior clinical documentation - Tier 2 variants with frameshift/stop-gain consequences receive moderate priority, enhanced by phenotype association - Tier 3 variants rely heavily on phenotype association (80% of score) - Tier 4 variants require both predicted functional impact and phenotype relevance for prioritization

5.3 High-Priority Threshold

Variants with IMPACT score ≥ 70 are considered high-priority candidates for visualization and downstream interpretation in IMPACT-VIS.

Rationale: This threshold ensures that variants receive priority based on either: - Strong clinical evidence (Tier 1 + any phenotype association), OR - High-impact predictions with strong phenotype relevance (Tier 2-4)

6 Integration with IMPACT-VIS

Final output from IMPACT-SNV consists of per-sample SNV_IMPACT.gds files containing:

  • Complete variant coordinates and reference/alternate alleles
  • All functional annotations from FAVOR database
  • Assigned tier classification and IMPACT score
  • Sample-level genotype and quality information

These files are directly consumed by the IMPACT-VIS visualization module, enabling interactive exploration of prioritized variants alongside structural variants and copy number variants in a unified framework.

7 Computational Efficiency

The SNV module employs on-disk processing strategies to handle genome-scale data:

  • GDS on-disk filtering: Variant filtering performed at HDF5 layer, minimizing memory usage
  • Chromosome-specific processing: Variants partitioned by chromosome for parallelization
  • Lazy evaluation: Annotations loaded only for variants meeting filter criteria
  • Annotation state caching: User classifications and notes stored in lightweight RDS files

This architecture enables responsive interactive visualization even for large genomes (30,000+ variants per sample) on modest hardware resources.