SNV/Indel Processing
1 Overview
This page documents SNV/Indel processing as implemented in the IMPACT-SNV upstream module. IMPACT-VIS receives preprocessed annotated GDS (aGDS) files with variants already scored via the IMPACT scoring system.
Upstream Repository: IMPACT-SNV (companion to IMPACT-VIS) Expected Output: {sample_id}_SNV_IMPACT.gds + {sample_id}_SNV_IMPACT.gds_variant_states.rds
For reproducibility details, see the IMPACT-SNV Documentation.
2 Introduction
The IMPACT-SNV module preprocesses single-nucleotide variants (SNVs) and small indels for phenotype-aware interpretation. This section details the data preprocessing workflow, annotation approach, prioritization algorithm, and integration with IMPACT-VIS for interactive variant visualization.
3 Workflow Overview
The IMPACT-SNV module follows a standardized three-stage pipeline:
- Normalization & Conversion: Raw VCF → normalized → GDS format
- Annotation: Functional annotations via FAVOR database (190 attributes)
- Prioritization: Tier-based scoring with phenotype-specific gene relevance
4 Data Preprocessing
4.1 VCF Normalization
Raw VCF files from the DRAGEN 3.7.8 pipeline (GRCh38/hg38 reference) are normalized using bcftools:
- Left-align variants to GRCh38 reference genome
- Split multi-allelic sites into individual records
- Merge chromosome-specific files
- Prune extraneous INFO fields to reduce file size
The merged, normalized VCF is then partitioned into chromosome-specific files for efficient downstream processing.
4.2 GDS Conversion
Normalized VCF files are converted to Genomic Data Structure (GDS) format using the SeqArray framework:
Rationale: GDS provides superior compression and rapid random-access compared to VCF format, enabling efficient on-disk filtering during interactive visualization [SeqArray documentation].
File Format: HDF5-based hierarchical container optimized for high-throughput sequencing data
File Naming Convention: - {sample_id}_SNV_IMPACT.gds — Annotated GDS file (aGDS) - {sample_id}_SNV_IMPACT.gds_variant_states.rds — Per-sample annotation state (user notes, classifications)
4.3 Functional Annotation
Annotated GDS files are generated using favorannotator, which appends 190 functional attributes from the FAVOR database, including:
- Predicted consequence (VEP/Gencode/RefSeq)
- Allele frequency (gnomAD, BRAVO, 1000 Genomes)
- Regulatory impact (CADD, REVEL, SIFT, PolyPhen)
- Gene-level annotations (constraints, intolerance scores)
- Clinical annotations (ClinVar assertions, disease associations)
The resulting annotated GDS (aGDS) files serve as the foundation for variant prioritization and visualization.
5 Variant Prioritization: IMPACT Scoring System
The IMPACT scoring system integrates variant-level pathogenicity with phenotype-specific gene relevance. Each variant is assigned to one of four tiers based on functional severity, and a final IMPACT score (0-100) is computed combining the tier-specific base score with the gene–disease association (GDA) score from Open Targets.
5.1 Tier Classification
Variants are stratified into four tiers based on functional consequence:
Tier 1: Variants documented in ClinVar as “Pathogenic” or “Likely Pathogenic”
Tier 2: Predicted frameshift or stop-gain variants by Gencode, RefSeq, or UCSC
Tier 3: Other coding or splicing variants predicted to alter transcript structure
Tier 4: Remaining variants, scored using normalized FAVOR aPC protein impact metrics
5.2 IMPACT Scoring Formula
For each tier, a base score is combined with the gene–disease association (GDA) score from Open Targets to produce a final IMPACT score ranging from 0 to 100:
| Tier | Formula | Interpretation |
|---|---|---|
| Tier 1 | \(\text{IMPACT} = 80 + 20 \times \text{GDA}\) | Clinically established pathogenic variants; GDA modulates priority |
| Tier 2 | \(\text{IMPACT} = 60 + 40 \times \text{GDA}\) | High-impact predicted variants; phenotype relevance strongly weighted |
| Tier 3 | \(\text{IMPACT} = 20 + 80 \times \text{GDA}\) | Moderate-impact variants; phenotype relevance is dominant factor |
| Tier 4 | \(\text{IMPACT} = 50 \times \text{aPC\_protein} + 50 \times \text{GDA}\) | Rare variants scored equally on predicted impact and phenotype relevance |
Variables: - \(\text{GDA}\) = Gene–disease association score (0–1, from Open Targets platform) - \(\text{aPC\_protein}\) = Normalized FAVOR aPC protein impact score (0–1)
Interpretation: - Tier 1 variants receive high baseline priority, as they have prior clinical documentation - Tier 2 variants with frameshift/stop-gain consequences receive moderate priority, enhanced by phenotype association - Tier 3 variants rely heavily on phenotype association (80% of score) - Tier 4 variants require both predicted functional impact and phenotype relevance for prioritization
5.3 High-Priority Threshold
Variants with IMPACT score ≥ 70 are considered high-priority candidates for visualization and downstream interpretation in IMPACT-VIS.
Rationale: This threshold ensures that variants receive priority based on either: - Strong clinical evidence (Tier 1 + any phenotype association), OR - High-impact predictions with strong phenotype relevance (Tier 2-4)
6 Integration with IMPACT-VIS
Final output from IMPACT-SNV consists of per-sample SNV_IMPACT.gds files containing:
- Complete variant coordinates and reference/alternate alleles
- All functional annotations from FAVOR database
- Assigned tier classification and IMPACT score
- Sample-level genotype and quality information
These files are directly consumed by the IMPACT-VIS visualization module, enabling interactive exploration of prioritized variants alongside structural variants and copy number variants in a unified framework.
7 Computational Efficiency
The SNV module employs on-disk processing strategies to handle genome-scale data:
- GDS on-disk filtering: Variant filtering performed at HDF5 layer, minimizing memory usage
- Chromosome-specific processing: Variants partitioned by chromosome for parallelization
- Lazy evaluation: Annotations loaded only for variants meeting filter criteria
- Annotation state caching: User classifications and notes stored in lightweight RDS files
This architecture enables responsive interactive visualization even for large genomes (30,000+ variants per sample) on modest hardware resources.