CNV Processing

Methods for copy number variant annotation and phenotype-aware prioritization via the IMPACT-CNV module

Authors

Affiliation

Nicholas Boehler

University of Toronto Mississauga

Hai-Ying Mary Cheng

University of Toronto Mississauga

Published

June 1, 2026

1 Overview

IMPACT-CNV Pipeline Integration

This page documents CNV processing as implemented in the IMPACT-CNV upstream module. IMPACT-VIS receives preprocessed SCIP-format CNV text files with variants already validated and prioritized via phenotype-aware gene filtering.

Upstream Repository: IMPACT-CNV (companion to IMPACT-VIS) Expected Output: {sample_id}_CNV_IMPACT.txt

For reproducibility details, see the IMPACT-CNV Documentation.

2 Introduction

The IMPACT-CNV module extends the Suite for CNV Interpretation and Prioritization (SCIP) framework to enable phenotype-aware filtering and prioritization of copy number variants (CNVs). This section details the CNV preprocessing workflow, quality validation, phenotype-specific filtering, and integration with IMPACT-VIS for interactive visualization.

3 Workflow Overview

The IMPACT-CNV module implements a multi-step pipeline:

VCF Conversion: CNV VCF → SCIP-compatible text format
SCIP Processing: Multi-step filtration and prioritization
Read-Depth Validation: CRAM-based verification of CNV calls
Phenotype-Aware Ranking: Phenotype relevance weighting
Integration with IMPACT-VIS: TXT output for interactive visualization

4 Data Preprocessing

4.1 CNV VCF to SCIP Conversion

CNV VCF files are converted to SCIP-compatible text inputs using a Python-based converter that:

Extracts sample identifiers from VCF header
Reformats variant coordinates (chr, start, end, CN)
Validates input format and coordinate ranges
Generates tab-separated text file for SCIP processing

Input Format: Standard VCF with CNV calls (e.g., from DRAGEN, Canvas, Manta, LUMPY)

Output Format: SCIP-compatible TXT with columns: chromosome, start, end, copy_number

4.2 SCIP Backend Processing

The SCIP backend applies a multi-step filtration and prioritization strategy informed by:

ClinGen dosage sensitivity maps
OMIM disease associations
gnomAD constraint metrics
Internal recurrence regions
Curated genomic annotations

5 Quality Control via Read-Depth Validation

To improve confidence in CNV calls, the pipeline performs validation using sample CRAM files and a high-quality control genome (NA12878):

Validation Steps:

Coverage Analysis: Verify expected read-depth shifts consistent with CN state
- Homozygous deletion (CN=0): ~0% coverage
- Heterozygous deletion (CN=1): ~50% coverage
- Normal (CN=2): ~100% coverage
- Duplication (CN=3): ~150% coverage
- Amplification (CN≥4): ≥200% coverage
Mapping Quality Assessment: Exclude regions with anomalous read signatures
- Low MAPQ scores indicating misalignment
- High soft-clip rates indicating sequence complexity
- Regions of significant segmental duplication
Paired-End Evidence Quantification: Assess supporting read pairs
- Concordant pairs within expected insert size
- Discordant pairs supporting breakpoints
- Split-read evidence at CNV boundaries
Control Genome Comparison: Cross-reference with NA12878 high-quality calls
- Identify artifacts common to control samples
- Flag potential false positives

CNV Flagging: CNVs lacking supporting reads or exhibiting anomalous signatures are flagged for manual review.

6 Phenotype-Aware Prioritization

6.1 Evidence Integration

Phenotype relevance is refined through integration of multiple evidence sources:

Evidence Category	Source	Application
Dosage Sensitivity	ClinGen	Identify genes intolerant to copy number change
Haploinsufficiency	ClinGen	Score genes sensitive to heterozygous deletion
Triplosensitivity	ClinGen	Score genes sensitive to duplication/amplification
OMIM Associations	OMIM	Map CNVs to known genetic diseases
Inheritance Patterns	GenCC	Filter based on autosomal/X-linked inheritance
Loss-of-Function Intolerance	gnomAD	Identify genes with pLI ≥ 0.9 or LOEUF ≤ 0.35
Disease Validity	GenCC	Assess strength of disease association

6.2 Prioritization Tiers

CNVs are assigned to one of three priority tiers based on accumulated evidence:

Tier 1 (High Priority):

- Fully contained within ClinGen dosage-sensitive regions, OR
- Overlapping genes with strong cumulative evidence (OMIM + high pLI + ClinGen + phenotype association)

Tier 2 (Moderate Priority):

- Partially overlapping ClinGen dosage-sensitive regions, OR
- Overlapping genes with moderate evidence strength

Tier 3 (Low Priority): - Extensive overlap to common CNVs in gnomAD, OR - Residing in low-quality genomic regions, OR - Insufficient evidence for prioritization

6.3 Gene–Disease Association Weighting

CNVs overlapping genes with strong gene–disease association (GDA) scores from Open Targets are prioritized. Additional weighting is applied for genes with: - High intolerance to loss-of-function (pLI ≥ 0.9 or LOEUF ≤ 0.35) - ClinGen haploinsufficiency score ≥ 3 (1-3 scale for deletions) - ClinGen triplosensitivity score ≥ 3 (1-3 scale for duplications/amplifications)

7 Output Format

The final output of the IMPACT-CNV module consists of per-sample CNV_IMPACT.txt files in SCIP-compatible format, consolidating:

- **Variant coordinates**: Chromosome, start, end, copy number state
- **Functional annotations**: Gene overlap, regulatory context
- **SCIP annotations**: Dosage sensitivity, OMIM associations
- **QC metrics**: Read-depth validation results, coverage statistics
- **Priority tier**: High/Moderate/Low classification
- **Evidence summary**: Concise reasoning for prioritization

File Structure: Tab-separated values with columns: chr, start, end, cn, genes, tier, qc_status, evidence_summary

7.1 Integration with IMPACT-VIS

CNV_IMPACT.txt files are formatted for direct input into the IMPACT-VIS visualization module, enabling:

- **Unified variant review**: CNVs visualized alongside SNVs and SVs
- **Quality-aware display**: QC-flagged variants highlighted for manual review
- **Phenotype-aware prioritization**: High-priority CNVs automatically surfaced
- **Interactive exploration**: Click-through to genes, dosage maps, disease databases

Systematic curation: Per-sample classifications and notes persisted for reproducibility

8 Manual Review Workflow

All high-priority CNVs identified by IMPACT-CNV undergo manual review for validation:

Curation Steps: 1. Verify read-depth and breakpoint evidence via CRAM inspection 2. Assess gene overlap and functional impact 3. Consult disease databases (OMIM, ClinGen, GenCC) 4. Assign final pathogenicity assessment 5. Document evidence and classification rationale

9 Computational Efficiency

The CNV module maintains efficiency through:

Vectorized operations: Batch processing of CNVs via SCIP backend
Lazy annotation: QC validation performed only on variants meeting initial filter criteria
Caching: Pre-computed dosage maps and constraint scores loaded once
Parallelization: Per-sample processing enables concurrent execution

Typical Runtime: 30-120 seconds per sample (varies with CNV count and CRAM file size)

Memory Usage: <2 GB per sample due to streaming CRAM access and lazy annotation loading