CNV Processing
1 Overview
This page documents CNV processing as implemented in the IMPACT-CNV upstream module. IMPACT-VIS receives preprocessed SCIP-format CNV text files with variants already validated and prioritized via phenotype-aware gene filtering.
Upstream Repository: IMPACT-CNV (companion to IMPACT-VIS) Expected Output: {sample_id}_CNV_IMPACT.txt
For reproducibility details, see the IMPACT-CNV Documentation.
2 Introduction
The IMPACT-CNV module extends the Suite for CNV Interpretation and Prioritization (SCIP) framework to enable phenotype-aware filtering and prioritization of copy number variants (CNVs). This section details the CNV preprocessing workflow, quality validation, phenotype-specific filtering, and integration with IMPACT-VIS for interactive visualization.
3 Workflow Overview
The IMPACT-CNV module implements a multi-step pipeline:
- VCF Conversion: CNV VCF → SCIP-compatible text format
- SCIP Processing: Multi-step filtration and prioritization
- Read-Depth Validation: CRAM-based verification of CNV calls
- Phenotype-Aware Ranking: Phenotype relevance weighting
- Integration with IMPACT-VIS: TXT output for interactive visualization
4 Data Preprocessing
4.1 CNV VCF to SCIP Conversion
CNV VCF files are converted to SCIP-compatible text inputs using a Python-based converter that:
- Extracts sample identifiers from VCF header
- Reformats variant coordinates (chr, start, end, CN)
- Validates input format and coordinate ranges
- Generates tab-separated text file for SCIP processing
Input Format: Standard VCF with CNV calls (e.g., from DRAGEN, Canvas, Manta, LUMPY)
Output Format: SCIP-compatible TXT with columns: chromosome, start, end, copy_number
4.2 SCIP Backend Processing
The SCIP backend applies a multi-step filtration and prioritization strategy informed by:
- ClinGen dosage sensitivity maps
- OMIM disease associations
- gnomAD constraint metrics
- Internal recurrence regions
- Curated genomic annotations
5 Quality Control via Read-Depth Validation
To improve confidence in CNV calls, the pipeline performs validation using sample CRAM files and a high-quality control genome (NA12878):
Validation Steps:
Coverage Analysis: Verify expected read-depth shifts consistent with CN state
- Homozygous deletion (CN=0): ~0% coverage
- Heterozygous deletion (CN=1): ~50% coverage
- Normal (CN=2): ~100% coverage
- Duplication (CN=3): ~150% coverage
- Amplification (CN≥4): ≥200% coverage
Mapping Quality Assessment: Exclude regions with anomalous read signatures
- Low MAPQ scores indicating misalignment
- High soft-clip rates indicating sequence complexity
- Regions of significant segmental duplication
Paired-End Evidence Quantification: Assess supporting read pairs
- Concordant pairs within expected insert size
- Discordant pairs supporting breakpoints
- Split-read evidence at CNV boundaries
Control Genome Comparison: Cross-reference with NA12878 high-quality calls
- Identify artifacts common to control samples
- Flag potential false positives
CNV Flagging: CNVs lacking supporting reads or exhibiting anomalous signatures are flagged for manual review.
6 Phenotype-Aware Prioritization
6.1 Evidence Integration
Phenotype relevance is refined through integration of multiple evidence sources:
| Evidence Category | Source | Application |
|---|---|---|
| Dosage Sensitivity | ClinGen | Identify genes intolerant to copy number change |
| Haploinsufficiency | ClinGen | Score genes sensitive to heterozygous deletion |
| Triplosensitivity | ClinGen | Score genes sensitive to duplication/amplification |
| OMIM Associations | OMIM | Map CNVs to known genetic diseases |
| Inheritance Patterns | GenCC | Filter based on autosomal/X-linked inheritance |
| Loss-of-Function Intolerance | gnomAD | Identify genes with pLI ≥ 0.9 or LOEUF ≤ 0.35 |
| Disease Validity | GenCC | Assess strength of disease association |
6.2 Prioritization Tiers
CNVs are assigned to one of three priority tiers based on accumulated evidence:
Tier 1 (High Priority):
- Fully contained within ClinGen dosage-sensitive regions, OR
- Overlapping genes with strong cumulative evidence (OMIM + high pLI + ClinGen + phenotype association)
Tier 2 (Moderate Priority):
- Partially overlapping ClinGen dosage-sensitive regions, OR
- Overlapping genes with moderate evidence strength
Tier 3 (Low Priority): - Extensive overlap to common CNVs in gnomAD, OR - Residing in low-quality genomic regions, OR - Insufficient evidence for prioritization
6.3 Gene–Disease Association Weighting
CNVs overlapping genes with strong gene–disease association (GDA) scores from Open Targets are prioritized. Additional weighting is applied for genes with: - High intolerance to loss-of-function (pLI ≥ 0.9 or LOEUF ≤ 0.35) - ClinGen haploinsufficiency score ≥ 3 (1-3 scale for deletions) - ClinGen triplosensitivity score ≥ 3 (1-3 scale for duplications/amplifications)
7 Output Format
The final output of the IMPACT-CNV module consists of per-sample CNV_IMPACT.txt files in SCIP-compatible format, consolidating:
- **Variant coordinates**: Chromosome, start, end, copy number state
- **Functional annotations**: Gene overlap, regulatory context
- **SCIP annotations**: Dosage sensitivity, OMIM associations
- **QC metrics**: Read-depth validation results, coverage statistics
- **Priority tier**: High/Moderate/Low classification
- **Evidence summary**: Concise reasoning for prioritization
File Structure: Tab-separated values with columns: chr, start, end, cn, genes, tier, qc_status, evidence_summary
7.1 Integration with IMPACT-VIS
CNV_IMPACT.txt files are formatted for direct input into the IMPACT-VIS visualization module, enabling:
- **Unified variant review**: CNVs visualized alongside SNVs and SVs
- **Quality-aware display**: QC-flagged variants highlighted for manual review
- **Phenotype-aware prioritization**: High-priority CNVs automatically surfaced
- **Interactive exploration**: Click-through to genes, dosage maps, disease databases
- Systematic curation: Per-sample classifications and notes persisted for reproducibility
8 Manual Review Workflow
All high-priority CNVs identified by IMPACT-CNV undergo manual review for validation:
Curation Steps: 1. Verify read-depth and breakpoint evidence via CRAM inspection 2. Assess gene overlap and functional impact 3. Consult disease databases (OMIM, ClinGen, GenCC) 4. Assign final pathogenicity assessment 5. Document evidence and classification rationale
9 Computational Efficiency
The CNV module maintains efficiency through:
- Vectorized operations: Batch processing of CNVs via SCIP backend
- Lazy annotation: QC validation performed only on variants meeting initial filter criteria
- Caching: Pre-computed dosage maps and constraint scores loaded once
- Parallelization: Per-sample processing enables concurrent execution
Typical Runtime: 30-120 seconds per sample (varies with CNV count and CRAM file size)
Memory Usage: <2 GB per sample due to streaming CRAM access and lazy annotation loading