Data Preparation

Format, validate, and organize input data for IMPACT-VIS
Authors
Affiliation

Nicholas Boehler

University of Toronto Mississauga

Hai-Ying Mary Cheng

University of Toronto Mississauga

Published

December 18, 2025

1 Overview

ImportantPrerequisites: Upstream Preprocessing

This guide assumes your variant data has already been preprocessed by IMPACT pipeline modules:

  • IMPACT-SNV produces: *_SNV_IMPACT.gds files
  • IMPACT-SV produces: *_SV_IMPACT.tsv files
  • IMPACT-CNV produces: *_CNV_IMPACT.txt files

If you only have raw VCF files, you must first run the upstream preprocessing pipelines:

  • IMPACT-SNV for SNV/Indel VCF → GDS conversion and annotation
  • IMPACT-SV for SV VCF → AnnotSV annotation
  • IMPACT-CNV for CNV VCF → SCIP processing

This page focuses on organizing already-preprocessed files for IMPACT-VIS visualization.

This guide walks you through preparing genomic variant data for use with IMPACT-VIS. IMPACT-VIS accepts three data types:

  1. SNV/Indel Data: Genomic Data Structure (GDS) files (from IMPACT-SNV)
  2. Structural Variants: AnnotSV-annotated TSV files (from IMPACT-SV)
  3. Copy Number Variants: Tab-separated text files (from IMPACT-CNV)

All data should be organized in a standardized directory structure.

2 Quick Reference: File Checklist

Before loading data into IMPACT-VIS, verify you have:

Troubleshooting: If files don’t appear in the sample selector after starting IMPACT-VIS, verify file permissions and naming convention, then click “Refresh Data” in the Sample Input panel.

3 File Organization

3.1 Directory Structure

IMPACT-VIS expects all sample data files to be placed directly in the app/data/ directory (not in subdirectories). Files are discovered and grouped by sample ID based on their filename pattern.

app/data/
├── sample001_SNV_IMPACT.gds
├── sample001_SNV_IMPACT.gds_variant_states.rds
├── sample001_SV_IMPACT.tsv
├── sample001_CNV_IMPACT.txt
├── sample002_SNV_IMPACT.gds
├── sample002_SNV_IMPACT.gds_variant_states.rds
├── sample002_SV_IMPACT.tsv
├── sample002_CNV_IMPACT.txt
└── ... (more samples)

Key Rules: - All files go in app/data/ root directory (no subfolders) - Files must follow the naming convention: {sample_id}_{TYPE}_IMPACT.{ext} - IMPACT-VIS automatically discovers samples by scanning for *_SNV_IMPACT.gds files - The sample ID is extracted by removing the _SNV_IMPACT.gds suffix from filenames

3.2 File Naming Convention

Data Type File Pattern Required?
SNV/Indel {sample_id}_SNV_IMPACT.gds Yes (or data won’t load)
Annotations State {sample_id}_SNV_IMPACT.gds_variant_states.rds No (created automatically)
Structural Variants {sample_id}_SV_IMPACT.tsv No (SV tab will be empty)
Copy Number Variants {sample_id}_CNV_IMPACT.txt No (CNV tab will be empty)

4 GDS File Format (SNV/Indel)

4.1 Overview

GDS (Genomic Data Structure) is an optimized binary format for genomic variant data, accessed via the SeqArray package.

Note

GDS files are created from VCF files using the SeqArray vcf2gds() function in the IMPACT-SNV preprocessing module.

4.2 Required GDS nodes (what IMPACT-VIS validates)

IMPACT-VIS validates that a candidate SNV/Indel GDS file can be opened with SeqArray and has non-zero variant and sample counts.

At minimum, the file must contain these core SeqArray nodes:

Node (SeqArray path) Type (typical) Description
variant.id integer/character Variant identifier vector used for filtering and subsetting
chromosome integer/character Per-variant chromosome identifier (stored as numeric in many SeqArray pipelines)
position integer 1-based genomic position
sample.id character Sample identifiers
genotype integer Genotype array (required by validator; used implicitly by SeqArray format)
NoteAlleles in SeqArray

The schema reference describes alleles via the allele node (ref/alt as a 2-column character matrix), not separate ref and alt nodes. IMPACT-VIS does not currently require allele for plotting, but it is expected in standard SeqArray GDS.

4.3 Required annotations (what the app actually reads)

IMPACT-VIS expects IMPACT-SNV/favorannotator-style annotation nodes under annotation/info/.

Validation requirement (hard): the validator requires at least one of:

  • annotation/info/impact_score, or
  • annotation/info/impact_score_calc

and it requires:

  • annotation/info/FunctionalAnnotation/VarInfo

Fields used for plotting/tooltips and filtering: when present, the loader reads the following nodes to build the SNV dataframe used in the plot and variant modal.

Node (SeqArray path) Used for Notes
annotation/info/impact_score Ranking + y-axis Variants are ranked by this score and the top \(N\) are visualized
annotation/info/impact_score_calc Tooltip Displayed as the IMPACT scoring method/version
annotation/info/tier Filtering + coloring Tier filtering is applied on-disk via SeqArray filters
annotation/info/FunctionalAnnotation/VarInfo Tooltip Functional consequence string (e.g., VEP consequence terms)
annotation/info/FunctionalAnnotation/genecode_comprehensive_info Gene filter + tooltip Used for gene-based filtering and display
annotation/info/FunctionalAnnotation/clnsig ClinVar filter + tooltip Used for ClinVar-based filtering; stored values may use pipes/underscores
annotation/info/FunctionalAnnotation/clndn Tooltip ClinVar disease name
annotation/info/FunctionalAnnotation/bravo_af Frequency filter + tooltip Used for optional Bravo AF threshold filtering
annotation/info/FunctionalAnnotation/aloft_description Tooltip Note: IMPACT-VIS currently reads aloft_description (not aloft_prediction)
annotation/info/FunctionalAnnotation/aloft_value Tooltip ALoFT numerical score
$dosage Tooltip (REF/ALT counts) Dosage matrices are used to infer genotype counts for the first sample
$dosage_alt Genotype filter + point shape Used for genotype filtering (het vs hom-alt) and point shape (diamond for ALT\(\ge 2\))
ImportantCurrent genotype assumption in IMPACT-VIS

IMPACT-VIS treats the GDS as effectively single-sample for genotype-derived columns by taking the first sample in $dosage/$dosage_alt.

If you provide a multi-sample GDS, the app will still validate it, but genotype-driven fields (REF/ALT/Shape and genotype filters) will reflect only the first sample.

NoteClinVar term matching

For ClinVar filtering, IMPACT-VIS normalizes both UI selections and clnsig values by lowercasing and collapsing punctuation/underscores/spaces. A record can contain multiple ClinVar terms (e.g., Pathogenic|Likely_pathogenic), and a variant matches if any term matches the selected filter.

5 SV Format (AnnotSV TSV)

5.1 Overview

Structural variant (SV) data are provided as tab-delimited AnnotSV tables generated by the IMPACT-SV preprocessing pipeline. IMPACT-VIS does not accept raw SV VCF directly; it expects the AnnotSV TSV output (with IMPACT-specific QC annotations).

Internally, IMPACT-VIS uses the TSV for:

  • Validation (presence of core AnnotSV columns and supported SV types).
  • Plotting SVs by type across the genome and filtering by per-variant QC.
  • Optional genotype-driven display/filtering using the first sample genotype column.

5.2 Required Columns

The following columns must be present for IMPACT-VIS to validate and load the SV file:

Column Type Example Description
AnnotSV_ID character TEST_10_1368865_1368865_INS_010 Unique variant identifier (auto-generated by AnnotSV)
SV_chrom character 10, 11, X Chromosome (1-22, X, Y)
SV_start integer 1368865 Start position (1-based)
SV_end integer 1368865 End position (1-based)
SV_type character DEL, DUP, INS, INV, TRA, BND Structural variant type (must be one of: DEL, DUP, INS, INV, TRA, BND, CNV)
Samples_ID character test_sample Sample identifier
NoteCoordinate rule

For validation, IMPACT-VIS requires SV_start < SV_end in the first 100 rows it inspects.

This means that breakend-style rows (e.g., some BND/TRA representations where start == end) may currently fail validation.

5.3 Columns Used by IMPACT-VIS

In addition to the required columns above, the following columns are used by the UI/plotting layer when present:

Column Why it matters in IMPACT-VIS
READ_SUPPORT_FILTERING IMPACT-SV QC label. By default, the SV plot shows only rows with PASSED; users can enable “Show variants failing QC” to include the rest.
ACMG_class Enables the “ACMG Class” filter for SVs (values are treated as strings; NA is handled explicitly).
Gene_name Displayed in the SV modal and tooltips; commonly a semicolon-delimited gene list from AnnotSV.

5.4 Genotype column (how IMPACT-VIS reads it)

SV genotype in IMPACT-VIS is not sourced from a fixed column name. Instead, the app looks for the VCF-style columns:

  • It finds the FORMAT column.
  • It then treats the next column (typically the sample column, e.g., test_sample) as the genotype-bearing field.
  • The genotype is parsed as the substring before the first : and normalized by converting | to /. Note future versions may support phased genotypes.

This supports optional filtering (e.g., heterozygous vs homozygous alt) and display (diamond vs circle marker).

5.5 Optional but Useful Columns

These columns contain valuable annotations when present:

Column Description
CytoBand Cytogenetic location (e.g., p15.3, q21)
Annotation_mode full, split, or overlap
Gene_count Number of genes overlapped
ACMG_class ACMG pathogenicity classification
AnnotSV_ranking_score Ranking score for prioritization
AnnotSV_ranking_criteria Criteria used for ranking
B_loss_source Population loss-of-function evidence
P_loss_phen Phenotypes associated with losses
P_loss_hpo HPO terms for loss-related phenotypes
TipKeep the AnnotSV header intact

IMPACT-VIS reads the TSV with the original column names (it does not aggressively rename columns for you). Avoid exporting “cleaned” tables that change headers or drop FORMAT/sample columns if you want genotype-aware filtering.

6 CNV Format (TXT)

6.1 Overview

Copy number variant data in tab-delimited text format (curation format with no header). Each row represents a curator action on a CNV, including QC status, classification, and evidence notes.

Note

This is NOT raw CNV data with genomic coordinates. Instead, it is a curation and classification record generated by IMPACT-CNV preprocessing and refined through manual review in IMPACT-VIS.

6.2 Column Structure

The CNV TXT file contains exactly 7 tab-separated columns (no header):

Column # Field Name Type Description
1 Timestamp numeric Unix timestamp of record creation
2 Sample_Interval character Variant ID: sample.chr.start.end.type (encoded genomic info)
3 Interpretation character QC result: Passed or Failed
4 Classification character Curation classification (Ruled Out, Further Review, etc.)
5 Evidence character Curated evidence summary or NA
6 Date_Time character Last modification timestamp (ISO 8601 format)
7 Username character Curator username

6.3 Classification Categories

Common classifications found in the Classification column:

Classification Meaning Recommended Action
Ruled Out - Quality Inadequate / Difficult to Assess Failed QC checks Exclude from reporting
Ruled Out - Incorrect Boundary, Fully Intronic Breakpoints don’t overlap exons Exclude from reporting
Ruled Out - Population Variation Known benign CNV (gnomAD, etc.) Exclude from reporting
Further Review - Not Likely Reportable Likely benign; documented for completeness Consider context before reporting
Further Review - Potentially Reportable Requires additional review; possible pathogenicity Prioritize for manual review
Not Evaluated CNV not yet curated Pending review

6.4 CNV File Example

Here’s a properly formatted CNV file with real test data:

1752094007.21964    test_sample.12.8863035.8865835.DEL  Failed  Ruled Out - Quality Inadequate / Difficult to Assess    NA  2025-07-09T16:46:47Z    nboehler
1752092239.12654    test_sample.12.2051740.2054739.DEL  Passed  Further Review - Potentially Reportable Het DEL of CACNA1C. Further Review needed   2025-07-09T16:17:19Z    nboehler
1752091645.91913    test_sample.5.76790253.76840052.DUP Passed  Further Review - Potentially Reportable DUP of F2RL1. Further Review needed.    2025-07-09T16:07:25Z    nboehler
1752062920.41666    test_sample.7.33090339.33148281.DUP Passed  Further Review - Potentially Reportable DUP overlaps BBS9, RP9. No clear pathogenicity known.   2025-07-09T08:08:40Z    nboehler

6.5 Decoding Sample_Interval

The Sample_Interval field (column 2) compactly encodes genomic location and variant type:

             ┌chr               ┌type
test_sample.12.8863035.8865835.DEL
     └sample      └start  └end 

Breakdown:

  • sample: Sample identifier (must match filename prefix)
  • chr: Chromosome (1-22, X, Y)
  • start: Start position (1-based integer)
  • end: End position (1-based integer)
  • type: Variant type (DEL = deletion, DUP = duplication)

This encoding allows IMPACT-VIS to extract coordinates without separate numeric columns.

7 Next Steps

  • Ready to load? Start IMPACT-VIS and follow Quick Start
  • Curious about algorithms? See Methods

Document Version: 1.0.0
Last Updated: 2025-12-10