Data Preparation

Format, validate, and organize input data for IMPACT-VIS

Authors

Affiliation

Nicholas Boehler

University of Toronto Mississauga

Hai-Ying Mary Cheng

University of Toronto Mississauga

Published

December 18, 2025

1 Overview

Prerequisites: Upstream Preprocessing

This guide assumes your variant data has already been preprocessed by IMPACT pipeline modules:

IMPACT-SNV produces: *_SNV_IMPACT.gds files
IMPACT-SV produces: *_SV_IMPACT.tsv files
IMPACT-CNV produces: *_CNV_IMPACT.txt files

If you only have raw VCF files, you must first run the upstream preprocessing pipelines:

IMPACT-SNV for SNV/Indel VCF → GDS conversion and annotation
IMPACT-SV for SV VCF → AnnotSV annotation
IMPACT-CNV for CNV VCF → SCIP processing

This page focuses on organizing already-preprocessed files for IMPACT-VIS visualization.

This guide walks you through preparing genomic variant data for use with IMPACT-VIS. IMPACT-VIS accepts three data types:

SNV/Indel Data: Genomic Data Structure (GDS) files (from IMPACT-SNV)
Structural Variants: AnnotSV-annotated TSV files (from IMPACT-SV)
Copy Number Variants: Tab-separated text files (from IMPACT-CNV)

All data should be organized in a standardized directory structure.

2 Quick Reference: File Checklist

Before loading data into IMPACT-VIS, verify you have:

SNV/Indel GDS file: {sample_id}_SNV_IMPACT.gds (required)
Annotation state file: {sample_id}_SNV_IMPACT.gds_variant_states.rds (optional, created if missing)
SV file (optional): {sample_id}_SV_IMPACT.tsv (AnnotSV format)
CNV file (optional): {sample_id}_CNV_IMPACT.txt (SCIP format)
File location: All files in app/data/ root directory (NOT subdirectories)
File permissions: Readable by the user running IMPACT-VIS

Troubleshooting: If files don’t appear in the sample selector after starting IMPACT-VIS, verify file permissions and naming convention, then click “Refresh Data” in the Sample Input panel.

3 File Organization

3.1 Directory Structure

IMPACT-VIS expects all sample data files to be placed directly in the app/data/ directory (not in subdirectories). Files are discovered and grouped by sample ID based on their filename pattern.

app/data/
├── sample001_SNV_IMPACT.gds
├── sample001_SNV_IMPACT.gds_variant_states.rds
├── sample001_SV_IMPACT.tsv
├── sample001_CNV_IMPACT.txt
├── sample002_SNV_IMPACT.gds
├── sample002_SNV_IMPACT.gds_variant_states.rds
├── sample002_SV_IMPACT.tsv
├── sample002_CNV_IMPACT.txt
└── ... (more samples)

Key Rules: - All files go in app/data/ root directory (no subfolders) - Files must follow the naming convention: {sample_id}_{TYPE}_IMPACT.{ext} - IMPACT-VIS automatically discovers samples by scanning for *_SNV_IMPACT.gds files - The sample ID is extracted by removing the _SNV_IMPACT.gds suffix from filenames

3.2 File Naming Convention

Data Type	File Pattern	Required?
SNV/Indel	`{sample_id}_SNV_IMPACT.gds`	Yes (or data won’t load)
Annotations State	`{sample_id}_SNV_IMPACT.gds_variant_states.rds`	No (created automatically)
Structural Variants	`{sample_id}_SV_IMPACT.tsv`	No (SV tab will be empty)
Copy Number Variants	`{sample_id}_CNV_IMPACT.txt`	No (CNV tab will be empty)

4 GDS File Format (SNV/Indel)

4.1 Overview

GDS (Genomic Data Structure) is an optimized binary format for genomic variant data, accessed via the SeqArray package.

Note

GDS files are created from VCF files using the SeqArray vcf2gds() function in the IMPACT-SNV preprocessing module.

4.2 Required GDS nodes (what IMPACT-VIS validates)

IMPACT-VIS validates that a candidate SNV/Indel GDS file can be opened with SeqArray and has non-zero variant and sample counts.

At minimum, the file must contain these core SeqArray nodes:

Node (SeqArray path)	Type (typical)	Description
`variant.id`	integer/character	Variant identifier vector used for filtering and subsetting
`chromosome`	integer/character	Per-variant chromosome identifier (stored as numeric in many SeqArray pipelines)
`position`	integer	1-based genomic position
`sample.id`	character	Sample identifiers
`genotype`	integer	Genotype array (required by validator; used implicitly by SeqArray format)

Alleles in SeqArray

The schema reference describes alleles via the allele node (ref/alt as a 2-column character matrix), not separate ref and alt nodes. IMPACT-VIS does not currently require allele for plotting, but it is expected in standard SeqArray GDS.

4.3 Required annotations (what the app actually reads)

IMPACT-VIS expects IMPACT-SNV/favorannotator-style annotation nodes under annotation/info/.

Validation requirement (hard): the validator requires at least one of:

annotation/info/impact_score, or
annotation/info/impact_score_calc

and it requires:

annotation/info/FunctionalAnnotation/VarInfo

Fields used for plotting/tooltips and filtering: when present, the loader reads the following nodes to build the SNV dataframe used in the plot and variant modal.

Node (SeqArray path)	Used for	Notes
`annotation/info/impact_score`	Ranking + y-axis	Variants are ranked by this score and the top $N$ are visualized
`annotation/info/impact_score_calc`	Tooltip	Displayed as the IMPACT scoring method/version
`annotation/info/tier`	Filtering + coloring	Tier filtering is applied on-disk via SeqArray filters
`annotation/info/FunctionalAnnotation/VarInfo`	Tooltip	Functional consequence string (e.g., VEP consequence terms)
`annotation/info/FunctionalAnnotation/genecode_comprehensive_info`	Gene filter + tooltip	Used for gene-based filtering and display
`annotation/info/FunctionalAnnotation/clnsig`	ClinVar filter + tooltip	Used for ClinVar-based filtering; stored values may use pipes/underscores
`annotation/info/FunctionalAnnotation/clndn`	Tooltip	ClinVar disease name
`annotation/info/FunctionalAnnotation/bravo_af`	Frequency filter + tooltip	Used for optional Bravo AF threshold filtering
`annotation/info/FunctionalAnnotation/aloft_description`	Tooltip	Note: IMPACT-VIS currently reads `aloft_description` (not `aloft_prediction`)
`annotation/info/FunctionalAnnotation/aloft_value`	Tooltip	ALoFT numerical score
`$dosage`	Tooltip (REF/ALT counts)	Dosage matrices are used to infer genotype counts for the first sample
`$dosage_alt`	Genotype filter + point shape	Used for genotype filtering (het vs hom-alt) and point shape (diamond for ALT$\ge 2$)

Current genotype assumption in IMPACT-VIS

IMPACT-VIS treats the GDS as effectively single-sample for genotype-derived columns by taking the first sample in $dosage/$dosage_alt.

If you provide a multi-sample GDS, the app will still validate it, but genotype-driven fields (REF/ALT/Shape and genotype filters) will reflect only the first sample.

ClinVar term matching

For ClinVar filtering, IMPACT-VIS normalizes both UI selections and clnsig values by lowercasing and collapsing punctuation/underscores/spaces. A record can contain multiple ClinVar terms (e.g., Pathogenic|Likely_pathogenic), and a variant matches if any term matches the selected filter.

5 SV Format (AnnotSV TSV)

5.1 Overview

Structural variant (SV) data are provided as tab-delimited AnnotSV tables generated by the IMPACT-SV preprocessing pipeline. IMPACT-VIS does not accept raw SV VCF directly; it expects the AnnotSV TSV output (with IMPACT-specific QC annotations).

Internally, IMPACT-VIS uses the TSV for:

Validation (presence of core AnnotSV columns and supported SV types).
Plotting SVs by type across the genome and filtering by per-variant QC.
Optional genotype-driven display/filtering using the first sample genotype column.

5.2 Required Columns

The following columns must be present for IMPACT-VIS to validate and load the SV file:

Column	Type	Example	Description
`AnnotSV_ID`	character	TEST_10_1368865_1368865_INS_010	Unique variant identifier (auto-generated by AnnotSV)
`SV_chrom`	character	10, 11, X	Chromosome (1-22, X, Y)
`SV_start`	integer	1368865	Start position (1-based)
`SV_end`	integer	1368865	End position (1-based)
`SV_type`	character	DEL, DUP, INS, INV, TRA, BND	Structural variant type (must be one of: DEL, DUP, INS, INV, TRA, BND, CNV)
`Samples_ID`	character	test_sample	Sample identifier

Coordinate rule

For validation, IMPACT-VIS requires SV_start < SV_end in the first 100 rows it inspects.

This means that breakend-style rows (e.g., some BND/TRA representations where start == end) may currently fail validation.

5.3 Columns Used by IMPACT-VIS

In addition to the required columns above, the following columns are used by the UI/plotting layer when present:

Column	Why it matters in IMPACT-VIS
`READ_SUPPORT_FILTERING`	IMPACT-SV QC label. By default, the SV plot shows only rows with `PASSED`; users can enable “Show variants failing QC” to include the rest.
`ACMG_class`	Enables the “ACMG Class” filter for SVs (values are treated as strings; `NA` is handled explicitly).
`Gene_name`	Displayed in the SV modal and tooltips; commonly a semicolon-delimited gene list from AnnotSV.

5.4 Genotype column (how IMPACT-VIS reads it)

SV genotype in IMPACT-VIS is not sourced from a fixed column name. Instead, the app looks for the VCF-style columns:

It finds the FORMAT column.
It then treats the next column (typically the sample column, e.g., test_sample) as the genotype-bearing field.
The genotype is parsed as the substring before the first : and normalized by converting | to /. Note future versions may support phased genotypes.

This supports optional filtering (e.g., heterozygous vs homozygous alt) and display (diamond vs circle marker).

5.5 Optional but Useful Columns

These columns contain valuable annotations when present:

Column	Description
`CytoBand`	Cytogenetic location (e.g., p15.3, q21)
`Annotation_mode`	full, split, or overlap
`Gene_count`	Number of genes overlapped
`ACMG_class`	ACMG pathogenicity classification
`AnnotSV_ranking_score`	Ranking score for prioritization
`AnnotSV_ranking_criteria`	Criteria used for ranking
`B_loss_source`	Population loss-of-function evidence
`P_loss_phen`	Phenotypes associated with losses
`P_loss_hpo`	HPO terms for loss-related phenotypes

Keep the AnnotSV header intact

IMPACT-VIS reads the TSV with the original column names (it does not aggressively rename columns for you). Avoid exporting “cleaned” tables that change headers or drop FORMAT/sample columns if you want genotype-aware filtering.

6 CNV Format (TXT)

6.1 Overview

Copy number variant data in tab-delimited text format (curation format with no header). Each row represents a curator action on a CNV, including QC status, classification, and evidence notes.

Note

This is NOT raw CNV data with genomic coordinates. Instead, it is a curation and classification record generated by IMPACT-CNV preprocessing and refined through manual review in IMPACT-VIS.

6.2 Column Structure

The CNV TXT file contains exactly 7 tab-separated columns (no header):

Column #	Field Name	Type	Description
1	`Timestamp`	numeric	Unix timestamp of record creation
2	`Sample_Interval`	character	Variant ID: `sample.chr.start.end.type` (encoded genomic info)
3	`Interpretation`	character	QC result: `Passed` or `Failed`
4	`Classification`	character	Curation classification (Ruled Out, Further Review, etc.)
5	`Evidence`	character	Curated evidence summary or `NA`
6	`Date_Time`	character	Last modification timestamp (ISO 8601 format)
7	`Username`	character	Curator username

6.3 Classification Categories

Common classifications found in the Classification column:

Classification	Meaning	Recommended Action
`Ruled Out - Quality Inadequate / Difficult to Assess`	Failed QC checks	Exclude from reporting
`Ruled Out - Incorrect Boundary, Fully Intronic`	Breakpoints don’t overlap exons	Exclude from reporting
`Ruled Out - Population Variation`	Known benign CNV (gnomAD, etc.)	Exclude from reporting
`Further Review - Not Likely Reportable`	Likely benign; documented for completeness	Consider context before reporting
`Further Review - Potentially Reportable`	Requires additional review; possible pathogenicity	Prioritize for manual review
`Not Evaluated`	CNV not yet curated	Pending review

6.4 CNV File Example

Here’s a properly formatted CNV file with real test data:

1752094007.21964    test_sample.12.8863035.8865835.DEL  Failed  Ruled Out - Quality Inadequate / Difficult to Assess    NA  2025-07-09T16:46:47Z    nboehler
1752092239.12654    test_sample.12.2051740.2054739.DEL  Passed  Further Review - Potentially Reportable Het DEL of CACNA1C. Further Review needed   2025-07-09T16:17:19Z    nboehler
1752091645.91913    test_sample.5.76790253.76840052.DUP Passed  Further Review - Potentially Reportable DUP of F2RL1. Further Review needed.    2025-07-09T16:07:25Z    nboehler
1752062920.41666    test_sample.7.33090339.33148281.DUP Passed  Further Review - Potentially Reportable DUP overlaps BBS9, RP9. No clear pathogenicity known.   2025-07-09T08:08:40Z    nboehler

6.5 Decoding Sample_Interval

The Sample_Interval field (column 2) compactly encodes genomic location and variant type:

             ┌chr               ┌type
test_sample.12.8863035.8865835.DEL
     └sample      └start  └end

Breakdown:

sample: Sample identifier (must match filename prefix)
chr: Chromosome (1-22, X, Y)
start: Start position (1-based integer)
end: End position (1-based integer)
type: Variant type (DEL = deletion, DUP = duplication)

This encoding allows IMPACT-VIS to extract coordinates without separate numeric columns.

7 Next Steps

Ready to load? Start IMPACT-VIS and follow Quick Start
Curious about algorithms? See Methods

Document Version: 1.0.0
Last Updated: 2025-12-10