Data Preparation
1 Overview
This guide assumes your variant data has already been preprocessed by IMPACT pipeline modules:
- IMPACT-SNV produces:
*_SNV_IMPACT.gdsfiles - IMPACT-SV produces:
*_SV_IMPACT.tsvfiles - IMPACT-CNV produces:
*_CNV_IMPACT.txtfiles
If you only have raw VCF files, you must first run the upstream preprocessing pipelines:
- IMPACT-SNV for SNV/Indel VCF → GDS conversion and annotation
- IMPACT-SV for SV VCF → AnnotSV annotation
- IMPACT-CNV for CNV VCF → SCIP processing
This page focuses on organizing already-preprocessed files for IMPACT-VIS visualization.
This guide walks you through preparing genomic variant data for use with IMPACT-VIS. IMPACT-VIS accepts three data types:
- SNV/Indel Data: Genomic Data Structure (GDS) files (from IMPACT-SNV)
- Structural Variants: AnnotSV-annotated TSV files (from IMPACT-SV)
- Copy Number Variants: Tab-separated text files (from IMPACT-CNV)
All data should be organized in a standardized directory structure.
2 Quick Reference: File Checklist
Before loading data into IMPACT-VIS, verify you have:
Troubleshooting: If files don’t appear in the sample selector after starting IMPACT-VIS, verify file permissions and naming convention, then click “Refresh Data” in the Sample Input panel.
3 File Organization
3.1 Directory Structure
IMPACT-VIS expects all sample data files to be placed directly in the app/data/ directory (not in subdirectories). Files are discovered and grouped by sample ID based on their filename pattern.
app/data/
├── sample001_SNV_IMPACT.gds
├── sample001_SNV_IMPACT.gds_variant_states.rds
├── sample001_SV_IMPACT.tsv
├── sample001_CNV_IMPACT.txt
├── sample002_SNV_IMPACT.gds
├── sample002_SNV_IMPACT.gds_variant_states.rds
├── sample002_SV_IMPACT.tsv
├── sample002_CNV_IMPACT.txt
└── ... (more samples)
Key Rules: - All files go in app/data/ root directory (no subfolders) - Files must follow the naming convention: {sample_id}_{TYPE}_IMPACT.{ext} - IMPACT-VIS automatically discovers samples by scanning for *_SNV_IMPACT.gds files - The sample ID is extracted by removing the _SNV_IMPACT.gds suffix from filenames
3.2 File Naming Convention
| Data Type | File Pattern | Required? |
|---|---|---|
| SNV/Indel | {sample_id}_SNV_IMPACT.gds |
Yes (or data won’t load) |
| Annotations State | {sample_id}_SNV_IMPACT.gds_variant_states.rds |
No (created automatically) |
| Structural Variants | {sample_id}_SV_IMPACT.tsv |
No (SV tab will be empty) |
| Copy Number Variants | {sample_id}_CNV_IMPACT.txt |
No (CNV tab will be empty) |
4 GDS File Format (SNV/Indel)
4.1 Overview
GDS (Genomic Data Structure) is an optimized binary format for genomic variant data, accessed via the SeqArray package.
GDS files are created from VCF files using the SeqArray vcf2gds() function in the IMPACT-SNV preprocessing module.
4.2 Required GDS nodes (what IMPACT-VIS validates)
IMPACT-VIS validates that a candidate SNV/Indel GDS file can be opened with SeqArray and has non-zero variant and sample counts.
At minimum, the file must contain these core SeqArray nodes:
| Node (SeqArray path) | Type (typical) | Description |
|---|---|---|
variant.id |
integer/character | Variant identifier vector used for filtering and subsetting |
chromosome |
integer/character | Per-variant chromosome identifier (stored as numeric in many SeqArray pipelines) |
position |
integer | 1-based genomic position |
sample.id |
character | Sample identifiers |
genotype |
integer | Genotype array (required by validator; used implicitly by SeqArray format) |
The schema reference describes alleles via the allele node (ref/alt as a 2-column character matrix), not separate ref and alt nodes. IMPACT-VIS does not currently require allele for plotting, but it is expected in standard SeqArray GDS.
4.3 Required annotations (what the app actually reads)
IMPACT-VIS expects IMPACT-SNV/favorannotator-style annotation nodes under annotation/info/.
Validation requirement (hard): the validator requires at least one of:
annotation/info/impact_score, orannotation/info/impact_score_calc
and it requires:
annotation/info/FunctionalAnnotation/VarInfo
Fields used for plotting/tooltips and filtering: when present, the loader reads the following nodes to build the SNV dataframe used in the plot and variant modal.
| Node (SeqArray path) | Used for | Notes |
|---|---|---|
annotation/info/impact_score |
Ranking + y-axis | Variants are ranked by this score and the top \(N\) are visualized |
annotation/info/impact_score_calc |
Tooltip | Displayed as the IMPACT scoring method/version |
annotation/info/tier |
Filtering + coloring | Tier filtering is applied on-disk via SeqArray filters |
annotation/info/FunctionalAnnotation/VarInfo |
Tooltip | Functional consequence string (e.g., VEP consequence terms) |
annotation/info/FunctionalAnnotation/genecode_comprehensive_info |
Gene filter + tooltip | Used for gene-based filtering and display |
annotation/info/FunctionalAnnotation/clnsig |
ClinVar filter + tooltip | Used for ClinVar-based filtering; stored values may use pipes/underscores |
annotation/info/FunctionalAnnotation/clndn |
Tooltip | ClinVar disease name |
annotation/info/FunctionalAnnotation/bravo_af |
Frequency filter + tooltip | Used for optional Bravo AF threshold filtering |
annotation/info/FunctionalAnnotation/aloft_description |
Tooltip | Note: IMPACT-VIS currently reads aloft_description (not aloft_prediction) |
annotation/info/FunctionalAnnotation/aloft_value |
Tooltip | ALoFT numerical score |
$dosage |
Tooltip (REF/ALT counts) | Dosage matrices are used to infer genotype counts for the first sample |
$dosage_alt |
Genotype filter + point shape | Used for genotype filtering (het vs hom-alt) and point shape (diamond for ALT\(\ge 2\)) |
IMPACT-VIS treats the GDS as effectively single-sample for genotype-derived columns by taking the first sample in $dosage/$dosage_alt.
If you provide a multi-sample GDS, the app will still validate it, but genotype-driven fields (REF/ALT/Shape and genotype filters) will reflect only the first sample.
For ClinVar filtering, IMPACT-VIS normalizes both UI selections and clnsig values by lowercasing and collapsing punctuation/underscores/spaces. A record can contain multiple ClinVar terms (e.g., Pathogenic|Likely_pathogenic), and a variant matches if any term matches the selected filter.
5 SV Format (AnnotSV TSV)
5.1 Overview
Structural variant (SV) data are provided as tab-delimited AnnotSV tables generated by the IMPACT-SV preprocessing pipeline. IMPACT-VIS does not accept raw SV VCF directly; it expects the AnnotSV TSV output (with IMPACT-specific QC annotations).
Internally, IMPACT-VIS uses the TSV for:
- Validation (presence of core AnnotSV columns and supported SV types).
- Plotting SVs by type across the genome and filtering by per-variant QC.
- Optional genotype-driven display/filtering using the first sample genotype column.
5.2 Required Columns
The following columns must be present for IMPACT-VIS to validate and load the SV file:
| Column | Type | Example | Description |
|---|---|---|---|
AnnotSV_ID |
character | TEST_10_1368865_1368865_INS_010 | Unique variant identifier (auto-generated by AnnotSV) |
SV_chrom |
character | 10, 11, X | Chromosome (1-22, X, Y) |
SV_start |
integer | 1368865 | Start position (1-based) |
SV_end |
integer | 1368865 | End position (1-based) |
SV_type |
character | DEL, DUP, INS, INV, TRA, BND | Structural variant type (must be one of: DEL, DUP, INS, INV, TRA, BND, CNV) |
Samples_ID |
character | test_sample | Sample identifier |
For validation, IMPACT-VIS requires SV_start < SV_end in the first 100 rows it inspects.
This means that breakend-style rows (e.g., some BND/TRA representations where start == end) may currently fail validation.
5.3 Columns Used by IMPACT-VIS
In addition to the required columns above, the following columns are used by the UI/plotting layer when present:
| Column | Why it matters in IMPACT-VIS |
|---|---|
READ_SUPPORT_FILTERING |
IMPACT-SV QC label. By default, the SV plot shows only rows with PASSED; users can enable “Show variants failing QC” to include the rest. |
ACMG_class |
Enables the “ACMG Class” filter for SVs (values are treated as strings; NA is handled explicitly). |
Gene_name |
Displayed in the SV modal and tooltips; commonly a semicolon-delimited gene list from AnnotSV. |
5.4 Genotype column (how IMPACT-VIS reads it)
SV genotype in IMPACT-VIS is not sourced from a fixed column name. Instead, the app looks for the VCF-style columns:
- It finds the
FORMATcolumn. - It then treats the next column (typically the sample column, e.g.,
test_sample) as the genotype-bearing field. - The genotype is parsed as the substring before the first
:and normalized by converting|to/. Note future versions may support phased genotypes.
This supports optional filtering (e.g., heterozygous vs homozygous alt) and display (diamond vs circle marker).
5.5 Optional but Useful Columns
These columns contain valuable annotations when present:
| Column | Description |
|---|---|
CytoBand |
Cytogenetic location (e.g., p15.3, q21) |
Annotation_mode |
full, split, or overlap |
Gene_count |
Number of genes overlapped |
ACMG_class |
ACMG pathogenicity classification |
AnnotSV_ranking_score |
Ranking score for prioritization |
AnnotSV_ranking_criteria |
Criteria used for ranking |
B_loss_source |
Population loss-of-function evidence |
P_loss_phen |
Phenotypes associated with losses |
P_loss_hpo |
HPO terms for loss-related phenotypes |
IMPACT-VIS reads the TSV with the original column names (it does not aggressively rename columns for you). Avoid exporting “cleaned” tables that change headers or drop FORMAT/sample columns if you want genotype-aware filtering.
6 CNV Format (TXT)
6.1 Overview
Copy number variant data in tab-delimited text format (curation format with no header). Each row represents a curator action on a CNV, including QC status, classification, and evidence notes.
This is NOT raw CNV data with genomic coordinates. Instead, it is a curation and classification record generated by IMPACT-CNV preprocessing and refined through manual review in IMPACT-VIS.
6.2 Column Structure
The CNV TXT file contains exactly 7 tab-separated columns (no header):
| Column # | Field Name | Type | Description |
|---|---|---|---|
| 1 | Timestamp |
numeric | Unix timestamp of record creation |
| 2 | Sample_Interval |
character | Variant ID: sample.chr.start.end.type (encoded genomic info) |
| 3 | Interpretation |
character | QC result: Passed or Failed |
| 4 | Classification |
character | Curation classification (Ruled Out, Further Review, etc.) |
| 5 | Evidence |
character | Curated evidence summary or NA |
| 6 | Date_Time |
character | Last modification timestamp (ISO 8601 format) |
| 7 | Username |
character | Curator username |
6.3 Classification Categories
Common classifications found in the Classification column:
| Classification | Meaning | Recommended Action |
|---|---|---|
Ruled Out - Quality Inadequate / Difficult to Assess |
Failed QC checks | Exclude from reporting |
Ruled Out - Incorrect Boundary, Fully Intronic |
Breakpoints don’t overlap exons | Exclude from reporting |
Ruled Out - Population Variation |
Known benign CNV (gnomAD, etc.) | Exclude from reporting |
Further Review - Not Likely Reportable |
Likely benign; documented for completeness | Consider context before reporting |
Further Review - Potentially Reportable |
Requires additional review; possible pathogenicity | Prioritize for manual review |
Not Evaluated |
CNV not yet curated | Pending review |
6.4 CNV File Example
Here’s a properly formatted CNV file with real test data:
1752094007.21964 test_sample.12.8863035.8865835.DEL Failed Ruled Out - Quality Inadequate / Difficult to Assess NA 2025-07-09T16:46:47Z nboehler
1752092239.12654 test_sample.12.2051740.2054739.DEL Passed Further Review - Potentially Reportable Het DEL of CACNA1C. Further Review needed 2025-07-09T16:17:19Z nboehler
1752091645.91913 test_sample.5.76790253.76840052.DUP Passed Further Review - Potentially Reportable DUP of F2RL1. Further Review needed. 2025-07-09T16:07:25Z nboehler
1752062920.41666 test_sample.7.33090339.33148281.DUP Passed Further Review - Potentially Reportable DUP overlaps BBS9, RP9. No clear pathogenicity known. 2025-07-09T08:08:40Z nboehler
6.5 Decoding Sample_Interval
The Sample_Interval field (column 2) compactly encodes genomic location and variant type:
┌chr ┌type
test_sample.12.8863035.8865835.DEL
└sample └start └end
Breakdown:
- sample: Sample identifier (must match filename prefix)
- chr: Chromosome (1-22, X, Y)
- start: Start position (1-based integer)
- end: End position (1-based integer)
- type: Variant type (DEL = deletion, DUP = duplication)
This encoding allows IMPACT-VIS to extract coordinates without separate numeric columns.
7 Next Steps
- Ready to load? Start IMPACT-VIS and follow Quick Start
- Curious about algorithms? See Methods
Document Version: 1.0.0
Last Updated: 2025-12-10