GDS File Format Specification

Detailed specification for SNV/Indel GDS files

1 GDS File Format Specification

Note

This page provides detailed field-level documentation for GDS format. For quick reference, see Overview.

1.1 Overview

GDS (Genomic Data Structure) files store SNV and Indel data in a hierarchical, compressed format based on SeqArray.

Format: HDF5-based binary Extension: .gds Library: SeqArray (Bioconductor)

1.2 Required Structure

1.2.1 Variant-Level Annotations

Field Type Description Example
variant.id integer Unique variant identifier 1, 2, 3, ...
position integer Chromosomal position (1-based) 12345678
chromosome integer/character Chromosome identifier (SeqArray) 1, 2, …, 23 (=X)
allele character Reference/alternate alleles A,G
annotation/info/impact_score numeric Numerical IMPACT severity score (0–100) 85.5
annotation/info/impact_score_calc character Scoring method 80 + 20 * 0.275
annotation/info/tier integer Tier classification (1–4) 1

1.2.2 Genotype Data

Field Type Description
genotype integer Genotype array (0/1/2 encoding, SeqArray standard)
sample.id character Sample identifiers
$dosage integer Dosage matrix used by IMPACT-VIS for genotype-derived fields
$dosage_alt integer Alternate allele dosage used for genotype filtering

1.2.3 Functional Annotations

Field Type Description Source
annotation/info/FunctionalAnnotation/VarInfo character FAVOR variant identifier string used as a stable per-variant key FAVOR
annotation/info/FunctionalAnnotation/Consequence character Variant consequence terms Ensembl VEP (via favorannotator)
annotation/info/FunctionalAnnotation/clnsig character ClinVar clinical significance ClinVar (via favorannotator)
annotation/info/FunctionalAnnotation/clndn character ClinVar disease name ClinVar (via favorannotator)
annotation/info/FunctionalAnnotation/bravo_af numeric BRAVO allele frequency BRAVO (via favorannotator)
annotation/info/FunctionalAnnotation/gnomad_af numeric gnomAD allele frequency gnomAD (via favorannotator)

1.3 Optional Annotations

Field Type Description
annotation/info/CADD_score numeric CADD deleteriousness
annotation/info/REVEL_score numeric REVEL pathogenicity
annotation/info/SIFT_pred character SIFT prediction
annotation/info/PolyPhen_pred character PolyPhen prediction

1.4 Validation Rules

Important

GDS files must pass validation before loading:

  1. File Structure: Must be valid SeqArray GDS format
  2. Required Fields: Must include SeqArray core nodes, plus IMPACT annotations (annotation/info/impact_score or annotation/info/impact_score_calc, and annotation/info/FunctionalAnnotation/VarInfo)
  3. Data Types: Correct types for each field
  4. Coordinate Validity: Positions within chromosome bounds
  5. Allele Format: REF,ALT comma-separated

1.5 Creating GDS from VCF

library(SeqArray)

# Convert VCF to GDS
seqVCF2GDS(
  vcf.fn = "input.vcf.gz",
  out.fn = "output.gds",
  storage.option = "LZMA_RA",
  verbose = TRUE
)

# Verify structure
gds <- seqOpen("output.gds")
seqSummary(gds)
seqClose(gds)