Data Manager Module

API reference for data loading and management functions

1 Data Manager Module

1.1 Overview

The data_manager.R module provides core data loading functionality for GDS, SV, and CNV inputs with pre-load validation and on-disk filtering (SeqArray). Loaders return NULL on validation failure (with a warning); otherwise they return a data.frame (potentially empty).

Location: app/logic/data_manager.R

1.2 Exported Functions

1.2.1 `load_gds_data()`

Loads SNV/Indel data from GDS files with optional filtering and ranking by IMPACT score. Uses on-disk filtering via SeqArray to minimize memory usage.

Parameters:

gds_path (character): Path to GDS file
num_variants (integer): Maximum number of variants to return (default: 1000)
filters (named list): Optional filters with keys: clinvar, tier, genes, genotype (default: list())
bravo_thresh (numeric): Optional threshold for BRAVO allele frequency (variants with AF ≤ threshold retained)

Returns:

data.frame containing top N variants with annotations, or
empty data.frame() if no variants match filters
NULL with warning if file validation fails

Details: Applies filters on the GDS handle (on-disk), ranks by annotation/info/impact_score (descending; NAs last), then retrieves top N variants and extracts a fixed set of annotations.

The returned data.frame includes (at least) the following columns:

Chromosome, Position
IMPACT_Score, IMPACT_Calc
VarInfo, Genes, Tier
ClinVar, ClinVar_Disease
REF, ALT, Shape
Bravo_AF, ALoFT_Prediction, ALoFT_Score

Example:

box::use(app/logic/data_manager[load_gds_data])

# Load all variants, no filtering
snv_data <- load_gds_data(
  gds_path = "app/data/sample_1_SNV_IMPACT.gds",
  num_variants = 1000
)

# Load with Tier filtering
snv_filtered <- load_gds_data(
  gds_path = "app/data/sample_1_SNV_IMPACT.gds",
  num_variants = 500,
  filters = list(tier = c("1", "2")),
  bravo_thresh = 0.01
)

Error Handling: Returns NULL and emits warning if GDS file validation fails.

1.2.2 `apply_gds_filters()`

Applies filter criteria to an open GDS file handle in-place. Modifies the GDS filter mask (on-disk filtering, no memory load).

Parameters: - gds_h (SeqArray GDS object): Open GDS file handle - filters (named list): Filter criteria with optional keys: - clinvar: Character vector of ClinVar terms to match (e.g., “Pathogenic”) - tier: Character vector of tier values (e.g., c(“1”, “2”)) - genes: Character vector of gene names to filter by - genotype: One of “alt1” (heterozygous) or “alt_ge2” (homozygous/compound) - bravo_thresh (numeric): BRAVO AF upper threshold

Returns: NULL (invisibly). Modifies GDS filter in-place.

Details: Applies filters sequentially with logical AND. ClinVar terms are normalized for matching (case-insensitive, punctuation-tolerant). This is an internal function typically used by load_gds_data().

Filter semantics:

filters$clinvar: splits each variant’s ClinVar string on common separators (pipe/comma/whitespace), normalizes terms, and matches if any term equals any selected normalized term.
filters$tier: keeps variants where annotation/info/tier is in the provided tier set.
filters$genes: uses regex matching against annotation/info/FunctionalAnnotation/genecode_comprehensive_info.
filters$genotype: supports "alt1" for heterozygous ($dosage_alt == 1) and "alt_ge2" for homozygous/compound ($dosage_alt >= 2) using an efficient subset filter.
bravo_thresh: keeps variants with missing Bravo AF, or bravo_af <= bravo_thresh.

Example:

box::use(app/logic/data_manager[apply_gds_filters])
box::use(SeqArray[seqOpen, seqClose])

gds <- seqOpen("sample_1_SNV_IMPACT.gds")
apply_gds_filters(
  gds,
  filters = list(clinvar = c("Pathogenic", "Likely Pathogenic")),
  bravo_thresh = 0.01
)
seqClose(gds)

1.2.3 `load_sv_data()`

Loads structural variant data from AnnotSV TSV files with validation.

Parameters: - sv_path (character): Path to AnnotSV TSV file

Returns: - data.frame containing SV annotations, or - NULL if validation fails

Details: Validates TSV structure before loading. Expects AnnotSV format with standard columns: AnnotSV_ID, SV_chrom, SV_start, SV_end, SV_type, Samples_ID, and 150+ additional annotation columns.

Notes: The loader is permissive beyond validation: it reads the TSV via readr::read_tsv(..., na = c(".", "")) and returns the resulting data.frame.

Example:

box::use(app/logic/data_manager[load_sv_data])

sv_data <- load_sv_data("app/data/sample_1_SV_IMPACT.tsv")

Error Handling: Returns NULL and emits warning if file validation fails.

1.2.4 `load_cnv_data()`

Loads copy number variant data from IMPACT-CNV TXT files (headerless, 6 tab-separated columns).

Parameters: - cnv_path (character): Path to CNV TXT file - include_failed (logical): Include CNVs with “Failed” interpretation (default: FALSE)

Returns: - data.frame with columns: CNV_Identifier, Sample_Interval, Interpretation, User_Note, Date_Time, Username, or - NULL if validation fails

Details: Parses headerless IMPACT-CNV format. By default, only “Passed” interpretations are returned. Column structure: (1) cnv_identifier, (2) sample_interval, (3) interpretation, (4) user_note, (5) date_and_time, (6) username.

Example:

box::use(app/logic/data_manager[load_cnv_data])

# Load only passed CNVs
cnv_passed <- load_cnv_data("app/data/sample_1_CNV_IMPACT.txt")

# Load all CNVs including failed
cnv_all <- load_cnv_data(
  "app/data/sample_1_CNV_IMPACT.txt",
  include_failed = TRUE
)

Error Handling: Returns NULL and emits warning if file validation fails.

1.3 See Also

Validators: Data validation functions
Error Handler: Error handling utilities
Sample Loader: Sample discovery
Methods: Data Loading Pipeline
Validators for data validation functions
Error Handler for error reporting

1 Data Manager Module

1.1 Overview

1.2 Exported Functions

1.2.1 load_gds_data()

1.2.2 apply_gds_filters()

1.2.3 load_sv_data()

1.2.4 load_cnv_data()

1.3 See Also

1.2.1 `load_gds_data()`

1.2.2 `apply_gds_filters()`

1.2.3 `load_sv_data()`

1.2.4 `load_cnv_data()`