tinydenseR Package Architecture

Overview

tinydenseR is a landmark-based framework for differential abundance and differential expression analysis of single-cell data. It is designed for scRNA-seq, flow, mass, and spectral cytometry data.

The core idea is to select a representative set of landmark cells, build a graph over them, map all cells to their nearest landmarks, and then model per-landmark density or expression across samples as biological replicates. This preserves statistical rigor while maintaining the richness of single-cell resolution.

TDRObj: The Core Data Structure

All analysis results are stored in a single S4 class, TDRObj. It contains 12 slots that organize the data and results at each stage of the pipeline:

Slot	Contents
`cells`	Named list of per-sample cell indices (or file paths for the files backend)
`metadata`	`data.frame` of sample-level metadata
`config`	Run parameters: key, sampling, assay.type, markers, n.threads, backend
`integration`	Trained projection models and batch variables (harmony.var, harmony.obj, symphony.obj, umap.model)
`assay`	Landmark expression layers (L × features matrices): raw, expr, scaled
`landmark.embed`	Landmark-space coordinate matrices (pca, le, umap), each with a `$coord` entry
`landmark.annot`	Per-landmark categorical annotations: clustering and celltyping, each with an `$ids` factor
`graphs`	Landmark-landmark connectivity: adj.matrix, snn, fgraph
`density`	Fuzzy density analytics (L × N matrices): raw, norm, log.norm, size.factors, composition
`sample.embed`	Sample-level embeddings (N × k matrices): pca, traj, pepc
`cellmap`	Per-cell, per-sample data: clustering and celltyping, each with an `$ids` factor, nearest.lm, fuzzy.graphs. Each entry is in-memory or an on-disk path string
`results`	All statistical outputs: lm, pb, marker, spec, nmf, pls, clustering, celltyping, features

The $ accessor is overloaded so you can use tdr$results instead of tdr@results:

tdr$metadata              # sample-level metadata
tdr$landmark.embed$pca    # PCA coordinates of landmarks
tdr$results$lm            # linear model results
tdr$config$assay.type     # "RNA" or "cyto"

Analysis Pipeline Stages

The standard analysis pipeline proceeds through a series of stages, each populating specific slots in the TDRObj:

setup.tdr.obj() / RunTDR()
    |
    v
get.landmarks()  --> landmark.embed (pca, le)
    |                 assay (raw, expr, scaled)
    v
get.graph()      --> graphs (adj.matrix, snn)
    |                 landmark.embed (umap)
    |                 landmark.annot (clustering)
    v
get.map()        --> density (raw, norm, log.norm, composition)
    |                 cellmap (nearest.lm, fuzzy.graphs)
    v
get.lm()         --> results$lm (fit, trad)
    |                 sample.embed (pca)
    v
get.pbDE()            --> results$pb
get.pbDE(.mode="marker") --> results$marker
get.plsD()           --> results$pls

A typical workflow looks like:

library(tinydenseR)

# Option 1: step-by-step
tdr <- setup.tdr.obj(...)
tdr <- get.landmarks(tdr)
tdr <- get.graph(tdr)
tdr <- get.map(tdr)
tdr <- get.lm(tdr, .design = design)

# Option 2: RunTDR() runs landmarks + graph + map + embedding in one call
result <- RunTDR(seurat_obj, .sample.var = "sample_id", .assay.type = "RNA")
result <- get.lm(result, .design = design)

S3 Dispatch System

tinydenseR uses S3 dispatch to support multiple container types (Seurat, SingleCellExperiment) through a single API. H5AD inputs (either as a file path or via anndataR) are converted to a bare TDRObj during RunTDR() and do not carry container dispatch methods. The dispatch wrappers in dispatch.R follow one of three patterns:

Tier 1: Compute-Only

Functions that only need the TDRObj for computation. The wrapper extracts the TDRObj, runs the .TDRObj method, and stores it back.

# Example: get.lm.Seurat
get.lm.Seurat <- function(x, ...) {
  tdr <- GetTDR(x)
  tdr <- get.lm.TDRObj(tdr, ...)
  SetTDR(x, tdr)
}

Functions in this tier: get.graph(), get.lm(), get.embedding(), get.plsD(), get.features(), lm.cluster(), celltyping().

Tier 2: Needs Source Data

Functions that need access to the original data container (e.g., to read expression matrices via .get_sample_matrix()). The source object is passed via the .source argument.

# Example: get.landmarks.Seurat
get.landmarks.Seurat <- function(x, ...) {
  tdr <- GetTDR(x)
  tdr <- get.landmarks.TDRObj(tdr, .source = x, ...)
  SetTDR(x, tdr)
}

Functions in this tier: get.landmarks(), get.map(), get.pbDE().

Note: get.markerDE() is soft-deprecated. Use get.pbDE(.mode = "marker") instead. The get.pbDE() function supports two modes:

.mode = "design" (default) — pseudobulk differential expression
.mode = "marker" — marker gene/protein identification between groups

Special Case: `goi.summary()`

goi.summary() is a read-only Tier 2 function: it needs .source for expression access, but it returns a summary data frame rather than updating the container.

# Example: goi.summary.Seurat
goi.summary.Seurat <- function(x, ...) {
  tdr <- GetTDR(x)
  goi.summary.TDRObj(tdr, .source = x, ...)
}

Tier 3: Read-Only Plots

Plot functions only need to read the TDRObj; they return a plot object rather than updating the container.

# Example: plotPCA.Seurat
plotPCA.Seurat <- function(x, ...) plotPCA.TDRObj(GetTDR(x), ...)

All plot*() functions follow this pattern: plotPCA(), plotUMAP(), plotBeeswarm(), plot2Markers(), plotSamplePCA(), plotSampleEmbedding(), plotTradStats(), plotTradPerc(), plotDensity(), plotPbDE(), plotDEA(), plotMarkerDE(), plotHeatmap(), plotPlsD(), plotPlsDHeatmap().

Backend Data Flow

When tinydenseR needs to access per-sample expression matrices during the pipeline, it calls the internal function .get_sample_matrix(). This function dispatches on config$backend to retrieve data from the appropriate source:

Backend value	Source	How data is retrieved
`"files"`	RDS files on disk	`readRDS(tdr@cells[[sample_idx]])`
`"seurat"`	Live Seurat object	`SeuratObject::LayerData(.source, assay, layer)[, col_idx]`
`"sce"`	Live SingleCellExperiment	`SummarizedExperiment::assay(.source, assay)[, col_idx]`
`"matrix"`	In-memory or on-disk matrix	Direct column indexing on the stored matrix reference
`"cyto"`	flowCore flowSet or flowWorkspace cytoset	`flowCore::exprs(cs[[sample_name]])`

The "matrix" backend is used by dgCMatrix, DelayedMatrix, IterableMatrix, and H5AD inputs (both RunTDR.character and RunTDR.HDF5AnnData). In these cases, a reference to the matrix is stored in a locked environment within config$source.env so that .get_sample_matrix() can slice into it without copying the entire matrix.

For the "seurat" and "sce" backends, the live container object is passed as .source to Tier 2 functions, which then forward it to .get_sample_matrix().

Caching System

tinydenseR includes an on-disk caching system for per-cell mapping results (nearest landmark assignments, fuzzy graph memberships). This avoids recomputing expensive cell-to-landmark mappings when re-running downstream analyses.

The cache is managed through three user-facing functions:

# Show cache location and size
tdr_cache_info(tdr)

# Validate cache integrity
tdr_cache_validate(tdr)

# Remove orphaned cache files
tdr_cache_cleanup(tdr)

Internal helpers (.tdr_cache_read(), .tdr_cache_write(), .tdr_cache_sweep_orphans()) handle the low-level read/write operations and are not intended to be called directly by users.

Memory Management on HPC

When running tinydenseR on Linux HPC systems, the glibc memory allocator may retain freed memory in the process address space rather than returning it to the OS. This can cause the reported RSS (Resident Set Size) to remain high even after R has freed its objects, potentially triggering OOM kills by the job scheduler.

get.map() already minimises peak memory by excluding large per-sample objects from its internal accumulator when on-disk caching is enabled (the default). For additional defence-in-depth, add the following to your ~/.Renviron or SLURM/PBS job script before launching R:

MALLOC_MMAP_THRESHOLD_=4096
MALLOC_TRIM_THRESHOLD_=0
MALLOC_TOP_PAD_=0
MALLOC_MMAP_MAX_=1000000

These variables instruct glibc to use mmap() for allocations larger than 4 KB. Unlike brk()-based allocations, mmap() pages are returned to the OS immediately on free(), keeping the process RSS close to actual R memory usage.

No additional R code or packages are required — this is a user-side configuration that applies to any R session on Linux.

Pedro Milanez-Almeida

2026-05-27

Overview

TDRObj: The Core Data Structure

Analysis Pipeline Stages

S3 Dispatch System

Tier 1: Compute-Only

Tier 2: Needs Source Data

Special Case: `goi.summary()`

Tier 3: Read-Only Plots

Backend Data Flow

Caching System

Memory Management on HPC

tinydenseR Package Architecture

Pedro Milanez-Almeida

2026-05-27

Overview

TDRObj: The Core Data Structure

Analysis Pipeline Stages

S3 Dispatch System

Tier 1: Compute-Only

Tier 2: Needs Source Data

Special Case: goi.summary()

Tier 3: Read-Only Plots

Backend Data Flow

Caching System

Memory Management on HPC

Special Case: `goi.summary()`