Leiden clustering with straggler absorption — leiden.cluster • tinydenseR

Performs community detection on a similarity graph using the Leiden algorithm, followed by reassignment of small "straggler" clusters to better-connected neighbors. This improves cluster quality by preventing tiny, isolated groups.

Usage

leiden.cluster(
  .tdr.obj = NULL,
  .sim.matrix,
  .resolution.parameter,
  .small.size = floor((nrow(x = .sim.matrix)/200)),
  .verbose = TRUE,
  .seed = 123
)

Arguments

.tdr.obj: Optional tinydenseR object from get.landmarks() and get.graph(). If provided, uses Laplacian Eigenmap embedding for initialization (if computed successfully), otherwise falls back to PCA embedding. If NULL, computes 3D embedding from .sim.matrix via truncated SVD.
.sim.matrix: Square similarity matrix (typically Jaccard similarity of k-NN graphs). Higher values indicate stronger connections between nodes. Self-loops (diagonal) are automatically removed.
.resolution.parameter: Numeric controlling cluster granularity in Leiden CPM algorithm. Higher values (e.g., 0.0005-0.005) yield more/smaller clusters. Lower values yield fewer/larger clusters. Optimal value depends on data scale.
.small.size: Integer threshold for straggler clusters. Clusters with fewer cells are absorbed into neighbors. Default: 0.5% of total nodes (floor(n/200)).
.verbose: Logical, print progress and cluster counts. Default TRUE.
.seed: Integer for reproducibility (affects k-means and Leiden). Default 123.

Value

Factor vector of cluster assignments with levels ordered by appearance. Labels are formatted as "cluster.01", "cluster.02", etc.

Details

Algorithm steps:

Initialize with k-means (up to 25 centers) on embedding to avoid singletons
Run Leiden community detection with CPM objective function
Identify "straggler" clusters smaller than .small.size
For each straggler, compute mean connectivity to all keeper clusters
Reassign straggler cells to the most connected keeper cluster
Relabel clusters sequentially

Initialization strategy: Uses Laplacian Eigenmap embedding (if computed and available in .tdr.obj$graph$LE$embed), otherwise PCA embedding (up to first 3 components), or computes 3D embedding from similarity matrix via truncated SVD (irlba) when .tdr.obj is NULL. K-means initialization helps Leiden start from reasonable communities rather than singleton nodes, improving stability.

Straggler absorption: Small clusters (< 0.5% of total by default) often represent noise or transitional states. Rather than keeping them as separate clusters, they're merged with the most similar neighboring cluster based on mean edge weights. For sparse similarity matrices, the mean is computed by normalizing the sum of stored values by the full submatrix size to properly account for implicit zeros.

Examples

if (FALSE) { # \dontrun{
# After building similarity matrix
clusters <- leiden.cluster(
  .tdr.obj = lm.obj,
  .sim.matrix = jaccard_sim,
  .resolution.parameter = 0.001,
  .small.size = 10  # Merge clusters < 10 cells
)
table(clusters)
} # }