Graph-Diffused, Density Contrast-Aligned PLS Decomposition (plsD)

Decomposes an expression interaction matrix M.local via NIPALS PLS1 against the density-contrast vector Y. plsD generates candidate features that drive the density contrast — including population markers (DA), differentially expressed genes (DE), and their mixture — by maximizing covariance between graph-smoothed expression and Y.

Usage

get.plsD(x, ...)

# S3 method for class 'TDRObj'
get.plsD(
  x,
  .coef.col,
  .model.name = "default",
  .ncomp = NULL,
  .min.prop = 0.005,
  .store.M = FALSE,
  .degree.reg = FALSE,
  .tau.mult = 1,
  .lazy.alpha = 1,
  .YX.interaction = TRUE,
  .loading.method = c("pearson", "ols", "spearman"),
  .residualize = FALSE,
  .verbose = TRUE,
  ...
)

Arguments

x: A TDRObj, Seurat, SingleCellExperiment, or HDF5AnnData (anndataR) object after get.lm().
...: Additional arguments passed to methods.
.coef.col: Character: column name in model coefficients to use as density contrast Y. Must be a valid column in .tdr.obj$map$lm[[.model.name]]$fit$coefficients.
.model.name: Character: name of the fitted model to use (default "default").
.ncomp: Integer: number of PLS components. Defaults to ncol(.tdr.obj$pca$embed), matching the number of PCs from get.landmarks.
.min.prop: Numeric: for RNA, minimum proportion of landmarks where a gene must be detected (>0) to be included. Default 0.005 (0.5 percent).
.store.M: Logical: if TRUE, store M.local in output. Default FALSE (saves memory; can be large for RNA).
.degree.reg: Logical: if TRUE, apply degree regularization by adding tau * I (self-loops) to SNN before row-normalizing. Default FALSE.
.tau.mult: Numeric: multiplier for tau when .degree.reg = TRUE. tau = .tau.mult * mean(degree). Default 1.
.lazy.alpha: Numeric in (0, 1]: mixing parameter for lazy random walk. P_lazy = (1 - alpha) * I + alpha * P. Default 1 (standard walk). Values < 1 reduce the spatial propagation range of expression scores on the graph, making the decomposition more local. This is particularly useful when the density contrast is dominated by a single biologically extreme population: reducing .lazy.alpha limits how far the structural score counterweight (see Details) propagates to distant manifold regions. Try 0.5 as a first step when strong unexpected balancing is observed. Values < 1 also incidentally damp oscillations on sparse/irregular graphs.
.YX.interaction: Logical: if TRUE (default), construct M.local = P diag(Y) Xc (Y-weighted interaction). Loadings capture candidate features driving the density contrast through both differential abundance (population markers) and differential expression. If FALSE, construct M.local = P Xc (graph-smoothed expression only; Y appears only on the response side). The FALSE mode is less comprehensive — it misses DA markers unless their expression independently covaries with Y. Note: Y is centered before forming diag(Y), so the interaction weights landmarks by deviation from the mean coefficient, not from zero. Landmarks with below-average (but still positive) raw coefficients receive a negative weight, contributing to the structural opposite pole in the decomposition.
.loading.method: Character: method for computing feature loadings. "pearson" computes Pearson correlation between component scores and centered expression (scale-invariant, fast sparse-BLAS path). "ols" computes OLS regression coefficients of centered Xc on component scores (same sparse-BLAS path; preserves magnitude information but is scale-dependent, so high-variance features rank higher). "spearman" computes Spearman rank correlation via a sparse-aware implementation (robust to outliers and nonlinearity; slower, O(nnz * log(m_g) * K)).
.residualize: Logical: if TRUE, project out nuisance covariates from the expression matrix before decomposition. Default FALSE. Nuisance columns are identified automatically: when contrasts are present, design columns that do not participate in any contrast are nuisance; when no contrasts are used, all design columns except .coef.col are nuisance. If get.lm() was called with .block, the blocking variable is additionally included in the nuisance design (dummy-coded). The design is expanded from sample level to landmark level via @config$key. Sparsity of the expression matrix is preserved via implicit operators (the residualized matrix is never materialized). Not supported with .loading.method = "spearman".
.verbose: Logical: print progress messages? Default TRUE.

Value

The modified .tdr.obj with results stored in .tdr.obj$plsD[[.coef.col]]:

scores: Matrix (landmarks x K): PLS scores (oriented so positive = aligned with Y)
feature.weights: Matrix (genes x K): PLS feature weight vectors w (unit-norm)
x.loadings: Matrix (genes x K): PLS deflation loadings p
y.loadings: Numeric vector (K): PLS Y-loadings q (scalar per component)
raw.loadings: Matrix (genes x K): raw feature loadings per component (Pearson r, OLS regression, or Spearman rank correlation, depending on .loading.method); positive = upregulated with Y, negative = downregulated
loadings: Matrix (genes x K): concordance-filtered feature loadings, defined as raw.loadings * concordance.weights for Pearson/OLS. For .loading.method = "spearman", concordance.weights are NA and loadings = raw.loadings. Positive = upregulated with Y, negative = downregulated.
concordance.weights: Matrix (genes/markers x K): per-gene soft concordance weight $c_{jk} \in [0,1]$ for each component. For gene $j$ and component $k$, $c_{jk}$ is the fraction of the OLS loading numerator ($t_{k}^\top X_{c,j}$) attributable to landmarks where the PLS score and the raw density contrast coefficient agree in sign, weighted by the posterior probability that each landmark's coefficient is genuinely nonzero and concordant: $s_{k,i} = \Phi(Y_{0,i} \cdot \operatorname{sign}(t_{k,c,i}) / \hat\sigma_i)$, where $\hat\sigma_i = \sqrt{s^{2,\text{post}}_i} \cdot \text{stdev.unscaled}_{i,c}$ is the limma eBayes posterior standard deviation. $c_{jk} = 1$: loading arises entirely from concordant, statistically confident landmarks; $c_{jk} = 0$: loading arises entirely from discordant or high-uncertainty landmarks (structural counterweight); $c_{jk} = 0.5$: neutral (near-zero loading or near-zero/highly-uncertain coefficient). NA for .loading.method = "spearman" (rank-based numerators are not additively decomposable). Compare raw.loadings to loadings to assess attenuation of structural balancing effects.
Y.alignment: Numeric vector (K): |cor(Y, score_k)| per component
smoothness: Numeric vector (K): Laplacian smoothness per score vector
Y: Numeric vector: the density contrast used, centered to mean zero (Y - mean(Y)). Note: this differs from the raw coefficient by a constant shift. Use the raw coefficient from the model fit for scientific interpretation of absolute density changes.
Y.mean: Numeric: mean of the density contrast used (mean(Y))
M.local: Matrix (optional): the interaction matrix (if .store.M = TRUE)
params: List: parameters used, including: params$residualize (logical: whether residualization was applied) and params$nuisance.cols (character: names of nuisance design columns, NULL if not residualized)

Details

plsD is an interpretive decomposition, not a formal hypothesis test. It answers: "What features — through expression patterns, population identity, or both — explain the density contrast?" Loadings reflect combined signal, like: markers of depleted populations carry negative loadings (high expression where density is low), DE genes carry signed loadings tracking their direction of change, and mixed DA/DE features appear alongside both.

When .YX.interaction = TRUE (default), the interaction term diag(Y) bakes the density contrast into the data matrix: M.local = P \ it captures DA markers, DE genes, and their interplay in a single decomposition. The interaction amplifies expression signal in landmarks with strong density contrast, making the method sensitive even to subtle DE in the absence of DA.

When .YX.interaction = FALSE, M.local = P \ centered expression without Y-weighting). Y appears only on the response side, so loadings reflect only features whose graph-smoothed expression independently covaries with Y. This mode is less comprehensive — it misses DA markers unless their expression happens to correlate with Y through smoothed space. In the presence of strongly bimodal density contrasts and/or, datasets with a small number of extreme landmarks, setting .YX.interaction = FALSE can help distinguish features whose expression genuinely covaries with Y from those that are strong but purely geometric counterweights.

The method:

Extracts the density contrast vector Y (coefficients from get.lm)
Prepares centered expression matrix Xc (size-factor normalized + log2 for RNA; raw for cyto; then centered)
Builds random-walk normalized graph P from SNN (with optional degree regularization and lazy walk)
Constructs M.local: if .YX.interaction = TRUE, M.local = P \ M.local = P \ Optionally, residualize Xc against nuisance covariates before constructing M.local.
Runs NIPALS PLS1: iteratively finds feature weights w maximizing cov(M.local w, Y), with deflation of both M.local and Y
Applies sign convention: positive scores = aligned with Y
Computes feature loadings (Pearson r, OLS regression, or Spearman rank correlation)
Computes Laplacian smoothness (Sk) for each component

Diagnostic metrics (Ak, Sk):

Ak (Y-alignment): Measures how strongly component scores covary with density contrast Y. Ak is high by construction for the leading components: NIPALS PLS1 maximizes covariance with Y at every deflation step. Components are ordered by extraction (plsD1 = highest covariance with Y).

Sk (graph smoothness): Derived from the normalized Laplacian applied to score vectors. High Sk indicates large-scale, graph-smooth structure.

About Y appearing on both sides (when .YX.interaction = TRUE): Y appears in both the data matrix (via diag(Y) in M.local) and the PLS objective (maximize covariance with Y). This is intentional: it gives plsD comprehensive sensitivity to features driving density changes. The design is not circular because the expression matrix Xc sits between: the product is large only when features exist whose Y-weighted, graph-smoothed expression genuinely covaries with Y. With permuted Y, Ak collapses. Setting .YX.interaction = FALSE removes Y from the data matrix entirely, providing a diagnostic comparator.

Nuisance residualization (.residualize = TRUE): When the experimental design includes nuisance covariates (e.g. batch, sex, age), the density contrast Y from get.lm is already adjusted for these. However, the expression matrix X is not. Setting .residualize = TRUE projects X onto the orthogonal complement of the nuisance design, so that both sides of the PLS objective are free of nuisance effects. This improves interpretational symmetry: loadings reflect features whose expression covaries with the adjusted density contrast, with nuisance-driven expression variation removed. The residualization uses the Frisch-Waugh-Lovell (FWL) decomposition: X is replaced by (I - H_Z) X, where H_Z is the hat matrix of the nuisance design expanded to landmark level via @config$key. Sparsity is preserved via implicit operators. When no nuisance covariates exist (e.g. a simple two-group design), .residualize = TRUE is equivalent to .residualize = FALSE.

Blocking variable handling: When get.lm() was called with a .block argument (invoking limma::duplicateCorrelation), the blocking factor does NOT appear in the design matrix — it enters only through the GLS covariance structure for the density contrast Y. If .residualize = TRUE, the blocking variable is automatically detected from fit$block and appended to the nuisance design Z as dummy-coded columns. This ensures that block-level mean expression shifts (e.g. donor-specific expression baselines) are removed from X, preventing them from contaminating PLS directions via spurious covariance with the block-correlated residual structure in Y.

Structural score constraints and the balancing effect: The score vectors computed by plsD are structurally mean-zero across landmarks. This is a mathematical consequence of column-centering M.local: for any weight vector w, the score t = M.local w satisfies sum(t) = 0. As a result, a dominant positive pole (landmarks with large positive scores) must be balanced by compensatory negative scores somewhere on the manifold.

In datasets where a small number of biologically extreme landmarks simultaneously have large |Y_c| and unusual expression profiles (large row-wise expression norm), these landmarks can dominate the first PLS component. The compensatory negative region may then appear coherent and strong, mimicking a genuine opposing biological program.

Distinguishing genuine signal from structural counterweight: the plotPlsD scatter panel colors landmarks by their raw (uncentered) density contrast coefficient. If landmarks with large negative scores show near-zero or positive raw coefficients, they are geometric counterweights rather than genuinely depleted populations. Conversely (depending on contrast direction and component), large-magnitude positive-score landmarks with near-zero raw Y may also reflect structural balance.

Pre-analysis diagnostic: if unexpected strong balancing is observed, compute the row-wise interaction norm distribution before running plsD:


Yc <- coef.mat[, .coef.col]; Yc <- Yc - mean(Yc)
# Xc: size-factor-normalized, log2-transformed, column-centered expression
row_norms <- sqrt(rowSums((Yc * as.matrix(Xc))^2))
quantile(row_norms)

A heavily right-skewed distribution (max >> Q75, or top 10 landmarks holding >20% of Frobenius norm squared) indicates dominant leverage; reduce .lazy.alpha or winsorize Yc before proceeding.

Note

plsD is designed as an exploratory tool to help interpret density changes in terms of the features driving them. It does not provide gene-level p-values or formal multiple testing correction. For rigorous differential expression testing with p-values following field standards, use get.pbDE (pseudobulk DE via edgeR/limma, both design and marker modes). plsD complements these methods by providing a multivariate, graph-aware decomposition that captures joint DA/DE patterns not visible in gene-by-gene tests.

Examples

if (FALSE) { # \dontrun{
# After fitting linear model
lm.obj <- get.lm(lm.obj, .design = design)

# Run plsD for "Infection" coefficient
lm.obj <- get.plsD(lm.obj, .coef.col = "Infection")

# Access results
lm.obj$plsD$Infection$scores[, "plsD1"]      # PLS scores (Y-aligned)
lm.obj$plsD$Infection$loadings[, "plsD1"]      # concordance-filtered gene loadings
lm.obj$plsD$Infection$raw.loadings[, "plsD1"]  # raw gene loadings before concordance filtering

# Diagnostic table
data.frame(
  component = colnames(lm.obj$plsD$Infection$scores),
  Ak = lm.obj$plsD$Infection$Y.alignment,
  Sk = lm.obj$plsD$Infection$smoothness
)
} # }