The adaptation of Large Language Model architectures to computational biology has enabled Single-Cell Foundation Models for learning from single-cell RNA-sequencing data. However, many of these models rely on masking strategies from natural language processing; unlike words in a sentence, gene expression is governed by highly correlated regulatory networks, making random masking or structure naive techniques biologically misaligned. Viewed through an information-theoretic lens, this introduces a key inefficiency: models can reconstruct masked genes from local correlations, limiting their ability to learn accurate biological representations based on higher-order structure and driving reliance on large datasets - a challenge in data-scarce settings such as rare disease cohorts or privacy-preserving environments.
To address this, we introduce domain-informed masking during pre-training. In this talk, we present CorrMask, a data-driven, dependency-aware masking scheme that leverages gene correlation structure to jointly mask related genes, encouraging learning from global cellular context. Across tissue-specific datasets, CorrMask matches baseline performance on both cell- and gene-level tasks using less data, with the strongest gains in underrepresented cell populations.
These results position CorrMask as an effective “data multiplier” for enabling efficient, biologically grounded foundation models, with broader implications for predictive modeling in our field.