SCimilarity β Extended Model
An extended version of SCimilarity, a metric-learning foundation model for single-cell RNA-seq that maps cells to a unified embedding space. The original model and method are described in:
Heimberg et al., "A cell atlas foundation model for scalable search of similar human cells", Nature, 2024. https://doi.org/10.1038/s41586-024-08411-y
This extended model was developed by Hussen Mohammed Ibrahim at the Single-Cell Analytics Innovation Laboratory (SAIL), Memorial Sloan Kettering Cancer Center, building on the original SCimilarity framework by Graham Heimberg and colleagues at Genentech.
What's different here
The original SCimilarity was trained on ~7.9 million annotated cells from 56 studies. This model was retrained from scratch on a significantly larger corpus extracted from CZ CELLxGENE Discover, using the same filtering criteria as the original paper (human cells, non-cancerous tissue, 10x Genomics platform). The extended model covers more diverse studies, a broader range of tissues and body parts, and more fine-grained cell type annotations.
| Original | This model | |
|---|---|---|
| Training cells | 7.9 M | 39.5 M |
| Search index cells | 23.4 M | 45.5 M |
Repository contents
βββ encoder.ckpt # encoder weights (use this for embedding)
βββ decoder.ckpt # decoder weights (reconstruction)
βββ gene_order.tsv # 28,231 gene symbols the model expects as input
βββ layer_sizes.json # network architecture
βββ hyperparameters.json # training hyperparameters
βββ label_ints.csv # cell type label β integer mappings
βββ metadata.json # dataset metadata
βββ reference_labels.tsv # per-cell metadata for all reference cells
β # (cell type, donor, tissue, dataset)
βββ annotation/
β βββ labelled_kNN.bin # kNN index for cell type annotation
βββ cellsearch/
βββ full_kNN.bin # kNN index for similarity search
The index files (annotation/ and cellsearch/) are large (~160 GB combined) but optional. If you only need to embed cells into the latent space β for clustering, visualization, or building your own index β you only need encoder.ckpt, gene_order.tsv, and layer_sizes.json.
Installation
pip install scimilarity
Or from source:
git clone https://github.com/Genentech/scimilarity
cd scimilarity
pip install -e .
Usage
For full usage examples including cell type annotation and similarity search, see the original SCimilarity notebooks. Simply point model_path to your local copy of this repository instead of the original model directory.
Encoder-only (no index required)
If you want to embed cells without downloading the full index:
import scanpy as sc
from scimilarity import CellEmbedding
from scimilarity.utils import align_dataset, lognorm_counts
ce = CellEmbedding(model_path="/path/to/model_v0")
adata = sc.read_h5ad("your_data.h5ad")
adata = align_dataset(adata, ce.gene_order)
adata = lognorm_counts(adata)
embeddings = ce.get_embeddings(adata.X)
adata.obsm["X_scimilarity"] = embeddings
Model architecture
| Parameter | Value |
|---|---|
| Input genes | 28,231 |
| Hidden layers | 3 Γ 1,024 |
| Embedding dimension | 128 |
| Normalization | L2 (unit hypersphere) |
| Loss | Triplet (semi-hard) + MSE reconstruction |