SCimilarity — Extended Model

An extended version of SCimilarity, a metric-learning foundation model for single-cell RNA-seq that maps cells to a unified embedding space. The original model and method are described in:

Heimberg et al., "A cell atlas foundation model for scalable search of similar human cells", Nature, 2024. https://doi.org/10.1038/s41586-024-08411-y

This extended model was developed by Hussen Mohammed Ibrahim at the Single-Cell Analytics Innovation Laboratory (SAIL), Memorial Sloan Kettering Cancer Center, building on the original SCimilarity framework by Graham Heimberg and colleagues at Genentech.

What's different here

The original SCimilarity was trained on ~7.9 million annotated cells from 56 studies. This model was retrained from scratch on a significantly larger corpus extracted from CZ CELLxGENE Discover, using the same filtering criteria as the original paper (human cells, non-cancerous tissue, 10x Genomics platform). The extended model covers more diverse studies, a broader range of tissues and body parts, and more fine-grained cell type annotations.

	Original	This model
Training cells	7.9 M	39.5 M
Search index cells	23.4 M	45.5 M

Repository contents

├── encoder.ckpt            # encoder weights (use this for embedding)
├── decoder.ckpt            # decoder weights (reconstruction)
├── gene_order.tsv          # 28,231 gene symbols the model expects as input
├── layer_sizes.json        # network architecture
├── hyperparameters.json    # training hyperparameters
├── label_ints.csv          # cell type label → integer mappings
├── metadata.json           # dataset metadata
├── reference_labels.tsv    # per-cell metadata for all reference cells
│                           # (cell type, donor, tissue, dataset)
├── annotation/
│   └── labelled_kNN.bin    # kNN index for cell type annotation
└── cellsearch/
    └── full_kNN.bin        # kNN index for similarity search

The index files (annotation/ and cellsearch/) are large (~160 GB combined) but optional. If you only need to embed cells into the latent space — for clustering, visualization, or building your own index — you only need encoder.ckpt, gene_order.tsv, and layer_sizes.json.

Installation

pip install scimilarity

Or from source:

git clone https://github.com/Genentech/scimilarity
cd scimilarity
pip install -e .

Usage

For full usage examples including cell type annotation and similarity search, see the original SCimilarity notebooks. Simply point model_path to your local copy of this repository instead of the original model directory.

Encoder-only (no index required)

If you want to embed cells without downloading the full index:

import scanpy as sc
from scimilarity import CellEmbedding
from scimilarity.utils import align_dataset, lognorm_counts

ce = CellEmbedding(model_path="/path/to/model_v0")

adata = sc.read_h5ad("your_data.h5ad")
adata = align_dataset(adata, ce.gene_order)
adata = lognorm_counts(adata)

embeddings = ce.get_embeddings(adata.X)
adata.obsm["X_scimilarity"] = embeddings

Model architecture

Parameter	Value
Input genes	28,231
Hidden layers	3 × 1,024
Embedding dimension	128
Normalization	L2 (unit hypersphere)
Loss	Triplet (semi-hard) + MSE reconstruction

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support