vandijklab
/

C2S-Scale-Gemma-2-2B

+---
+license: cc0-1.0
+language:
+- en
+base_model: google/gemma-2-2b
+library_name: transformers
+tags:
+- biology
+- scRNAseq
+- single-cell
+- gemma-2
+---
+C2S-Scale-2B (Gemma-2)
+# Overview
+This is the C2S-Scale-Gemma-2B pretrained model, based on the Gemma-2 2B architecture
+developed by Google. This model was fine-tuned using the Cell2Sentence (C2S) framework
+on an extensive collection of single-cell RNA sequencing (scRNA-seq) datasets from
+CellxGene and the Human Cell Atlas. The Cell2Sentence method adapts large language
+models (LLMs) for single-cell biology by converting scRNA-seq data into "cell sentences,"
+which are ordered sequences of gene names based on their expression levels. This model
+is designed to handle a wide variety of single- and multi-cell tasks, making it a
+powerful tool for biological discovery.
+This work is the result of a collaboration between Yale University, Google Research,
+and Google DeepMind aimed at scaling up C2S models. The C2S-Scale models were trained
+on Google's TPU v5s, enabling a significant leap in model size and capabilities. The
+new capabilities and scaling analysis of C2S-Scale models are detailed in our new
+[preprint paper](https://www.biorxiv.org/content/10.1101/2025.04.14.648850v1) and
+[blog post](https://research.google/blog/teaching-machines-the-language-of-biology-scaling-large-language-models-for-next-generation-single-cell-analysis/)
+highlight from Google Research.
+# Training Data
+This model was trained on over 57 million human and mouse cells from more than 800
+scRNA-seq datasets from CellxGene and the Human Cell Atlas. This large-scale dataset
+encompasses a diverse range of cell types, tissues, and conditions from human and
+mouse. The model was trained with a variable number of genes per cell sentence.
+- Cells: For multi-cell samples, each training sample contained between 5 and 20 cells, with the same number of genes for each cell in the sample.
+- Genes: For single-cell samples, each cell sentence contained between 100 and 2048 genes. For multi-cell samples, each cell sentence per cell contained between 100 and 400 genes.
+# Tasks
+This model is designed for a comprehensive suite of single-cell and multi-cell tasks:
+Single-Cell Tasks
+- Unconditional single-cell generation: Generate single-cell sentences without any specific conditioning.
+- Cell type prediction: Predict the cell type of a given single cell.[2]
+- Cell type-conditioned generation: Generate a single-cell sentence based on a specific cell type.
+Multi-Cell Tasks
+- Unconditional multi-cell generation: Generate multiple cell sentences unconditionally.
+- Tissue prediction: Predict the tissue of origin for a group of cells.
+- Cell type prediction: Predict the cell types for a group of multiple cells.
+- Tissue-conditioned multi-cell generation: Generate multiple cell sentences conditioned on a specific tissue.
+- Cell type-conditioned multi-cell generation: Generate multiple cell sentences conditioned on the cell type of each cell.
+- Multi-cells to abstract: Generate a research paper abstract from multi-cell sentences.
+- Abstract to multi-cells: Generate multi-cell sentences based on a given research paper abstract.
+Gene Set Tasks
+- Gene set name to genes: Generate an alphabetical list of genes for a given gene set name.
+- Genes to gene set name: Identify the name of a gene set from an alphabetical list of genes.
+# C2S-Scale Links
+- Paper: [Scaling Large Language Models for Next-Generation Single-Cell Analysis](https://www.biorxiv.org/content/10.1101/2025.04.14.648850v1)
+- Google Research Blog Post: [Teaching machines the language of biology: Scaling large language models for next-generation single-cell analysis](https://research.google/blog/teaching-machines-the-language-of-biology-scaling-large-language-models-for-next-generation-single-cell-analysis/)
+- GitHub: https://github.com/vandijklab/cell2sentence (Note: The codebase is licensed under CC BY-NC-ND 4.0, while the model weights on Hugging Face are CC0 1.0)
+# Gemma-2 Links
+Hugging Face: https://huggingface.co/google/gemma-2-2b
+Blog Post: Gemma explained: What's new in Gemma 2[5]