Added initial README file for Gemma-2 2B
Browse files
README.md
CHANGED
|
@@ -1,3 +1,70 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: cc0-1.0
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: cc0-1.0
|
| 3 |
+
language:
|
| 4 |
+
- en
|
| 5 |
+
base_model: google/gemma-2-2b
|
| 6 |
+
library_name: transformers
|
| 7 |
+
tags:
|
| 8 |
+
- biology
|
| 9 |
+
- scRNAseq
|
| 10 |
+
- single-cell
|
| 11 |
+
- gemma-2
|
| 12 |
+
---
|
| 13 |
+
|
| 14 |
+
C2S-Scale-2B (Gemma-2)
|
| 15 |
+
|
| 16 |
+
# Overview
|
| 17 |
+
This is the C2S-Scale-Gemma-2B pretrained model, based on the Gemma-2 2B architecture
|
| 18 |
+
developed by Google. This model was fine-tuned using the Cell2Sentence (C2S) framework
|
| 19 |
+
on an extensive collection of single-cell RNA sequencing (scRNA-seq) datasets from
|
| 20 |
+
CellxGene and the Human Cell Atlas. The Cell2Sentence method adapts large language
|
| 21 |
+
models (LLMs) for single-cell biology by converting scRNA-seq data into "cell sentences,"
|
| 22 |
+
which are ordered sequences of gene names based on their expression levels. This model
|
| 23 |
+
is designed to handle a wide variety of single- and multi-cell tasks, making it a
|
| 24 |
+
powerful tool for biological discovery.
|
| 25 |
+
|
| 26 |
+
This work is the result of a collaboration between Yale University, Google Research,
|
| 27 |
+
and Google DeepMind aimed at scaling up C2S models. The C2S-Scale models were trained
|
| 28 |
+
on Google's TPU v5s, enabling a significant leap in model size and capabilities. The
|
| 29 |
+
new capabilities and scaling analysis of C2S-Scale models are detailed in our new
|
| 30 |
+
[preprint paper](https://www.biorxiv.org/content/10.1101/2025.04.14.648850v1) and
|
| 31 |
+
[blog post](https://research.google/blog/teaching-machines-the-language-of-biology-scaling-large-language-models-for-next-generation-single-cell-analysis/)
|
| 32 |
+
highlight from Google Research.
|
| 33 |
+
|
| 34 |
+
# Training Data
|
| 35 |
+
This model was trained on over 57 million human and mouse cells from more than 800
|
| 36 |
+
scRNA-seq datasets from CellxGene and the Human Cell Atlas. This large-scale dataset
|
| 37 |
+
encompasses a diverse range of cell types, tissues, and conditions from human and
|
| 38 |
+
mouse. The model was trained with a variable number of genes per cell sentence.
|
| 39 |
+
- Cells: For multi-cell samples, each training sample contained between 5 and 20 cells, with the same number of genes for each cell in the sample.
|
| 40 |
+
- Genes: For single-cell samples, each cell sentence contained between 100 and 2048 genes. For multi-cell samples, each cell sentence per cell contained between 100 and 400 genes.
|
| 41 |
+
|
| 42 |
+
# Tasks
|
| 43 |
+
This model is designed for a comprehensive suite of single-cell and multi-cell tasks:
|
| 44 |
+
|
| 45 |
+
Single-Cell Tasks
|
| 46 |
+
- Unconditional single-cell generation: Generate single-cell sentences without any specific conditioning.
|
| 47 |
+
- Cell type prediction: Predict the cell type of a given single cell.[2]
|
| 48 |
+
- Cell type-conditioned generation: Generate a single-cell sentence based on a specific cell type.
|
| 49 |
+
|
| 50 |
+
Multi-Cell Tasks
|
| 51 |
+
- Unconditional multi-cell generation: Generate multiple cell sentences unconditionally.
|
| 52 |
+
- Tissue prediction: Predict the tissue of origin for a group of cells.
|
| 53 |
+
- Cell type prediction: Predict the cell types for a group of multiple cells.
|
| 54 |
+
- Tissue-conditioned multi-cell generation: Generate multiple cell sentences conditioned on a specific tissue.
|
| 55 |
+
- Cell type-conditioned multi-cell generation: Generate multiple cell sentences conditioned on the cell type of each cell.
|
| 56 |
+
- Multi-cells to abstract: Generate a research paper abstract from multi-cell sentences.
|
| 57 |
+
- Abstract to multi-cells: Generate multi-cell sentences based on a given research paper abstract.
|
| 58 |
+
|
| 59 |
+
Gene Set Tasks
|
| 60 |
+
- Gene set name to genes: Generate an alphabetical list of genes for a given gene set name.
|
| 61 |
+
- Genes to gene set name: Identify the name of a gene set from an alphabetical list of genes.
|
| 62 |
+
|
| 63 |
+
# C2S-Scale Links
|
| 64 |
+
- Paper: [Scaling Large Language Models for Next-Generation Single-Cell Analysis](https://www.biorxiv.org/content/10.1101/2025.04.14.648850v1)
|
| 65 |
+
- Google Research Blog Post: [Teaching machines the language of biology: Scaling large language models for next-generation single-cell analysis](https://research.google/blog/teaching-machines-the-language-of-biology-scaling-large-language-models-for-next-generation-single-cell-analysis/)
|
| 66 |
+
- GitHub: https://github.com/vandijklab/cell2sentence (Note: The codebase is licensed under CC BY-NC-ND 4.0, while the model weights on Hugging Face are CC0 1.0)
|
| 67 |
+
|
| 68 |
+
# Gemma-2 Links
|
| 69 |
+
Hugging Face: https://huggingface.co/google/gemma-2-2b
|
| 70 |
+
Blog Post: Gemma explained: What's new in Gemma 2[5]
|