SyedA5688 commited on
Commit
e3561be
·
verified ·
1 Parent(s): 7097bc1

Added initial README file for Gemma-2 2B

Browse files
Files changed (1) hide show
  1. README.md +70 -3
README.md CHANGED
@@ -1,3 +1,70 @@
1
- ---
2
- license: cc0-1.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc0-1.0
3
+ language:
4
+ - en
5
+ base_model: google/gemma-2-2b
6
+ library_name: transformers
7
+ tags:
8
+ - biology
9
+ - scRNAseq
10
+ - single-cell
11
+ - gemma-2
12
+ ---
13
+
14
+ C2S-Scale-2B (Gemma-2)
15
+
16
+ # Overview
17
+ This is the C2S-Scale-Gemma-2B pretrained model, based on the Gemma-2 2B architecture
18
+ developed by Google. This model was fine-tuned using the Cell2Sentence (C2S) framework
19
+ on an extensive collection of single-cell RNA sequencing (scRNA-seq) datasets from
20
+ CellxGene and the Human Cell Atlas. The Cell2Sentence method adapts large language
21
+ models (LLMs) for single-cell biology by converting scRNA-seq data into "cell sentences,"
22
+ which are ordered sequences of gene names based on their expression levels. This model
23
+ is designed to handle a wide variety of single- and multi-cell tasks, making it a
24
+ powerful tool for biological discovery.
25
+
26
+ This work is the result of a collaboration between Yale University, Google Research,
27
+ and Google DeepMind aimed at scaling up C2S models. The C2S-Scale models were trained
28
+ on Google's TPU v5s, enabling a significant leap in model size and capabilities. The
29
+ new capabilities and scaling analysis of C2S-Scale models are detailed in our new
30
+ [preprint paper](https://www.biorxiv.org/content/10.1101/2025.04.14.648850v1) and
31
+ [blog post](https://research.google/blog/teaching-machines-the-language-of-biology-scaling-large-language-models-for-next-generation-single-cell-analysis/)
32
+ highlight from Google Research.
33
+
34
+ # Training Data
35
+ This model was trained on over 57 million human and mouse cells from more than 800
36
+ scRNA-seq datasets from CellxGene and the Human Cell Atlas. This large-scale dataset
37
+ encompasses a diverse range of cell types, tissues, and conditions from human and
38
+ mouse. The model was trained with a variable number of genes per cell sentence.
39
+ - Cells: For multi-cell samples, each training sample contained between 5 and 20 cells, with the same number of genes for each cell in the sample.
40
+ - Genes: For single-cell samples, each cell sentence contained between 100 and 2048 genes. For multi-cell samples, each cell sentence per cell contained between 100 and 400 genes.
41
+
42
+ # Tasks
43
+ This model is designed for a comprehensive suite of single-cell and multi-cell tasks:
44
+
45
+ Single-Cell Tasks
46
+ - Unconditional single-cell generation: Generate single-cell sentences without any specific conditioning.
47
+ - Cell type prediction: Predict the cell type of a given single cell.[2]
48
+ - Cell type-conditioned generation: Generate a single-cell sentence based on a specific cell type.
49
+
50
+ Multi-Cell Tasks
51
+ - Unconditional multi-cell generation: Generate multiple cell sentences unconditionally.
52
+ - Tissue prediction: Predict the tissue of origin for a group of cells.
53
+ - Cell type prediction: Predict the cell types for a group of multiple cells.
54
+ - Tissue-conditioned multi-cell generation: Generate multiple cell sentences conditioned on a specific tissue.
55
+ - Cell type-conditioned multi-cell generation: Generate multiple cell sentences conditioned on the cell type of each cell.
56
+ - Multi-cells to abstract: Generate a research paper abstract from multi-cell sentences.
57
+ - Abstract to multi-cells: Generate multi-cell sentences based on a given research paper abstract.
58
+
59
+ Gene Set Tasks
60
+ - Gene set name to genes: Generate an alphabetical list of genes for a given gene set name.
61
+ - Genes to gene set name: Identify the name of a gene set from an alphabetical list of genes.
62
+
63
+ # C2S-Scale Links
64
+ - Paper: [Scaling Large Language Models for Next-Generation Single-Cell Analysis](https://www.biorxiv.org/content/10.1101/2025.04.14.648850v1)
65
+ - Google Research Blog Post: [Teaching machines the language of biology: Scaling large language models for next-generation single-cell analysis](https://research.google/blog/teaching-machines-the-language-of-biology-scaling-large-language-models-for-next-generation-single-cell-analysis/)
66
+ - GitHub: https://github.com/vandijklab/cell2sentence (Note: The codebase is licensed under CC BY-NC-ND 4.0, while the model weights on Hugging Face are CC0 1.0)
67
+
68
+ # Gemma-2 Links
69
+ Hugging Face: https://huggingface.co/google/gemma-2-2b
70
+ Blog Post: Gemma explained: What's new in Gemma 2[5]