AiCoderv2 commited on
Commit
9b96618
·
verified ·
1 Parent(s): 6feb547

Upload 8 files

Browse files
README.md CHANGED
@@ -1,3 +1,176 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: gemma
3
+ library_name: transformers
4
+ pipeline_tag: text-generation
5
+ extra_gated_heading: Access Gemma on Hugging Face
6
+ extra_gated_prompt: >-
7
+ To access Gemma on Hugging Face, you’re required to review and agree to
8
+ Google’s usage license. To do this, please ensure you’re logged in to Hugging
9
+ Face and click below. Requests are processed immediately.
10
+ extra_gated_button_content: Acknowledge license
11
+ tags:
12
+ - differential_privacy
13
+ - dp-sgd
14
+ ---
15
+
16
+ # VaultGemma model card
17
+
18
+ **Model Page**: [VaultGemma][model-page]
19
+
20
+ **Resources and Technical Documentation**:
21
+
22
+ * [VaultGemma Technical Report][tech-report]
23
+ * [Responsible Generative AI Toolkit][rai-toolkit]
24
+ * [VaultGemma on Kaggle][kaggle-gemma]
25
+
26
+ **Terms of Use**: [Terms][terms]
27
+
28
+ **Authors**: Google
29
+
30
+ ## Model Information
31
+
32
+ Summary description and brief definition of inputs and outputs.
33
+
34
+ ### Description
35
+
36
+ VaultGemma is a variant of the Gemma family of lightweight, state-of-the-art open models from Google. It is pre-trained from the ground up using Differential Privacy (DP). This provides strong, mathematically-backed privacy guarantees for its training data, limiting the extent to which the model's outputs can reveal information about any single training example.
37
+
38
+ VaultGemma uses a similar architecture as Gemma 2. VaultGemma is a pretrained model that can be instruction tuned for a variety of language understanding and generation tasks. Its relatively small size (< 1B parameters) makes it possible to deploy in environments with limited resources, democratizing access to state-of-the-art AI models that are built with privacy at their core.
39
+
40
+ ### Inputs and outputs
41
+
42
+ - **Input:**
43
+ - Text string, such as a question, a prompt, or a document to be summarized.
44
+ - Total input context of 1K (1,024) tokens.
45
+
46
+ - **Output:**
47
+ - Generated text in response to the input, such as an answer to a question or a summary or categorization.
48
+
49
+ ## Model Data
50
+
51
+ Data used for model training and how the data was processed.
52
+
53
+ ### Training Dataset
54
+
55
+ The model was trained from scratch with differential privacy on a large-scale dataset of English-language text data from a variety of sources, including:
56
+
57
+ - Web Documents: A diverse collection of web text ensures the model is exposed to a broad range of linguistic styles, topics, and vocabulary.
58
+ - Code: Exposing the model to code helps it to learn the syntax and patterns of programming languages, which improves its ability to generate code and understand code-related questions.
59
+ - Mathematics: Training on mathematical text helps the model learn logical reasoning and symbolic representation to address mathematical queries.
60
+
61
+ The defining feature of this model is that the entire pre-training process was conducted using Differentially Private Stochastic Gradient Descent (DP-SGD) with a privacy budget of ε≤2.0, δ≤1.1e-10. DP-SGD provides a formal guarantee that the model's core knowledge base is itself private with respect to the individual examples in the training set.
62
+
63
+ ### Data Preprocessing
64
+
65
+ In addition to the inherent privacy protections of differential privacy, the following data cleaning and filtering methods used with Gemma 2 were applied to the training data:
66
+
67
+ - CSAM Filtering: Rigorous CSAM (Child Sexual Abuse Material) filtering was applied at multiple stages in the data preparation process to ensure the exclusion of harmful and illegal content.
68
+ - Sensitive Data Filtering: As part of making Gemma pre-trained models safe and reliable, automated techniques were used to filter out certain personal information and other sensitive data from training sets.
69
+ - Additional methods: Filtering based on content quality and safety in line with [our policies][safety-policies].
70
+
71
+ ## Implementation Information
72
+
73
+ Details about the model internals.
74
+
75
+ ### Hardware
76
+
77
+ VaultGemma was trained using [Tensor Processing Unit (TPU)][tpu] hardware TPUv6e. Training large language models with the significant computational overhead of differential privacy requires specialized hardware. TPUs are designed to handle the massive computations involved, offering the performance, memory, and scalability necessary to train models like VaultGemma efficiently and sustainably.
78
+
79
+ ### Software
80
+
81
+ Training was done using [JAX][jax] and [ML Pathways][ml-pathways]. The core of the training implementation relied on specialized algorithms for privacy-preserving machine learning at scale:
82
+
83
+ - [Differentially Private Stochastic Gradient Descent (DP-SGD)][dp-sgd]: The optimization algorithm used to train the model while providing formal privacy guarantees.
84
+ - [Truncated Poisson Subsampling][poisson-subsampling]: A computationally efficient method used to enable large-scale DP training with fixed-size batches, which is critical for performance on modern accelerators.
85
+ - [DP Scaling Laws][dp-scaling-laws]: The training configuration (model size, batch size, iterations) was determined by a novel set of scaling laws developed specifically for differentially private training, ensuring the optimal use of the compute and privacy budgets.
86
+
87
+ ## Evaluation
88
+
89
+ Model evaluation metrics and results.
90
+
91
+ ### Benchmark Results
92
+
93
+ The model was evaluated on a range of standard academic benchmarks. As expected, there is a utility trade-off for the strong privacy guarantees offered by the model. The table below shows the performance of the 1B pre-trained (PT) VaultGemma model.
94
+
95
+ | **Benchmark** | **n-shot** | **VaultGemma 1B PT** |
96
+ | :----------------------- | :-----------: | -------------------: |
97
+ | [HellaSwag][hellaswag] | 10-shot | 39.09 |
98
+ | [BoolQ][boolq] | 0-shot | 62.04 |
99
+ | [PIQA][piqa] | 0-shot | 68.00 |
100
+ | [SocialIQA][socialiqa] | 0-shot | 46.16 |
101
+ | [TriviaQA][triviaqa] | 5-shot | 11.24 |
102
+ | [ARC-c][arc] | 25-shot | 26.45 |
103
+ | [ARC-e][arc] | 0-shot | 51.78 |
104
+
105
+ ### Empirical Memorization Analysis
106
+
107
+ We also conducted empirical tests to measure the model's "memorization rate"—its tendency to reproduce sequences from its training data. We followed the established methodology in the [Gemma 3 technical report][g3-tech-report]. The model was prompted with 50-token prefixes extracted from the training corpus to determine if it would generate the corresponding 50-token suffixes. The evaluation specifically tested for:
108
+
109
+ - Exact Memorization: Verbatim reproduction of the suffix.
110
+ - Approximate Memorization: Reproduction of the suffix with up to a 10% error rate.
111
+
112
+ VaultGemma exhibited **no detectable memorization** (neither exact nor approximate) in these tests. This empirical finding strongly validates the effectiveness of the Differentially Private Stochastic Gradient Descent (DP-SGD) pre-training process in preventing the retention of individual training examples.
113
+
114
+ ## Ethics and Safety
115
+
116
+ We use the same data mixture as Gemma 2, and utilize differential privacy during the training process to ensure the model's parameters do not memorize individual training examples, providing a formal privacy guarantee for the training data. Further we are only providing a pre-trained model.
117
+
118
+ ## Usage and Limitations
119
+
120
+ These models have certain limitations that users should be aware of.
121
+
122
+ ### Intended Usage
123
+
124
+ VaultGemma is intended for a wide range of natural language processing (NLP) applications. The purpose of this list is to provide contextual information about possible use cases that the model creators considered.
125
+
126
+ - Privacy-Preserving NLP Research: Serve as a strong baseline for researchers to experiment with privacy-preserving techniques, develop new algorithms, and fine-tune models on sensitive data.
127
+ - Applications with Sensitive Data: Can be fine-tuned on private or sensitive datasets (e.g., in healthcare, finance) where it is critical that the base model itself does not carry risks from public pre-training data.
128
+ - Content Creation and Communication: Generate creative text, power chatbots, and summarize documents in scenarios where data privacy is a primary concern.
129
+
130
+ ### Limitations
131
+
132
+ - Utility Gap for Privacy: There is an inherent trade-off between the strength of the privacy guarantee and model utility. As shown in the evaluation benchmarks, VaultGemma may underperform compared to non-private models of a similar size.
133
+ - Training Data: The quality and diversity of the training data influence the model's capabilities. Biases or gaps in the training data can lead to limitations in the model's responses.
134
+ - Factual Accuracy: The model generates responses based on patterns from its training data but is not a knowledge base. It may generate incorrect or outdated factual statements.
135
+ - Language Nuance: The model may struggle to grasp subtle nuances, sarcasm, or figurative language.
136
+
137
+ ### Ethical Considerations and Risks
138
+
139
+ The development of language models raises several ethical concerns. In creating this open model, we have carefully considered the following:
140
+
141
+ - Bias and Fairness: Models trained on large-scale data can reflect socio-cultural biases from the training material.
142
+ - Misinformation and Misuse: Models can be misused to generate text that is false, misleading, or harmful. Guidelines are provided for responsible use in the [Responsible Generative AI Toolkit][rai-toolkit].
143
+ - Transparency and Accountability: This model card summarizes details on the model's architecture, capabilities, limitations, and evaluation processes
144
+
145
+ Risks identified and mitigations:
146
+
147
+ - **Perpetuation of biases**: It's encouraged to perform continuous monitoring (using evaluation metrics, human review) and the exploration of de-biasing techniques during model training, fine-tuning, and other use cases.
148
+ - **Generation of harmful content**: Mechanisms and guidelines for content safety are essential. Developers are encouraged to exercise caution and implement appropriate content safety safeguards based on their specific product policies and application use cases.
149
+ - **Misuse for malicious purposes**: Technical limitations and developer and end-user education can help mitigate against malicious applications of VLMs. Educational resources and reporting mechanisms for users to flag misuse are provided. Prohibited uses of Gemma models are outlined in the [Gemma Prohibited Use Policy][prohibited-use].
150
+ - **Privacy violations**: Models were trained on data filtered for removal of certain personal information and other sensitive data. Further, we use differential privacy during pre-training, with ε≤2.0, δ≤1.1e-10. Developers are encouraged to adhere to privacy regulations with privacy-preserving techniques.
151
+
152
+ ### Benefits
153
+
154
+ At the time of release to the best of our knowledge, this model is the largest and highest-performing open language model pretrained from the ground up with formal differential privacy. Its primary benefit is providing strong, mathematically-backed privacy guarantees for its training data, making it uniquely suited for applications and research where training data privacy is a critical concern.
155
+
156
+ [model-page]: # "Link to VaultGemma Model Page"
157
+ [tech-report]: https://services.google.com/fh/files/blogs/vaultgemma_tech_report.pdf
158
+ [rai-toolkit]: https://ai.google.dev/responsible
159
+ [kaggle-gemma]: https://www.kaggle.com/models/google/vaultgemma
160
+ [terms]: https://ai.google.dev/gemma/terms
161
+ [safety-policies]: https://ai.google/static/documents/ai-responsibility-update-published-february-2025.pdf
162
+ [prohibited-use]: https://ai.google.dev/gemma/prohibited_use_policy
163
+ [tpu]: https://cloud.google.com/tpu/docs/intro-to-tpu
164
+ [jax]: https://github.com/jax-ml/jax
165
+ [ml-pathways]: https://blog.google/technology/ai/introducing-pathways-next-generation-ai-architecture/
166
+ [dp-sgd]: https://arxiv.org/abs/1607.00133
167
+ [poisson-subsampling]: https://arxiv.org/abs/2411.04205
168
+ [dp-scaling-laws]: https://arxiv.org/pdf/2501.18914
169
+ [g3-tech-report]: https://arxiv.org/pdf/2503.19786
170
+
171
+ [hellaswag]: https://arxiv.org/abs/1905.07830
172
+ [boolq]: https://arxiv.org/abs/1905.10044
173
+ [piqa]: https://arxiv.org/abs/1911.11641
174
+ [socialiqa]: https://arxiv.org/abs/1904.09728
175
+ [triviaqa]: https://arxiv.org/abs/1705.03551
176
+ [arc]: https://arxiv.org/abs/1911.01547
config.json ADDED
@@ -0,0 +1,59 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "VaultGemmaForCausalLM"
4
+ ],
5
+ "attention_bias": false,
6
+ "attention_dropout": 0.0,
7
+ "attn_logit_softcapping": null,
8
+ "bos_token_id": 2,
9
+ "dtype": "bfloat16",
10
+ "eos_token_id": 1,
11
+ "final_logit_softcapping": null,
12
+ "head_dim": 256,
13
+ "hidden_activation": "gelu_pytorch_tanh",
14
+ "hidden_size": 1152,
15
+ "initializer_range": 0.02,
16
+ "intermediate_size": 6912,
17
+ "layer_types": [
18
+ "full_attention",
19
+ "full_attention",
20
+ "full_attention",
21
+ "full_attention",
22
+ "full_attention",
23
+ "full_attention",
24
+ "full_attention",
25
+ "full_attention",
26
+ "full_attention",
27
+ "full_attention",
28
+ "full_attention",
29
+ "full_attention",
30
+ "full_attention",
31
+ "full_attention",
32
+ "full_attention",
33
+ "full_attention",
34
+ "full_attention",
35
+ "full_attention",
36
+ "full_attention",
37
+ "full_attention",
38
+ "full_attention",
39
+ "full_attention",
40
+ "full_attention",
41
+ "full_attention",
42
+ "full_attention",
43
+ "full_attention"
44
+ ],
45
+ "max_position_embeddings": 1024,
46
+ "model_type": "vaultgemma",
47
+ "num_attention_heads": 4,
48
+ "num_hidden_layers": 26,
49
+ "num_key_value_heads": 4,
50
+ "pad_token_id": 0,
51
+ "query_pre_attn_scalar": 256,
52
+ "rms_norm_eps": 1e-06,
53
+ "rope_theta": 10000.0,
54
+ "sliding_window": 512,
55
+ "torch_dtype": "bfloat16",
56
+ "transformers_version": "4.53.2",
57
+ "use_cache": true,
58
+ "vocab_size": 256000
59
+ }
generation_config.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "bos_token_id": 2,
4
+ "eos_token_id": 1,
5
+ "pad_token_id": 0,
6
+ "transformers_version": "4.53.2"
7
+ }
gitattributes ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ffce096c534a88df675bef4ef0e5786acaa544d7459bfa8ef3aade934a067add
3
+ size 2077509528
special_tokens_map.json ADDED
@@ -0,0 +1,30 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<bos>",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "eos_token": {
10
+ "content": "<eos>",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": {
17
+ "content": "<pad>",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "unk_token": {
24
+ "content": "<unk>",
25
+ "lstrip": false,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ }
30
+ }
tokenizer.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a343c33e9f5cd740f55625b7fc556d71599c239ada13f6308387d6ce79b3a9d6
3
+ size 4945541
tokenizer_config.json ADDED
@@ -0,0 +1,49 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_bos_token": true,
3
+ "add_eos_token": false,
4
+ "added_tokens_decoder": {
5
+ "0": {
6
+ "content": "<pad>",
7
+ "lstrip": false,
8
+ "normalized": false,
9
+ "rstrip": false,
10
+ "single_word": false,
11
+ "special": true
12
+ },
13
+ "1": {
14
+ "content": "<eos>",
15
+ "lstrip": false,
16
+ "normalized": false,
17
+ "rstrip": false,
18
+ "single_word": false,
19
+ "special": true
20
+ },
21
+ "2": {
22
+ "content": "<bos>",
23
+ "lstrip": false,
24
+ "normalized": false,
25
+ "rstrip": false,
26
+ "single_word": false,
27
+ "special": true
28
+ },
29
+ "3": {
30
+ "content": "<unk>",
31
+ "lstrip": false,
32
+ "normalized": false,
33
+ "rstrip": false,
34
+ "single_word": false,
35
+ "special": true
36
+ }
37
+ },
38
+ "bos_token": "<bos>",
39
+ "clean_up_tokenization_spaces": false,
40
+ "eos_token": "<eos>",
41
+ "extra_special_tokens": {},
42
+ "model_max_length": 1000000000000000019884624838656,
43
+ "pad_token": "<pad>",
44
+ "sp_model_kwargs": {},
45
+ "spaces_between_special_tokens": false,
46
+ "tokenizer_class": "GemmaTokenizer",
47
+ "unk_token": "<unk>",
48
+ "use_default_system_prompt": false
49
+ }