sebasgar commited on
Commit
f260203
·
verified ·
1 Parent(s): 2477751

Upload folder using huggingface_hub

Browse files
Files changed (3) hide show
  1. README.md +310 -0
  2. config.json +26 -0
  3. pytorch_model.bin +3 -0
README.md ADDED
@@ -0,0 +1,310 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: other
3
+ license_name: health-ai-developer-foundations
4
+ license_link: https://developers.google.com/health-ai-developer-foundations/terms
5
+ language:
6
+ - en
7
+ tags:
8
+ - medical
9
+ - medical-embeddings
10
+ - audio
11
+ - health-acoustic
12
+ extra_gated_heading: Access HeAR on Hugging Face
13
+ extra_gated_prompt: >-
14
+ To access HeAR on Hugging Face, you're required to review and agree to [Health
15
+ AI Developer Foundation's terms of
16
+ use](https://developers.google.com/health-ai-developer-foundations/terms). To
17
+ do this, please ensure you're logged in to Hugging Face and click below.
18
+ Requests are processed immediately.
19
+ extra_gated_button_content: Acknowledge license
20
+ library_name: transformers
21
+ ---
22
+ # HeAR model card
23
+
24
+ **Model documentation:** [HeAR](https://developers.google.com/health-ai-developer-foundations/hear)
25
+
26
+ **Resources**:
27
+
28
+ * Model on Google Cloud Model Garden: [HeAR](https://console.cloud.google.com/vertex-ai/publishers/google/model-garden/hear)
29
+
30
+ * Model on Hugging Face (PyTorch): [google/hear-pytorch](https://huggingface.co/google/hear-pytorch)
31
+
32
+ * Model on Hugging Face (Tensorflow): [google/hear](https://huggingface.co/google/hear)
33
+
34
+ * GitHub repository (supporting code, Colab notebooks, discussions, and
35
+ issues): [HeAR](https://github.com/google-health/hear)
36
+
37
+ * Quick start notebook (PyTorch): [notebooks/quick\_start\_pytorch](https://github.com/google-health/hear/blob/master/notebooks/quick_start_with_hugging_face_pytorch.ipynb)
38
+
39
+ * Quick start notebook (Tensorflow): [notebooks/quick\_start](https://github.com/google-health/hear/blob/master/notebooks/quick_start_with_hugging_face.ipynb)
40
+
41
+ * Support: See
42
+ [Contact](https://developers.google.com/health-ai-developer-foundations/hear/get-started.md#contact).
43
+
44
+ Terms of use: [Health AI Developer Foundations terms of
45
+ use](https://developers.google.com/health-ai-developer-foundations/terms)
46
+
47
+ **Author**: Google
48
+
49
+ ## Model information
50
+
51
+ This section describes the HeAR model and how to use it. HeAR was originally
52
+ released as a Tensorflow SavedModel at https://huggingface.co/google/hear.
53
+ This is an equivalent PyTorch implementation.
54
+
55
+ ### Description
56
+
57
+ Health-related acoustic cues, originating from the respiratory system's airflow,
58
+ including sounds like coughs and breathing patterns can be harnessed for health
59
+ monitoring purposes. Such health sounds can also be collected via ambient
60
+ sensing technologies on ubiquitous devices such as mobile phones, which may
61
+ augment screening capabilities and inform clinical decision making. Health
62
+ acoustics, specifically non-semantic respiratory sounds, also have potential as
63
+ biomarkers to detect and monitor various health conditions, for example,
64
+ identifying disease status from cough sounds, or measuring lung function using
65
+ exhalation sounds made during spirometry.
66
+
67
+ Health Acoustic Representations, or HeAR, is a health acoustic foundation model
68
+ that is pre trained to efficiently represent these non-semantic respiratory
69
+ sounds to accelerate research and development of AI models that use these inputs
70
+ to make predictions. HeAR is trained unsupervised on a large and diverse
71
+ unlabelled corpus, which may generalize better than non-pretrained models to
72
+ unseen distributions and new tasks.
73
+
74
+ Key Features
75
+
76
+ * Generates health-optimized embeddings for biological sounds such as coughs
77
+ and breathes
78
+
79
+ * Versatility: Exhibits strong performance across diverse health acoustic
80
+ tasks.
81
+
82
+ * Data Efficiency: Demonstrates high performance even with limited labeled
83
+ training data for downstream tasks.
84
+
85
+ * Microphone robustness: Downstream models trained using HeAR generalize
86
+ well to sounds recorded from unseen devices.
87
+
88
+ Potential Applications
89
+
90
+ HeAR can be a useful tool for AI research geared towards
91
+ discovery of novel acoustic biomarkers in the following areas:
92
+
93
+ * Aid screening & monitoring for respiratory diseases like COVID-19,
94
+ tuberculosis, and COPD from cough and breath sounds.
95
+
96
+ * Low-resource settings: Can potentially augment healthcare services in
97
+ settings with limited resources by offering accessible screening and
98
+ monitoring tools.
99
+
100
+ ### How to use
101
+
102
+ Below are some example code snippets to help you quickly get started running the
103
+ model locally. If you want to use the model to run inference on a large amount
104
+ of audio, we recommend that you create a production version using [the Vertex
105
+ Model
106
+ Garden](https://console.cloud.google.com/vertex-ai/publishers/google/model-garden/hear).
107
+
108
+ ```python
109
+
110
+ ! git clone https://github.com/Google-Health/hear.git
111
+ ! pip install --upgrade --quiet transformers==4.50.3
112
+
113
+
114
+ import torch
115
+ from transformers import AutoModel
116
+
117
+ from huggingface_hub.utils import HfFolder
118
+ from huggingface_hub import notebook_login, from_pretrained_keras, notebook_login
119
+ if HfFolder.get_token() is None:
120
+ notebook_login()
121
+
122
+ import importlib
123
+ audio_utils = importlib.import_module(
124
+ "hear.python.data_processing.audio_utils"
125
+ )
126
+ preprocess_audio = audio_utils.preprocess_audio
127
+
128
+ model = AutoModel.from_pretrained("google/hear-pytorch")
129
+
130
+ # Generate 4 Examples of two-second random audio clips
131
+ raw_audio_batch = torch.rand((4, 32000), dtype=torch.float32)
132
+ spectrogram_batch = preprocess_audio(raw_audio_batch)
133
+
134
+ # Perform Inference to obtain HeAR embeddings
135
+ # There are 4 embeddings each with length 512 corresponding to the 4 inputs
136
+ embedding_batch = model.forward(
137
+ spectrogram_batch, return_dict=True, output_hidden_states=True)
138
+ ```
139
+
140
+ ### Examples
141
+
142
+ See the following Colab notebooks for examples of how to use HeAR:
143
+
144
+ * To give the model a quick try, running it locally with weights from Hugging
145
+ Face, see [Quick start notebook in
146
+ Colab](https://colab.research.google.com/github/google-health/hear/blob/master/notebooks/quick_start_with_hugging_face_pytorch.ipynb).
147
+
148
+
149
+ ### Model architecture overview
150
+
151
+ HeAR is a [Masked Auto Encoder](https://arxiv.org/abs/2111.06377), a
152
+ [transformer-based](https://arxiv.org/abs/1706.03762) neural
153
+ network.
154
+
155
+ * It was trained using masked auto-encoding on a large corpus of
156
+ health-related sounds, with a self-supervised learning objective on a
157
+ massive dataset (\~174k hours) of two-second audio clips. At training time,
158
+ it tries to reconstruct masked spectrogram patches from the visible patches.
159
+
160
+ * After it is trained, its encoder can generate low-dimensional
161
+ representations of two-second audio clips, optimized for capturing and
162
+ containing the most salient parts of health-related information from
163
+ sounds like coughs and breathes.
164
+
165
+ * These representations, or embeddings, can be used as inputs to other
166
+ models trained for a variety of supervised tasks related to health.
167
+
168
+ * The HeAR model was developed based on a [ViT-L architecture](https://arxiv.org/abs/2010.11929)
169
+
170
+ * Instead of relying on CNNs, a pure transformer applied directly to
171
+ sequences of image patches is the idea behind the model architecture,
172
+ and it resulted in good performance in image classification tasks. This
173
+ approach of using the Vision Transformer (ViT) attains excellent results
174
+ compared to state-of-the-art convolutional networks while requiring
175
+ substantially fewer computational resources to train.
176
+
177
+ * The training process for HeAR comprised of three main components
178
+ * A data curation step (including a health acoustic event detector);
179
+ * A general purpose training step to develop an audio encoder (embedding
180
+ model), and
181
+ * A task-specific evaluation step that adopts the trained embedding model
182
+ for various downstream tasks.
183
+
184
+ * The system is designed to encode two-second long audio clips and
185
+ generate audio embeddings for use in downstream tasks.
186
+
187
+ ### Technical Specifications
188
+
189
+ * Model type: [ViT (vision transformer)](https://arxiv.org/abs/2010.11929)
190
+
191
+ * Key publication: [https://arxiv.org/abs/2403.02522](https://arxiv.org/abs/2403.02522)
192
+
193
+ * Model created: 2023-12-04
194
+
195
+ * Model Version: 1.0.0
196
+
197
+ ### Performance & Validation
198
+
199
+ HeAR's performance has been validated via linear probing the frozen embeddings
200
+ on a benchmark of 33 health acoustic tasks across 6 datasets.
201
+
202
+ HeAR is benchmarked on a diverse set of health acoustic tasks spanning 13 health
203
+ acoustic event detection tasks, 14 cough inference tasks, and 6 spirometry
204
+ inference tasks, across 6 datasets, and it demonstrated that simple linear
205
+ classifiers trained on top of our representations can perform as good or better
206
+ than many similar leading models.
207
+
208
+ ### Key performance metrics
209
+
210
+ * HeAR achieved high performance on **diverse health-relevant tasks**:
211
+ inference of medical conditions (TB, COVID) and medically-relevant
212
+ quantities (lung function, smoking status) from recordings of coughs or
213
+ exhalations, including a task on predicting chest X-ray findings (pleural
214
+ effusion, opacities etc.).
215
+
216
+ * HeAR had **superior device generalizability** compared to other models
217
+ (MRR=0.745 versus second-best being CLAP with MRR=0.497), which is
218
+ crucially important for real-world applications.
219
+
220
+ * HeAR is more **data efficient** than baseline models, sometimes reaching
221
+ the same level of performance when trained on as little as 6.25% of the
222
+ amount of training data.
223
+
224
+ ### Inputs and outputs
225
+
226
+ **Input:** Two-second long 16 kHz mono audio clip. Inputs can be batched so you
227
+ can pass in n=10 as (10,32k) or n=1 as (1,32k)
228
+
229
+ **Output:** Embedding vector of floating point values in (n, 512) for n
230
+ two-second clips in the vector, or an embedding of length 512 for each
231
+ two-second input clip.
232
+
233
+ ### Dataset details
234
+
235
+ ### Training dataset
236
+
237
+ For training, a dataset of YT-NS (YouTube Non-Semantic) was curated, and it
238
+ consisted of two-second long audio clips extracted from three billion public
239
+ non-copyrighted YouTube videos using a health acoustic event detector, totalling
240
+ 313.3 million two-second clips or roughly 174k hours of audio. We chose a
241
+ two-second window since most events we cared about were shorter than that. The
242
+ HeAR audio encoder is trained solely on this dataset.
243
+
244
+ ### Evaluation dataset
245
+
246
+ Six datasets were used for evaluation:
247
+
248
+ * [FSD50K](https://zenodo.org/records/4060432)
249
+ * [Flusense](https://github.com/Forsad/FluSense-data)
250
+ * [CoughVID](https://zenodo.org/records/4048312)
251
+ * [Coswara](https://zenodo.org/records/7188627)
252
+ * [CIDRZ](https://www.kaggle.com/datasets/googlehealthai/google-health-ai)
253
+ * [SpiroSmart](https://dl.acm.org/doi/10.1145/2370216.2370261)
254
+
255
+ ## License
256
+
257
+ The use of the HeAR is governed by the [Health AI Developer Foundations terms of
258
+ use](https://developers.google.com/health-ai-developer-foundations/terms).
259
+
260
+ ### Implementation information
261
+
262
+ Details about the model internals.
263
+
264
+ ### Software
265
+
266
+ Training was done using [JAX](https://github.com/jax-ml/jax)
267
+
268
+ JAX allows researchers to take advantage of the latest generation of hardware,
269
+ including TPUs, for faster and more efficient training of large models.
270
+
271
+ ## Use and limitations
272
+
273
+ ### Intended use
274
+
275
+ * Research and development of health-related acoustic biomarkers.
276
+
277
+ * Exploration of novel applications in disease detection and health
278
+ monitoring.
279
+
280
+ ### Benefits
281
+
282
+ HeAR embeddings can be used for efficient training of AI models for
283
+ health acoustics tasks with significantly less data and compute than training
284
+ neural networks initialised randomly or from checkpoints trained on generic
285
+ datasets. This allows quick prototyping to see if health acoustics signals can
286
+ be used by themselves or combined with other signals to make predictions of
287
+ interest.
288
+
289
+ ### Limitations
290
+
291
+ * Limited Sequence Length: Primarily trained on 2-second audio clips.
292
+
293
+ * Model Size: Current model size is too large for on-device deployment.
294
+
295
+ * Bias Considerations: Potential for biases based on demographics and
296
+ recording device quality, necessitating further investigation and
297
+ mitigation strategies.
298
+
299
+ * HeAR was trained using two-second audio clips of health-related sounds from
300
+ a public non-copyrighted subset of Youtube. These clips come from a
301
+ variety of sources but may be noisy or low-quality.
302
+
303
+ * The model is only used to generate embeddings of the user-owned dataset.
304
+ It does not generate any predictions or diagnosis on its own.
305
+
306
+ * As with any research, developers should ensure that any downstream
307
+ application is validated to understand performance using data that is
308
+ appropriately representative of the intended use setting for the
309
+ specific application (e.g., age, sex, gender, recording device,
310
+ background noise, etc.).
config.json ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "ViTModel"
4
+ ],
5
+ "image_size": [
6
+ 192,
7
+ 128
8
+ ],
9
+ "hidden_size": 1024,
10
+ "num_hidden_layers": 24,
11
+ "num_attention_heads": 16,
12
+ "intermediate_size": 4096,
13
+ "hidden_act": "gelu_fast",
14
+ "hidden_dropout_prob": 0.0,
15
+ "attention_probs_dropout_prob": 0.0,
16
+ "initializer_range": 0.02,
17
+ "layer_norm_eps": 1e-06,
18
+ "pooled_dim": 512,
19
+ "patch_size": 16,
20
+ "num_channels": 1,
21
+ "qkv_bias": true,
22
+ "encoder_stride": 16,
23
+ "pooler_act": "linear",
24
+ "model_type": "vit",
25
+ "pooler_output_size": 512
26
+ }
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d44d355816ee4315f67d7810da274409e9b1a6570325fc5ba9ae27555fd81723
3
+ size 1212947234