vidore
/

colqwen-omni-v0.1

Visual Document Retrieval

vidore-experimental

Model card Files Files and versions

manu commited on Jul 15

Commit

1c44a5e

·

verified ·

1 Parent(s): 3f99f65

Update README.md

Files changed (1) hide show

README.md +4 -1

README.md CHANGED Viewed

@@ -32,7 +32,10 @@ Data is the same as the ColPali data described in the paper.
 ## Model Training
-### Dataset (Fully Image)
 Our training dataset of 127,460 query-page pairs is comprised of train sets of openly available academic datasets (63%) and a synthetic dataset made up of pages from web-crawled PDF documents and augmented with VLM-generated (Claude-3 Sonnet) pseudo-questions (37%).
 Our training set is fully English by design, enabling us to study zero-shot generalization to non-English languages. We explicitly verify no multi-page PDF document is used both [*ViDoRe*](https://huggingface.co/collections/vidore/vidore-benchmark-667173f98e70a1c0fa4db00d) and in the train set to prevent evaluation contamination.
 A validation set is created with 2% of the samples to tune hyperparameters.

 ## Model Training
+### Dataset
+The audio retrieval capabilities are acquired in a 0-shot capacity, as the entire training data is purely image-text matching. Yhe audio and vision tower are frozen during training.
 Our training dataset of 127,460 query-page pairs is comprised of train sets of openly available academic datasets (63%) and a synthetic dataset made up of pages from web-crawled PDF documents and augmented with VLM-generated (Claude-3 Sonnet) pseudo-questions (37%).
 Our training set is fully English by design, enabling us to study zero-shot generalization to non-English languages. We explicitly verify no multi-page PDF document is used both [*ViDoRe*](https://huggingface.co/collections/vidore/vidore-benchmark-667173f98e70a1c0fa4db00d) and in the train set to prevent evaluation contamination.
 A validation set is created with 2% of the samples to tune hyperparameters.