juletxara
/

vilt-vsr-zeroshot

Model card Files Files and versions

juletxara commited on Aug 4, 2022

Commit

a8a05cd

·

1 Parent(s): 840133d

update readme

Files changed (1) hide show

README.md +72 -0

README.md CHANGED Viewed

@@ -1,3 +1,75 @@
 ---
 license: apache-2.0
 ---

 ---
 license: apache-2.0
 ---
+# Vision-and-Language Transformer (ViLT), fine-tuned on VSR zeroshot split
+Vision-and-Language Transformer (ViLT) model fine-tuned on zeroshot split of [Visual Spatial Reasoning (VSR)](https://arxiv.org/abs/2205.00363). ViLT was introduced in the paper [ViLT: Vision-and-Language Transformer
+Without Convolution or Region Supervision](https://arxiv.org/abs/2102.03334) by Kim et al. and first released in [this repository](https://github.com/dandelin/ViLT).
+## Intended uses & limitations
+You can use the model to determine whether a sentence is true or false given an image.
+### How to use
+Here is how to use the model in PyTorch:
+```
+from transformers import ViltProcessor, ViltForImagesAndTextClassification
+import requests
+from PIL import Image
+image = Image.open(requests.get("https://camo.githubusercontent.com/ffcbeada14077b8e6d4b16817c91f78ba50aace210a1e4754418f1413d99797f/687474703a2f2f696d616765732e636f636f646174617365742e6f72672f747261696e323031372f3030303030303038303333362e6a7067", stream=True).raw)
+text = "The person is ahead of the cow."
+processor = ViltProcessor.from_pretrained("juletxara/vilt-vsr-zeroshot")
+model = ViltForImagesAndTextClassification.from_pretrained("juletxara/vilt-vsr-zeroshot")
+# prepare inputs
+encoding = processor(image, text, return_tensors="pt")
+# forward pass
+outputs = model(input_ids=encoding.input_ids, pixel_values=encoding.pixel_values.unsqueeze(0))
+logits = outputs.logits
+idx = logits.argmax(-1).item()
+print("Predicted answer:", model.config.id2label[idx])
+```
+## Training data
+(to do)
+## Training procedure
+### Preprocessing
+(to do)
+### Pretraining
+(to do)
+## Evaluation results
+(to do)
+### BibTeX entry and citation info
+```bibtex
+@misc{kim2021vilt,
+      title={ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision},
+      author={Wonjae Kim and Bokyung Son and Ildoo Kim},
+      year={2021},
+      eprint={2102.03334},
+      archivePrefix={arXiv},
+      primaryClass={stat.ML}
+}
+@article{liu2022visual,
+  title={Visual Spatial Reasoning},
+  author={Liu, Fangyu and Emerson, Guy and Collier, Nigel},
+  journal={arXiv preprint arXiv:2205.00363},
+  year={2022}
+}
+```