A Touch, Vision, and Language Dataset for Multimodal Alignment
by Max (Letian) Fu, Gaurav Datta*, Huang Huang*, William Chung-Ho Panitch*, Jaimyn Drake*, Joseph Ortiz, Mustafa Mukadam, Mike Lambeta, Roberto Calandra, Ken Goldberg at UC Berkeley, Meta AI, TU Dresden, and CeTI (*equal contribution).
[Paper] | [Project Page] | [Checkpoints] | [Dataset] | [Citation]
   
This repo contains the official checkpoints for A Touch, Vision, and Language Dataset for Multimodal Alignment.
The tactile encoders comes in three different sizes: ViT-Tiny, ViT-Small, and ViT-Base, all of which are stored in
ckpt/tvl_enc
TVL-LLaMA, the generative counterparts, are stored in
ckpt/tvl_llama
Inference
For zero-shot classification, we would require OpenCLIP with the following configuration:
CLIP_VISION_MODEL = "ViT-L-14"
CLIP_PRETRAIN_DATA = "datacomp_xl_s13b_b90k"
For TVL-LLaMA, please request access to the pre-trained LLaMA-2 from this form. In particular, we use llama-2-7b as the base model. The weights here contains the trained adapter, the tactile encoder, and the vision encoder for the ease of loading. 
For the complete info, please take a look at the GitHub repo to see instructions on pretraining, fine-tuning, and evaluation with these models.
Citation
Please give us a star ๐ on Github to support us!
Please cite our work if you find our work inspiring or use our code in your work:
@article{fu2024tvl,
    title={A Touch, Vision, and Language Dataset for Multimodal Alignment}, 
    author={Letian Fu and Gaurav Datta and Huang Huang and William Chung-Ho Panitch and Jaimyn Drake and Joseph Ortiz and Mustafa Mukadam and Mike Lambeta and Roberto Calandra and Ken Goldberg},
    journal={arXiv preprint arXiv:2402.13232},
    year={2024}
}