arxiv A Touch, Vision, and Language Dataset for Multimodal Alignment