arxiv What Matters in Training a GPT4-Style Language Model with Multimodal Inputs?