arxiv InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks