arxiv VisualWebBench: How Far Have Multimodal LLMs Evolved in Web Page Understanding and Grounding?