This article details LLaVA-UHD v3, developed by teams from Tsinghua University and the Chinese Academy of Sciences. Through its innovative Progressive Visual Compression (PVC) framework, the model tackles key challenges faced by multimodal large models (MLLMs) when processing high-resolution images: specifically, the heavy computational burden of global native resolution encoding and the lack of global context in patch-based encoding. It innovatively balances efficiency and performance. The PVC framework comprises Refined Patch Embedding (RPE) for fine-grained modeling and Windowed Token Compression (WTC) for efficient token compression, significantly reducing the number of visual tokens while maintaining global semantic consistency. Experiments demonstrate that LLaVA-UHD v3 achieves a 1.9 times speedup compared to mainstream models like Qwen2-VL and shows highly competitive performance across multiple vision-language benchmarks, proving its capability for "efficiency without degradation" without sacrificing performance.



