SOTA in 41 Leaderboards! Zhipu's Latest Open-Source GLM-4.5V Tested: Guessing Addresses from Images, Converting Videos to Code in Seconds
The article details Zhipu's latest open-source GLM-4.5V multimodal visual reasoning model, which is based on the GLM-4.5 base and has won SOTA in 41 out of 42 public leaderboards, making it the strongest 100B-level open-source multimodal model. Through multiple real-world case studies such as GeoGuessr (guessing addresses from images), Qingming Shanghe Map Grounding (Grounding of Along the River During the Qingming Festival), video-to-frontend code conversion, spatial relationship understanding, UI-to-Code, image recognition, and object counting, the article comprehensively demonstrates GLM-4.5V's excellent capabilities in image, video, and document understanding, especially its untrained "video-to-code" capability, which reflects strong generalization. At the technical level, the article elaborates on the AIMv2-Huge Visual Encoder, MLP adapter, 3D-RoPE, and three-stage (pre-training, SFT, RL) training strategies used by GLM-4.5V. In addition, the article also mentions the model's cost-effective API calls and free resource packs, aiming to lower the barrier for developers to use and promote multimodal AI from 'proof of concept' to 'large-scale deployment.'