Logobestblogs.dev

Articles

Kuaishou's Klear-Reasoner Achieves Top Performance on the 8B Model Leaderboard; GPPO Algorithm Enhances Stability and Exploration Capabilities
快手技术
Today
AI Score: 87
⭐⭐⭐⭐

The article details the Klear-Reasoner model launched by the Kuaishou Klear language large model team. This model is based on Qwen3-8B-Base and achieves SOTA level among models of the same scale in multiple authoritative benchmarks for mathematics and code. Its core innovation lies in the proposed GPPO (Gradient-Preserving Clipping Policy Optimization) algorithm, which aims to solve the problem of the clip mechanism in traditional PPO limiting the model's exploration ability and delaying the convergence of negative samples. By decoupling gradient backpropagation and clip operations, GPPO effectively preserves the exploration signals of high-entropy tokens and the correction signals of negative samples while maintaining training stability. In addition, the article also shares several important experimental insights in the Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) stages, including prioritizing data quality over quantity in the SFT stage, fault tolerance promotes learning for high-difficulty samples, soft rewards being superior to hard rewards in the RL stage, and the importance of filtering test cases for code data. The paper, Hugging Face, and GitHub links are provided for community reproduction and application.

ProgrammingChineseLLMReinforcement LearningGPPO (Gradient-Preserving Clipping Policy Optimization) AlgorithmModel Training OptimizationReasoning Ability
No more articles