Logobestblogs.dev

Articles

SFT followed by RL with Suboptimal Performance? Rethinking the Use of Offline Expert Data
通义大模型
08-20
AI Score: 82
⭐⭐⭐⭐

The article delves into common issues such as performance degradation and stagnation when combining Large Language Model SFT (Supervised Fine-Tuning) and RL (Reinforcement Learning), pointing out that the root cause lies in the improper handling of 'Offline Expert Data' and the paradigm defects of 'hard switching' between SFT and RL. To solve this problem, the Trinity-RFT team from Tongyi Lab proposed the CHORD framework, which transforms SFT from an independent stage into a dynamically weighted auxiliary objective within the RL training process. The CHORD framework introduces a global balance coefficient µ to achieve a soft transition from imitation to exploration, and further implements selective learning of expert data by Token-level weight function ϕ, optimizing for relevant information and mitigating irrelevant data. Experimental results show that the CHORD framework significantly improves model performance in multiple benchmark tests and maintains good general capabilities. The article provides open-source code and paper links, providing new ideas for Large Language Model training optimization.

Artificial IntelligenceChineseLarge Language ModelLLMModel TrainingSFTReinforcement Learning
No more articles