DeepSeek
- DeepSeek-V3 is a MoE that exploits lots of innovations to reduce training costs:
- multi-head latent attention
- auxiliary-loss-free load balancing
- multi-token prediction
- FP8 training
- huge memory savings and code optimization
- DeepSeek-R1 further post-trains DeepSeekV3 with Reinforcement Learning (GRPO) for inference-time scaling, i.e., increasing the length of CoT.
The team first trains DeepSeek-Zero
DeepSeek-Zero leads to very good reasonings, but poor language fluency/quality, which is expected given that it has not been trained on real language data, but only on its own generations.
DeepSeek-R1 thus follows the following training process:
- DS-V3 is first finetuned on 1000s of manual reasoning data, giving X
- X is trained with RL like DS-Zero, given Y
- Generate new silver data from Y, combine it with real data from DS-V3 corpus (writing, QA, self-cognition)
- Retrain DS-V3 on this data, giving Z
- Z is further trained with RL, giving DeepSeek-R1
Finally, DeepSeek-R1 is used as a teacher to distill its reasoning capabilities into Qwen2.5 and Llama3.1.
- A reproduction of DeepSeek-Zero is available in Berkeley's researcher github repo
- A simpler version, called S1, obtained by carefully building only 1000 training samples and finetuning Qwen2.5:32b on them (no RL!) is described in https://arxiv.org/pdf/2501.19393. They obtain convincing test-time scaling curves.
- Other private models on this vein: OpenAI o1, Gemini 2.0 Flash Thinking Experimental
- Other open models: QwQ-32b, Sky-T1, Bespoke-32b, Kimi k1.5 (multimodal)