1. RLHF (Reinforcement Learning from Human Feedback)
This is the "classic" method that made ChatGPT popular. It involves three main steps: Supervised Fine-Tuning (SFT): Humans write high-quality answers to prompts. Reward Model: Humans rank different AI responses from best to worst. A separate "Reward Model" learns what humans like. Optimization: The AI plays a "game" where it tries to generate text that gets the highest score from the Reward Model using an algorithm called PPO (Proximal Policy Optimization).
2. RLVR (Reinforcement Learning from Verifiable Rewards).
This is the current "gold standard" for AI models that need to be 100% accurate, like those used for coding or math. The Problem: In RLHF, humans might prefer an answer because it looks smart, even if the math is wrong. The Solution: RLVR uses "Verifiable Rewards." If the AI writes code, the reward comes from whether the code actually runs and passes tests. If it solves a math problem, the reward comes from the final numerical answer being correct. It ignores "style" and focuses purely on "truth".
3. MoE (Mixture of Experts)
An MoE model consists of two main parts:
The Experts: Instead of one massive feed-forward network, the model is split into many smaller sub-networks (the "experts").
In 2026 models like DeepSeek-V3, there might be hundreds of these experts.
The Gating Network (The Router): This is the "manager." When a word (token) comes in, the Router looks at it and decides:
"This is a coding question; send it to Expert #42 and Expert #109."