DeepSeek-R1: RL for LLMs Rethought

7 min readJan 31, 2025

For years, supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF) have been the dominant training methods for large language models (LLMs). The assumption has been clear: human-labeled data is necessary to teach structured reasoning, and reinforcement learning (RL) without human preference modeling is not enough to create a competitive model.

DeepSeek-R1 challenges this paradigm head-on.

Instead of relying on human preference ranking and costly reward models, it shows that pure reinforcement learning, structured self-ranking, and rejection sampling can achieve state-of-the-art reasoning — without the bottlenecks of traditional RLHF.

The key breakthroughs include:

DeepSeek-R1-Zero: The first RL-trained LLM without supervised fine-tuning that spontaneously developed reasoning.
Group Relative Policy Optimization (GRPO): A reinforcement learning method that eliminates the need for a separate reward model.
Cold-start fine-tuning: A minimal dataset used before RL to stabilize training.
Rejection sampling: A self-improvement mechanism that replaces human preference tuning.
Distillation: Smaller models can inherit the reasoning capabilities of DeepSeek-R1.

This deep dive will explore the full DeepSeek-R1 methodology, how it compares to other models, and the implications of its breakthroughs.

DeepSeek-R1-Zero: The Proof That Reinforcement Learning Alone Can Induce Reasoning

The most radical idea behind DeepSeek-R1 started with an experiment:

What if we removed supervised fine-tuning (SFT) entirely and trained a model using only reinforcement learning?

The result was DeepSeek-R1-Zero, a model that:

Developed self-verification, reflection, and structured reasoning — without seeing human-labeled Chain-of-Thought (CoT) data.
Learned to reevaluate and correct its own mistakes.
Naturally extended the depth of its reasoning over time.

Aha Moment #1: LLMs Do Not Need Human-Labeled Reasoning Steps to Learn Structured Thought

Traditionally, LLMs are fine-tuned on human-labeled reasoning chains before reinforcement learning. The belief has been that without this, models will not generalize multi-step reasoning well.

DeepSeek-R1-Zero proved this assumption wrong.

At one point, the model paused mid-solution and interrupted itself:

“Wait, let’s go over this again. That might not be right.”

This kind of self-correction usually emerges only when a model has been explicitly trained with human preference models rewarding verification behaviors. DeepSeek-R1-Zero discovered it naturally.

Aha Moment #1: Reinforcement Learning Alone Can Induce Structured Reasoning — No Human CoT Data Required.

However, while DeepSeek-R1-Zero displayed remarkable reasoning abilities, it was not user-friendly:

Readability was poor — responses were long and unstructured.
Language mixing — responses contained multiple languages in a single answer.
Formatting issues — no markdown, no clear separation between thoughts and conclusions.

This led to the creation of DeepSeek-R1, which introduced a small fine-tuning step before RL to improve usability.

DeepSeek-R1: A More Stable Reinforcement Learning Pipeline

To fix the instability of pure RL, DeepSeek-R1 added a minimal fine-tuning phase before reinforcement learning, known as cold-start fine-tuning.

Step 1: Cold-Start Fine-Tuning (Minimal SFT for Stability)

Instead of training on a massive dataset, DeepSeek-R1 was fine-tuned on a small, high-quality dataset before reinforcement learning.

This improved:

Readability — structured outputs instead of raw reasoning chains.
Training stability — RL learning curves were smoother.
Language consistency — prevented mixing multiple languages in a single response.

Aha Moment #2: RL Doesn’t Need Full SFT — Just a Strong Enough Starting Point.

Step 2: Pure Reinforcement Learning with GRPO

After minimal fine-tuning, the model was trained using Group Relative Policy Optimization (GRPO) — a reinforcement learning strategy that removes the need for a reward model.

Instead of using a critic model to assign reward scores, GRPO:

Generates multiple responses per prompt.
Ranks them relative to each other.
Optimizes the model to prefer higher-ranked responses — without absolute reward values.

Aha Moment #3: An LLM Can Learn Without an External Critic — Self-Ranking Is Enough.

Step 3: Post-RL Fine-Tuning with Rejection Sampling

Once RL training was complete, DeepSeek-R1 introduced rejection sampling — a way for the model to refine itself without human preference tuning.

The model generates multiple responses per prompt.
The best responses are selected and used for an additional fine-tuning pass.
This acts as a self-improving feedback loop, replacing human preference tuning.

Aha Moment #4: An LLM Can Self-Generate Fine-Tuning Data — No Human Preferences Needed.

Distillation in DeepSeek-R1: Scaling Down Without Losing Reasoning Power

One of the most strategic moves in DeepSeek-R1’s design was the introduction of distillation, allowing smaller models to inherit the reasoning capabilities of the larger model without requiring full retraining.

Unlike conventional instruction tuning, where smaller models are fine-tuned on datasets independently curated from various sources, DeepSeek-R1’s distillation directly leverages the reasoning capability of its larger counterpart.

How the Distillation Process Works

DeepSeek-R1 first generates 800K high-quality training samples curated from its reasoning outputs.
These samples are then used to fine-tune smaller models like Qwen and Llama to replicate the larger model’s reasoning.
No reinforcement learning is applied to the distilled models, even though incorporating RL could further enhance performance.

This approach enables smaller models to retain advanced reasoning skills without the high cost of training from scratch.

These distilled models outperformed larger open-source models like Qwen-32B and even OpenAI’s o1-mini in reasoning-heavy tasks.

Aha Moment #5: Distilling Reasoning is More Effective Than Training a Small Model from Scratch

One of the key findings of DeepSeek-R1’s distillation process was that:

A 14B distilled model outperformed a 32B open-source model that had never been trained with RL.
A distilled 70B model set new records on reasoning benchmarks among dense models.

This suggests that pre-trained smaller models may not be inherently worse at reasoning — they just lack the right training.

Instead of spending massive compute on RL training for each small model, distillation allows smaller models to “inherit” reasoning skills from larger ones.

Distillation vs. Reinforcement Learning: Why Not Just Train Small Models with RL?

A key question arises: Why distill instead of applying RL directly to the smaller models?

To test this, the DeepSeek-R1 team applied full-scale RL training to Qwen-32B-Base using math, code, and STEM data. The results showed that:

Distillation from DeepSeek-R1 outperformed direct RL training on a smaller model.
The reasoning patterns discovered by the larger model were crucial for improving smaller models.
Applying RL to a distilled model still provided additional benefits, but the bulk of reasoning capability was already transferred through distillation.

Aha Moment #6: Distillation Can Be More Effective Than RL for Small Models — And Requires Less Compute.

This means that instead of investing in RL training for every model size, researchers can train a single strong RL model and distill its reasoning capabilities into smaller models.

Final Thoughts: Rethinking the LLM Training Paradigm

DeepSeek-R1 is not just another step forward — it is a fundamental rethinking of how large language models should be trained. By removing the reliance on human preference models, eliminating reward hacking, and proving that reinforcement learning alone can induce reasoning, DeepSeek-R1 challenges some of the most entrenched assumptions in AI development.

Key Takeaways from DeepSeek-R1’s Innovations:

Reinforcement learning alone can induce structured reasoning. DeepSeek-R1-Zero proved that an LLM can develop self-verification, reflection, and multi-step problem-solving without human-annotated Chain-of-Thought (CoT) examples.
Reward models are not necessary for reinforcement learning. Group Relative Policy Optimization (GRPO) eliminates the need for an external critic model, making reinforcement learning more stable and cost-efficient than traditional RLHF.
Fine-tuning before RL doesn’t need to be large-scale. A small cold-start fine-tuning phase is sufficient to stabilize training — removing the need for full-scale supervised fine-tuning.
Self-ranking can replace human preference tuning. Instead of training a reward model based on human feedback, DeepSeek-R1 ranks its own responses and refines itself using rejection sampling.
Distillation makes advanced reasoning models scalable. DeepSeek-R1 distilled its knowledge into smaller Qwen and LLaMA-based models, proving that high-level reasoning can be transferred to compact architectures without full RL training.

Have We Been Overusing RLHF?

For years, RLHF has been considered the gold standard for aligning LLMs with human preferences, but DeepSeek-R1 suggests that this approach may be overused. It shows that:

Pure reinforcement learning can match or exceed RLHF-trained models in reasoning-heavy tasks like math and coding.
Eliminating the reward model prevents reward hacking and makes training more stable.
Distillation can scale high-quality reasoning down to smaller, more efficient models.

This raises an important question for the AI research community:

Do we really need human feedback at every stage, or can self-optimizing reinforcement learning be enough?

What This Means for the Future of LLM Training

DeepSeek-R1 introduces a new blueprint for LLM training — one that is cheaper, more scalable, and more stable than traditional RLHF approaches. This shift could reshape how future LLMs are trained, making high-performance models:

Less reliant on human annotation bottlenecks.
More adaptable to self-improving feedback loops.
Easier to scale through distillation into efficient, smaller models.

If DeepSeek-R1’s approach proves generalizable beyond math and coding, it could signal the beginning of a new training era — one where LLMs learn to optimize themselves without needing human intervention at every step.