Step-Level Reward Models: A Framework for Structured Mathematical Reasoning

8 min readJan 4, 2025

Mathematical reasoning requires precision and logical consistency, especially in multi-step problems where every intermediate step must align with the final solution. Large language models (LLMs) have shown promising results in automating such reasoning, but their limitations in evaluating and guiding intermediate steps present a significant challenge. Without robust mechanisms to assess logical coherence at each step, errors often propagate unnoticed, undermining the final results.

The paper on Step-Level Reward Models for Mathematical Reasoning by Ma et al. examines the role of Step-Level Reward Models (SRMs), leveraging Monte Carlo Tree Search (MCTS) to evaluate intermediate reasoning steps. It delivers a counterintuitive insight: natural language is not essential for effective reasoning guidance in mathematical tasks.

This blog post explores the study’s findings, methodology, and implications, providing a structured analysis of how SRMs enhance logical reasoning.

The Problem: Evaluating Multi-Step Reasoning

Why Evaluate Step by Step?

Mathematical problem-solving often involves multiple reasoning steps:

1. Intermediate Logic: Each step must align with the overarching goal of solving the problem.

2. Cumulative Consistency: Mistakes in one step can propagate, making early detection critical.

Traditional approaches focus on end-to-end outputs, ignoring the logical consistency of intermediate steps. This leaves reasoning models blind to errors until the final result, making debugging and improvement difficult.

Step-Level Reward Models (SRMs)

What Are SRMs?

Step-Level Reward Models (SRMs) are designed to evaluate the quality of intermediate reasoning steps, assigning scores based on their contribution to solving the problem.

SRM Configurations

The paper evaluates four types of SRMs:

1. Full-Context SRM (FC-SRM): Considers both natural language thoughts and mathematical expressions:

2. Math-Only SRM (MO-SRM): Focuses solely on mathematical expressions:

3. Single-Step Math-Only SRM (SSMO-SRM): Considers only the most recent mathematical expression, ignoring previous context:

4. Next-Thought SRM (NT-SRM): Evaluates the next proposed reasoning step:

Monte Carlo Tree Search (MCTS): The Key to Training SRMs

Training SRMs requires data on which reasoning steps are better. This is where Monte Carlo Tree Search (MCTS) comes into play. MCTS simulates different reasoning paths for a given problem, scoring each step based on its likelihood of leading to the correct solution. It generates step-level preferences by comparing these paths.

How MCTS Works: A Step-by-Step Explanation

Imagine solving a math problem, like computing:

There are multiple ways to approach it, but not all paths are equally effective. MCTS helps evaluate these paths systematically.

Representing the Problem as a Tree

The root node represents the initial problem state.
Each node in the tree represents a partial solution (e.g., a series of steps completed so far).
Edges between nodes represent actions or decisions (e.g., choosing the next step).

2. The Four Phases of MCTS

MCTS explores this tree using a systematic process:

A. Selection

Starting at the root, MCTS selects a promising node to expand.
The selection is guided by a balance of: exploration, i.e. trying new paths and exploitation: i.e. refining paths that seem promising.
This balance is controlled using the Upper Confidence Bound for Trees (UCT) formula:

where,

Q(s_i, a_j): Average reward of taking action a_j from state s_i.
N(s_i): Number of times state s_i has been visited.
N(s_i, a_j): Number of times action a_j has been taken from s_i.
c: A parameter balancing exploration and exploitation.

B. Expansion

When MCTS reaches a node that hasn’t been fully explored, it generates new child nodes by considering possible next steps.

For example, in the problem:

Expansion: We compute

leading to

as the next node.

C. Rollout

From the newly expanded node, MCTS simulates a complete reasoning path to the final solution.
For each path, the reasoning model predicts outcomes for intermediate steps, assigning a binary reward: R = 1 if the path leads to a correct solution and R = 0 otherwise.

D. Backpropagation

The results of the rollout are propagated back through the tree, updating the scores of all nodes along the path.
This ensures that earlier decisions reflect the success or failure of the reasoning path.

Illustrative Example

1. Initial Problem:

2. Expansion: Compute

leading to:

Node 1: → Correct.

Node 2: → Invalid expansion (model error).

3. Rollout: Simulate reasoning paths:

• Path 1→ Correct.

• Path 2→ Incorrect.

4. Backpropagation: Assign higher scores to correct paths and prune invalid expansions.

Training SRMs with Preferences

The preferences generated by MCTS are used to train SRMs. For each pair of reasoning steps (s_i, s_j) , the SRM predicts which step is better using a contrastive loss function:

Where:

V(s_i) : Predicted score for the preferred state.
V(s_j) : Predicted score for the less preferred state.

Key Results and Learnings

Datasets Used

The study evaluated Step-Level Reward Models (SRMs) on two datasets:

1. GSM8K: A dataset containing grade-school-level math problems. These problems involve basic reasoning steps and simple numerical operations, making it a good testbed for evaluating logical coherence in elementary reasoning.

2. MATH: A more challenging dataset with high-school-level math problems. These problems demand advanced logical reasoning and multi-step calculations, providing a robust test for SRMs.

Performance of SRMs

The experiments compared different configurations of SRMs in terms of their ability to guide reasoning systems effectively:

1. Full-Context SRM (FC-SRM):

This model considered both natural language explanations of reasoning and mathematical expressions.
It achieved significant improvements in accuracy, showing that combining natural language and structured math can enhance reasoning.
However, the computational cost of processing natural language inputs made it less efficient compared to other configurations.

2. Math-Only SRM (MO-SRM):

Focused solely on structured mathematical expressions.
Surprisingly, it performed almost as well as the Full-Context SRM, with only a marginal drop in accuracy.
On the MATH dataset, which features complex problems, MO-SRM slightly outperformed FC-SRM. This indicates that mathematical structure alone is often sufficient for logical reasoning, reducing reliance on language inputs.

3. Single-Step Math-Only SRM (SSMO-SRM):

This model evaluated only the most recent mathematical expression without considering prior steps.
It performed worse than MO-SRM and FC-SRM, demonstrating the importance of context in multi-step reasoning tasks.

4. Next-Thought SRM (NT-SRM):

This model relied solely on the natural language descriptions of the next step in the reasoning process.
It struggled to evaluate logical coherence effectively, particularly for tasks requiring structured reasoning, underscoring the limitations of relying solely on language.

Quantitative Gains

For grade-school problems (GSM8K), both FC-SRM and MO-SRM achieved significant improvements over the baseline performance, with gains of approximately 7%.
For more complex problems (MATH), MO-SRM achieved the highest improvement, with an 8.48% gain in accuracy. This demonstrates the strength of structured mathematical reasoning, even for high-difficulty tasks.

Main Takeaways

1. Natural Language Is Not Essential

One of the most counterintuitive findings is that natural language explanations are not critical for guiding reasoning models in mathematical tasks. MO-SRM, which excludes language inputs, performed comparably — or even better — than FC-SRM in many cases. This challenges the conventional belief that language is necessary for contextualizing reasoning.

2. Structured Mathematical Representations Are Sufficient

The success of MO-SRM highlights the power of structured inputs like equations and symbolic representations. Mathematical reasoning inherently relies on logical structures, making language an optional rather than essential component for reasoning tasks.

3. Context Matters

The poor performance of Single-Step Math-Only SRM (SSMO-SRM) demonstrates that reasoning models need context. Evaluating each step in isolation, without considering prior steps, leads to incomplete and inconsistent reasoning.

4. Complex Reasoning Tasks Benefit from Focused Models

On the challenging MATH dataset, MO-SRM excelled, proving that a focus on structured, domain-specific reasoning can outperform hybrid approaches. This is particularly relevant for tasks like theorem proving or symbolic computation, where clarity and precision are paramount.

5. Efficiency Gains Without Sacrificing Accuracy

By excluding natural language inputs, MO-SRM reduces computational overhead while maintaining accuracy. This makes it a cost-effective alternative for reasoning systems, especially in environments with limited resources.

Learnings from the Study

Implications for Model Design

1. Domain-Specific Optimization:

For fields like mathematics or symbolic reasoning, where structured data dominates, models can focus exclusively on such inputs without relying on language. This improves efficiency and simplifies model architecture.

2. Hybrid Models for Mixed Domains:

While natural language is not essential for math reasoning, domains that combine unstructured and structured inputs (e.g., legal reasoning or scientific discovery) may benefit from hybrid SRMs that can process both.

Challenges and Future Directions

1. Path-Length Bias in SRMs:

SRMs tend to favor shorter reasoning paths, potentially overlooking longer, optimal solutions. Future work could explore techniques to balance this bias.

2. Expanding to Mixed Reasoning Tasks:

The current focus on mathematical tasks leaves open questions about SRMs’ effectiveness in mixed reasoning contexts, such as combining logical reasoning with creative problem-solving.

3. Enhancing Logical Coherence in Language:

The limited performance of NT-SRM highlights the challenges of capturing and evaluating implicit logic in natural language. This is an area ripe for further exploration.

Takeaway for Practitioners

For tasks involving structured, domain-specific reasoning:

Use domain-optimized models like MO-SRM, which focus exclusively on structured inputs.
Reserve natural language processing capabilities for tasks where unstructured inputs provide additional context or insights.

For broader applications:

Leverage hybrid SRMs that can adapt to mixed reasoning domains.
Explore methods for generating step-level preferences in domains with less structured data.

Wrapping up

Step-Level Reward Models provide a systematic way to evaluate and guide multi-step reasoning tasks. The study’s counterintuitive finding — that natural language is not essential for mathematical reasoning — highlights the power of structured inputs like equations and symbolic representations. By leveraging techniques like Monte Carlo Tree Search, SRMs can assign step-level rewards that guide models toward logical and coherent solutions.

As reasoning systems continue to evolve, the key challenge will be extending SRMs to handle mixed-domain tasks while maintaining their efficiency and logical rigor. This work lays the foundation for building reasoning models that are not only accurate but also transparent and computationally efficient.