SokoBench: Evaluating Long-Horizon Planning and Reasoning in Large Language Models

•

Original Author:Sebastiano Monti et al.

•

January 28, 2026

SokoBench: Evaluating Long-Horizon Planning and Reasoning in Large Language Models

Image generated by Gemini AI

A new study systematically evaluates the long-horizon planning abilities of state-of-the-art Large Reasoning Models (LRMs) using a benchmark based on simplified Sokoban puzzles. Results indicate significant performance drops when more than 25 moves are needed, highlighting limitations in their forward planning capacity. Enhancements through Planning Domain Definition Language (PDDL) tools show only modest improvements, suggesting fundamental architectural constraints remain unaddressed by scaling methods.

SokoBench Benchmark Evaluates Long-Horizon Planning in Large Language Models

Recent research has revealed significant insights into the long-horizon planning capabilities of state-of-the-art Large Reasoning Models (LRMs). A new benchmark, named SokoBench, assesses these models through a series of simplified Sokoban puzzles designed to isolate the complexities of long-term planning.

This evaluation highlights a marked decline in the performance of LRMs when tasked with solving problems that require more than 25 moves, indicating a fundamental limitation in the models' capacity for forward planning.

Benchmark Findings

The SokoBench was developed to focus on planning challenges, allowing researchers to examine the inherent capabilities of LRMs in long-horizon reasoning tasks. Results indicate a consistent degradation in performance as the length of the required solution path increases.

Models demonstrated significant difficulty with puzzles necessitating over 25 moves.
The performance decline suggests limitations in how LRMs can project future states.

The research team incorporated Planning Domain Definition Language (PDDL) parsing and solving tools into the LRMs, resulting in modest performance improvements, but not fully compensating for the observed deficiencies.

Share this article

Twitter Facebook LinkedIn WhatsApp Reddit

SokoBench: Evaluating Long-Horizon Planning and Reasoning in Large Language Models

SokoBench Benchmark Evaluates Long-Horizon Planning in Large Language Models

Benchmark Findings

Related Topics:

Share this article