An essay by GPT:
This maze is interesting for the wrong reason. The real issue is not whether a system can patiently trace corridors from entrance to exit. It is whether it can recognize, early and cheaply, that no valid route exists at all. In the image, the exit sits in the upper-right region, and that region appears visually cut off from the start-side space in the lower-left. A human often does not need to “solve” such a maze in the ordinary sense. The human sees that the exit is stranded and stops there. The problem is over before route construction begins.
That difference matters because it separates two very different tasks. One task is path construction: given that a path exists, find one. The other is reachability testing: determine whether any path can exist in the first place. In graph theory, the second task is the simpler and more basic one. If the start and goal lie in different connected components of the walkable region, no search for a route is needed. Standard connected-components checks are built on breadth-first search and run in linear time in the size of the graph. In other words, once the maze is represented correctly, this is not a deep puzzle. It is a cheap structural test. (NetworkX)
Humans often do well on cases like this because human vision is not only a serial step-by-step tracer. Classic work on global precedence found that people often register overall structure before local detail. Related work on uniform connectedness argues that connected regions tend to be perceived as entry-level units of visual organization. Put simply, people are often good at seeing “this open area belongs together” before they enumerate exact moves. That is why the exit-side region in a maze can pop out as a disconnected pocket before any deliberate search begins. The human advantage here is not mystical intuition. It is rapid perceptual grouping. (Wexler)
The model’s mistake, then, is not just that it failed to find the right route. It failed to ask the right first question. Instead of checking connectivity, it moved straight to path narration, as though a maze prompt automatically implies that a path must exist. That is a revealing error because it shows a bias toward constructive output over structural verification. The system acted as though producing something route-shaped was the task, when the real task was to reject the premise of solvability.
Recent spatial-reasoning research suggests that this is not an isolated anecdote. The VSP benchmark found that current vision-language models fail even simple spatial planning tasks, and the authors traced those failures to a combination of weak visual perception and reasoning bottlenecks. SPaRC, a pathfinding benchmark with 1,000 grid puzzles, reports near-ceiling human performance while strong reasoning models remain far behind and often generate invalid paths. These are not just “wrong answers.” They are failures to preserve hard spatial constraints while producing fluent-looking outputs. (arXiv)
That is why impossible cases like this one are unusually diagnostic. In an ordinary solvable maze, a model can bluff its way part of the distance with plausible local moves. In an impossible maze, bluffing is fatal. The only correct answer is a negative one, and negative answers require a different cognitive discipline: not continuation, but verification. TopoBench, a recent benchmark for hard topological reasoning, makes this point at a broader level. It shows that even frontier models struggle badly on problems defined by global invariants such as connectivity, loop closure, and region structure. Its error analysis identifies patterns like premature commitment and constraint forgetting, and the paper’s broader conclusion is especially relevant here: the real bottleneck often lies in extracting and maintaining the spatial constraints, not in reasoning over them once they are made explicit. (arXiv)
This is also why the most natural human move is to start from the exit. Formally, starting from the entrance or the exit is symmetric for a reachability test. Perceptually, it is not. A small, suspicious-looking goal pocket is often the cheapest place to inspect first. Humans gravitate to the visually diagnostic region because it minimizes work. The model, by contrast, tends to behave as though all maze problems want the same generic template: begin at the start, narrate a route, and keep going until something answer-shaped appears. That is not search. It is format completion.
The deeper lesson is that this is not fundamentally a language problem. It is a topological problem. Language can describe it, but description is not the same thing as solving it. A graph algorithm solves it by testing connectedness. Human perception often solves it by grouping regions of space into coherent wholes. A model trained primarily to continue text has no built-in reason to privilege either of those procedures unless the prompt, architecture, or external tools force it to do so. That is why the mismatch matters: next-token generation is local, but connectedness is global. A system can be very good at extending a sequence and still be poor at protecting a global invariant. (NetworkX)
There is already evidence for what this means in practice. A 2025 Scientific Reports paper on path planning describes a directly related failure mode: the model “imagined a non-existent path” through obstacles and planned greedily toward the destination anyway. The authors call this spatial hallucination. That phrase is useful because it captures what goes wrong in examples like this one. The error is not only incorrect reasoning. It is the invention of spatial structure that is not there. Once that happens, the model can produce a very confident answer to a problem it has silently rewritten. (Nature)
So the strongest conclusion is not that language models are useless at mazes. That would be too broad. The stronger and more defensible conclusion is narrower: current LLMs and VLMs are often unreliable at explicit global spatial verification, especially in trick cases where the correct answer is “no solution.” Humans often solve those cases cheaply by seeing disconnectedness at a glance. Classical graph methods solve them by checking components. Models, left in their default generative mode, too often skip that step and begin constructing paths anyway. That is what this maze exposes so clearly. It is not a failure to search longer. It is a failure to test whether search should begin at all. (arXiv)