Content extract
UC Merced UC Merced Electronic Theses and Dissertations Title Spatial Sequence Reasoning in Large Language Models: An MDL-Based Evaluation with Dehaene’s Geometric Language Permalink https://escholarship.org/uc/item/462545ss ISBN 9798297612938 Author Ingale, Ajinkya Publication Date 2025-08-18 Copyright Information This work is made available under the terms of a Creative Commons AttributionNonCommercial License, available at https://creativecommons.org/licenses/by-nc/40/ Peer reviewed|Thesis/dissertation eScholarship.org Powered by the California Digital Library University of California UNIVERSITY OF CALIFORNIA, MERCED Spatial Sequence Reasoning in Large Language Models: An MDL-Based Evaluation with Dehaene’s Geometric Language A Thesis submitted in partial satisfaction of the requirements for the degree of Master of Arts/Master of Science In Masters of Science in Cognitive and Information Sciences by Ajinkya Ashok Ingale Committee in charge: Professor
Chris Kello, Chair Professor Heather Bortfeld Professor Kristina Backer Copyright (or ) Ajinkya Ashok Ingale, 2025 All rights reserved 2 The Thesis of Ajinkya Ashok Ingale is approved, and it is acceptable in quality and form for publication on microfilm and electronically: Heather Bortfeld Kristina Backer Chris Kello, Chair University of California, Merced 2025 3 Table of Contents Abstract .5 1. Introduction 6 2. Methods: Computational Assessment of Spatial Reasoning Performance 8 2.1 Experimental Framework and Design 8 2.2 Stimuli 9 2.21 Spatial Template Creation 9 2.22 Test Image Design 10 2.23 Metadata Specification 10 2.3 Computational Architecture 10 2.4 Experimental Procedure 11 2.5 Analytical Approach 11 2.6 Implementation Specifications 12 3. Analysis 13 3.1 Data Integration and
Preprocessing 12 3.2 Normalization of Accuracy Metrics 12 3.3 Visualization Strategy 12 3.4 Regression Analysis (OLS; full, cutoff, no-fit) 14 3.5 Confidence Intervals and Model Diagnostics 14 3.6 Analytical Outputs 14 4. Results 15 4.1 Relationship Between MDL and Model Performance 18 4.2 Human Benchmark and Task Comparability 19 5. Discussion 19 References .21 List of Figures Figure 1. Sample Template (circular arrangement with numbered elements) 8 Figure 2. Order Images: Clockwise (top), Zigzag (center), 2 Arcs (bottom) 9 Figure 3. Geometrical Rules and Sequence Examples (rotations, symmetries) 9 Figure 4. Scenario 1: No Fit LineNormalized Edit Distance vs MDL 15 Figure 5. Symbol ChartPattern–Symbol Mapping 15 Figure 6. Scenario 2: OLS Fit (All MDL) with 95% CIs 16 Figure 7. Scenario 3: OLS Fit (MDL ≤ 10) with 95% CIs 17 Figure 8. Human Benchmark (MDL vs Anticipatory Eye Movements) 18 4
Abstract: Artificial Intelligence (AI) language models, such as GPT-mini (ChatGPT) or OpenAi-o1, have demonstrated exceptional capabilities in language processing, logical reasoning, and symbolic tasks[1, 2]. However, spatial reasoning, which encompasses the ability to mentally manipulate objects, understand spatial relationships, and predict structured spatial sequences, remains largely uncharted in the context of large language models (LLMs). Spatial reasoning is fundamental to human cognition and relies on specialized neural structures such as cortical maps, as suggested by Dehaene et al[16]. Unlike humans, LLMs lack explicit spatial architectures, raising important questions about their ability to perform tasks that require spatial transformations, recursion, and geometric compression. This study investigates the extent to which OpenAi gpt-4o or GPT-mini can perform spatial reasoning tasks by adapting Dehaene’s geometric language framework into a purely verbal input-output
paradigm. Tasks include spatial sequence prediction and intruder detection. Human benchmarks from prior studies and newly collected performance data serve as comparative baselines[16, 17,18] . By analyzing GPT-mini/ gpt-4o’s performance across accuracy, consistency, and error patterns, we aim to identify the limitations of current language models in approximating spatial cognition. Recent research reveals significant performance gaps: GPT-mini and OpenAi-o1 underperforms humans across all spatial reasoning tasks, particularly those requiring recursive compression and symmetry generalization[13,15]. Error analysis indicates a reliance on linear stepwise logic rather than compact, recursive strategies observed in human reasoning. These results underscore the limitations of symbolic reasoning models in replicating spatial cognition and highlight the need for architectural enhancements. Inspired by Dehaene’s Cultural Recycling Hypothesis, we discuss opportunities to integrate spatial
processing modules into LLMs, advancing their ability to approximate human-like spatial reasoning. 5 1. Introduction : AI language models like ChatGPT have demonstrated impressive skills in language processing and symbolic reasoning, but they struggle with tasks requiring spatial reasoningthe human ability to mentally manipulate objects and understand spatial relationships. This gap underscores a fundamental distinction between AI and human intelligence, linked to biological differences in brain architecture. While humans rely on specialized cortical maps[16] for spatial tasks, AI lacks these structures. This research asks how well, if at all, ChatGPT can replicate human spatial reasoning, shedding light on the limits of AI in spatial processing.This study investigates to what extent language models like ChatGPT can perform spatial reasoning tasks, questioning whether the lack of specialized neural architecture limits their ability to approximate human spatial cognition.
Language models excel in symbolic processing, handling complex rules and patterns, but struggle with spatial tasks due to their lack of visual-spatial architecture. AI research shows that these models, although strong in pattern recognition, often fail when spatial relationships or visual information come into play[3, 5, 9, 12]. In contrast, human cognition integrates symbolic and spatial reasoning seamlessly. Studies show that our brains use specialized cortical regions, like the Visual Word Form Area (VWFA), to support both abstract and spatial tasks. Research on the Cultural Recycling Hypothesis suggests that neural circuits are adapted for culture-specific tasks, building on innate spatial-processing structures. For instance, the VWFA originally for visual recognition is now also used for reading in literate individuals, showing how human cognition can co-opt spatial structures for symbolic tasks. Humans’ unique ability to blend symbolic thinking, recursion, and spatial reasoning
has evolved through both biology and culture. This “cognitive singularity” enables tasks that AI cannot replicate, reflecting a fundamental architectural limitation[7, 16, 18] . Although language models are skilled in language processing, there’s limited research on their capacity for spatial reasoning. This gap constrains AI’s potential in fields requiring embodied cognition, like robotics and navigation. Addressing it could improve AI’s theoretical and practical capabilities, advancing models that can handle more than abstract language. This study seeks to answer the following central question: How do LLMs compare to humans in performing spatial reasoning tasks, specifically in spatial sequence and intruder detection tasks? Through this research, we aim to uncover fundamental differences in the reasoning mechanisms of LLMs and human cognition, elucidating the limitations of current AI models. Over the past decade, LLMs have achieved significant breakthroughs in processing
and generating language. Innovations like “chain-of-thought” 6 prompting have improved LLM performance on reasoning tasks by encouraging intermediate reasoning steps[11]. Studies reveal, however, that LLMs’ reasoning abilities are heavily dependent on the structure of their training data and the contextual prompts provided during inference. In contrast, human reasoning is characterized by efficiency, adaptability, and robustness. Cognitive tasks such as arithmetic, spatial navigation, and abstract planning demonstrate human capacity for flexible problem-solving and adaptation. Spatial reasoning, in particular, is deeply embedded in human neural architecture, as seen in tasks like the Corsi block test, where participants memorize and replicate spatial sequences. Human performance on such tasks reflects an interplay of working memory, spatial mapping, and predictive modeling, a synthesis that LLMs currently struggle to replicate. While some researchers propose that the
mechanisms of human cognition and LLMs might converge[14, 19], empirical evidence suggests otherwise[13, 20]. Studies comparing neural activation patterns between humans and LLMs highlight key differences, particularly in tasks involving reasoning[14]. For example, while LLMs rely on token prediction and statistical associations, humans employ a dynamic integration of sensory inputs, memory, and learned heuristics. Despite advancements in LLMs, critical gaps remain in our understanding of their reasoning capabilities. First, the extent to which LLMs can generalize spatial reasoning across modalities remains unclear. Existing studies focus primarily on linguistic reasoning, leaving spatial reasoning largely underexplored. Second, there is limited understanding of how LLMs integrate spatial reasoning with linguistic tasks and whether deficits in one modality affect performance in the other. Finally, the robustness of “chain-of-thought” reasoning in spatial contexts has not been
rigorously tested, particularly in comparison to human benchmarks.These gaps hinder our ability to evaluate LLMs comprehensively and limit their application in real-world scenarios requiring spatial reasoning, such as robotics, navigation, and interactive AI systems.To address these gaps, this study will evaluate LLM performance on two spatial reasoning tasks: spatial sequence and intruder detection tasks. Spatial sequence tasks require participants to predict the next step in a spatial pattern, testing their ability to understand and extrapolate geometric transformations. Intruder detection tasks involve identifying an irregular shape among a set of regular shapes, challenging participants to discern subtle spatial anomalies. By comparing LLM outputs to human benchmarks, we aim to isolate performance gaps and identify underlying mechanisms driving these differences. We hypothesize that humans will outperform LLMs in both tasks due to innate cognitive advantages in spatial
reasoning. LLM performance is expected to degrade significantly as task complexity increases, reflecting limitations in their training and reasoning frameworks. Alternatively, if LLMs perform comparably to humans, this could indicate that certain architectural features enable emergent 7 reasoning capabilities. Findings from this study will provide critical insights into the strengths and weaknesses of LLMs in spatial contexts, informing the design of next-generation models and advancing interdisciplinary research in AI and cognitive science. 2. Methods : Computational Assessment of Spatial Reasoning Performance This proposal aims to evaluate ChatGPT’s reasoning capabilities using geometric sequences and compare its performance across increasingly complex geometric sequence tasks. Each aim is linked to experiments designed to measure performance, analyze mechanisms, and if applicable, benchmark against human cognition. We developed a script to call the open ai api with a
prompt consisting of a template image and an order image 2.1 Experimental Framework and Design This study employed a computational paradigm to evaluate spatial reasoning capabilities through sequence prediction tasks. The core investigation centered on assessing how effectively an artificial intelligence system could interpret and reproduce spatial patterns when presented with complex visual stimuli. The experimental design incorporated five distinct spatial templates as reference frameworks, with each template[Fig 1] Fig 1: Sample template featuring unique circular arrangements of numbered elements. For each template configuration, the system was presented with test images containing sequences of filled circles requiring left-to-right ordering[Fig 2]. We tested with a total of 12 geometric sequences[Fig 3] The independent variable was the template configuration, while dependent variables included sequence prediction accuracy measured through Hamming distance. To establish performance
baselines, the experiment incorporated chance-level comparisons through random sequence permutations. The experimental protocol executed 10 trials per stimulus to account for inherent stochasticity in AI-generated responses. This replication strategy provided robust performance metrics while accommodating the probabilistic nature of generative AI systems. The comprehensive design yielded 50 experimental conditions (5 templates × 10 trials), generating sufficient data for meaningful statistical analysis of spatial reasoning capabilities. 8 Fig 2: Top to bottom : Order images - Clockwise (top), Zigzag (center), 2 Arcs(bottom) 2.2 Stimuli : 2.21Spatial Template Creation Five distinct spatial templates served as foundational reference frameworks for the experiment. Each template consisted of a circular arrangement of eight open circles, with individual circles containing unique numeric identifiers (1-8) at fixed positions. These templates were stored as PNG image files
(template.png) and represented different spatial configurations requiring mental transformation and Fig 3. (A) Geometrical rules for sequences: rotations (+1, +2, -1, -2), axial symmetries (H, V, A, B), and rotational symmetry (P). Each octagon location is reachable via these primitives. (B) Experimental Paradigm: Participants predict the next location of an orange dot on the octagon. (C) Sequence examples[17] 9 positional reasoning. The circular arrangement was deliberately selected to introduce rotational complexity, preventing simple linear mapping strategies. 2.22 Test Image Design For each template[Fig 1], a series of test images was generated featuring eight horizontally arranged rings[Fig 2]. Each ring contained exactly one filled circle among seven open circles, creating sequences that required identification and ordering. The spatial correspondence between test images and their parent templates was maintained through shared positional logic. The filled
circles' positions in test images directly corresponded to the numbered positions in the template ring, requiring cross-configuration mental mapping. 2.23 Metadata Specification Each template folder included an order.txt file containing ground-truth sequences that specified the correct left-to-right ordering of filled circles for every test image. These reference sequences were formatted as comma-separated numeric strings, enabling automated accuracy verification during analysis. The metadata structure ensured precise validation of spatial reasoning outputs against predetermined correct solutions. 2.3 Computational Architecture The experimental pipeline was implemented in Python 3.10, leveraging specialized libraries for image processing, distance metrics, and AI model interaction. The core architecture comprised three integrated modules: Stimulus Processing Module: This component managed image encoding and preparation for AI input. Local template images were converted
to base64-encoded data URIs with appropriate MIME type identification (image/jpeg or image/png). The encoding process preserved visual fidelity while ensuring compatibility with the AI model's input requirements. Test images were dynamically referenced through URL construction, enabling efficient access to remote stimulus repositories. AI Inference Module: The system employed OpenAI's GPT-4o multimodal model through its official API. The model was configured with a temperature setting of 10 to balance creativity and consistency across trials. A strict response protocol was enforced via system prompts that mandated comma-separated numeric outputs without explanatory text. The instructional design emphasized spatial transformation requirements: "Using the indices in the template, what is the order of filled circles from left to right?" 10 Evaluation Module: This component implemented following distance metrics for performance assessment: 1.
Normalized Hamming Distance: Computed position-wise mismatches between equal-length sequences using SciPy's hamming distance implementation, with result scaling to absolute counts. 2.4 Experimental Procedure The execution workflow followed a rigorous multi-stage process: 1. Initialization Phase: System paths were configured for template storage, test image access, and results output. Directory structures were validated to ensure all required assets (images, metadata files) were accessible. A CSV output file was initialized with comprehensive headers tracking template IDs, stimulus identifiers, correct sequences, and distance metrics. 2. Stimulus Presentation Loop: For each template configuration (1-5), the system iterated through associated test images. Each stimulus pair (template + test image) was processed through ten independent trials to capture performance variability. The template image was base64-encoded and paired with its corresponding test image URL to form
a complete input package. 3. AI Query Execution: For each trial, the GPT-4o model received a structured prompt containing: ○ A textual description emphasizing spatial relationships ○ The encoded template image[Fig 1] ○ The test image URL[Fig 2] Model responses were captured as raw strings and parsed into ordered sequences. 4. Performance Assessment: Predicted sequences were compared against ground-truth references using both distance metrics. Concurrently, chance-level performance was established by generating random permutations of correct sequences and computing identical distance metrics. All results were appended to the growing dataset with trial-level granularity. 5. Persistence and Output: Upon completing all templates and trials, comprehensive results were written to a CSV file with appropriate error handling for failed trials. The output structure enabled subsequent statistical analysis with preservation of trial-level details. 2.5 Analytical
Approach The analytical framework employed quantitative assessment of spatial reasoning performance through several dimensions: 11 Primary Accuracy Metrics: Hamming distance served as a measure of sequence prediction fidelity. Lower values indicated superior spatial reasoning performance, with perfect matches yielding zero distances. The Hamming distance provided additional positional accuracy insights when sequence lengths matched. Comparative Analysis: Model performance was systematically compared against chance-level baselines generated through random permutations. This comparison established whether the AI system's spatial reasoning capabilities exceeded random guessing, with statistical significance determined through paired difference testing. Error Pattern Examination: Systematic analysis of common error types (transpositions, omissions, insertions) provided insights into characteristic failure modes in spatial reasoning. Error distributions across
template configurations revealed potential interactions between spatial arrangement complexity and reasoning performance. Performance Aggregation: Results were aggregated across multiple dimensions: Per-template performance profiles Cross-template comparative analysis Trial-to-trial variability assessment Distance metric correlations The comprehensive dataset enabled both qualitative and quantitative evaluation of spatial reasoning capabilities, with particular attention to consistency, accuracy, and error characteristics across diverse spatial configurations. 2.6 Implementation Specifications The experimental system operated on a Linux-based computational platform with the following technical specifications: Core Libraries: openai (0.270), scipy (1130) AI Model: GPT-4o (2024-05-13 snapshot) Image Encoding: Base64 with MIME type detection Data Persistence: CSV output with trial-level granularity Randomization: Python random module with
system seeding All software components were containerized to ensure computational reproducibility, with dependency versions explicitly fixed to prevent behavioral drift. The complete implementation and dataset have been archived for verification and replication purposes. 12 3. Analysis: The analysis focused on examining the relationship between spatial reasoning performance, expressed as normalized Hamming distance, and the complexity of spatial patterns as quantified by their Minimum Description Length (MDL). The analysis was conducted using Python 3.10 with pandas, matplotlib, NumPy, and statsmodels as the primary computational tools. All analyses adhered to reproducible, containerized workflows consistent with the implementation specifications described previously. 3.1 Data Integration and Preprocessing The experimental output file containing trial-level Hamming distance metrics was first imported into a pandas DataFrame. A secondary dataset containing symbolic
representations for each spatial pattern was also imported and merged with the primary dataset via the common Image identifier. This integration ensured that each MDL value and accuracy metric could be directly associated with its corresponding symbolic pattern for subsequent visualization. Unicode compatibility was preserved by dynamically detecting available system fonts and selecting the first available from a predefined list optimized for symbol rendering (e.g, Apple Symbols, Symbola, Noto Sans Symbols 2) 3.2 Data Normalization To enable direct comparison across patterns, the Avg GPT (Hamming) values were normalized to a 0–1 scale using the global minimum and maximum values across the full dataset. This normalization was performed prior to all visualization and regression procedures, allowing consistent y-axis scaling across multiple plots and facilitating comparative interpretation. 3.3 Visualization Strategy Data were visualized as scatter plots of MDL (x-axis) versus
normalized average Hamming distance (y-axis), with each point annotated by its corresponding pattern symbol. Symbol-only plots were first generated for two dataset configurations: 1. Full Set: All 12 spatial patterns 2. Dehaene Subset: A reduced set of 9 patterns, excluding those flagged for omission via an Ignore column. Axes were fixed across all visualizations (MDL range: 5–16; normalized Hamming range: 0–1) to maintain comparability. Grid lines, bold axis labels, and consistent font sizing were applied for clarity in presentation. 13 3.4 Regression Analysis To examine potential linear trends between MDL and normalized performance, Ordinary Least Squares (OLS) regression models were fit using statsmodels. Regression analyses were conducted under three conditions for both the full set and the Dehaene subset: 1. Full Fit: Including all available data points 2. Cutoff Fit: Restricting the model to MDL ≤ 10, with higher-MDL points visually shaded and excluded
from fitting. 3. No-Fit Display: Displaying the scatter plot without a regression line for descriptive visualization only. For fitted models, regression outputs included slope, intercept, standard error, 95% confidence intervals, standardized beta coefficients, p-values, and coefficient of determination (R2R^2R2). Standardized beta values were computed to facilitate comparability of effect sizes across different model configurations. 3.5 Confidence Intervals and Model Diagnostics For each fitted regression, model predictions were computed across a continuous MDL range using the fitted parameters. Associated 95% confidence intervals were calculated and plotted as shaded bands, providing a visual representation of model uncertainty. These intervals allowed for the qualitative assessment of model robustness and the potential influence of extreme MDL values. 3.6 Analytical Outputs The analysis produced six primary plots each for gpt 4o and gpt o4 mini: 1. Full 12-Pattern
Scatter (Symbol-Only) 2. Dehaene 9-Pattern Scatter (Symbol-Only) 3. Full 12-Pattern with OLS Fit 4. Dehaene 9-Pattern with OLS Fit 5. Full 12-Pattern with MDL ≤ 10 OLS Fit 6. Dehaene 9-Pattern with MDL ≤ 10 OLS Fit These plots serve both as visual summaries of spatial reasoning performance trends and as empirical checks for hypothesized relationships between task complexity (MDL) and model accuracy. Statistical outputs from the OLS regressions will be further interpreted in the Results section to determine whether performance patterns align with theoretical expectations. 14 4. Results Fig 4: Scenario 1: No Fit Line Left: GPT-4o; Right: GPT-o4 mini. Top: 12 symbolic patterns; Bottom: 9 patterns from Dehaene et al (human-relevant set). Shows normalized edit distance vs MDL without regression fits Humans (fig 8) outperformed both LLMs overall. Note that human participants memorized patterns without mapping random templates, while LLMs solved a
different, context-window–based reasoning task. Each red symbol in Figures 4,6,7 represents one spatial sequence pattern (mapped to a unique shape[Fig 5]), plotted by its Minimum Description Length (MDL) on the x-axis vs. GPT-4o’s normalized average Hamming error on the y-axis. The LLM model performance on the spatial sequence prediction tasks was below human levels and showed wide variability across different patterns. GPT-4o (the multimodal GPT-4 model) did achieve better-than-chance accuracy on some sequences, but it struggled with many others. Its mean sequence prediction Fig 5: Symbols chart 15 error (measured by Hamming distance) across all 12 patterns was ≈5.6 (out of 8), compared to a chance-level error of ~6.6 [Fig 6] In other words, GPT-4o on average placed only about 2–3 out of 8 positions correctly, modestly surpassing random guessing. For instance, GPT-4o performed relatively well on the “2Arcs” pattern (average Hamming ≈ 3.8 vs chance ~73),
indicating it could partially follow that spatial rule. However, it struggled even on some intuitively simple patterns: on the “Clockwise” rotation sequence (the lowest-complexity pattern, MDL = 5), GPT-4o’s error was about 6.0 (nearly as high as chance ~73), suggesting it failed to detect the circular progression[Fig 4]. Other patterns that require symmetry or alternating logic, such as “Alternate” and “4points”, also saw high error rates (around 6.1 - 62, approaching chance levels) These high errors indicate the model often outputs nearly incorrect orderings for those sequences[Fig 4]. Fig 6: Scenario 2: OLS Fit (All MDL) Same layout as Figure 1, now with OLS regression lines and 95% CIs. Across the full MDL range, slopes are flat and non-significant for both models, indicating MDL does not predict error in this setting. For context, human data from Dehaene [Fig 8 C] show a strong linear MDL–performance relationship, but that trend is not observed here
under the LLM paradigm. 16 Interestingly, the smaller GPT-mini model (ChatGPT) performed similarly, and occasionally even outperformed GPT-4o on certain patterns. GPT-mini’s overall mean error was ~5.0, slightly lower (better) than GPT-4o’s ~56, though still far above human performance. For example, GPT-mini handled the “Clockwise” sequence much better than GPT-4o (average Hamming ~2.5, indicating it mostly got the order correct), whereas GPT-4o nearly failed that pattern. GPT-mini also achieved lower errors on patterns like “ZigZag” (∼4.0 vs GPT-4o’s ~55) and “Irregular” (∼4.5 vs GPT-4o’s ~52) This result was unexpected, as GPT-4o is the more advanced model; it suggests that having a more powerful language model did not guarantee better spatial reasoning. In some cases GPT-4o may have been overfitting to spurious visual cues or simply did not interpret the stimuli as Fig 7: Scenario 3: OLS Fit (MDL ≤ 10) Same layout as Figure 1, fits restricted to
simpler patterns (MDL ≤ 10). GPT-4o remains flat GPT-o4 mini shows a suggestive (non-significant) increase in error with MDL in the 9-pattern subset, hinting at a human-like direction of effect within this restricted range. Humans still perform better overall, and comparisons should note the task difference (human canonical task vs. LLM template mapping without video) 17 effectively as the smaller model. Both models, however, remained well below human-level accuracy on all sequences. Human participants (from prior studies) are able to predict these spatial sequences with near-perfect accuracy in most cases, even for complex patterns[16, 17]. By contrast, the LLMs frequently produced only partially correct sequences with several mistakes, indicating a fundamental gap in spatial sequence learning. 4.1 Relationship Between Sequence Complexity (MDL) and GPT-4o Performance: The blue dashed line indicates the best-fit linear regression, with the shaded region showing the 95%
confidence interval for the fit. No significant correlation was found – the regression slope is essentially zero (≈ –0.015, 95% CI [–0056, 0.027]), with p ≈ 046 and R² = 0056 [Fig 6] In other words, GPT-4o’s errors did not systematically increase with pattern complexity. Some low-complexity patterns (left side of the plot) yielded high error, while a high-complexity pattern (“Irregular”, MDL 16) had only moderate error, leading to a flat trendline. For comparison, human performance exhibits a strong linear relationship with complexity: in prior studies, human anticipation accuracy declines nearly linearly as MDL rises, with R² ≈ 0.86 (indicating much higher error on complex sequences and minimal error on simple ones)[16, 17]. Fig 8: Left: Behavioral evidence: the human error rate in storing the sequence in memory, anticipating the following items, and detecting outliers is monotonically related to MDL; here the graph indicates the percentage of anticipatory eye
movements.[16,17] Right : Scenario 3 - OLS Fit (MDL ≤ 10) OLS fits restricted to simpler patterns (MDL ≤ 10) for 9 patterns on GPT o4 mini 18 We further examined a subset of nine “canonical” patterns that were originally studied by Dehaene et al.[17], excluding a few additional sequences we introduced. This subset included sequences like Repeat, Alternate, 2Arcs, 2Squares, 4Diagonals, 4points (segments), 2Rectangles, 2Crosses, and Irregular. Even within this subset, no meaningful performance trend emerged A linear fit on these nine points yielded R² ≈ 0.05 (p ≈ 054), essentially a flat line[Fig 6]. This contrasts with the human benchmark on the same patterns[17, 18], where complexity accounted for a large portion of variance in performance. In summary, GPT-4o and GPT-mini did not show the expected gradation of performance with sequence complexity. Instead, their success or failure seemed idiosyncratic to each pattern, hinting that they might be using superficial or
ad-hoc strategies rather than an integrated spatial reasoning approach. In summary across Figures 4,6,7, both models show high, variable errors with no clear link between complexity (MDL) and accuracy. Linear fits over all 12 patterns are essentially flat and non-significant for both GPT-4o and GPT-o4 mini. When restricting to MDL ≤ 10, GPT-4o remains flat, while GPT-o4 mini shows a suggestive increase in error with complexity in the 9 canonical patterns (not statistically significant)[Fig 7]. Aside from a single near-zero-error case at low MDL (~5) for GPT-o4 mini, performance is idiosyncratic, indicating that a simple linear MDL metric does not capture the models’ difficulty landscape. 4.2 Human benchmark and task comparability To make the human–LLM contrast explicit, Fig. 8 (left) reproduces the Dehaene human benchmark, which shows a clear, near-linear relationship between MDL and performance, whereas Figs. 6,7 show flat or only suggestive trends for the LLMs.
Importantly, the tasks are not identical: human participants learned canonical sequences by memorization without having to map a random visual template, while the models had to align an arbitrary template from static images within a limited context window as current APIs do not accept video. This mismatch likely contributes to the observed gap: humans benefited from stable canonical frames and temporal information while the models lack this. The static image template paradigm has strength that it is fully reproducible, tightly controlled, and isolates symbolic rule induction from extraneous cues. At the same time it suffers from limitation of no temporal information that may mask competence emerging with video-like input. Consequently, the absence of a strong MDL–performance slope in the LLMs should be interpreted as reflecting both architectural differences and the current I/O constraints of these systems. 5. Discussion Across 12 spatial sequence patterns, GPT-4o and GPT o4-mini
did not exhibit the human-like complexity gradient: model error did not increase with Minimum 19 Description Length (MDL), whereas human anticipation shows a strong complexity effect. Instead, model accuracy was pattern-specific and often inverted relative to MDL. For example, both models performed poorly on the simple two-point alternation (low MDL) yet comparatively well on the highest MDL “irregular” sequence; GPT-mini was strong on clockwise and zigzag, while GPT-4o lagged on those but led on 2-arcs. Overall, GPT-mini outperformed GPT-4o on most patterns, indicating that scaling alone did not yield better spatial reasoning. The error profiles suggest the models did not abstract the generating rules (rotations, alternations, symmetries) as compact programs. The two-point failure (despite its simplicity) and success on “irregular” (despite high MDL) imply reliance on surface cues or ad-hoc memorization instead of representing the underlying geometric computation. In
Dehaene’s human data [Fig 8 (left)], anticipation accuracy declines almost linearly as MDL increases, accounting for most of the variance. Our Fig 3 (bottom-right) the OLS fit on the nine canonical patterns within MDL ≤ 10 is the closest LLM analogue. GPT-o4 mini shows a modest, human-like rise in error with MDL, whereas GPT-4o remains essentially flat. The mini model’s trend is suggestive rather than significant and weaker than the human slope, indicating that current LLMs do not robustly internalize the human-style complexity gradient. Task differences likely contribute: humans memorized canonical sequences, while our models had to align a random template from static images (no video input), which may depress LLM performance and blunt any human-like MDL trend. Transformers process inputs as token sequences and lack explicit spatial state (e.g, a map of positions, axes, or rotations) The mixed results of GPT-4o vs GPT-mini underscore that more parameters and broader pretraining
are insufficient when the inductive bias for geometry is missing. Progress likely requires hybrid designs e.g, spatial modules/graphs coupled to LLMs, or algorithmic components that support variables, recursion, and transformations and multimodal/embodied training that forces alignment between vision and spatial action. In light of Dehaene’s Cultural Recycling Hypothesis, humans may repurpose specialized visuospatial circuits for abstract geometry, yielding the robust MDL sensitivity seen in behavioral data. LLMs lack such domain-specific structure, which helps explain both the absent complexity gradient and their idiosyncratic, cue-driven performance. Embedding analogous, reusable spatial systems in AI so models can think spatially, not only in tokens appears necessary to approach human-like abstraction and generalization. 20 References: 1 Shen, H., Li, T, Li, T J J, Park, J S, & Yang, D (2023, October) Shaping the emerging norms of
using large language models in social computing research. In Companion Publication of the 2023 Conference on Computer Supported Cooperative Work and Social Computing (pp. 569-571) 2 Zeng, Z., Yu, J, Gao, T, Meng, Y, Goyal, T, & Chen, D (2023) Evaluating large language models at evaluating instruction following. arXiv preprint arXiv:231007641 3 Amirizaniani, M., Martin, E, Sivachenko, M, Mashhadi, A, & Shah, C (2024) Do LLMs Exhibit Human-Like Reasoning? Evaluating Theory of Mind in LLMs for Open-Ended Responses. arXiv preprint arXiv:240605659 4 Rostam, Z. R K, Szénási, S, & Kertész, G (2024) Achieving Peak Performance for Large Language Models: A Systematic Review. IEEE Access 5 Wu, Z., Qiu, L, Ross, A, Akyürek, E, Chen, B, Wang, B, & Kim, Y (2023) Reasoning or reciting? exploring the capabilities and limitations of language models through counterfactual tasks. arXiv
preprint arXiv:230702477 6 Kojima, Takeshi, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. "Large language models are zero-shot reasoners" Advances in neural information processing systems 35 (2022): 22199-22213. 7 Dehaene, S. (2021) How we learn: Why brains learn better than any machine for now. Penguin 8 Chang, Y., Wang, X, Wang, J, Wu, Y, Yang, L, Zhu, K, & Xie, X (2024) A survey on evaluation of large language models. ACM Transactions on Intelligent Systems and Technology, 15(3), 1-45. 9 Plaat, A., Wong, A, Verberne, S, Broekens, J, van Stein, N, & Back, T (2024) Reasoning with large language models, a survey. arXiv preprint arXiv:240711511 10 OpenAI. (2024) Learning to Reason with LLMs Openaicom https://openai.com/index/learning-to-reason-with-llms/ 11 Wei, J., Wang, X,
Schuurmans, D, Bosma, M, Ichter, B, Xia, F, Chi, E, Le, Q, & Zhou, D. (2022) Chain of Thought Prompting Elicits Reasoning in Large Language Models. ArXiv:220111903 [Cs] 12 Dave, N., Kifer, D, Giles, C L, & Mali, A (2024) Investigating Symbolic Capabilities of Large Language Models. arXiv preprint arXiv:240513209 21 13 Mahowald, K., Ivanova, A A, Blank, I A, Kanwisher, N, Tenenbaum, J B, & Fedorenko, E. (2024) Dissociating language and thought in large language models Trends in Cognitive Sciences. 14 AlKhamissi, B., Tuckute, G, Bosselut, A, & Schrimpf, M (2024) The LLM Language Network: A Neuroscientific Approach for Identifying Causally Task-Relevant Units. arXiv preprint arXiv:2411.02280 15 Valmeekam, K., Marquez, M, Olmo, A, Sreedharan, S, & Kambhampati, S (nd) PlanBench: An Extensible Benchmark for Evaluating Large Language Models on Planning and Reasoning about
Change. https://arxivorg/pdf/220610498 16 Dehaene, S., Al Roumi, F, Lakretz, Y, Planton, S, & Sablé-Meyer, M (2022) Symbols and mental programs: a hypothesis about human singularity. Trends in Cognitive Sciences, 26(9), 751-766. 17 Amalric, M., Wang, L, Pica, P, Figueira, S, Sigman, M, & Dehaene, S (2017) The language of geometry: Fast comprehension of geometrical primitives and rules in human adults and preschoolers. PLoS computational biology, 13(1), e1005273 18 Dehaene S, Izard V, Pica P, Spelke E. Core knowledge of geometry in an Amazonian indigene group. Science 2006 Jan 20;311(5759):381-4 doi: 101126/science1121739 PMID: 16424341. 19 Caucheteux C, King JR. Brains and algorithms partially converge in natural language processing. Commun Biol 2022 Feb 16;5(1):134 doi: 101038/s42003-022-03036-1 Erratum in: Commun Biol. 2023 Apr 11;6(1):396 doi: 101038/s42003-023-04776-4 PMID:
35173264; PMCID: PMC8850612. 20 Moro, A., Greco, M, & Cappa, S F (2023) Large languages, impossible languages and human brains. Cortex, 167, 82-85 22