Logo MathGlance

Evaluating MLLMs on Abstract Visual and Symbolic Understanding in Mathematical Diagrams

*Project Lead,     Corresponding Author
Email: shan.zhang@adelaide.edu.au; yanpeng_sun@nus.edu.sg; anton.vandenhengel@adelaide.edu.au
geometric reasoning

Although both natural images and symbolic diagrams can be represented as grids of pixels, they constitute very different forms of information. Images represent samples of the intensity of the real world, while diagrams are human-constructed and convey geometric concepts through structured symbols and their interrelationships. Figure (a) illustrates that diagrams pose unique challenges for current Multimodal Large Language Models (MLLMs), particularly in fine-grained grounding task. Figure (b) demonstrates a positive correlation between low-level perception and high-level reasoning tasks, emphasizing that clear diagram perception leads to substantial improvements in mathematical reasoning performance.

Abstract

Diagrams are a form of visual language, representing complex concepts and their interrelationships through structured symbols, shapes, and spatial arrangements. Unlike natural images, they are inherently symbolic and abstract. Diagrams thus pose significant challenges for Multimodal Large Language Models (MLLMs). Current benchmarks conflate perceptual and reasoning tasks, making it difficult to assess whether MLLMs genuinely understand mathematical diagrams beyond superficial pattern recognition and textual memorization. To address this gap, we introduce Logo MathGLance, a benchmark specifically designed to isolate and evaluate diagram perception in MLLMs. Logo MathGLance comprises 1.2K diagrams and 1.6K carefully curated questions spanning four perception tasks: shape classification, object counting, relationship identification, and object grounding, covering diverse domains including plane geometry, solid geometry, and graphical representations. Our evaluation of MLLMs reveals that their ability to understand diagrams is limited, particularly in fine-grained grounding tasks. In response, we construct the perception-oriented GeoPeP, a 200K dataset that represents geometric diagrams as structured graphs capturing primitives, their spatial relationships, and fine-grained bounding boxes. Training MLLM on GeoPeP leads to significant gains in perceptual accuracy, which in turn substantially improves mathematical reasoning. Our benchmark and dataset establish critical standards for multimodal mathematical understanding and offer valuable resources to advance MLLM research.

Leaderboard on Logo MathGlance

Accuracy scores on the subset of Plane Geometry (PG), Soild Geometry (SG), Graphs (G) in Logo MathGlance.

# Model Source Avg. PG_ALL PG_cls PG_cnt PG_grd PG_rlat SG_ALL SG_cls SG_cnt SG_grd SG_rlat G_ALL G_cls G_cnt G_grd G_rlat
1 Qwen2.5-VL+-32B (ours) 🥇 Link 74.2 77.9 70.7 79.6 84.0 79.5 73.8 98.8 86.4 15.0 85.0 71.1 98.6 98.2 2.7 99.0
2 Qwen2.5-VL+-7B (ours) 🥈 Link 72.9 78.5 70.7 79.2 82.6 85.0 71.9 97.9 86.2 12.9 70.0 68.2 94.2 96.3 4.9 89.4
3 SVE-Math-DeepSeek+-7B (ours) 🥉 Link 68.4 84.6 75.8 88.4 82.9 97.5 54.1 85.3 65.8 20.3 45.0 60.7 85.1 78.4 1.6 75.7
4 InternVL2.5-38B Link 63.1 44.0 59.9 52.0 2.5 66.0 78.8 98.8 92.8 38.1 72.5 66.5 98.6 96.3 3.2 69.7
5 Qwen2.5-VL-32B Link 62.2 43.3 56.9 54.8 0.0 67.0 72.5 98.8 89.7 1.6 87.5 68.8 91.3 100.0 1.6 97.0
6 Qwen2-VL-72B Link 59.9 42.4 51.2 50.8 17.4 52.0 71.2 97.7 84.5 6.4 77.5 66.1 76.8 98.2 16.1 84.9
7 Qwen2.5-VL-7B Link 59.2 44.0 56.2 51.3 18.5 52.0 68.0 98.8 88.7 0.0 65.0 65.7 89.9 100.0 3.2 78.8
8 InternLM-XComposer2-7B Link 55.6 35.8 49.4 48.8 0.0 47.0 62.9 90.7 86.6 0.0 53.8 54.6 60.9 94.4 0.0 78.8
9 GPT-4o Link 53.3 42.8 58.4 53.2 1.1 62.5 60.7 72.1 84.5 1.6 66.3 56.4 92.8 72.2 1.6 57.6
10 DeepSeek-VL2-Small (16B) Link 51.5 37.6 47.6 43.6 12.5 48.5 63.8 98.8 70.1 11.1 60.0 53.2 76.8 53.7 11.3 81.8
11 Qwen2-VL-7B Link 51.4 37.9 47.6 41.2 12.8 53.0 64.1 93.0 78.4 14.3 55.0 52.3 84.1 88.9 3.2 18.2
12 InternVL2.5-8B Link 50.7 35.0 48.8 36.0 0.0 60.0 65.6 98.8 72.2 4.8 70.0 51.4 68.1 77.8 0.0 69.7
13 mPLUG-owl3-7B Link 50.0 36.4 46.7 41.6 3.9 58.5 65.3 95.4 83.5 0.0 62.5 48.2 59.4 77.8 0.0 66.7
14 InternVL2-8B Link 48.4 31.9 44.3 38.0 0.0 48.5 62.9 98.8 62.9 4.8 70.0 50.5 68.1 75.9 0.0 66.7
15 SVE-Math-DeepSeek-7B Link 46.6 35.4 52.4 36.0 3.56 51.0 49.4 77.9 62.9 0.0 41.3 55.1 81.2 75.9 0.0 69.7
16 MultiMath-7B Link 41.8 31.2 44.0 30.4 1.07 53.0 45.7 81.4 53.6 0.0 33.8 48.6 79.7 57.4 0.0 33.8
17 Math-LLaVA-13B Link 40.0 27.9 34.4 32.4 0.0 50.5 44.8 81.4 55.7 0.0 27.5 47.3 78.3 59.3 0.0 51.5
18 GPT-o1 Link 36.5 15.8 33.2 11.6 0.0 14.0 41.4 75.6 52.6 0.0 23.8 52.3 82.6 81.5 0.0 39.4
19 LLaVA-v1.5-13B Link 35.4 32.8 29.3 40.4 23.5 42.0 35.9 60.5 38.1 0.0 35.0 37.6 63.8 42.6 0.0 45.5
20 LLaVA-v1.5-7B Link 33.3 29.2 29.0 39.6 14.2 37.5 31.6 43.0 42.3 0.0 31.3 39.0 76.8 35.2 0.0 39.4
21 DeepSeek-VL2-Tiny Link 32.6 29.5 45.2 34.4 4.6 32.0 39.0 76.7 32.0 0.0 37.5 29.4 39.1 57.4 0.0 18.2
22 G-LLaVA-7B Link 30.3 25.6 27.8 41.2 0.4 38.0 31.3 45.4 38.1 0.0 32.5 33.9 58.0 37.0 0.0 42.4

Logo MathGlance Dataset

Overview

Logo The MathGLance benchmark is a novel evaluation framework designed to assess the mathematical perception abilities of Multimodal Large Language Models (MLLMs). Unlike existing benchmarks that often conflate perception with high-level reasoning tasks, MathGLance isolates perceptual skills by focusing on mathematical visual reasoning with minimal cognitive load. It provides both quantitative and qualitative assessments across different granularity levels. The benchmark covers a diverse range of mathematical contexts, including Plane Geometry (66%), Solid Geometry (20%), and Graphical data representations (14%) such as line plots, bar charts, and pie charts. It comprises 1,609 questions and 1,198 unique images, formulated mainly as multiple-choice or true/false questions to streamline evaluation. MathGLance features four key task categories: 1) Shape Classification which identifies object classes based on visual attributes (e.g., vertices, material, color, size) across 16 plane geometry categories, 3 CLEVR-defined solid objects, and 5 graphical types; 2) Object Counting which evaluates the model's ability to count either the total number of objects or specific geometric shapes within an image; 3) Relationship Identification which assesses understanding of spatial and mathematical relationships between geometric primitives, covering 4 spatial and over 10 mathematical relationships; 4) Object Grounding which measures fine-grained localization by predicting object coordinates (x1, y1, x2, y2) based on textual descriptions. MathGLance is designed to challenge MLLMs in mathematical perception while minimizing high-level reasoning demands, offering a comprehensive and fine-grained evaluation of visual reasoning abilities.

Key statistics and subject-task distribution of Logo MathGLance.
data-overview
geometric reasoning

The synthetic construction process for plane geometry. We synthesize geometric figures by randomly sampling elements from the geometric shape pool and relationship pool, ensuring consistency through a verifier that enforces logical constraints based on manually designed rules, fundamental mathematical principles, and prerequisite points. All visual elements are structured and saved in JSON format. Images are rendered using the Matplotlib package, and corresponding Q&A pairs are generated using a template-based pipeline.

Experiment Results

Main Results of Logo MathGLance

geometric reasoning

Performance comparison of different MLLMs on MathGLance across Plane Geometry, Solid Geometry, and Graphs. cls, cnt, grd, and rlat represent different question categories: shape classification, object counting, object grounding, and relationship identification, respectively. all indicates the overall accuracy, calculated as the ratio of correctly answered questions to the total number of questions in the benchmark, while Avg. denotes the average all score across all subjects.

More Results

geometric reasoning

Model Responses

BibTeX

@article{sun2025mathglance,
  author    = {Yanpeng Sun and Shan Zhang and Wei Tang and Aotian Chen and Piotr Koniusz and Kai Zou and Yuan Xue and Anton van den Hengel},
  title     = {MATHGLANCE: Multimodal Large Language Models Do Not Know Where to Look in Mathematical Diagrams},
  booktitle = {arXiv preprint arXiv:2503.20745},
  year      = {2025}
}
@article{zhang2025primitive,
  author    = {Shan Zhang and Aotian Chen and Yanpeng Sun and Jindong Gu and Yi-Yu Zheng and Piotr Koniusz and Kai Zou and Anton van den Hengel and Yuan Xue},
  title     = {Primitive Vision: Improving Diagram Understanding in MLLMs},
  booktitle = {Proceedings of the 42th International Conference on Machine Learning},
  year      = {2025}
}