Predicting future states is a critical mission in computer vision research – not least in robotics, where real-world situations must be considered. Machine learning systems entrusted with mission-critical tasks therefore need adequate understanding of the physical world.
However, in some cases, an apparently impressive knowledge of temporal reality could be deceptive: a new paper from the United Arab Emirates has found that state-of-the-art Multimodal Large Language Models (MLLMs), including sector leaders GPT-4o and Google Gemini, fall short when it comes to interpreting how time is represented in images.
Example sequential pairs (see image below), which would be unchallenging for humans even when put in the wrong order, can fox advanced MLLMs when presented in unexpected contexts or configurations (such as second-image-first, concatenated into single images, sequential multiple images which may or may not represent the correct temporal order, and so on.).
The researchers tasked the models with basic temporal reasoning challenges, such as determining event order or estimating time gaps, and found that the seven MLLMs tested performed notably below human accuracy:
‘Overall, the [results] reveal that all current MLLMs, including GPT-4o – the most advanced model in our evaluation – struggle with the proposed benchmark. Despite GPT-4o’s superior performance relative to other models, it fails to consistently demonstrate accurate temporal reasoning across different settings.
‘The consistent accuracy scores are notably low for all models, indicating significant limitations in their ability to comprehend and interpret temporal sequences from visual inputs. These deficiencies are evident even when models are provided with multiimage inputs or optimized prompts, suggesting that current architectures and training methodologies are insufficient for robust temporal order understanding.’
Machine learning systems are designed to optimize to the most accurate, but also the most efficient and people-pleasing results*. Since they do not reveal their reasoning explicitly, it can be difficult to tell when they’re cheating, or using ‘shortcuts’.
In such a case, the MLLM may arrive at the right answer by the wrong method. The fact that such an answer can be correct may inspire false confidence in the model, which could produce incorrect results by the same method in later tasks presented to it.
Worse yet, this misdirection can become even more deeply embedded in the development chain if humans are impressed by it, and give positive feedback in trials and annotation sessions which may contribute to the direction that the data and/or the model might take.
In this case, the suggestion is that MLLMs are ‘faking’ a true understanding of chronology and temporal phenomena, by observing and anchoring on secondary indicators (such as time-stamps, for instance, in video data, order of images in a layout, or even – potentially – sequentially-numbered file-names).
It further indicates that MLLMs currently fail to satisfy any real definition of having generalized a concept of temporal phenomena – at least, to the extent that humans can.
The new paper is titled Can Multimodal MLLMs do Visual Temporal Understanding and Reasoning? The answer is No!, and comes from three researchers at the Mohamed bin Zayed University of Artificial Intelligence and Alibaba International Digital Commerce.
Data and Tests
The authors note that prior benchmarks and studies, such as MMMU and TemporalBench, concentrate on single-image inputs or else formulate questions for the MLLMs that may be rather too easy to answer, and may not uncover a tendency towards shortcut behavior.
Therefore the authors offer two updated approaches: Temporal Order Understanding (TOU) and Time-lapse Estimation (TLE). The TOU approach tests the models on their ability to determine the correct sequence of events from pairs of video frames; the TLE method evaluates the MLLM’s ability to estimate the time difference between two images, ranging from seconds to years.
The researchers curated 360 image pairs for the TOU benchmark, using open source videos from Pixabay and Pexels, so that it would be possible to make the dataset available via a GUI.
The videos covered a range of subjects, from people in everyday activities to non-human content such as animals and plants. From these, pairs of frames were selected to depict a sequence of events with sufficient variation to make the starting frame ‘obvious’.
Human selection was used to ensure that the frames could be definitively ordered. For example, one of the curated pairs shows a partially-filled teacup in one frame, and the same cup fully filled with tea in the next, making the sequence logic easy to identify.
In this way, 360 image pairs were obtained.
For the TLE approach, copyright-free images were chosen from Google and Flickr, as well as select frames from copyright-free videos on YouTube. The subject-matter of these videos featured scenes or objects whose change interval ranged from seconds to days to seasons – for example, ripening fruit, or the change of seasons in landscapes.
Thus 125 image pairs were curated for the TLE method.
Not all of the MLLMs tested were able to process multiple images; therefore tests differed to accommodate each model’s capabilities.
Multiple versions of the curated datasets were generated, in which some of the pairs were concatenated vertically, and others horizontally. Further variations swapped the true and correct temporal sequence of the pairs.
Two prompt-types were developed. The first followed this template:
Did the event in the (left / top / first) image happen before the event in the (right / bottom / second) image? State true or false with reasoning.
The second followed this schema:
Between these two images, which one depicts the event that happened first? State (left or right / top or bottom / first or second) with reasoning.
For TLE, questions were multiple-choice, asking the models to evaluate the time-lapse between the two presented images, with seconds, hours, minutes, days, months and years available as the time-units. In this configuration, the most recent image was presented on the right.
The prompt used here was:
In the given image, estimate the time that has passed between the first image (left) and the second image (right).
Choose one of the following options:
-
Less than 15 seconds
B. Between 2 minutes to 15 minutes
C. Between 1 hour to 12 hours
D. Between 2 days to 30 days
E. Between 4 months to 12 months
F. More than 3 years
The MLLMs tested were ChatGPT-4o; Gemini1.5-Pro; LlaVa-NeXT; InternVL; Qwen-VL; Llama-3-vision; and LLaVA-CoT.
Temporal Order Understanding: Results
Regarding the results shown above, the authors found that all tested MLLMs, including GPT-4o (which showed the best overall performance), struggled significantly with the TemporalVQA benchmark – and even GPT-4o failed to consistently exhibit reliable temporal reasoning across different configurations.
The authors contend that the consistently low accuracy across LLMs highlights significant shortcomings in the models’ ability to interpret and reason about temporal sequences from visual data. The researchers note that these challenges persist even with the use of multi-image inputs and optimized prompts, pointing to fundamental limitations in current model architectures and training methods.
The tests showed significant variations in performance across prompting strategies. While GPT-4o improved with optimized prompts (reaching 4% in single-image and 65.3% in multi-image settings), performance remained below acceptable levels.
Models such as LLaVA-NeXT and Qwen-VL were even more sensitive, with performance declining when alternate prompts were used, suggesting that prompt engineering alone cannot overcome the MLLMs’ fundamental limitations in regard to temporal reasoning.
Tests also indicated that image layout (i.e., vertical vs. horizontal) significantly impacted model performance. GPT-4o improved its consistency with vertical arrangements, rising from 39.2% to 52.8%; however, other models, including the LLaVA strains, showed strong directional biases, excelling in one orientation but failing in another.
The paper indicates that these inconsistencies suggest reliance on spatial cues, rather than true temporal reasoning, with the MLLMs not genuinely analyzing the sequence of events or understanding the progression over time. Instead, they appear to have relied on patterns or visual features related to the layout of images, such as their position or alignment, in order to make decisions.
Comparison tests between single-image and multi-image inputs demonstrated limited overall improvement, with GPT-4o performing slightly better on multi-image input, rising from 31.0% to 43.6% (with P1) and 46.0% to 65.3% (with P2).
Other models, such as InternVL, demonstrated stable but low accuracy, while Qwen-VL saw minor gains. The authors conclude that these results indicate that additional visual context does not substantially enhance temporal reasoning capabilities, since models struggle to integrate temporal information effectively.
Human Study
In a human study, three surveys were conducted to assess how closely the best-performing multimodal MLLM perfgormed against human estimation.
Humans achieved 90.3% accuracy, outperforming GPT-4o’s 65.3% by 25%. The dataset proved reliable, with minimal human errors and consistent agreement on correct answers.
Time-lapse Estimation: Results
In these tests, the MLLMs performed only adequately on time-lapse estimation: GPT-4o achieved 70% accuracy, but the other models performed significantly worse (see table above), and performance also varied notably across the various time scales.
The authors comment:
‘The task of time-lapse estimation tests the ability of MLLMs to infer temporal intervals between image pairs. [All] MLLMs, including top performers like GPT-4o and Gemini1.5-Pro, struggle with this task, achieving only moderate accuracy levels of 60-70%. GPT-4o shows inconsistent performance, with strong performance in Seconds and Years but underperforming in Hours.
Similarly, LLaVA-CoT demonstrates exceptional performance in the time spans of Seconds and Days, while showing notably poor performance in the other time intervals.’
Human Study
In the human study for TLE, average human performance improved on GPT-4o (the best-performing model also in this category) by 12.3%.
The authors note that some of the challenges were particularly demanding, and that in one case all the human participants returned a wrong answer, along with all the AI participants.
The authors conclude that GPT-4o exhibits ‘reasonably robust reasoning capabilities, notwithstanding the order of images presented to it.
Conclusion
If MLLMs eventually amass and absorb enough ‘shortcut’ data to cover even the trickiest challenges of the type presented by the authors in this study, whether or not they can be said to have developed human-style generalization capabilities in this domain could become a moot point.
Neither is it known exactly by what route we obtain our own abilities in temporal reasoning – do we likewise ‘cheat’ until the sheer quantity of learned experience reveals a pattern that performs as ‘instinct’ in regards to this kind of test?
* From the point of view that models are increasingly being optimized with loss functions which human feedback has contributed to, and effectively optimized by human trials and subsequent triage.
First published Monday, January 27, 2025