ECCV 2026 Accepted
Evaluating and Enhancing Negation Comprehension in Remote Sensing MLLMs
RS-Neg evaluates whether remote sensing MLLMs truly understand negation, and NeFo improves this ability through lightweight test-time learning.
Peng Cheng Laboratory Tsinghua University Central South University
* Equal contribution. † Corresponding authors.
Abstract
Multimodal Large Language Models have shown strong performance on remote sensing tasks, but their ability to understand negation remains underexplored. This limitation matters in practical scenarios where models must identify what is absent or false, such as non-flooded routes during emergency response.
We introduce RS-Neg, the first benchmark for evaluating negation understanding in remote sensing MLLMs across region-level and scene-level tasks. We also propose NeFo, a test-time learning method that explicitly models the logical role of negation and improves robustness using only a small amount of unlabeled test data.
Contributions
- RS-Neg benchmark. 22,464 negation-aware samples spanning VQA, MCQ, visual grounding, and scene classification.
- Verified data pipeline. LLM-driven negation synthesis with dynamic visual focus to verify absent concepts in fine-grained RS images.
- NeFo adaptation. A lightweight test-time learning method that improves negation comprehension without extra annotations.
RS-Neg Benchmark
RS-Neg is built from 7 widely used remote sensing datasets. It includes object-, attribute-, and state-level negation for visual conversation tasks, together with scene-level classification samples containing negated label distractors.
| Task | Object | Attribute | State | Total |
|---|---|---|---|---|
| VQA | 5,675 | 2,379 | 1,151 | 9,205 |
| MCQ | 4,502 | 791 | 382 | 5,675 |
| Grounding | 1,494 | 671 | 319 | 2,484 |
| Classification | - | - | - | 5,100 |
Evaluation on RS-Neg
Current MLLMs consistently underperform on negation queries. The gap appears across general-purpose models, RS-specific models, and reasoning-augmented models.
NeFo: Negation-Focused Test-Time Learning
NeFo constructs a negation-masked counterpart for each query and optimizes the model to distinguish the negated query from the masked version. A knowledge retaining loss regularizes the adapted model using the original model's predictions on affirmative inputs.
Main Results
NeFo improves all evaluated base MLLMs on RS-Neg VQA and MCQ. The table reports total accuracy.
| Base MLLM | VQA | VQA + NeFo | MCQ | MCQ + NeFo |
|---|---|---|---|---|
| Qwen2.5-VL | 74.96 | 79.52 | 54.29 | 65.46 |
| Qwen3-VL | 71.47 | 75.42 | 57.30 | 64.55 |
| RS-LLaVA | 73.15 | 74.75 | 23.72 | 24.46 |
| GeoReason | 67.30 | 69.76 | 45.59 | 53.39 |
NeFo also transfers to unseen negation tasks, including RS-Neg classification, RS-Neg grounding, and real-world FloodNet VQA.
BibTeX
@article{han2026evaluating,
title={Evaluating and Enhancing Negation Comprehension in Remote Sensing MLLMs},
author={Han, Haochen and Wang, Jue and Wang, Alex Jinpeng and Liu, Fangming},
journal={arXiv preprint arXiv:2606.20177},
year={2026}
}