ECCV 2026 Accepted

Evaluating and Enhancing Negation Comprehension in Remote Sensing MLLMs

RS-Neg evaluates whether remote sensing MLLMs truly understand negation, and NeFo improves this ability through lightweight test-time learning.

Haochen Han*, Jue Wang*, Alex Jinpeng Wang, Fangming Liu

Peng Cheng Laboratory    Tsinghua University    Central South University

* Equal contribution. † Corresponding authors.

Examples of RS-Neg tasks across scene classification, visual grounding, multiple choice, and visual QA.
RS-Neg covers negation-conditioned remote sensing tasks from region-level grounding to scene-level reasoning.

Abstract

Multimodal Large Language Models have shown strong performance on remote sensing tasks, but their ability to understand negation remains underexplored. This limitation matters in practical scenarios where models must identify what is absent or false, such as non-flooded routes during emergency response.

We introduce RS-Neg, the first benchmark for evaluating negation understanding in remote sensing MLLMs across region-level and scene-level tasks. We also propose NeFo, a test-time learning method that explicitly models the logical role of negation and improves robustness using only a small amount of unlabeled test data.

Contributions

RS-Neg Benchmark

RS-Neg is built from 7 widely used remote sensing datasets. It includes object-, attribute-, and state-level negation for visual conversation tasks, together with scene-level classification samples containing negated label distractors.

Task Object Attribute State Total
VQA 5,675 2,379 1,151 9,205
MCQ 4,502 791 382 5,675
Grounding 1,494 671 319 2,484
Classification - - - 5,100
RS-Neg data construction pipeline.
Pipeline for constructing RS-Neg, using MCQ generation as an example.

Evaluation on RS-Neg

Current MLLMs consistently underperform on negation queries. The gap appears across general-purpose models, RS-specific models, and reasoning-augmented models.

VQA performance drop under negation.
VQA performance drops under negation queries.
Scene classification performance drop under negation.
Scene classification shows severe negation sensitivity.
Visual grounding performance under negation.
Region-level visual grounding remains challenging.
MCQ performance by negation type.
Several RS-specific MLLMs fall below random selection on MCQ.

NeFo: Negation-Focused Test-Time Learning

NeFo constructs a negation-masked counterpart for each query and optimizes the model to distinguish the negated query from the masked version. A knowledge retaining loss regularizes the adapted model using the original model's predictions on affirmative inputs.

Overview of the NeFo method.
Overview of NeFo with truth-value inversion and knowledge retaining objectives.

Main Results

NeFo improves all evaluated base MLLMs on RS-Neg VQA and MCQ. The table reports total accuracy.

Base MLLM VQA VQA + NeFo MCQ MCQ + NeFo
Qwen2.5-VL 74.96 79.52 54.29 65.46
Qwen3-VL 71.47 75.42 57.30 64.55
RS-LLaVA 73.15 74.75 23.72 24.46
GeoReason 67.30 69.76 45.59 53.39

NeFo also transfers to unseen negation tasks, including RS-Neg classification, RS-Neg grounding, and real-world FloodNet VQA.

Scaling study with different numbers of test-time adaptation samples.
Effect of varying test-time adaptation data size.

BibTeX

@article{han2026evaluating,
  title={Evaluating and Enhancing Negation Comprehension in Remote Sensing MLLMs},
  author={Han, Haochen and Wang, Jue and Wang, Alex Jinpeng and Liu, Fangming},
  journal={arXiv preprint arXiv:2606.20177},
  year={2026}
}