ECCV 2026 Accepted

Evaluating and Enhancing Negation Comprehension in Remote Sensing MLLMs

RS-Neg evaluates whether remote sensing MLLMs truly understand negation, and NeFo improves this ability through lightweight test-time learning.

Haochen Han^*, Jue Wang^*, Alex Jinpeng Wang^†, Fangming Liu^†

Peng Cheng Laboratory Tsinghua University Central South University

* Equal contribution. † Corresponding authors.

Paper Code Dataset PDF

RS-Neg covers negation-conditioned remote sensing tasks from region-level grounding to scene-level reasoning.

Abstract

Multimodal Large Language Models have shown strong performance on remote sensing tasks, but their ability to understand negation remains underexplored. This limitation matters in practical scenarios where models must identify what is absent or false, such as non-flooded routes during emergency response.

We introduce RS-Neg, the first benchmark for evaluating negation understanding in remote sensing MLLMs across region-level and scene-level tasks. We also propose NeFo, a test-time learning method that explicitly models the logical role of negation and improves robustness using only a small amount of unlabeled test data.

Contributions

RS-Neg benchmark. 22,464 negation-aware samples spanning VQA, MCQ, visual grounding, and scene classification.
Verified data pipeline. LLM-driven negation synthesis with dynamic visual focus to verify absent concepts in fine-grained RS images.
NeFo adaptation. A lightweight test-time learning method that improves negation comprehension without extra annotations.

RS-Neg Benchmark

RS-Neg is built from 7 widely used remote sensing datasets. It includes object-, attribute-, and state-level negation for visual conversation tasks, together with scene-level classification samples containing negated label distractors.

Task	Object	Attribute	State	Total
VQA	5,675	2,379	1,151	9,205
MCQ	4,502	791	382	5,675
Grounding	1,494	671	319	2,484
Classification	-	-	-	5,100

RS-Neg data construction pipeline. — Pipeline for constructing RS-Neg, using MCQ generation as an example.

Evaluation on RS-Neg

Current MLLMs consistently underperform on negation queries. The gap appears across general-purpose models, RS-specific models, and reasoning-augmented models.

VQA performance drop under negation. — VQA performance drops under negation queries.

Scene classification performance drop under negation. — Scene classification shows severe negation sensitivity.

Visual grounding performance under negation. — Region-level visual grounding remains challenging.

MCQ performance by negation type. — Several RS-specific MLLMs fall below random selection on MCQ.

NeFo: Negation-Focused Test-Time Learning

NeFo constructs a negation-masked counterpart for each query and optimizes the model to distinguish the negated query from the masked version. A knowledge retaining loss regularizes the adapted model using the original model's predictions on affirmative inputs.

Overview of the NeFo method. — Overview of NeFo with truth-value inversion and knowledge retaining objectives.

Main Results

NeFo improves all evaluated base MLLMs on RS-Neg VQA and MCQ. The table reports total accuracy.

Base MLLM	VQA	VQA + NeFo	MCQ	MCQ + NeFo
Qwen2.5-VL	74.96	79.52	54.29	65.46
Qwen3-VL	71.47	75.42	57.30	64.55
RS-LLaVA	73.15	74.75	23.72	24.46
GeoReason	67.30	69.76	45.59	53.39

NeFo also transfers to unseen negation tasks, including RS-Neg classification, RS-Neg grounding, and real-world FloodNet VQA.

Scaling study with different numbers of test-time adaptation samples. — Effect of varying test-time adaptation data size.

BibTeX

@article{han2026evaluating,
  title={Evaluating and Enhancing Negation Comprehension in Remote Sensing MLLMs},
  author={Han, Haochen and Wang, Jue and Wang, Alex Jinpeng and Liu, Fangming},
  journal={arXiv preprint arXiv:2606.20177},
  year={2026}
}