Unveiling Differences: A Vision Encoder-Decoder Model for Difference Medical Visual Question Answering

Difference Medical Visual Question Answering (Diff-VQA), a specialized subfield of Medical VQA, tackles the critical task of identifying and describing differences between pairs of medical images. This study introduces a novel Vision Encoder-Decoder (VED) architecture tailored for this task, focusing on the comparison of chest X-ray images to detect and explain changes. The proposed model incorporates two key innovations: (1) a light-weight Transformer text decoder architecture capable of generating precise and contextually relevant answers to complex medical questions, and (2) an enhanced fusion mechanism that improves the model’s ability to distinguish between two input images, enabling more accurate comparison of radiological findings. Our approach excels in identifying significant changes, such as pneumonia and lung opacity, demonstrating its utility in automating preliminary radiological assessments. By leveraging large-scale, domain-specific datasets and employing advanced training strategies, our VED architecture achieves state-of-the-art performance on standard VQA metrics, setting a new benchmark in diagnostic accuracy. These advancements highlight the potential of Diff-VQA to enhance clinical workflows and support radiologists in making more precise, informed decisions.

The proposed architecture focuses on a Vision Encoder-Decoder scheme specifically designed to compare pairs of chest X-ray images and generate answers about changes between them. The visual component uses a Transformer-based backbone pre-trained and fine-tuned on radiograph data to extract representations from both images. These representations are augmented with a learnable indicator that signals which image is the reference and which is the current one, so that the fusion process can explicitly distinguish between the two inputs.

The text decoder is a lightweight Transformer that processes the tokenized question and integrates the differentiated visual information via cross-attention. Given the relatively limited vocabulary and structure of questions in this domain, a smaller decoder is chosen to generate precise answers without requiring large-scale pre-trained language models. At each generation step, the decoder combines the textual context (the question and previously generated tokens) with the fused visual features, producing the answer autoregressively and ensuring coherence in comparing the reference and current images.

The training strategy is organized in staged phases: first, the visual encoder is fine-tuned exclusively on radiographs to learn domain-specific features; next, the text decoder is integrated and initially trained alone with the encoder frozen, optimizing visual-text fusion in isolation; finally, a joint fine-tuning of both encoder and decoder is performed to adjust the entire system. During this process, techniques such as hard example selection and image augmentations are applied to strengthen the model’s robustness against real variations in radiographs.

Our results confirm that the proposed method clearly outperforms the state of the art in Medical VQA. In our overall evaluations, our model produces answers that are more coherent, fluent, and clinically relevant than those of previous approaches, thanks to an architecture specifically designed to interpret radiological images and generate precise diagnoses. This improvement is reflected across all quality metrics, demonstrating that our training strategy and network modifications foster a deeper understanding of the medical context and yield text that more faithfully mirrors clinical reasoning.

Beyond the aggregate numbers, the qualitative analysis of two extreme examples—both achieving maximum metric scores—reveals a crucial insight. In the first case, the model’s response perfectly matches the true diagnosis, identifying every anomaly and adequately justifying each finding. However, in the second case, despite registering the same high quantitative scores, the model commits a diagnostic error: it either assumes that certain abnormalities have resolved when they remain present.

This contrast highlights a decisive limitation of current metrics: they measure linguistic and structural similarity but fail to capture the clinical validity of the content. In a setting where automated diagnoses can directly influence medical decisions, this gap is unacceptable. Thus, our results not only establish a new benchmark in quantitative performance but also underscore the urgent need to develop evaluation metrics tailored to Medical VQA—metrics capable of verifying diagnostic accuracy and ensuring patient safety.

BibTeX

@inproceedings{obrador-reina2025unveiling,
  title      = {Unveiling Differences: A Vision Encoder-Decoder Model for Difference Medical Visual Question Answering},
  author     = {Luis-Jesus Marhuenda and Miquel Obrador-Reina and Mohamed Aas-Alas and Alberto Albiol and Roberto Paredes},
  booktitle  = {Medical Imaging with Deep Learning},
  year       = {2025},
  url        = {https://openreview.net/forum?id=8CNssOg7fk}
}

Unveiling Differences: A Vision Encoder-Decoder Model for Difference Medical Visual Question Answering

Abstract

Architecture

Results

BibTeX