Image and Video Processing 9
♻ ☆ Prompt-Guided Latent Diffusion with Predictive Class Conditioning for 3D Prostate MRI Generation
Objective: Latent diffusion models (LDM) could alleviate data scarcity
challenges affecting machine learning development for medical imaging. However,
medical LDM strategies typically rely on short-prompt text encoders,
non-medical LDMs, or large data volumes. These strategies can limit performance
and scientific accessibility. We propose a novel LDM conditioning approach to
address these limitations. Methods: We propose Class-Conditioned Efficient
Large Language model Adapter (CCELLA), a novel dual-head conditioning approach
that simultaneously conditions the LDM U-Net with free-text clinical reports
and radiology classification. We also propose a data-efficient LDM framework
centered around CCELLA and a proposed joint loss function. We first evaluate
our method on 3D prostate MRI against state-of-the-art. We then augment a
downstream classifier model training dataset with synthetic images from our
method. Results: Our method achieves a 3D FID score of 0.025 on a size-limited
3D prostate MRI dataset, significantly outperforming a recent foundation model
with FID 0.071. When training a classifier for prostate cancer prediction,
adding synthetic images generated by our method during training improves
classifier accuracy from 69% to 74%. Training a classifier solely on our
method's synthetic images achieved comparable performance to training on real
images alone. Conclusion: We show that our method improved both synthetic image
quality and downstream classifier performance using limited data and minimal
human annotation. Significance: The proposed CCELLA-centric framework enables
radiology report and class-conditioned LDM training for high-quality medical
image synthesis given limited data volume and human data annotation, improving
LDM performance and scientific accessibility. Code from this study will be
available at https://github.com/grabkeem/CCELLA
comment: MAH and BT are co-senior authors on the work. This work has been
submitted to the IEEE for possible publication
♻ ☆ Realism in Action: Anomaly-Aware Diagnosis of Brain Tumors from Medical Images Using YOLOv8 and DeiT
Reliable diagnosis of brain tumors remains challenging due to low clinical
incidence rates of such cases. However, this low rate is neglected in most of
proposed methods. We propose a clinically inspired framework for
anomaly-resilient tumor detection and classification. Detection leverages
YOLOv8n fine-tuned on a realistically imbalanced dataset (1:9 tumor-to-normal
ratio; 30,000 MRI slices from 81 patients). In addition, we propose a novel
Patient-to-Patient (PTP) metric that evaluates diagnostic reliability at the
patient level. Classification employs knowledge distillation: a Data Efficient
Image Transformer (DeiT) student model is distilled from a ResNet152 teacher.
The distilled ViT achieves an F1-score of 0.92 within 20 epochs, matching near
teacher performance (F1=0.97) with significantly reduced computational
resources. This end-to-end framework demonstrates high robustness in clinically
representative anomaly-distributed data, offering a viable tool that adheres to
realistic situations in clinics.
♻ ☆ SurgTPGS: Semantic 3D Surgical Scene Understanding with Text Promptable Gaussian Splatting MICCAI 2025
Yiming Huang, Long Bai, Beilei Cui, Kun Yuan, Guankun Wang, Mobarak I. Hoque, Nicolas Padoy, Nassir Navab, Hongliang Ren
In contemporary surgical research and practice, accurately comprehending 3D
surgical scenes with text-promptable capabilities is particularly crucial for
surgical planning and real-time intra-operative guidance, where precisely
identifying and interacting with surgical tools and anatomical structures is
paramount. However, existing works focus on surgical vision-language model
(VLM), 3D reconstruction, and segmentation separately, lacking support for
real-time text-promptable 3D queries. In this paper, we present SurgTPGS, a
novel text-promptable Gaussian Splatting method to fill this gap. We introduce
a 3D semantics feature learning strategy incorporating the Segment Anything
model and state-of-the-art vision-language models. We extract the segmented
language features for 3D surgical scene reconstruction, enabling a more
in-depth understanding of the complex surgical environment. We also propose
semantic-aware deformation tracking to capture the seamless deformation of
semantic features, providing a more precise reconstruction for both texture and
semantic features. Furthermore, we present semantic region-aware optimization,
which utilizes regional-based semantic information to supervise the training,
particularly promoting the reconstruction quality and semantic smoothness. We
conduct comprehensive experiments on two real-world surgical datasets to
demonstrate the superiority of SurgTPGS over state-of-the-art methods,
highlighting its potential to revolutionize surgical practices. SurgTPGS paves
the way for developing next-generation intelligent surgical systems by
enhancing surgical precision and safety. Our code is available at:
https://github.com/lastbasket/SurgTPGS.
comment: MICCAI 2025. Project Page:
https://lastbasket.github.io/MICCAI-2025-SurgTPGS/
♻ ☆ Volumetric segmentation of muscle compartments using in vivo imaging and architectural validation in human finger flexors
Segmenting muscle compartments and measuring their architecture can
facilitate movement function assessment, accurate musculoskeletal modeling, and
synergy-based electromyogram simulation. Here, we presented a novel method for
volumetric segmentation of muscle compartments using in vivo imaging, focusing
on the independent compartments for finger control of flexor digitorum
superficialis (FDS). Besides, we measured the architectural properties of FDS
compartments and validated the segmentation. Specifically, ultrasound and
magnetic resonance imaging (MRI) from 10 healthy subjects were used for
segmentation and measurement, while electromyography was utilized for
validation. A two-step piecewise segmentation was proposed, first annotating
compartment regions in the cross-sectional ultrasound image based on
compartment movement, and then performing minimum energy matching to register
the ultrasound data to the three-dimensional MRI coordinate system.
Additionally, the architectural properties were measured in the compartment
masks from the segmentation using MRI tractography. Anatomical correctness was
verified by comparing known anatomy with reconstructed fiber tracts and
measured properties, while segmentation accuracy was quantified as the
percentage of finger electromyogram centers falling within their corresponding
compartments. Results demonstrated agreement for the fiber orientation between
the tractography and cadaveric photographs. Significant differences in
architectural properties (P < 0.001) were observed between compartments. The
properties of FDS and its compartments were within the physiological ranges (P
< 0.01). 95% (38/40) of the electromyogram centers were located within
respective compartments, with 2 errors occurring in the index and little
fingers. The validated segmentation method and derived architectural properties
may advance biomedical applications.
comment: 19 pages, 13 figures
♻ ☆ Dehazing Light Microscopy Images with Guided Conditional Flow Matching: finding a sweet spot between fidelity and realism
Fluorescence microscopy is a major driver of scientific progress in the life
sciences. Although high-end confocal microscopes are capable of filtering
out-of-focus light, cheaper and more accessible microscopy modalities, such as
widefield microscopy, can not, which consequently leads to hazy image data.
Computational dehazing is trying to combine the best of both worlds, leading to
cheap microscopy but crisp-looking images. The perception-distortion trade-off
tells us that we can optimize either for data fidelity, e.g. low MSE or high
PSNR, or for data realism, measured by perceptual metrics such as LPIPS or FID.
Existing methods either prioritize fidelity at the expense of realism, or
produce perceptually convincing results that lack quantitative accuracy. In
this work, we propose HazeMatching, a novel iterative method for dehazing light
microscopy images, which effectively balances these objectives. Our goal was to
find a balanced trade-off between the fidelity of the dehazing results and the
realism of individual predictions (samples). We achieve this by adapting the
conditional flow matching framework by guiding the generative process with a
hazy observation in the conditional velocity field. We evaluate HazeMatching on
5 datasets, covering both synthetic and real data, assessing both distortion
and perceptual quality. Our method is compared against 7 baselines, achieving a
consistent balance between fidelity and realism on average. Additionally, with
calibration analysis, we show that HazeMatching produces well-calibrated
predictions. Note that our method does not need an explicit degradation
operator to exist, making it easily applicable on real microscopy data. All
data used for training and evaluation and our code will be publicly available
under a permissive license.
comment: 4 figures, 10 pages + refs, 40 pages total (including supplement), 24
supplementary figures
♻ ☆ Downscaling Neural Network for Coastal Simulations
Learning the fine-scale details of a coastal ocean simulation from a coarse
representation is a challenging task. For real-world applications,
high-resolution simulations are necessary to advance understanding of many
coastal processes, specifically, to predict flooding resulting from tsunamis
and storm surges. We propose a Downscaling Neural Network for Coastal
Simulation (DNNCS) for spatiotemporal enhancement to efficiently learn the
high-resolution numerical solution. Given images of coastal simulations
produced on low-resolution computational meshes using low polynomial order
discontinuous Galerkin discretizations and a coarse temporal resolution, the
proposed DNNCS learns to produce high-resolution free surface elevation and
velocity visualizations in both time and space. To efficiently model the
dynamic changes over time and space, we propose grid-aware spatiotemporal
attention to project the temporal features to the spatial domain for non-local
feature matching. The coordinate information is also utilized via positional
encoding. For the final reconstruction, we use the spatiotemporal bilinear
operation to interpolate the missing frames and then expand the feature maps to
the frequency domain for residual mapping. Besides data-driven losses, the
proposed physics-informed loss guarantees gradient consistency and momentum
changes. Their combination contributes to the overall 24% improvements in Root
Mean Square Error (RMSE). To train the proposed model, we propose a novel
coastal simulation dataset and use it for model optimization and evaluation.
Our method shows superior downscaling quality and fast computation compared to
the state-of-the-art methods.
comment: 13 pages, 12 figures
♻ ☆ ICME 2025 Grand Challenge on Video Super-Resolution for Video Conferencing
Super-Resolution (SR) is a critical task in computer vision, focusing on
reconstructing high-resolution (HR) images from low-resolution (LR) inputs. The
field has seen significant progress through various challenges, particularly in
single-image SR. Video Super-Resolution (VSR) extends this to the temporal
domain, aiming to enhance video quality using methods like local, uni-,
bi-directional propagation, or traditional upscaling followed by restoration.
This challenge addresses VSR for conferencing, where LR videos are encoded with
H.265 at fixed QPs. The goal is to upscale videos by a specific factor,
providing HR outputs with enhanced perceptual quality under a low-delay
scenario using causal models. The challenge included three tracks:
general-purpose videos, talking head videos, and screen content videos, with
separate datasets provided by the organizers for training, validation, and
testing. We open-sourced a new screen content dataset for the SR task in this
challenge. Submissions were evaluated through subjective tests using a
crowdsourced implementation of the ITU-T Rec P.910.
♻ ☆ De-LightSAM: Modality-Decoupled Lightweight SAM for Generalizable Medical Segmentation
Qing Xu, Jiaxuan Li, Xiangjian He, Chenxin Li, Fiseha B. Tesem, Wenting Duan, Zhen Chen, Rong Qu, Jonathan M. Garibaldi, Chang Wen Chen
The universality of deep neural networks across different modalities and
their generalization capabilities to unseen domains play an essential role in
medical image segmentation. The recent segment anything model (SAM) has
demonstrated strong adaptability across diverse natural scenarios. However, the
huge computational costs, demand for manual annotations as prompts and
conflict-prone decoding process of SAM degrade its generalization capabilities
in medical scenarios. To address these limitations, we propose a
modality-decoupled lightweight SAM for domain-generalized medical image
segmentation, named De-LightSAM. Specifically, we first devise a lightweight
domain-controllable image encoder (DC-Encoder) that produces discriminative
visual features for diverse modalities. Further, we introduce the self-patch
prompt generator (SP-Generator) to automatically generate high-quality dense
prompt embeddings for guiding segmentation decoding. Finally, we design the
query-decoupled modality decoder (QM-Decoder) that leverages a one-to-one
strategy to provide an independent decoding channel for every modality,
preventing mutual knowledge interference of different modalities. Moreover, we
design a multi-modal decoupled knowledge distillation (MDKD) strategy to
leverage robust common knowledge to complement domain-specific medical feature
representations. Extensive experiments indicate that De-LightSAM outperforms
state-of-the-arts in diverse medical imaging segmentation tasks, displaying
superior modality universality and generalization capabilities. Especially,
De-LightSAM uses only 2.0% parameters compared to SAM-H. The source code is
available at https://github.com/xq141839/De-LightSAM.
comment: Under Review
♻ ☆ Exposing and Mitigating Calibration Biases and Demographic Unfairness in MLLM Few-Shot In-Context Learning for Medical Image Classification MICCAI 2025
Multimodal large language models (MLLMs) have enormous potential to perform
few-shot in-context learning in the context of medical image analysis. However,
safe deployment of these models into real-world clinical practice requires an
in-depth analysis of the accuracies of their predictions, and their associated
calibration errors, particularly across different demographic subgroups. In
this work, we present the first investigation into the calibration biases and
demographic unfairness of MLLMs' predictions and confidence scores in few-shot
in-context learning for medical image classification. We introduce CALIN, an
inference-time calibration method designed to mitigate the associated biases.
Specifically, CALIN estimates the amount of calibration needed, represented by
calibration matrices, using a bi-level procedure: progressing from the
population level to the subgroup level prior to inference. It then applies this
estimation to calibrate the predicted confidence scores during inference.
Experimental results on three medical imaging datasets: PAPILA for fundus image
classification, HAM10000 for skin cancer classification, and MIMIC-CXR for
chest X-ray classification demonstrate CALIN's effectiveness at ensuring fair
confidence calibration in its prediction, while improving its overall
prediction accuracies and exhibiting minimum fairness-utility trade-off.
comment: Preprint version. The peer-reviewed version of this paper has been
accepted to MICCAI 2025 main conference