Image and Video Processing 16
☆ RDDM: Practicing RAW Domain Diffusion Model for Real-world Image Restoration
We present the RAW domain diffusion model (RDDM), an end-to-end diffusion
model that restores photo-realistic images directly from the sensor RAW data.
While recent sRGB-domain diffusion methods achieve impressive results, they are
caught in a dilemma between high fidelity and realistic generation. As these
models process lossy sRGB inputs and neglect the accessibility of the sensor
RAW images in many scenarios, e.g., in image and video capturing in edge
devices, resulting in sub-optimal performance. RDDM bypasses this limitation by
directly restoring images in the RAW domain, replacing the conventional
two-stage image signal processing (ISP) + IR pipeline. However, a simple
adaptation of pre-trained diffusion models to the RAW domain confronts the
out-of-distribution (OOD) issues. To this end, we propose: (1) a RAW-domain VAE
(RVAE) learning optimal latent representations, (2) a differentiable Post Tone
Processing (PTP) module enabling joint RAW and sRGB space optimization. To
compensate for the deficiency in the dataset, we develop a scalable degradation
pipeline synthesizing RAW LQ-HQ pairs from existing sRGB datasets for
large-scale training. Furthermore, we devise a configurable multi-bayer (CMB)
LoRA module handling diverse RAW patterns such as RGGB, BGGR, etc. Extensive
experiments demonstrate RDDM's superiority over state-of-the-art sRGB diffusion
methods, yielding higher fidelity results with fewer artifacts.
☆ Random forest-based out-of-distribution detection for robust lung cancer segmentation
Accurate detection and segmentation of cancerous lesions from computed
tomography (CT) scans is essential for automated treatment planning and cancer
treatment response assessment. Transformer-based models with self-supervised
pretraining can produce reliably accurate segmentation from in-distribution
(ID) data but degrade when applied to out-of-distribution (OOD) datasets. We
address this challenge with RF-Deep, a random forest classifier that utilizes
deep features from a pretrained transformer encoder of the segmentation model
to detect OOD scans and enhance segmentation reliability. The segmentation
model comprises a Swin Transformer encoder, pretrained with masked image
modeling (SimMIM) on 10,432 unlabeled 3D CT scans covering cancerous and
non-cancerous conditions, with a convolution decoder, trained to segment lung
cancers in 317 3D scans. Independent testing was performed on 603 3D CT public
datasets that included one ID dataset and four OOD datasets comprising chest
CTs with pulmonary embolism (PE) and COVID-19, and abdominal CTs with kidney
cancers and healthy volunteers. RF-Deep detected OOD cases with a FPR95 of
18.26%, 27.66%, and less than 0.1% on PE, COVID-19, and abdominal CTs,
consistently outperforming established OOD approaches. The RF-Deep classifier
provides a simple and effective approach to enhance reliability of cancer
segmentation in ID and OOD scenarios.
☆ Composition and Alignment of Diffusion Models using Constrained Learning
Diffusion models have become prevalent in generative modeling due to their
ability to sample from complex distributions. To improve the quality of
generated samples and their compliance with user requirements, two commonly
used methods are: (i) Alignment, which involves fine-tuning a diffusion model
to align it with a reward; and (ii) Composition, which combines several
pre-trained diffusion models, each emphasizing a desirable attribute in the
generated outputs. However, trade-offs often arise when optimizing for multiple
rewards or combining multiple models, as they can often represent competing
properties. Existing methods cannot guarantee that the resulting model
faithfully generates samples with all the desired properties. To address this
gap, we propose a constrained optimization framework that unifies alignment and
composition of diffusion models by enforcing that the aligned model satisfies
reward constraints and/or remains close to (potentially multiple) pre-trained
models. We provide a theoretical characterization of the solutions to the
constrained alignment and composition problems and develop a Lagrangian-based
primal-dual training algorithm to approximate these solutions. Empirically, we
demonstrate the effectiveness and merits of our proposed approach in image
generation, applying it to alignment and composition, and show that our aligned
or composed model satisfies constraints effectively, and improves on the
equally-weighted approach. Our implementation can be found at
https://github.com/shervinkhalafi/constrained_comp_align.
☆ Understanding Benefits and Pitfalls of Current Methods for the Segmentation of Undersampled MRI Data
MR imaging is a valuable diagnostic tool allowing to non-invasively visualize
patient anatomy and pathology with high soft-tissue contrast. However, MRI
acquisition is typically time-consuming, leading to patient discomfort and
increased costs to the healthcare system. Recent years have seen substantial
research effort into the development of methods that allow for accelerated MRI
acquisition while still obtaining a reconstruction that appears similar to the
fully-sampled MR image. However, for many applications a perfectly
reconstructed MR image may not be necessary, particularly, when the primary
goal is a downstream task such as segmentation. This has led to growing
interest in methods that aim to perform segmentation directly on accelerated
MRI data. Despite recent advances, existing methods have largely been developed
in isolation, without direct comparison to one another, often using separate or
private datasets, and lacking unified evaluation standards. To date, no
high-quality, comprehensive comparison of these methods exists, and the optimal
strategy for segmenting accelerated MR data remains unknown. This paper
provides the first unified benchmark for the segmentation of undersampled MRI
data comparing 7 approaches. A particular focus is placed on comparing
\textit{one-stage approaches}, that combine reconstruction and segmentation
into a unified model, with \textit{two-stage approaches}, that utilize
established MRI reconstruction methods followed by a segmentation network. We
test these methods on two MRI datasets that include multi-coil k-space data as
well as a human-annotated segmentation ground-truth. We find that simple
two-stage methods that consider data-consistency lead to the best segmentation
scores, surpassing complex specialized methods that are developed specifically
for this task.
☆ Lossless 4:2:0 Screen Content Coding Using Luma-Guided Soft Context Formation
The soft context formation coder is a pixel-wise state-of-the-art lossless
screen content coder using pattern matching and color palette coding in
combination with arithmetic coding. It achieves excellent compression
performance on screen content images in RGB 4:4:4 format with few distinct
colors. In contrast to many other lossless compression methods, it codes entire
color pixels at once, i.e., all color components of one pixel are coded
together. Consequently, it does not natively support image formats with
downsampled chroma, such as YCbCr 4:2:0, which is an often used chroma format
in video compression. In this paper, we extend the soft context formation
coding capabilities to 4:2:0 image compression, by successively coding Y and
CbCr planes based on an analysis of normalized mutual information between image
planes. Additionally, we propose an enhancement to the chroma prediction based
on the luminance plane. Furthermore, we propose to transmit side-information
about occurring luma-chroma combinations to improve chroma probability
distribution modelling. Averaged over a large screen content image dataset, our
proposed method outperforms HEVC-SCC, with HEVC-SCC needing 5.66% more bitrate
compared to our method.
comment: 5 pages, 4 figures, 3 tables, accepted to EUSIPCO 2025
☆ HOTSPOT-YOLO: A Lightweight Deep Learning Attention-Driven Model for Detecting Thermal Anomalies in Drone-Based Solar Photovoltaic Inspections
Thermal anomaly detection in solar photovoltaic (PV) systems is essential for
ensuring operational efficiency and reducing maintenance costs. In this study,
we developed and named HOTSPOT-YOLO, a lightweight artificial intelligence (AI)
model that integrates an efficient convolutional neural network backbone and
attention mechanisms to improve object detection. This model is specifically
designed for drone-based thermal inspections of PV systems, addressing the
unique challenges of detecting small and subtle thermal anomalies, such as
hotspots and defective modules, while maintaining real-time performance.
Experimental results demonstrate a mean average precision of 90.8%, reflecting
a significant improvement over baseline object detection models. With a reduced
computational load and robustness under diverse environmental conditions,
HOTSPOT-YOLO offers a scalable and reliable solution for large-scale PV
inspections. This work highlights the integration of advanced AI techniques
with practical engineering applications, revolutionizing automated fault
detection in renewable energy systems.
☆ A Closer Look at Edema Area Segmentation in SD-OCT Images Using Adversarial Framework
The development of artificial intelligence models for macular edema (ME)
analy-sis always relies on expert-annotated pixel-level image datasets which
are expen-sive to collect prospectively. While anomaly-detection-based
weakly-supervised methods have shown promise in edema area (EA) segmentation
task, their per-formance still lags behind fully-supervised approaches. In this
paper, we leverage the strong correlation between EA and retinal layers in
spectral-domain optical coherence tomography (SD-OCT) images, along with the
update characteristics of weakly-supervised learning, to enhance an
off-the-shelf adversarial framework for EA segmentation with a novel
layer-structure-guided post-processing step and a test-time-adaptation (TTA)
strategy. By incorporating additional retinal lay-er information, our framework
reframes the dense EA prediction task as one of confirming intersection points
between the EA contour and retinal layers, result-ing in predictions that
better align with the shape prior of EA. Besides, the TTA framework further
helps address discrepancies in the manifestations and presen-tations of EA
between training and test sets. Extensive experiments on two pub-licly
available datasets demonstrate that these two proposed ingredients can im-prove
the accuracy and robustness of EA segmentation, bridging the gap between
weakly-supervised and fully-supervised models.
☆ ModAn-MulSupCon: Modality-and Anatomy-Aware Multi-Label Supervised Contrastive Pretraining for Medical Imaging
Background and objective: Expert annotations limit large-scale supervised
pretraining in medical imaging, while ubiquitous metadata (modality, anatomical
region) remain underused. We introduce ModAn-MulSupCon, a modality- and
anatomy-aware multi-label supervised contrastive pretraining method that
leverages such metadata to learn transferable representations.
Method: Each image's modality and anatomy are encoded as a multi-hot vector.
A ResNet-18 encoder is pretrained on a mini subset of RadImageNet (miniRIN,
16,222 images) with a Jaccard-weighted multi-label supervised contrastive loss,
and then evaluated by fine-tuning and linear probing on three binary
classification tasks--ACL tear (knee MRI), lesion malignancy (breast
ultrasound), and nodule malignancy (thyroid ultrasound).
Result: With fine-tuning, ModAn-MulSupCon achieved the best AUC on MRNet-ACL
(0.964) and Thyroid (0.763), surpassing all baselines ($p<0.05$), and ranked
second on Breast (0.926) behind SimCLR (0.940; not significant). With the
encoder frozen, SimCLR/ImageNet were superior, indicating that ModAn-MulSupCon
representations benefit most from task adaptation rather than linear
separability.
Conclusion: Encoding readily available modality/anatomy metadata as
multi-label targets provides a practical, scalable pretraining signal that
improves downstream accuracy when fine-tuning is feasible. ModAn-MulSupCon is a
strong initialization for label-scarce clinical settings, whereas
SimCLR/ImageNet remain preferable for frozen-encoder deployments.
☆ Stress-testing cross-cancer generalizability of 3D nnU-Net for PET-CT tumor segmentation: multi-cohort evaluation with novel oesophageal and lung cancer datasets
Soumen Ghosh, Christine Jestin Hannan, Rajat Vashistha, Parveen Kundu, Sandra Brosda, Lauren G. Aoude, James Lonie, Andrew Nathanson, Jessica Ng, Andrew P. Barbour, Viktor Vegh
Robust generalization is essential for deploying deep learning based tumor
segmentation in clinical PET-CT workflows, where anatomical sites, scanners,
and patient populations vary widely. This study presents the first cross cancer
evaluation of nnU-Net on PET-CT, introducing two novel, expert-annotated
whole-body datasets. 279 patients with oesophageal cancer (Australian cohort)
and 54 with lung cancer (Indian cohort). These cohorts complement the public
AutoPET dataset and enable systematic stress-testing of cross domain
performance. We trained and tested 3D nnUNet models under three paradigms.
Target only (oesophageal), public only (AutoPET), and combined training. For
the tested sets, the oesophageal only model achieved the best in-domain
accuracy (mean DSC, 57.8) but failed on external Indian lung cohort (mean DSC
less than 3.4), indicating severe overfitting. The public only model
generalized more broadly (mean DSC, 63.5 on AutoPET, 51.6 on Indian lung
cohort) but underperformed in oesophageal Australian cohort (mean DSC, 26.7).
The combined approach provided the most balanced results (mean DSC, lung
(52.9), oesophageal (40.7), AutoPET (60.9)), reducing boundary errors and
improving robustness across all cohorts. These findings demonstrate that
dataset diversity, particularly multi demographic, multi center and multi
cancer integration, outweighs architectural novelty as the key driver of robust
generalization. This work presents the demography based cross cancer deep
learning segmentation evaluation and highlights dataset diversity, rather than
model complexity, as the foundation for clinically robust segmentation.
♻ ☆ Image Coding for Machines via Feature-Preserving Rate-Distortion Optimization
Many images and videos are primarily processed by computer vision algorithms,
involving only occasional human inspection. When this content requires
compression before processing, e.g., in distributed applications, coding
methods must optimize for both visual quality and downstream task performance.
We first show theoretically that an approach to reduce the effect of
compression for a given task loss is to perform rate-distortion optimization
(RDO) using the distance between features, obtained from the original and the
decoded images, as a distortion metric. However, optimizing directly such a
rate-distortion objective is computationally impractical because it requires
iteratively encoding and decoding the entire image-plus feature evaluation-for
each possible coding configuration. We address this problem by simplifying the
RDO formulation to make the distortion term computable using block-based
encoders. We first apply Taylor's expansion to the feature extractor, recasting
the feature distance as a quadratic metric involving the Jacobian matrix of the
neural network. Then, we replace the linearized metric with a block-wise
approximation, which we call input-dependent squared error (IDSE). To make the
metric computable, we approximate IDSE using sketches of the Jacobian. The
resulting loss can be evaluated block-wise in the transform domain and combined
with the sum of squared errors (SSE) to address both visual quality and
computer vision performance. Simulations with AVC and HEVC across multiple
feature extractors and downstream networks show up to 17 % bit-rate savings for
the same task accuracy compared to RDO based on SSE, with no decoder complexity
overhead and a small (7.86 %) encoder complexity increase.
♻ ☆ TimeFlow: Temporal Conditioning for Longitudinal Brain MRI Registration and Aging Analysis
Bailiang Jian, Jiazhen Pan, Yitong Li, Fabian Bongratz, Ruochen Li, Daniel Rueckert, Benedikt Wiestler, Christian Wachinger
Longitudinal brain analysis is essential for understanding healthy aging and
identifying pathological deviations. Longitudinal registration of sequential
brain MRI underpins such analyses. However, existing methods are limited by
reliance on densely sampled time series, a trade-off between accuracy and
temporal smoothness, and an inability to prospectively forecast future brain
states. To overcome these challenges, we introduce \emph{TimeFlow}, a
learning-based framework for longitudinal brain MRI registration. TimeFlow uses
a U-Net backbone with temporal conditioning to model neuroanatomy as a
continuous function of age. Given only two scans from an individual, TimeFlow
estimates accurate and temporally coherent deformation fields, enabling
non-linear extrapolation to predict future brain states. This is achieved by
our proposed inter-/extra-polation consistency constraints applied to both the
deformation fields and deformed images. Remarkably, these constraints preserve
temporal consistency and continuity without requiring explicit smoothness
regularizers or densely sampled sequential data. Extensive experiments
demonstrate that TimeFlow outperforms state-of-the-art methods in terms of both
future timepoint forecasting and registration accuracy. Moreover, TimeFlow
supports novel biological brain aging analyses by differentiating
neurodegenerative trajectories from normal aging without requiring
segmentation, thereby eliminating the need for labor-intensive annotations and
mitigating segmentation inconsistency. TimeFlow offers an accurate,
data-efficient, and annotation-free framework for longitudinal analysis of
brain aging and chronic diseases, capable of forecasting brain changes beyond
the observed study period.
♻ ☆ MorphSAM: Learning the Morphological Prompts from Atlases for Spine Image Segmentation
Dingwei Fan, Junyong Zhao, Chunlin Li, Mingliang Wang, Qi Zhu, Haipeng Si, Daoqiang Zhang, Liang Sun
Spine image segmentation is crucial for clinical diagnosis and treatment of
spine diseases. The complex structure of the spine and the high morphological
similarity between individual vertebrae and adjacent intervertebral discs make
accurate spine segmentation a challenging task. Although the Segment Anything
Model (SAM) has been proposed, it still struggles to effectively capture and
utilize morphological information, limiting its ability to enhance spine image
segmentation performance. To address these challenges, in this paper, we
propose a MorphSAM that explicitly learns morphological information from
atlases, thereby strengthening the spine image segmentation performance of SAM.
Specifically, the MorphSAM includes two fully automatic prompt learning
networks, 1) an anatomical prompt learning network that directly learns
morphological information from anatomical atlases, and 2) a semantic prompt
learning network that derives morphological information from text descriptions
converted from the atlases. Then, the two learned morphological prompts are fed
into the SAM model to boost the segmentation performance. We validate our
MorphSAM on two spine image segmentation tasks, including a spine anatomical
structure segmentation task with CT images and a lumbosacral plexus
segmentation task with MR images. Experimental results demonstrate that our
MorphSAM achieves superior segmentation performance when compared to the
state-of-the-art methods.
comment: The manuscript has been withdrawn by the authors due to substantial
revisions. A thoroughly revised version will be submitted in the future
♻ ☆ Uni-AIMS: AI-Powered Microscopy Image Analysis
Yanhui Hong, Nan Wang, Zhiyi Xia, Haoyi Tao, Xi Fang, Yiming Li, Jiankun Wang, Peng Jin, Xiaochen Cai, Shengyu Li, Ziqi Chen, Zezhong Zhang, Guolin Ke, Linfeng Zhang
This paper presents a systematic solution for the intelligent recognition and
automatic analysis of microscopy images. We developed a data engine that
generates high-quality annotated datasets through a combination of the
collection of diverse microscopy images from experiments, synthetic data
generation and a human-in-the-loop annotation process. To address the unique
challenges of microscopy images, we propose a segmentation model capable of
robustly detecting both small and large objects. The model effectively
identifies and separates thousands of closely situated targets, even in
cluttered visual environments. Furthermore, our solution supports the precise
automatic recognition of image scale bars, an essential feature in quantitative
microscopic analysis. Building upon these components, we have constructed a
comprehensive intelligent analysis platform and validated its effectiveness and
practicality in real-world applications. This study not only advances automatic
recognition in microscopy imaging but also ensures scalability and
generalizability across multiple application domains, offering a powerful tool
for automated microscopic analysis in interdisciplinary research. A online
application is made available for researchers to access and evaluate the
proposed automated analysis service.
♻ ☆ Enhancing Underwater Images via Deep Learning: A Comparative Study of VGG19 and ResNet50-Based Approaches ICIP
This paper addresses the challenging problem of image enhancement in complex
underwater scenes by proposing a solution based on deep learning. The proposed
method skillfully integrates two deep convolutional neural network models,
VGG19 and ResNet50, leveraging their powerful feature extraction capabilities
to perform multi-scale and multi-level deep feature analysis of underwater
images. By constructing a unified model, the complementary advantages of the
two models are effectively integrated, achieving a more comprehensive and
accurate image enhancement effect.To objectively evaluate the enhancement
effect, this paper introduces image quality assessment metrics such as PSNR,
UCIQE, and UIQM to quantitatively compare images before and after enhancement
and deeply analyzes the performance of different models in different
scenarios.Furthermore, to improve the practicality and stability of the
underwater visual enhancement system, this paper also provides practical
suggestions from aspects such as model optimization, multi-model fusion, and
hardware selection, aiming to provide strong technical support for visual
enhancement tasks in complex underwater environments.
comment: 7 pages, 6 figures,2025 IEEE 3rd International Conference on Image
Processing and Computer Applications (ICIPCA 2025)
♻ ☆ Deshadow-Anything: When Segment Anything Model Meets Zero-shot shadow removal
Segment Anything (SAM), an advanced universal image segmentation model
trained on an expansive visual dataset, has set a new benchmark in image
segmentation and computer vision. However, it faced challenges when it came to
distinguishing between shadows and their backgrounds. To address this, we
developed Deshadow-Anything, considering the generalization of large-scale
datasets, and we performed Fine-tuning on large-scale datasets to achieve image
shadow removal. The diffusion model can diffuse along the edges and textures of
an image, helping to remove shadows while preserving the details of the image.
Furthermore, we design Multi-Self-Attention Guidance (MSAG) and adaptive input
perturbation (DDPM-AIP) to accelerate the iterative training speed of
diffusion. Experiments on shadow removal tasks demonstrate that these methods
can effectively improve image restoration performance.
comment: We need to make major changes and re-upload
♻ ☆ MOSformer: Momentum encoder-based inter-slice fusion transformer for medical image segmentation
De-Xing Huang, Xiao-Hu Zhou, Mei-Jiang Gui, Xiao-Liang Xie, Shi-Qi Liu, Shuang-Yi Wang, Zhen-Qiu Feng, Zhi-Chao Lai, Zeng-Guang Hou
Medical image segmentation takes an important position in various clinical
applications. 2.5D-based segmentation models bridge the computational
efficiency of 2D-based models with the spatial perception capabilities of
3D-based models. However, existing 2.5D-based models primarily adopt a single
encoder to extract features of target and neighborhood slices, failing to
effectively fuse inter-slice information, resulting in suboptimal segmentation
performance. In this study, a novel momentum encoder-based inter-slice fusion
transformer (MOSformer) is proposed to overcome this issue by leveraging
inter-slice information from multi-scale feature maps extracted by different
encoders. Specifically, dual encoders are employed to enhance feature
distinguishability among different slices. One of the encoders is
moving-averaged to maintain consistent slice representations. Moreover, an
inter-slice fusion transformer (IF-Trans) module is developed to fuse
inter-slice multi-scale features. MOSformer is evaluated on three benchmark
datasets (Synapse, ACDC, and AMOS), achieving a new state-of-the-art with
85.63%, 92.19%, and 85.43% DSC, respectively. These results demonstrate
MOSformer's competitiveness in medical image segmentation.
comment: 13 pages, 9 figures, 8 tables. Under Review