Image and Video Processing 14
☆ MOIS-SAM2: Exemplar-based Segment Anything Model 2 for multilesion interactive segmentation of neurobromas in whole-body MRI
Georgii Kolokolnikov, Marie-Lena Schmalhofer, Sophie Götz, Lennart Well, Said Farschtschi, Victor-Felix Mautner, Inka Ristow, Rene Werner
Background and Objectives: Neurofibromatosis type 1 is a genetic disorder
characterized by the development of numerous neurofibromas (NFs) throughout the
body. Whole-body MRI (WB-MRI) is the clinical standard for detection and
longitudinal surveillance of NF tumor growth. Existing interactive segmentation
methods fail to combine high lesion-wise precision with scalability to hundreds
of lesions. This study proposes a novel interactive segmentation model tailored
to this challenge.
Methods: We introduce MOIS-SAM2, a multi-object interactive segmentation
model that extends the state-of-the-art, transformer-based, promptable Segment
Anything Model 2 (SAM2) with exemplar-based semantic propagation. MOIS-SAM2 was
trained and evaluated on 119 WB-MRI scans from 84 NF1 patients acquired using
T2-weighted fat-suppressed sequences. The dataset was split at the patient
level into a training set and four test sets (one in-domain and three
reflecting different domain shift scenarios, e.g., MRI field strength
variation, low tumor burden, differences in clinical site and scanner vendor).
Results: On the in-domain test set, MOIS-SAM2 achieved a scan-wise DSC of
0.60 against expert manual annotations, outperforming baseline 3D nnU-Net (DSC:
0.54) and SAM2 (DSC: 0.35). Performance of the proposed model was maintained
under MRI field strength shift (DSC: 0.53) and scanner vendor variation (DSC:
0.50), and improved in low tumor burden cases (DSC: 0.61). Lesion detection F1
scores ranged from 0.62 to 0.78 across test sets. Preliminary inter-reader
variability analysis showed model-to-expert agreement (DSC: 0.62-0.68),
comparable to inter-expert agreement (DSC: 0.57-0.69).
Conclusions: The proposed MOIS-SAM2 enables efficient and scalable
interactive segmentation of NFs in WB-MRI with minimal user input and strong
generalization, supporting integration into clinical workflows.
☆ An on-chip Pixel Processing Approach with 2.4μs latency for Asynchronous Read-out of SPAD-based dToF Flash LiDARs
Yiyang Liu, Rongxuan Zhang, Istvan Gyongy, Alistair Gorman, Sarrah M. Patanwala, Filip Taneski, Robert K. Henderson
We propose a fully asynchronous peak detection approach for SPAD-based direct
time-of-flight (dToF) flash LiDAR, enabling pixel-wise event-driven depth
acquisition without global synchronization. By allowing pixels to independently
report depth once a sufficient signal-to-noise ratio is achieved, the method
reduces latency, mitigates motion blur, and increases effective frame rate
compared to frame-based systems. The framework is validated under two hardware
implementations: an offline 256$\times$128 SPAD array with PC based processing
and a real-time FPGA proof-of-concept prototype with 2.4$\upmu$s latency for
on-chip integration. Experiments demonstrate robust depth estimation,
reflectivity reconstruction, and dynamic event-based representation under both
static and dynamic conditions. The results confirm that asynchronous operation
reduces redundant background data and computational load, while remaining
tunable via simple hyperparameters. These findings establish a foundation for
compact, low-latency, event-driven LiDAR architectures suited to robotics,
autonomous driving, and consumer applications. In addition, we have derived a
semi-closed-form solution for the detection probability of the raw-peak finding
based LiDAR systems that could benefit both conventional frame-based and
proposed asynchronous LiDAR systems.
☆ WaveletGaussian: Wavelet-domain Diffusion for Sparse-view 3D Gaussian Object Reconstruction
3D Gaussian Splatting (3DGS) has become a powerful representation for
image-based object reconstruction, yet its performance drops sharply in
sparse-view settings. Prior works address this limitation by employing
diffusion models to repair corrupted renders, subsequently using them as pseudo
ground truths for later optimization. While effective, such approaches incur
heavy computation from the diffusion fine-tuning and repair steps. We present
WaveletGaussian, a framework for more efficient sparse-view 3D Gaussian object
reconstruction. Our key idea is to shift diffusion into the wavelet domain:
diffusion is applied only to the low-resolution LL subband, while
high-frequency subbands are refined with a lightweight network. We further
propose an efficient online random masking strategy to curate training pairs
for diffusion fine-tuning, replacing the commonly used, but inefficient,
leave-one-out strategy. Experiments across two benchmark datasets, Mip-NeRF 360
and OmniObject3D, show WaveletGaussian achieves competitive rendering quality
while substantially reducing training time.
☆ FlashGMM: Fast Gaussian Mixture Entropy Model for Learned Image Compression
High-performance learned image compression codecs require flexible
probability models to fit latent representations. Gaussian Mixture Models
(GMMs) were proposed to satisfy this demand, but suffer from a significant
runtime performance bottleneck due to the large Cumulative Distribution
Function (CDF) tables that must be built for rANS coding. This paper introduces
a fast coding algorithm that entirely eliminates this bottleneck. By leveraging
the CDF's monotonic property, our decoder performs a dynamic binary search to
find the correct symbol, eliminating the need for costly table construction and
lookup. Aided by SIMD optimizations and numerical approximations, our approach
accelerates the GMM entropy coding process by up to approximately 90x without
compromising rate-distortion performance, significantly improving the
practicality of GMM-based codecs. The implementation will be made publicly
available at https://github.com/tokkiwa/FlashGMM.
comment: Accepted by IEEE VCIP 2025
☆ RFI Removal from SAR Imagery via Sparse Parametric Estimation of LFM Interferences
One of the challenges in spaceborne synthetic aperture radar (SAR) is
modeling and mitigating radio frequency interference (RFI) artifacts in SAR
imagery. Linear frequency modulated (LFM) signals have been commonly used for
characterizing the radar interferences in SAR. In this letter, we propose a new
signal model that approximates RFI as a mixture of multiple LFM components in
the focused SAR image domain. The azimuth and range frequency modulation (FM)
rates for each LFM component are estimated effectively using a sparse
parametric representation of LFM interferences with a discretized LFM
dictionary. This approach is then tested within the recently developed RFI
suppression framework using a 2-D SPECtral ANalysis (2-D SPECAN) algorithm
through LFM focusing and notch filtering in the spectral domain [1].
Experimental studies on Sentinel-1 single-look complex images demonstrate that
the proposed LFM model and sparse parametric estimation scheme outperforms
existing RFI removal methods.
☆ HyperCool: Reducing Encoding Cost in Overfitted Codecs with Hypernetworks
Overfitted image codecs like Cool-chic achieve strong compression by
tailoring lightweight models to individual images, but their encoding is slow
and computationally expensive. To accelerate encoding, Non-Overfitted (N-O)
Cool-chic replaces the per-image optimization with a learned inference model,
trading compression performance for encoding speed. We introduce HyperCool, a
hypernetwork architecture that mitigates this trade-off. Building upon the N-O
Cool-chic framework, HyperCool generates content-adaptive parameters for a
Cool-chic decoder in a single forward pass, tailoring the decoder to the input
image without requiring per-image fine-tuning. Our method achieves a 4.9% rate
reduction over N-O Cool-chic with minimal computational overhead. Furthermore,
the output of our hypernetwork provides a strong initialization for further
optimization, reducing the number of steps needed to approach fully overfitted
model performance. With fine-tuning, HEVC-level compression is achieved with
60.4% of the encoding cost of the fully overfitted Cool-chic. This work
proposes a practical method to accelerate encoding in overfitted image codecs,
improving their viability in scenarios with tight compute budgets.
☆ Event-guided 3D Gaussian Splatting for Dynamic Human and Scene Reconstruction
Reconstructing dynamic humans together with static scenes from monocular
videos remains difficult, especially under fast motion, where RGB frames suffer
from motion blur. Event cameras exhibit distinct advantages, e.g., microsecond
temporal resolution, making them a superior sensing choice for dynamic human
reconstruction. Accordingly, we present a novel event-guided human-scene
reconstruction framework that jointly models human and scene from a single
monocular event camera via 3D Gaussian Splatting. Specifically, a unified set
of 3D Gaussians carries a learnable semantic attribute; only Gaussians
classified as human undergo deformation for animation, while scene Gaussians
stay static. To combat blur, we propose an event-guided loss that matches
simulated brightness changes between consecutive renderings with the event
stream, improving local fidelity in fast-moving regions. Our approach removes
the need for external human masks and simplifies managing separate Gaussian
sets. On two benchmark datasets, ZJU-MoCap-Blur and MMHPSD-Blur, it delivers
state-of-the-art human-scene reconstruction, with notable gains over strong
baselines in PSNR/SSIM and reduced LPIPS, especially for high-speed subjects.
☆ Efficient Breast and Ovarian Cancer Classification via ViT-Based Preprocessing and Transfer Learning
Cancer is one of the leading health challenges for women, specifically breast
and ovarian cancer. Early detection can help improve the survival rate through
timely intervention and treatment. Traditional methods of detecting cancer
involve manually examining mammograms, CT scans, ultrasounds, and other imaging
types. However, this makes the process labor-intensive and requires the
expertise of trained pathologists. Hence, making it both time-consuming and
resource-intensive. In this paper, we introduce a novel vision transformer
(ViT)-based method for detecting and classifying breast and ovarian cancer. We
use a pre-trained ViT-Base-Patch16-224 model, which is fine-tuned for both
binary and multi-class classification tasks using publicly available
histopathological image datasets. Further, we use a preprocessing pipeline that
converts raw histophological images into standardized PyTorch tensors, which
are compatible with the ViT architecture and also help improve the model
performance. We evaluated the performance of our model on two benchmark
datasets: the BreakHis dataset for binary classification and the UBC-OCEAN
dataset for five-class classification without any data augmentation. Our model
surpasses existing CNN, ViT, and topological data analysis-based approaches in
binary classification. For multi-class classification, it is evaluated against
recent topological methods and demonstrates superior performance. Our study
highlights the effectiveness of Vision Transformer-based transfer learning
combined with efficient preprocessing in oncological diagnostics.
comment: 10 pages, 3 figures
♻ ☆ Saturation-Aware Snapshot Compressive Imaging: Theory and Algorithm
Snapshot Compressive Imaging (SCI) uses coded masks to compress a 3D data
cube into a single 2D snapshot. In practice, multiplexing can push intensities
beyond the sensor's dynamic range, producing saturation that violates the
linear SCI model and degrades reconstruction. This paper provides the first
theoretical characterization of SCI recovery under saturation. We model
clipping as an element-wise nonlinearity and derive a finite-sample recovery
bound for compression-based SCI that links reconstruction error to mask density
and the extent of saturation. The analysis yields a clear design rule: optimal
Bernoulli masks use densities below one-half, decreasing further as saturation
strengthens. Guided by this principle, we optimize mask patterns and introduce
a novel reconstruction framework, Saturation-Aware PnP Net (SAPnet), which
explicitly enforces consistency with saturated measurements. Experiments on
standard video-SCI benchmarks confirm our theory and demonstrate that SAPnet
significantly outperforms existing PnP-based methods.
comment: 13 pages
♻ ☆ DWTGS: Rethinking Frequency Regularization for Sparse-view 3D Gaussian Splatting
Sparse-view 3D Gaussian Splatting (3DGS) presents significant challenges in
reconstructing high-quality novel views, as it often overfits to the
widely-varying high-frequency (HF) details of the sparse training views. While
frequency regularization can be a promising approach, its typical reliance on
Fourier transforms causes difficult parameter tuning and biases towards
detrimental HF learning. We propose DWTGS, a framework that rethinks frequency
regularization by leveraging wavelet-space losses that provide additional
spatial supervision. Specifically, we supervise only the low-frequency (LF) LL
subbands at multiple DWT levels, while enforcing sparsity on the HF HH subband
in a self-supervised manner. Experiments across benchmarks show that DWTGS
consistently outperforms Fourier-based counterparts, as this LF-centric
strategy improves generalization and reduces HF hallucinations.
comment: Accepted to VCIP 2025
♻ ☆ Fix your downsampling ASAP! Be natively more robust via Aliasing and Spectral Artifact free Pooling
Convolutional Neural Networks (CNNs) are successful in various computer
vision tasks. From an image and signal processing point of view, this success
is counter-intuitive, as the inherent spatial pyramid design of most CNNs is
apparently violating basic signal processing laws, i.e. the Sampling Theorem in
their downsampling operations. This issue has been broadly neglected until
recent work in the context of adversarial attacks and distribution shifts
showed that there is a strong correlation between the vulnerability of CNNs and
aliasing artifacts induced by bandlimit-violating downsampling. As a remedy, we
propose an alias-free downsampling operation in the frequency domain, denoted
Frequency Low Cut Pooling (FLC Pooling) which we further extend to Aliasing and
Sinc Artifact-free Pooling (ASAP). ASAP is alias-free and removes further
artifacts from sinc-interpolation. Our experimental evaluation on ImageNet-1k,
ImageNet-C and CIFAR datasets on various CNN architectures demonstrates that
networks using FLC Pooling and ASAP as downsampling methods learn more stable
features as measured by their robustness against common corruptions and
adversarial attacks, while maintaining a clean accuracy similar to the
respective baseline models.
♻ ☆ L2M-Reg: Building-level Uncertainty-aware Registration of Outdoor LiDAR Point Clouds and Semantic 3D City Models SP
Accurate registration between LiDAR (Light Detection and Ranging) point
clouds and semantic 3D city models is a fundamental topic in urban digital
twinning and a prerequisite for downstream tasks, such as digital construction,
change detection and model refinement. However, achieving accurate
LiDAR-to-Model registration at individual building level remains challenging,
particularly due to the generalization uncertainty in semantic 3D city models
at the Level of Detail 2 (LoD2). This paper addresses this gap by proposing
L2M-Reg, a plane-based fine registration method that explicitly accounts for
model uncertainty. L2M-Reg consists of three key steps: establishing reliable
plane correspondence, building a pseudo-plane-constrained Gauss-Helmert model,
and adaptively estimating vertical translation. Experiments on three real-world
datasets demonstrate that L2M-Reg is both more accurate and computationally
efficient than existing ICP-based and plane-based methods. Overall, L2M-Reg
provides a novel building-level solution regarding LiDAR-to-Model registration
when model uncertainty is present.
comment: Submitted to the ISPRS Journal of Photogrammetry and Remote Sensing
♻ ☆ Efficient Sub-pixel Motion Compensation in Learned Video Codecs
Motion compensation is a key component of video codecs. Conventional codecs
(HEVC and VVC) have carefully refined this coding step, with an important focus
on sub-pixel motion compensation. On the other hand, learned codecs achieve
sub-pixel motion compensation through simple bilinear filtering. This paper
offers to improve learned codec motion compensation by drawing inspiration from
conventional codecs. It is shown that the usage of more advanced interpolation
filters, block-based motion information and finite motion accuracy lead to
better compression performance and lower decoding complexity. Experimental
results are provided on the Cool-chic video codec, where we demonstrate a rate
decrease of more than 10% and a lowering of motion-related decoding complexity
from 391 MAC per pixel to 214 MAC per pixel. All contributions are made
open-source at https://github.com/Orange-OpenSource/Cool-Chic
♻ ☆ A Chain-of-thought Reasoning Breast Ultrasound Dataset Covering All Histopathology Categories
Haojun Yu, Youcheng Li, Zihan Niu, Nan Zhang, Xuantong Gong, Huan Li, Zhiying Zou, Haifeng Qi, Zhenxiao Cao, Zijie Lan, Xingjian Yuan, Jiating He, Haokai Zhang, Shengtao Zhang, Zicheng Wang, Dong Wang, Ziwei Zhao, Congying Chen, Yong Wang, Wangyan Qin, Qingli Zhu, Liwei Wang
Breast ultrasound (BUS) is an essential tool for diagnosing breast lesions,
with millions of examinations per year. However, publicly available
high-quality BUS benchmarks for AI development are limited in data scale and
annotation richness. In this work, we present BUS-CoT, a BUS dataset for
chain-of-thought (CoT) reasoning analysis, which contains 11,439 images of
10,019 lesions from 4,838 patients and covers all 99 histopathology types. To
facilitate research on incentivizing CoT reasoning, we construct the reasoning
processes based on observation, feature, diagnosis and pathology labels,
annotated and verified by experienced experts. Moreover, by covering lesions of
all histopathology types, we aim to facilitate robust AI systems in rare cases,
which can be error-prone in clinical practice.