Image and Video Processing 14
☆ A High-Level Feature Model to Predict the Encoding Energy of a Hardware Video Encoder
In today's society, live video streaming and user generated content streamed
from battery powered devices are ubiquitous. Live streaming requires real-time
video encoding, and hardware video encoders are well suited for such an
encoding task. In this paper, we introduce a high-level feature model using
Gaussian process regression that can predict the encoding energy of a hardware
video encoder. In an evaluation setup restricted to only P-frames and a single
keyframe, the model can predict the encoding energy with a mean absolute
percentage error of approximately 9%. Further, we demonstrate with an ablation
study that spatial resolution is a key high-level feature for encoding energy
prediction of a hardware encoder. A practical application of our model is that
it can be used to perform a prior estimation of the energy required to encode a
video at various spatial resolutions, with different coding standards and codec
presets.
comment: Accepted for Picture Coding Symposium (PCS) 2025
☆ MH-LVC: Multi-Hypothesis Temporal Prediction for Learned Conditional Residual Video Coding
Huu-Tai Phung, Zong-Lin Gao, Yi-Chen Yao, Kuan-Wei Ho, Yi-Hsin Chen, Yu-Hsiang Lin, Alessandro Gnutti, Wen-Hsiao Peng
This work, termed MH-LVC, presents a multi-hypothesis temporal prediction
scheme that employs long- and short-term reference frames in a conditional
residual video coding framework. Recent temporal context mining approaches to
conditional video coding offer superior coding performance. However, the need
to store and access a large amount of implicit contextual information extracted
from past decoded frames in decoding a video frame poses a challenge due to
excessive memory access. Our MH-LVC overcomes this issue by storing multiple
long- and short-term reference frames but limiting the number of reference
frames used at a time for temporal prediction to two. Our decoded frame buffer
management allows the encoder to flexibly utilize the long-term key frames to
mitigate temporal cascading errors and the short-term reference frames to
minimize prediction errors. Moreover, our buffering scheme enables the temporal
prediction structure to be adapted to individual input videos. While this
flexibility is common in traditional video codecs, it has not been fully
explored for learned video codecs. Extensive experiments show that the proposed
method outperforms VTM-17.0 under the low-delay B configuration in terms of
PSNR-RGB across commonly used test datasets, and performs comparably to the
state-of-the-art learned codecs (e.g.~DCVC-FM) while requiring less decoded
frame buffer and similar decoding time.
☆ Targeted Pooled Latent-Space Steganalysis Applied to Generative Steganography, with a Fix
Steganographic schemes dedicated to generated images modify the seed vector
in the latent space to embed a message, whereas most steganalysis methods
attempt to detect the embedding in the image space. This paper proposes to
perform steganalysis in the latent space by modeling the statistical
distribution of the norm of the latent vector. Specifically, we analyze the
practical security of a scheme proposed by Hu et. al. for latent diffusion
models, which is both robust and practically undetectable when steganalysis is
performed on generated images. We show that after embedding, the Stego (latent)
vector is distributed on a hypersphere while the Cover vector is i.i.d.
Gaussian. By going from the image space to the latent space, we show that it is
possible to model the norm of the vector in the latent space under the Cover or
Stego hypothesis as Gaussian distributions with different variances. A
Likelihood Ratio Test is then derived to perform pooled steganalysis. The
impact of the potential knowledge of the prompt and the number of diffusion
steps, is also studied. Additionally, we also show how, by randomly sampling
the norm of the latent vector before generation, the initial Stego scheme
becomes undetectable in the latent space.
☆ An Empirical Study of Reducing AV1 Decoder Complexity and Energy Consumption via Encoder Parameter Tuning
The widespread adoption of advanced video codecs such as AV1 is often
hindered by their high decoding complexity, posing a challenge for
battery-constrained devices. While encoders can be configured to produce
bitstreams that are decoder-friendly, estimating the decoding complexity and
energy overhead for a given video is non-trivial. In this study, we
systematically analyse the impact of disabling various coding tools and
adjusting coding parameters in two AV1 encoders, libaom-av1 and SVT-AV1. Using
system-level energy measurement tools like RAPL (Running Average Power Limit),
Intel SoC Watch (integrated with VTune profiler), we quantify the resulting
trade-offs between decoding complexity, energy consumption, and compression
efficiency for decoding a bitstream. Our results demonstrate that specific
encoder configurations can substantially reduce decoding complexity with
minimal perceptual quality degradation. For libaom-av1, disabling CDEF, an
in-loop filter gives us a mean reduction in decoding cycles by 10%. For
SVT-AV1, using the in-built, fast-decode=2 preset achieves a more substantial
24% reduction in decoding cycles. These findings provide strategies for content
providers to lower the energy footprint of AV1 video streaming.
comment: Accepted Camera-Ready paper for PCS 2025, 5 Pages
☆ LiteVPNet: A Lightweight Network for Video Encoding Control in Quality-Critical Applications
In the last decade, video workflows in the cinema production ecosystem have
presented new use cases for video streaming technology. These new workflows,
e.g. in On-set Virtual Production, present the challenge of requiring precise
quality control and energy efficiency. Existing approaches to transcoding often
fall short of these requirements, either due to a lack of quality control or
computational overhead. To fill this gap, we present a lightweight neural
network (LiteVPNet) for accurately predicting Quantisation Parameters for NVENC
AV1 encoders that achieve a specified VMAF score. We use low-complexity
features, including bitstream characteristics, video complexity measures, and
CLIP-based semantic embeddings. Our results demonstrate that LiteVPNet achieves
mean VMAF errors below 1.2 points across a wide range of quality targets.
Notably, LiteVPNet achieves VMAF errors within 2 points for over 87% of our
test corpus, c.f. approx 61% with state-of-the-art methods. LiteVPNet's
performance across various quality regions highlights its applicability for
enhancing high-value content transport and streaming for more energy-efficient,
high-quality media experiences.
comment: Accepted PCS 2025 Camera-Ready Version, 5 Pages
☆ AngularFuse: A Closer Look at Angle-based Perception for Spatial-Sensitive Multi-Modality Image Fusion
Visible-infrared image fusion is crucial in key applications such as
autonomous driving and nighttime surveillance. Its main goal is to integrate
multimodal information to produce enhanced images that are better suited for
downstream tasks. Although deep learning based fusion methods have made
significant progress, mainstream unsupervised approaches still face serious
challenges in practical applications. Existing methods mostly rely on manually
designed loss functions to guide the fusion process. However, these loss
functions have obvious limitations. On one hand, the reference images
constructed by existing methods often lack details and have uneven brightness.
On the other hand, the widely used gradient losses focus only on gradient
magnitude. To address these challenges, this paper proposes an angle-based
perception framework for spatial-sensitive image fusion (AngularFuse). At
first, we design a cross-modal complementary mask module to force the network
to learn complementary information between modalities. Then, a fine-grained
reference image synthesis strategy is introduced. By combining Laplacian edge
enhancement with adaptive histogram equalization, reference images with richer
details and more balanced brightness are generated. Last but not least, we
introduce an angle-aware loss, which for the first time constrains both
gradient magnitude and direction simultaneously in the gradient domain.
AngularFuse ensures that the fused images preserve both texture intensity and
correct edge orientation. Comprehensive experiments on the MSRS, RoadScene, and
M3FD public datasets show that AngularFuse outperforms existing mainstream
methods with clear margin. Visual comparisons further confirm that our method
produces sharper and more detailed results in challenging scenes, demonstrating
superior fusion capability.
comment: For the first time, angle-based perception was introduced into the
multi-modality image fusion task
☆ Ivan-ISTD: Rethinking Cross-domain Heteroscedastic Noise Perturbations in Infrared Small Target Detection
In the multimedia domain, Infrared Small Target Detection (ISTD) plays a
important role in drone-based multi-modality sensing. To address the dual
challenges of cross-domain shift and heteroscedastic noise perturbations in
ISTD, we propose a doubly wavelet-guided Invariance learning
framework(Ivan-ISTD). In the first stage, we generate training samples aligned
with the target domain using Wavelet-guided Cross-domain Synthesis. This
wavelet-guided alignment machine accurately separates the target background
through multi-frequency wavelet filtering. In the second stage, we introduce
Real-domain Noise Invariance Learning, which extracts real noise
characteristics from the target domain to build a dynamic noise library. The
model learns noise invariance through self-supervised loss, thereby overcoming
the limitations of distribution bias in traditional artificial noise modeling.
Finally, we create the Dynamic-ISTD Benchmark, a cross-domain dynamic
degradation dataset that simulates the distribution shifts encountered in
real-world applications. Additionally, we validate the versatility of our
method using other real-world datasets. Experimental results demonstrate that
our approach outperforms existing state-of-the-art methods in terms of many
quantitative metrics. In particular, Ivan-ISTD demonstrates excellent
robustness in cross-domain scenarios. The code for this work can be found at:
https://github.com/nanjin1/Ivan-ISTD.
comment: In infrared small target detection, noise from different sensors can
cause significant interference to performance. We propose a new dataset and a
wavelet-guided Invariance learning framework(Ivan-ISTD) to emphasize this
issue
♻ ☆ DarkIR: Robust Low-Light Image Restoration CVPR 2025
Photography during night or in dark conditions typically suffers from noise,
low light and blurring issues due to the dim environment and the common use of
long exposure. Although Deblurring and Low-light Image Enhancement (LLIE) are
related under these conditions, most approaches in image restoration solve
these tasks separately. In this paper, we present an efficient and robust
neural network for multi-task low-light image restoration. Instead of following
the current tendency of Transformer-based models, we propose new attention
mechanisms to enhance the receptive field of efficient CNNs. Our method reduces
the computational costs in terms of parameters and MAC operations compared to
previous methods. Our model, DarkIR, achieves new state-of-the-art results on
the popular LOLBlur, LOLv2 and Real-LOLBlur datasets, being able to generalize
on real-world night and dark images. Code and models at
https://github.com/cidautai/DarkIR
comment: CVPR 2025
♻ ☆ Efficient Fine-Tuning of DINOv3 Pretrained on Natural Images for Atypical Mitotic Figure Classification (MIDOG 2025 Task 2 Winner)
Guillaume Balezo, Hana Feki, Raphaël Bourgade, Lily Monnier, Matthieu Blons, Alice Blondel, Etienne Decencière, Albert Pla Planas, Thomas Walter
Atypical mitotic figures (AMFs) represent abnormal cell division associated
with poor prognosis. Yet their detection remains difficult due to low
prevalence, subtle morphology, and inter-observer variability. The MIDOG 2025
challenge introduces a benchmark for AMF classification across multiple
domains. In this work, we fine-tuned the recently published DINOv3-H+ vision
transformer, pretrained on natural images, using low-rank adaptation (LoRA),
training only ~1.3M parameters in combination with extensive augmentation and a
domain-weighted Focal Loss to handle domain heterogeneity. Despite the domain
gap, our fine-tuned DINOv3 transfers effectively to histopathology, reaching
first place on the final test set. These results highlight the advantages of
DINOv3 pretraining and underline the efficiency and robustness of our
fine-tuning strategy, yielding state-of-the-art results for the atypical
mitosis classification challenge in MIDOG 2025.
comment: 4 pages. Challenge report for MIDOG 2025 (Task 2: Atypical Mitotic
Figure Classification)
♻ ☆ OmniLens: Towards Universal Lens Aberration Correction via LensLib-to-Specific Domain Adaptation
Qi Jiang, Yao Gao, Shaohua Gao, Zhonghua Yi, Xiaolong Qian, Hao Shi, Kailun Yang, Lei Sun, Kaiwei Wang
Emerging universal Computational Aberration Correction (CAC) paradigms
provide an inspiring solution to light-weight and high-quality imaging with a
universal model trained on a lens library (LensLib) to address arbitrary lens
aberrations blindly. However, the limited coverage of existing LensLibs leads
to poor generalization of the trained models to unseen lenses, whose
fine-tuning pipeline is also confined to the lens-descriptions-known case. In
this work, we introduce OmniLens, a flexible solution to universal CAC via (i)
establishing a convincing LensLib with comprehensive coverage for pre-training
a robust base model, and (ii) adapting the model to any specific lens designs
with unknown lens descriptions via fast LensLib-to-specific domain adaptation.
To achieve these, an Evolution-based Automatic Optical Design (EAOD) pipeline
is proposed to generate a rich variety of lens samples with realistic
aberration behaviors. Then, we design an unsupervised regularization term for
efficient domain adaptation on a few easily accessible real-captured images
based on the statistical observation of dark channel priors in degradation
induced by lens aberrations. Extensive experiments demonstrate that the LensLib
generated by EAOD effectively develops a universal CAC model with strong
generalization capabilities, which can also improve the non-blind lens-specific
methods by 0.35-1.81dB in PSNR. Additionally, the proposed domain adaptation
method significantly improves the base model, especially in severe aberration
cases (at most 2.59dB in PSNR). The code and data will be available at
https://github.com/zju-jiangqi/OmniLens.
comment: The code and data will be available at
https://github.com/zju-jiangqi/OmniLens
♻ ☆ BAAF: A benchmark attention adaptive framework for medical ultrasound image segmentation tasks
The AI-based assisted diagnosis programs have been widely investigated on
medical ultrasound images. Complex scenario of ultrasound image, in which the
coupled interference of internal and external factors is severe, brings a
unique challenge for localize the object region automatically and precisely in
ultrasound images. In this study, we seek to propose a more general and robust
Benchmark Attention Adaptive Framework (BAAF) to assist doctors segment or
diagnose lesions and tissues in ultrasound images more quickly and accurately.
Different from existing attention schemes, the BAAF consists of a parallel
hybrid attention module (PHAM) and an adaptive calibration mechanism (ACM).
Specifically, BAAF first coarsely calibrates the input features from the
channel and spatial dimensions, and then adaptively selects more robust lesion
or tissue characterizations from the coarse-calibrated feature maps. The design
of BAAF further optimizes the "what" and "where" focus and selection problems
in CNNs and seeks to improve the segmentation accuracy of lesions or tissues in
medical ultrasound images. The method is evaluated on four medical ultrasound
segmentation tasks, and the adequate experimental results demonstrate the
remarkable performance improvement over existing state-of-the-art methods. In
addition, the comparison with existing attention mechanisms also demonstrates
the superiority of BAAF. This work provides the possibility for automated
medical ultrasound assisted diagnosis and reduces reliance on human accuracy
and precision.
comment: 10 pages, 11 figures
♻ ☆ Logarithmic Mathematical Morphology: theory and applications
In Mathematical Morphology for grey-level functions, an image is analysed by
another image named the structuring function. This structuring function is
translated over the image domain and summed to the image. However, in an image
presenting lighting variations, the amplitude of the structuring function
should vary according to the image intensity. Such a property is not verified
in Mathematical Morphology for grey level functions, when the structuring
function is summed to the image with the usual additive law. In order to
address this issue, a new framework is defined with an additive law for which
the amplitude of the structuring function varies according to the image
amplitude. This additive law is chosen within the Logarithmic Image Processing
framework and models the lighting variations with a physical cause such as a
change of light intensity. The new framework is named Logarithmic Mathematical
Morphology (LMM) and allows the definition of operators which are robust to
such lighting variations.
♻ ☆ Robust Real-Time Endoscopic Stereo Matching under Fuzzy Tissue Boundaries
Real-time acquisition of accurate scene depth is essential for automated
robotic minimally invasive surgery. Stereo matching with binocular endoscopy
can provide this depth information. However, existing stereo matching methods,
designed primarily for natural images, often struggle with endoscopic images
due to fuzzy tissue boundaries and typically fail to meet real-time
requirements for high-resolution endoscopic image inputs. To address these
challenges, we propose \textbf{RRESM}, a real-time stereo matching method
tailored for endoscopic images. Our approach integrates a 3D Mamba Coordinate
Attention module that enhances cost aggregation through position-sensitive
attention maps and long-range spatial dependency modeling via the Mamba block,
generating a robust cost volume without substantial computational overhead.
Additionally, we introduce a High-Frequency Disparity Optimization module that
refines disparity predictions near tissue boundaries by amplifying
high-frequency details in the wavelet domain. Evaluations on the SCARED and
SERV-CT datasets demonstrate state-of-the-art matching accuracy with a
real-time inference speed of 42 FPS. The code is available at
https://github.com/Sonne-Ding/RRESM.
♻ ☆ Unsupervised patch-based dynamic MRI reconstruction using learnable tensor function with implicit neural representation
Yuanyuan Liu, Yuanbiao Yang, Jing Cheng, Zhuo-Xu Cui, Qingyong Zhu, Congcong Liu, Yuliang Zhu, Jingran Xu, Hairong Zheng, Dong Liang, Yanjie Zhu
Dynamic MRI suffers from limited spatiotemporal resolution due to long
acquisition times. Undersampling k-space accelerates imaging but makes accurate
reconstruction challenging. Supervised deep learning methods achieve impressive
results but rely on large fully sampled datasets, which are difficult to
obtain. Recently, implicit neural representations (INR) have emerged as a
powerful unsupervised paradigm that reconstructs images from a single
undersampled dataset without external training data. However, existing
INR-based methods still face challenges when applied to highly undersampled
dynamic MRI, mainly due to their inefficient representation capacity and high
computational cost. To address these issues, we propose TenF-INR, a novel
unsupervised framework that integrates low-rank tensor modeling with INR, where
each factor matrix in the tensor decomposition is modeled as a learnable factor
function. Specifically,we employ INR to model learnable tensor functions within
a low-rank decomposition, reducing the parameter space and computational
burden. A patch-based nonlocal tensor modeling strategy further exploits
temporal correlations and inter-patch similarities, enhancing the recovery of
fine spatiotemporal details. Experiments on dynamic cardiac and abdominal
datasets demonstrate that TenF-INR achieves up to 21-fold acceleration,
outperforming both supervised and unsupervised state-of-the-art methods in
image quality, temporal fidelity, and quantitative accuracy.