ECCV 2026 Project Page

CMDS-AD: Cross-Modal Dual-Stream Decoupling for Few-Shot Anomaly Detection

CMDS-AD tackles extreme few-shot multi-modal anomaly detection by decoupling low-frequency structural priors from high-frequency defect cues. A diffusion normal estimator anchors a stable estimated stream, a real stream preserves fine anomalies, and a coordinate-aware cross-modal mapper aligns RGB and 3D signals with higher precision.

Junhao Cai¹ Homepage Junyu Chen² Homepage Deyu Zeng^1,2* Junhao Pang¹ Qiwei Liang^1,3 Homepage Xiaopin Zhong¹ Zongze Wu¹

* Corresponding author: Deyu Zeng

Shenzhen University

Shenzhen, Guangdong 518060, China

Guangzhou Maritime University

Guangzhou, Guangdong 510725, China

Hong Kong University of Science and Technology (Guangzhou)

Guangzhou, Guangdong 511453, China

Inst. 1 Emails

caijunhao27@gmail.com, 2500092013@mails.szu.edu.cn, liangqiwei2022@email.szu.edu.cn, xzhong@szu.edu.cn, zzwu@szu.edu.cn

Inst. 2 Emails

zengdeyu@gzmtu.edu.cn, chenjunyu@gzmtu.edu.cn

arXiv Code PDF BibTeX Contact

Few-Shot Learning
Multi-Modal Anomaly Detection
Diffusion Models
Dual-Stream Optimization

Radar chart comparing CMDS-AD against prior methods on MVTec 3D-AD and EyeCandies. — Few-shot benchmark overview. CMDS-AD consistently improves both I-AUROC and AUPRO across 1-shot, 2-shot, and 4-shot settings on MVTec 3D-AD and EyeCandies.

Benchmark Highlights

New state of the art from 1-shot to 4-shot

MVTec 3D-AD (1-shot) 79.6 / 94.2 I-AUROC / AUPRO@30% +5.7 / +2.0 vs prior best

EyeCandies (1-shot) 77.2 / 85.5 I-AUROC / AUPRO@30% +7.7 / +5.6 vs prior best

MVTec 3D-AD (4-shot) 87.1 / 95.8 I-AUROC / AUPRO@30% High precision localization

EyeCandies (4-shot) 82.7 / 87.7 I-AUROC / AUPRO@30% Robust under severe scarcity

Evaluation follows the standard few-shot protocol with k in {1, 2, 4} normal samples per category.

Overview

Separating stable structure from defect-sensitive detail

Few-shot anomaly detection remains difficult because existing multi-modal methods tend to process all spatial content uniformly, mixing stable macroscopic structure with high-frequency localized defect signals. Under severe data scarcity, this creates cross-modal misalignment and inflated false-positive responses.

CMDS-AD reframes the problem with a dual-stream design. A LoRA-guided diffusion model augments scarce RGB data, while a diffusion-based normal estimator provides a structurally stable estimated stream that behaves like a non-linear low-pass filter. This auxiliary stream anchors normality, allowing the real stream to focus on micro-defects without losing cross-modal consistency.

A Coordinate-Aware Hierarchical Feature Mapper aligns RGB and 3D semantics across multiple scales, and a multiplicative anomaly fusion rule keeps only the spatial regions that are supported by both modalities, sharply reducing isolated modality-specific noise.

Diffusion as more than augmentation

The normal estimator is repurposed as a low-frequency prior, not just a generator, giving the method a stable structural template even in the 1-shot regime.

Dual-stream decoupling

The estimated stream captures low-frequency normal structure, while the real stream preserves coupled high- and low-frequency content to localize subtle anomalies.

Cross-modal precision scoring

Element-wise multiplication of 2D and 3D anomaly maps acts as a strict spatial gate, suppressing false alarms and improving boundary sharpness.

Method

CMDS-AD pipeline

Pipeline diagram of the CMDS-AD framework. — The framework combines diffusion-driven multimodal augmentation, real and estimated streams, coordinate-aware feature mapping, and multiplicative cross-modal anomaly scoring.

Step 1

Diffusion-driven augmentation

Stable Diffusion v2.1 is LoRA-finetuned to synthesize diverse RGB samples from very limited normal data. Marigold then predicts estimated surface normals for both real and generated images.

Step 2

Real and estimated streams

The estimated stream acts as a purely low-frequency anchor, while the real stream retains coupled detailed geometry. Their complementary anomaly responses are kept independent until final fusion.

Step 3

Coordinate-aware mapper

Multi-scale ViT features from layers 4, 7, and 11 are aligned through coordinate attention and spatial selection, which preserves positional precision while closing the 2D-3D semantic gap.

Step 4

Multiplicative anomaly fusion

Weighted real and estimated anomaly maps are merged within each modality, then multiplied across 2D and 3D to keep only jointly supported defect regions.

Results

Quantitative gains and cleaner localization

MVTec 3D-AD, 1-shot 79.6% I-AUROC

MVTec 3D-AD, 1-shot 94.2% AUPRO@30%

EyeCandies, 1-shot 77.2% I-AUROC

EyeCandies, 1-shot 85.5% AUPRO@30%

Benchmark summary

On MVTec 3D-AD, CMDS-AD reaches 79.6% I-AUROC and 94.2% AUPRO@30% in the 1-shot setting, then scales to 87.1% and 95.8% at 4-shot. The method is especially strong on geometrically complex categories such as Bagel and Rope.

On EyeCandies, CMDS-AD establishes a new state of the art across all few-shot settings, delivering 77.2% I-AUROC and 85.5% AUPRO@30% in the 1-shot setting and reaching 82.7% I-AUROC at 4-shot.

Why the maps look better

The estimated stream provides a stable structural guide, which helps the real stream avoid confusing background texture or sensor noise with true defects. The final multiplicative fusion then removes isolated modality-specific responses, resulting in sharper boundaries and fewer false alarms in normal regions.

This behavior is visible both on texture-driven defects and on subtle geometric anomalies that are easy to miss from RGB alone.

Qualitative examples showing RGB, real normals, estimated normals, and anomaly maps. — Modality complementarity. CMDS-AD captures color-only, shape-only, and combined anomalies by leveraging both RGB and 3D cues.

Comparison of anomaly overlays between CMDS-AD and prior baselines. — Qualitative comparison with prior methods. CMDS-AD yields tighter masks and fewer false positives on both MVTec 3D-AD and EyeCandies.

Analysis

Ablations confirm the design choices

Ablation takeaways

Full model wins on localization

Under the 4-shot setting, the full system reaches 0.958 AUPRO@30% and 0.410 AUPRO@1%, confirming that the real stream, estimated stream, and feature mapper work best together.

Feature mapper matters

Replacing the proposed mapper with a standard MLP degrades localization metrics, showing the importance of hierarchical gating and coordinate-aware alignment.

Multiplication beats simpler fusion

Element-wise multiplication achieves 0.989 P-AUROC and 0.958 AUPRO@30%, outperforming addition for fine-grained localization under strict false-positive constraints.

Citation

Paper and acknowledgements

BibTeX

@inproceedings{cai2026cmds,
  title     = {CMDS-AD: Cross-Modal Dual-Stream Decoupling for Few-Shot Anomaly Detection},
  author    = {Cai, Junhao and Chen, Junyu and Zeng, Deyu and Pang, Junhao and Liang, Qiwei and Zhong, Xiaopin and Wu, Zongze},
  booktitle = {European Conference on Computer Vision (ECCV)},
  year      = {2026}
}

Acknowledgements

This research work was financially supported in part by the Guangdong Major Project of Basic Research under Grant 2023B0303000009, the NSFC Youth Fund Project under Grant 62403326, the Shenzhen Fundamental Research Fund under Grant JCYJ20230808105212023, the Research Team Cultivation Program of Shenzhen University under Grant 2023JCT004, and the Shenzhen University 2035 Program for Excellent Research under Grant 00000224.

Project materials on this page are adapted from the manuscript figures and content provided in the ECCV 2026 paper directory.