CMDS-AD

ECCV 2026 Project Page

CMDS-AD: Cross-Modal Dual-Stream Decoupling for Few-Shot Anomaly Detection

CMDS-AD tackles extreme few-shot multi-modal anomaly detection by decoupling low-frequency structural priors from high-frequency defect cues. A diffusion normal estimator anchors a stable estimated stream, a real stream preserves fine anomalies, and a coordinate-aware cross-modal mapper aligns RGB and 3D signals with higher precision.

Junhao Cai1 Homepage Junyu Chen2 Homepage Deyu Zeng1,2* Junhao Pang1 Qiwei Liang1,3 Homepage Xiaopin Zhong1 Zongze Wu1

* Corresponding author: Deyu Zeng

1

Shenzhen University

Shenzhen, Guangdong 518060, China

2

Guangzhou Maritime University

Guangzhou, Guangdong 510725, China

3

Hong Kong University of Science and Technology (Guangzhou)

Guangzhou, Guangdong 511453, China

  • Few-Shot Learning
  • Multi-Modal Anomaly Detection
  • Diffusion Models
  • Dual-Stream Optimization
Radar chart comparing CMDS-AD against prior methods on MVTec 3D-AD and EyeCandies.
Few-shot benchmark overview. CMDS-AD consistently improves both I-AUROC and AUPRO across 1-shot, 2-shot, and 4-shot settings on MVTec 3D-AD and EyeCandies.

Overview

Separating stable structure from defect-sensitive detail

Few-shot anomaly detection remains difficult because existing multi-modal methods tend to process all spatial content uniformly, mixing stable macroscopic structure with high-frequency localized defect signals. Under severe data scarcity, this creates cross-modal misalignment and inflated false-positive responses.

CMDS-AD reframes the problem with a dual-stream design. A LoRA-guided diffusion model augments scarce RGB data, while a diffusion-based normal estimator provides a structurally stable estimated stream that behaves like a non-linear low-pass filter. This auxiliary stream anchors normality, allowing the real stream to focus on micro-defects without losing cross-modal consistency.

A Coordinate-Aware Hierarchical Feature Mapper aligns RGB and 3D semantics across multiple scales, and a multiplicative anomaly fusion rule keeps only the spatial regions that are supported by both modalities, sharply reducing isolated modality-specific noise.

01

Diffusion as more than augmentation

The normal estimator is repurposed as a low-frequency prior, not just a generator, giving the method a stable structural template even in the 1-shot regime.

02

Dual-stream decoupling

The estimated stream captures low-frequency normal structure, while the real stream preserves coupled high- and low-frequency content to localize subtle anomalies.

03

Cross-modal precision scoring

Element-wise multiplication of 2D and 3D anomaly maps acts as a strict spatial gate, suppressing false alarms and improving boundary sharpness.

Method

CMDS-AD pipeline

Pipeline diagram of the CMDS-AD framework.
The framework combines diffusion-driven multimodal augmentation, real and estimated streams, coordinate-aware feature mapping, and multiplicative cross-modal anomaly scoring.
Step 1

Diffusion-driven augmentation

Stable Diffusion v2.1 is LoRA-finetuned to synthesize diverse RGB samples from very limited normal data. Marigold then predicts estimated surface normals for both real and generated images.

Step 2

Real and estimated streams

The estimated stream acts as a purely low-frequency anchor, while the real stream retains coupled detailed geometry. Their complementary anomaly responses are kept independent until final fusion.

Step 3

Coordinate-aware mapper

Multi-scale ViT features from layers 4, 7, and 11 are aligned through coordinate attention and spatial selection, which preserves positional precision while closing the 2D-3D semantic gap.

Step 4

Multiplicative anomaly fusion

Weighted real and estimated anomaly maps are merged within each modality, then multiplied across 2D and 3D to keep only jointly supported defect regions.

Results

Quantitative gains and cleaner localization

MVTec 3D-AD, 1-shot 79.6% I-AUROC
MVTec 3D-AD, 1-shot 94.2% AUPRO@30%
EyeCandies, 1-shot 77.2% I-AUROC
EyeCandies, 1-shot 85.5% AUPRO@30%

Benchmark summary

On MVTec 3D-AD, CMDS-AD reaches 79.6% I-AUROC and 94.2% AUPRO@30% in the 1-shot setting, then scales to 87.1% and 95.8% at 4-shot. The method is especially strong on geometrically complex categories such as Bagel and Rope.

On EyeCandies, CMDS-AD establishes a new state of the art across all few-shot settings, delivering 77.2% I-AUROC and 85.5% AUPRO@30% in the 1-shot setting and reaching 82.7% I-AUROC at 4-shot.

Why the maps look better

The estimated stream provides a stable structural guide, which helps the real stream avoid confusing background texture or sensor noise with true defects. The final multiplicative fusion then removes isolated modality-specific responses, resulting in sharper boundaries and fewer false alarms in normal regions.

This behavior is visible both on texture-driven defects and on subtle geometric anomalies that are easy to miss from RGB alone.

Analysis

Ablations confirm the design choices

Sensitivity analysis over fusion weights on MVTec 3D-AD and EyeCandies.
Sensitivity analysis over estimated-stream fusion weights. A moderate weight of 0.1 consistently provides the best balance across datasets and shot settings.

Ablation takeaways

Full model wins on localization

Under the 4-shot setting, the full system reaches 0.958 AUPRO@30% and 0.410 AUPRO@1%, confirming that the real stream, estimated stream, and feature mapper work best together.

Feature mapper matters

Replacing the proposed mapper with a standard MLP degrades localization metrics, showing the importance of hierarchical gating and coordinate-aware alignment.

Multiplication beats simpler fusion

Element-wise multiplication achieves 0.989 P-AUROC and 0.958 AUPRO@30%, outperforming addition for fine-grained localization under strict false-positive constraints.

Citation

Paper and acknowledgements

BibTeX

@inproceedings{cai2026cmds,
  title     = {CMDS-AD: Cross-Modal Dual-Stream Decoupling for Few-Shot Anomaly Detection},
  author    = {Cai, Junhao and Chen, Junyu and Zeng, Deyu and Pang, Junhao and Liang, Qiwei and Zhong, Xiaopin and Wu, Zongze},
  booktitle = {European Conference on Computer Vision (ECCV)},
  year      = {2026}
}

Acknowledgements

This research work was financially supported in part by the Guangdong Major Project of Basic Research under Grant 2023B0303000009, the NSFC Youth Fund Project under Grant 62403326, the Shenzhen Fundamental Research Fund under Grant JCYJ20230808105212023, the Research Team Cultivation Program of Shenzhen University under Grant 2023JCT004, and the Shenzhen University 2035 Program for Excellent Research under Grant 00000224.

Project materials on this page are adapted from the manuscript figures and content provided in the ECCV 2026 paper directory.