M3DM-NR: RGB-3D Noisy-Resistant Industrial Anomaly Detection via Multimodal Denoising

Chengjie Wang, Haokun Zhu, Jinlong Peng, Yue Wang, Ran Yi
Yunsheng Wu, Lizhuang Ma, Jiangning Zhang C. Wang is with Shanghai Jiao Tong University and Youtu Lab, Shanghai, China. H. Zhu, Y. Wang, R. Yi, and L. Ma are with the Shanghai Jiao Tong University, Shanghai, China. J. Peng, Y. Wu, and J. Zhang are with Youtu Lab, Tencent, China. Corresponding author: Ran Yi

Abstract

Existing industrial anomaly detection methods primarily concentrate on unsupervised learning with pristine RGB images. Yet, both RGB and 3D data are crucial for anomaly detection, and the datasets are seldom completely clean in practical scenarios. To address above challenges, this paper initially delves into the RGB-3D multi-modal noisy anomaly detection, proposing a novel noise-resistant M3DM-NR framework to leveraging strong multi-modal discriminative capabilities of CLIP. M3DM-NR consists of three stages: Stage-I introduces the Suspected References Selection module to filter a few normal samples from the training dataset, using the multimodal features extracted by the Initial Feature Extraction, and a Suspected Anomaly Map Computation module to generate a suspected anomaly map to focus on abnormal regions as reference. Stage-II uses the suspected anomaly maps of the reference samples as reference, and inputs image, point cloud, and text information to achieve denoising of the training samples through intra-modal comparison and multi-scale aggregation operations. Finally, Stage-III proposes the Point Feature Alignment, Unsupervised Feature Fusion, Noise Discriminative Coreset Selection, and Decision Layer Fusion modules to learn the pattern of the training dataset, enabling anomaly detection and segmentation while filtering out noise. Extensive experiments show that M3DM-NR outperforms state-of-the-art methods in 3D-RGB multi-modal noisy anomaly detection.

Index Terms:

Anomaly Detection, Multi-modal Learning, Noisy Learning, Unsupervised Learning

1 Introduction

Industrial anomaly detection aims to find the abnormal region of products and plays an important role in industrial quality inspection. Most existing industrial anomaly detection methods [1, 2] primarily focus on RGB images [3, 4] and use a vast number of normal examples for training. Consequently, current industrial anomaly detection methods predominantly rely on unsupervised approaches, meaning they train exclusively on normal RGB examples and only during inference are defect examples tested. These two factors contribute to two significant issues (Fig. 1-Top-Left). First, during the quality inspection of industrial products, human inspectors rely on both 3D shape and color characteristics to assess product quality. The 3D shape information is crucial for accurate defect detection in particular, and identifying defects using only RGB images proves difficult. With advancements in 3D sensor technology, recent MVTec-3D AD dataset that includes both 2D images and 3D point cloud data is proposed to alleviate this problem and has bolstered research in multi-modal industrial anomaly detection (Fig. 1-Top-Middle). Second, the presence of noise in the normal dataset is an unavoidable issue in real-world applications, particularly in industrial manufacturing where products are mass-produced daily. Most existing unsupervised AD methods [5, 6, 7] are prone to noisy data due to their exhaustive strategy to model the training set. However, noisy samples can easily mislead those overconfident AD algorithms, causing them to misclassify similar anomaly samples in the test set and generate incorrect locations. SoftPatch [8] is the first to introduce the setting for noisy industrial detection, but it explored only noisy industrial detection on RGB data.

Refer to caption — Figure 1: Top: Intuitive diagram of different task settings. Middle: Representative PatchCore [5] for solving RGB images, our M3DM [9] (conference version) for solving multi-modal RGB+3D data, and new M3DM-NR to tackle more challenging and practial noisy setting. Bottom: Quantitative visualization results on MVTec 3D-AD dataset [10]. Our M3DM-NR can predict more precise anomaly regions obviously compared to PatchCore+FPFH [11] and M3DM [9].

For the first issue, the core idea for existing unsupervised anomaly detection is to find out the difference between normal representations and anomalies. Current 2D industrial anomaly detection methods can be mainly categorized into two categories: (1) Reconstruction-based methods. Image reconstruction tasks are widely used in anomaly detection methods [3, 12, 13, 14, 15, 16] to learn normal representation. Reconstruction-based methods are easy to implement for a single modal input (2D image or 3D point cloud). But for multi-modal inputs, it is hard to find a reconstruction target. (2) Pretrained feature extractor-based methods. An intuitive way to utilize the feature extractor is to map the extracted feature to a normal distribution and find the out-of-distribution one as an anomaly. Normalizing flow-based methods [6, 17, 18] use an invertible transformation to directly construct normal distribution, and memory bank-based methods[19, 5] store some representative features to implicitly construct the feature distribution. Compared with reconstruction-based methods, directly using a pretrained feature extractor does not involve the design of a multi-modal reconstruction target and is a better choice for the multi-modal task. Besides that, current multi-modal industrial anomaly detection methods [20, 18] directly concatenate the features of the two modalities together. However, when the feature dimension is high, the disturbance between multi-modal features will be violent and cause performance reduction.

Regarding the second issue of noisy anomaly detection, existing methods in noisy industrial detection have primarily focused on single-modality noisy anomaly detection using RGB images, with a lack of research on RGB-3D multi-modal noisy data. However, in practical industrial detection, noise often contaminates 3D data, and RGB-3D multi-modal data serve as an important reference for determining whether a sample is anomalous. The absence of exploration in RGB-3D multi-modal noisy data means that current methods are vulnerable to the multi-modal noisy data in real-world production environments. Furthermore, existing approaches employ a simplistic and naive strategy of patch-level denoising and sample re-weighting based on outlier-detection weights, leading to unsatisfying denoising effects and the persistence of noise in the dataset.

To solve the problems mentioned above, in this paper, we first delve into the RGB-3D multi-modal noisy industrial detection problem (Fig. 1-Top-Right). To address the challenges of RGB-3D multi-modal noisy data, we propose a novel three-stage multi-modal noise-resistant framework termed M3DM-NR, which performs denoising at both sample-level and patch-level, as shown in Fig. 2. This framework utilizes pretrained CLIP [21] and Point-BIND [22] models to extract aligned text, RGB, and 3D point cloud features to denoise multi-modal data through both cross-modal comparison and intra-modality comparison. To the best of our knowledge, we are the first to employ a multi-modal learning approach based on pre-trained CLIP and Point-BIND to solve the RGB-3D multi-modal noisy industrial anomaly detection problem. In this framework, Stage I selects a few normal samples from the training dataset as intra-modal reference samples and compute the suspected anomaly map to focus on abnormal regions by the proposed Intra-Modal Reference Selection. In Stage II, recognizing the fact that in industrial anomaly detection, anomalies often constitute only a small fraction of the entire sample, we thus propose a novel Enhanced Multi-modal Denoising module to rank the anomalies of each training sample by performing multi-scale feature comparison and weighting with a suspected reference, enabling the filtering of anomalous samples. In Stage III, to address the above problems concerning multi-modal anomaly detection, we propose a novel Multimodal Anomaly Detection via Hybrid Fusion scheme to Learn the pattern of the training dataset to conduct anomaly detection and segmentation while filtering out noise at the patch level. Different from the existing methods that directly concatenate the features of the two modalities, we propose a hybrid fusion scheme to reduce the disturbance between multi-modal features and encourage feature interaction. We propose Unsupervised Feature Fusion (UFF) to fuse multi-modal features, which is trained using a patch-wise contrastive loss to learn the inherent relation between multi-modal feature patches at the same position. To encourage the anomaly detection model to keep the single domain inference ability, we construct three memory banks separately for RGB, 3D and fused features. For the final decision, we construct Decision Layer Fusion (DLF) to consider all memory banks for anomaly detection and segmentation. Besides, we further propose a Point Feature Alignment (PFA) operation to better align 3D and 2D features and Noise Discriminative Coreset Selection to filter out noise at patch-level.

To evaluate our method, we conduct extensive experiments on the MVTec 3D-AD [10] and Eyecandies [23] datasets, comparing our method with existing RGB, 3D, and RGB-3D based industrial detection methods. Moreover, to further highlight the robustness of our method, we follow the experiment setting in SoftPatch [8] and conduct experiments under Non-Overlap and, more challenging, Overlap settings. The extensive experimental results and metrics (I-AUROC, P-AUROC, AUPRO) demonstrate that our method surpasses existing state-of-the-art approaches. Additionally, we performed a comprehensive ablation study, thoroughly validating the effectiveness of all novel modules proposed.

This is an extension of the previous conference version (M3DM [9] in CVPR’23). In the conference papar, we mainly proposed M3DM, a novel multi-modal industrial anomaly detection method with hybrid feature fusion, which outperforms the state-of-the-art detection and segmentation precision on MVTec 3D-AD [10]. In this extended journal version, we make the following four contributions:

•

We study a new RGB-3D multi-modal noisy industrial anomaly detection task and have substantially broadened our research to this practical setting, proposing a novel three-stage multi-modal noise-resistant framework termed M3DM-NR. It addresses reference selection, denoising, and final anomaly detection and segmentation, ensuring systematic and hierarchical processing.
•

We design three novel Initial Feature Extraction, Suspected References Selection, and Suspected Anomaly Map Computation modules in Stage I to select a few normal samples from the training dataset as intra-modal reference samples, and it generates suspected anomaly maps to focus on abnormal regions as the reference for the next stage.
•

To obtain cleaner training data, we propose an extra Stage II termed Enhanced Multi-modal Denoising to introduce multi-scale feature comparison and weighting methods to finely rank and denoise training samples.
•

We employ M3DM as Stage III to achieve final anomaly detection and segmentation. Extensive quantitative experiments across various settings demonstrate the performance of our approach over existing state-of-the-art methods in 3D-RGB multi-modal noisy anomaly detection. We also conduct massive ablation study to illustrate the effectiveness of each designed component.

2 Related Work

2.1 2D Industrial Anomaly Detection

Current anomaly detection can be mainly categorized into following three parts: 1) Data augmentation based methods [24, 25, 14, 26, 27, 28] propose to introduce pseudo anomalies to normal samples with the aim of improving the system’s ability to identify such anomalies during training. 2) Reconstruction based methods [29, 12, 13, 16, 15, 30, 31, 32, 33, 34] leverage auto-encoders and generative adversarial networks. Although these reconstruction methods may not accurately recover anomalous regions, comparing the reconstructed image with the original can pinpoint anomalies and facilitate decision-making. 3) Feature embedding based methods [35, 36, 6, 17, 5, 37, 38, 39] depend on pre-trained feature extractors, with additional detection modules that learn to identify abnormal areas using the extracted features or representations. Drawing parallels between 2D and 3D anomaly detection, our work expands the application of the memory bank approach to 3D and multi-modal contexts, yielding impressive outcomes.

2.2 3D Industrial Anomaly Detection

The first public 3D industrial anomaly detection dataset is the MVTec 3D-AD dataset [10], which includes both RGB information and point position data for each instance. Current 3D anomaly detection can be mainly categorized into following four parts: 1) Data augmentation-based methods [40, 41] draw inspirations from 2D anomaly detection strategies to generate pseudo RGB and 3D anomaly samples, enhancing the model’s capacity to identify anomalies. 2) Reconstruction-based methods [42, 40] utilize auto-encoders and generative adversarial networks trained to generate normal samples for both RGB and 3D data, irrespective of whether the input is normal or anomalous. This approach fails to reconstruct regions with anomalies effectively. By comparing these reconstructed samples with the originals, anomalies can be identified, thus aiding in decision-making. 3) Feature embedding-based methods [11, 9, 43, 44, 45, 46] rely on pre-trained feature extractors, supplemented with additional fusion modules that align and integrate multi-modal information. Detection modules then utilize these fused features or representations to identify abnormal areas, enhancing the system’s ability to detect anomalies. 4) Knowledge distillation-based methods [47, 18, 48] train a student network to reconstruct samples or extract features, where the disparity between the teacher and student networks serves as an indicator of anomalies. In our research, we adopt the feature embedding-based approach but diverge with a novel pipeline.

2.3 Learning with Noisy Data

Recognizing noisy labels is increasingly gaining attention in the realm of supervised learning. Yet, this concept has scarcely been ventured into within unsupervised anomaly detection, largely due to the absence of clear labels. In classification tasks, certain studies have suggested filtering pseudo-labeled data that carry a high confidence threshold to mitigate noise [49, 50]. Li et al. [51] employ a mixture model to identify noisy-labeled data, adopting a semi-supervised approach for training. In the field of object detection, strategies such as multi-augmentation [52], a teacher-student model [53], or contrastive learning [54] have been leveraged, drawing on the expertise of expert models to reduce noise. However, the prevailing methods for recognizing noisy labels depend heavily on labeled data for correcting inaccuracies. Our research diverges by aiming to enhance a model’s resistance to noise in an unsupervised manner, thereby eliminating the need for manual annotations. A recent review [55] examines the robustness of 30 AD algorithms, yet overlooks unsupervised approaches in the context of annotation errors. Pang et al. [56] address anomalies in video without relying on manually labeled data, exploiting information across consecutive frames, contrasting our focus on detecting anomalies in single images. Other studies [57, 58, 59] tackle the elimination of noisy and corrupted data in semantic anomaly detection. SoftPatch [8] proposed to filter out noise at patch-level using outlier detection, but the employed outlier detection method is rather naive and doesn’t produce very good results. In this paper, we introduce a method that utilizes a pretrained CLIP-based model to extract and align multi-modal information, enabling the effective filtration of noise at sample-level.

2.4 Multi-modal Learning

Among the recent successes of large pre-trained vision-language models (VLMs) [60, 61, 21], CLIP [21] stands out as the first to employ pre-training on web-scale image-text data, demonstrating unprecedented generality. Notable features include its language-driven zero-shot inference capabilities, which have significantly enhanced both effective robustness [62] and perceptual alignment [63]. Other studies [64, 65, 66] have also utilized the pre-trained CLIP model for downstream tasks, such as language-guided detection and segmentation, achieving promising results. Beyond aligning vision and language, Point-Bind [22] extends this alignment to include 3D modality. Recently, some recent works have attempted to apply the multimodal CLIP model to the AD domain [67, 68, 69, 70, 71]. Specific WinCLIP [67] leverages the robust multi-modal capabilities of the pre-trained CLIP model for effective zero-shot 2D anomaly detection.

In this paper, we utilize the Point-BIND’s aligned embedding space of image, language, and 3D modalities to effectively filter out noise at sample-level in the training set.

3 Methodology

As shown in Fig. 3, our proposed M3DM-NR framework takes RGB images and 3D point clouds as input to perform RGB-3D based multi-modal noisy anomaly detection and segmentation. Specifically, M3DM-NR consists of three stages to achieves this goal: 1) Intra-modal Reference Selection (Stage I in Sec. 3.1) selects a few normal samples from the training dataset as intra-modal reference samples, and the suspected anomaly map is computed to focus on abnormal regions. 2) Enhanced Multi-modal Denoising (Stage II in Sec. 3.2) ranks the anomalies of each training sample by performing multi-scale feature comparison and weighting with a suspected reference, enabling the filtering of anomalous samples. 3) Multimodal Anomaly Detection via Hybrid Fusion (Stage III in Sec. 3.3) learns the pattern of the training dataset to conduct anomaly detection and segmentation while filtering out noise at patch-level.

3.1 Stage I: Intra-modal Reference Selection

3.1.1 Initial Feature Extraction

Given $M$ image and point cloud pairs $\left\{I_{m}\right\}_{m=1}^{M}$ and $\left\{P_{m}\right\}_{m=1}^{M}$ , RGB-3D anomaly detection requires three modes of information input, so it contains three parts of feature pre-extraction algorithm:

Text prompt ensemble. The effectiveness of text descriptions is crucial for multimodal anomaly detection. Following APRIL-GAN [68], we employ a text prompt ensemble strategy $\varphi_{T}$ to fully explore the textual representation of defects. Specifically, the proposed strategy $\varphi_{T}$ includes several templates, each in the format “A photo of a state class”, where ‘state’ denotes predefined normal and abnormal state descriptions, and ‘class’ represents the class name. The output features are averaged using pooling to obtain the final descriptive features $f_{T}^{Nor}\in\mathbb{R}^{d}$ and $f_{T}^{Ano}\in\mathbb{R}^{d}$ .

Multi-scale image feature representation. For each image $I_{m}$ in the training dataset, we first use pretrained image encoder $E_{I}$ in CLIP model to extract corresponding feature $F_{I_{m}}$ :

\displaystyle F_{I_{m}}=E_{I}(I_{m}).

(1)

Then, a multi-scale segmentation operation $\mathcal{H}_{I}$ is used to segment $F_{I_{m}}$ into 3 different scales $F_{I_{m}}^{\sigma},\sigma\in\{l,m,s\}$ , denoted as:

\displaystyle f_{I_{m}}^{l},F_{I_{m}}^{l},F_{I_{m}}^{m},F_{I_{m}}^{s}

\displaystyle=\mathcal{H}_{I}\left(F_{I_{m}}\right).

(2)

where $f_{I_{m}}^{l}$ is the class token and $F_{I_{m}}^{\sigma}$ is obtained by the following equation:

$\displaystyle F_{I_{m}}^{\sigma}$	$\displaystyle=\left\{f_{uv}^{\sigma}\right\}_{Im}$	(3)
	$\displaystyle=F_{I_{m}}\odot\left\{M_{uv}^{\sigma}\right\}$
	$\displaystyle\textit{s.t.}~{}\sigma\in\{l,m,s\}.$

$M=\left\{M_{uv}^{\sigma}\right\}$ is the multi-scale mask, where each $M_{uv}^{\sigma}\in\{0,1\}^{h\times w}$ is a binary mask that selects $k\times k$ kernel size centered at $(u,v)$ , with ${M_{uv}^{l}}$ specifically selects the entire point cloud. $F_{I_{m}}^{\sigma}$ is the set of image patches at big, middle, or small scale, $u v$ indicates the coordinate of patches in the original image, and $\odot$ denotes the element-wise multiplication.

Aligned multi-scale point cloud feature extraction. As previous work [9] shown, in the MVTec 3D-AD [10] dataset, many anomalies cannot be detected through RGB images alone. For example, in the ‘potato’ category, an anomaly type named ‘cut’ can only be identified using 3D point cloud data. Thus, incorporating 3D point cloud data in the noise-filtering process is crucial. Therefore, we proposed to use 3D point cloud modality in noise detection.

However, we find that relying solely on the whole point cloud was insufficient during the experiments. In the MVTec 3D-AD dataset, defects often occupy only a small portion of the entire sample’s point cloud data, meaning that most areas of a sample are normal. Furthermore, existing works [72, 73, 22, 74] aligning point cloud encoders with CLIP focus on object classification tasks, which prioritize the global information of the object’s 3D point cloud data and overlook local details. Traditional multi-scale point cloud data segmentation based on FPS sampling (Fig. 4-Left) presents a full point cloud perspective with varying levels of sparsity but fails to specifically highlight local details. Yet, focusing on these details is crucial for detecting noise samples.

To address this problem, we propose a novel Aligned Multi-Scale Point Cloud Feature Extraction module, as shown in the right part of Fig. 4. This approach enhances the ability of localized noise detection by extracting local point cloud features aligned with the granularity of image patching. Specifically, for each point cloud $P_{m}\in\mathbb{R}^{h\times w\times 3}$ in the training dataset, we segment $P_{m}$ into three scales, mirroring the approach used for image segmentation. Also, we generate 3 sets of masks $\{M_{uv}^{l}\}$ , $\{M_{uv}^{m}\}$ , and $\{M_{uv}^{s}\}$ as aforementioned operation of image. By applying these three sets of masks to the entire point cloud, we obtain three distinct sets of point clouds at different scales:

\{P_{uv}^{\sigma}\}_{m}=P_{m}\odot\{M_{uv}^{\sigma}\},\;\sigma\in\{l,m,s\},

(4)

Unlike images, in point cloud modality, only the points that do not fall on the backplane are meaningful. Consequently, some smaller patches of the point cloud may contain only a few meaningful points or none at all, making them insignificant or even obstructive for anomaly detection. To enhance efficiency, we identify and discard these non-contributory patches during the segmentation. This process results in filtered sets of point clouds:

\displaystyle\{\hat{P}_{uv}^{\sigma}\}_{m}=\{P_{uv}^{\sigma}|Num(P_{uv}^{% \sigma})>\theta\}_{m},\;\sigma\in\{l,m,s\},

(5)

where $\theta$ is a hyper-parameter representing the thresholds for the minimum number of points required in a point cloud patch to be considered meaningful.

These sets of point clouds constitute three distinct scales of point cloud representation. The granularity of these patches is aligned with that of image patches, enhancing the efficacy of subsequent multi-modal anomaly detection. We extract features from these multi-scale point cloud patches:

$\displaystyle f_{P_{m}}^{l},F_{P_{m}}^{l}$	$\displaystyle=\mathcal{H}_{P}\left(E_{P}(\{\hat{P}_{uv}^{l}\}_{m})\right)$	(6)
$\displaystyle F_{P_{m}}^{m}$	$\displaystyle=\mathcal{H}_{P}\left(E_{P}(\{\hat{P}_{uv}^{m}\}_{m})\right)$
$\displaystyle F_{P_{m}}^{s}$	$\displaystyle=\mathcal{H}_{P}\left(E_{P}(\{\hat{P}_{uv}^{s}\}_{m})\right)$

where $f_{P_{m}}^{l}$ is the class token and $F_{P_{m}}^{\sigma}$ is the feature map of $\sigma$ -scale point cloud.

3.1.2 Suspected References Selection

We first try to identify noise samples in the training dataset solely by comparing the class tokens of text and RGB images. However, we observed that certain samples in the MVTec 3D-AD [10] dataset cannot be straightforwardly classified using only cross-modal comparison, i.e., text and image class tokens. For example, the ‘Foam’ category in MVTec 3D-AD includes a defect type labeled ‘color’, which defies classification with our text templates and necessitates comparison with an RGB reference image of a normal sample. Consequently, to achieve comprehensive anomaly classification, a language-guided zero-shot approach falls short, as some defects are only identifiable through intra-modal references, not merely by cross-modal comparison. Given that noise data constitutes a relatively small fraction of the entire training set, the majority of data are normal samples, we propose to select $N$ samples that are most representative of normality from the training set in Stage I. These samples will then serve as intra-modal references in Stage II to compensate for the shortcomings of cross-modal comparison. Specifically, $f_{I_{m}}^{l}$ is used to get suspected anomaly score by computing similarity with $f_{T}^{Nor}$ and $f_{T}^{Ano}$ as follows:

s_{I_{m}}=\frac{<f_{I_{m}}^{l},f_{T}^{Ano}>}{{<f_{I_{m}}^{l},f_{T}^{Ano}>}+{<f% _{I_{m}}^{l},f_{T}^{Nor}>}},

(7)

where $<\cdot,\cdot>$ denotes the cosine similarity. $s_{P_{m}}$ is calculated with $f_{P_{m}}^{l}$ , $f_{T}^{Nor}$ , and $f_{T}^{Ano}$ in the same way.

s_{P_{m}}=\frac{<f_{P_{m}}^{l},f_{T}^{Ano}>}{{<f_{P_{m}}^{l},f_{T}^{Ano}>}+{<f% _{P_{m}}^{l},f_{T}^{Nor}>}}.

(8)

Final suspected score $s_{ref}$ combines $s_{I_{m}}$ and $s_{P_{m}}$ together:

s_{ref}=s_{I_{m}}+s_{P_{m}}.

(9)

We select $N$ normal samples with the smallest $s_{ref}$ as intra-modal references for the next Stage II that is identified as $\left\{R_{I_{n}}\right\}_{n=1}^{N}$ and $\left\{R_{P_{n}}\right\}_{n=1}^{N}$ in Fig. 3.

3.1.3 Suspected Anomaly Map Computation

Furthermore, we have observed that in industrial anomaly detection tasks, anomalies typically constitute only a small fraction of the entire sample. This means that focusing on all small local patch with uniform attention will not effectively facilitate optimal noise sample detection. Consequently, we propose using the preliminary suspected anomaly map obtained from Stage I as the attention map in Noise-Focused Aggregation within Stage II. This strategy allows for differentiated attention across all local patches, enabling our model to more precisely focus on specific local patches that may contain noise. To generate the preliminary suspected anomaly map, we follow WinCLIP [67], using Harmonic aggregation of windows and multi-scale aggregation to get the suspected anomaly map $W_{n}\in\mathbb{R}^{h\times w}$ ( $n=1,\cdots,N$ ). This suspected anomaly maps $\left\{W_{n}\right\}_{n=1}^{N}$ serve as the attention map to enhance the denoising process in Stage II.

3.2 Stage II: Enhanced Multi-modal Denoising

In industrial anomaly detection tasks, anomalies often occupy only a small portion of the entire sample. Therefore, after segmenting the sample into multi-scale patches, some patches will contain anomalies while others will not. Naturally, we aim to focus more on those patches containing anomalies and less on those without when computing the suspected anomaly score through intra-modality comparison, to enhance the accuracy of anomaly detection. This is achieved by assigning a weight to each patch based on the suspected anomaly map computed in Sec. 3.1.3, thereby allowing differential attention to patches based on their likelihood of containing anomalies. Specifically, this process is divided into four steps:

Intra-modal comparison. With $N$ intra-modal references selected during Stage I, we employ these image features $\left\{R_{I_{n}}\right\}_{n=1}^{N}$ and point cloud features $\left\{R_{P_{n}}\right\}_{n=1}^{N}$ for reference:

	$\displaystyle r_{I_{n}}^{l},R_{I_{n}}^{l},R_{I_{n}}^{m},F_{I_{n}}^{s}$	$\displaystyle=R_{I_{n}}$		(10)
	$\displaystyle r_{P_{n}}^{l},R_{P_{n}}^{l},R_{P_{n}}^{m},F_{P_{n}}^{s}$	$\displaystyle=R_{P_{n}},$		(10)

where $r_{I_{n}}^{l}$ and $r_{P_{n}}^{l}$ are class tokens, while $R_{I_{n}}^{\sigma}=\left\{r_{uv}^{\sigma}\right\}_{I_{n}}$ and $R_{P_{n}}^{\sigma}=\left\{r_{uv}^{\sigma}\right\}_{P_{n}}$ are $\sigma$ -scale feature maps. The intra-modality suspected anomaly score is determined by the cosine similarity between the feature vectors of the original query samples and those of intra-modal references:

	$\displaystyle\{\bar{s}_{uv}^{\sigma}\}_{I_{m}}=\{1-\max<f_{uv}^{\sigma}\|I_{m},% r_{uv}^{\sigma}\|I_{[1,N]}>\}_{m}$		(11)
	$\displaystyle\{\bar{s}_{uv}^{\sigma}\}_{P_{m}}=\{1-\max<f_{uv}^{\sigma}\|P_{m},% r_{uv}^{\sigma}\|P_{[1,N]}>\}_{m},$		(11)

where $\bar{s}_{I_{m}}=\{\bar{s}_{uv}^{\sigma}\}_{I_{m}}$ , $\bar{s}_{P_{m}}=\{\bar{s}_{uv}^{\sigma}\}_{P_{m}}$ , and $\sigma\in\{l,m,s\}$ .

Compute weights for local patches. We first compute weight for every local patch. Given the suspected anomaly map $W\in\mathbb{R}^{h\times w}$ , we initially procure individual suspected anomaly maps for distinct patches by applying the masks generated in Sec. 3.1 to the whole suspected anomaly map.

\displaystyle\{W_{uv}^{\sigma}\}_{n}

\displaystyle=\{W_{n}\odot M_{uv}^{\sigma}\},\;\sigma\in\{l,m,s\}.

(12)

In this way, we can determine the weight for each local patch at both middle and small scales. For large scale, the entire suspected anomaly map can be directly used as the weight.

Multi-scale anomaly score aggregation. For each local patch, the suspected anomaly score $\bar{s}^{\sigma}_{uv}$ is first distributed to every pixel of the local patch. Then at each pixel in the whole point cloud, we aggregate multiple scores from all overlapping local patches to improve anomaly classification. In order to focus more on those patches which contain anomalies, we re-weight the score $\bar{s}^{\sigma}_{uv}$ using $W^{\sigma}_{uv}$ while aggregating multi-scale information. In this way, regions will be paid attention based on their likelihood of containing anomalies (Fig. 5-Left):

	$\displaystyle\{\bar{\bar{s}}_{uv}^{\sigma}\}_{I_{m}}=\{\frac{\sum_{p,q}(W_{pq}% ^{\sigma}\odot\bar{s}_{pq}^{\sigma})_{uv}}{{\sum_{p,q}(M_{pq}^{\sigma}})_{uv}}% \}_{I_{m}}$		(13)
	$\displaystyle\{\bar{\bar{s}}_{uv}^{\sigma}\}_{P_{m}}=\{\frac{\sum_{p,q}(W_{pq}% ^{\sigma}\odot\bar{s}_{pq}^{\sigma})_{uv}}{{\sum_{p,q}(M_{pq}^{\sigma}})_{uv}}% \}_{P_{m}}.$		(13)

Final suspected anomaly score computation. The final suspected image anomaly score $\tilde{s}_{I_{m}}$ is computed using both cross-modal score $s_{P}$ calculated in Eq. 7 and intra-modality score $\{\bar{\bar{s}}_{uv}^{\sigma}\}_{I_{m}}=\{\{\bar{\bar{s}}_{uv}^{l}\}_{I_{m}},% \{\bar{\bar{s}}_{uv}^{m}\}_{I_{m}},\{\bar{\bar{s}}_{uv}^{s}\}_{I_{m}}\}$ calculated in Eq. 13:

\tilde{s}_{I_{m}}=\frac{1}{3}(s_{I_{m}}+\max_{uv}\{\{\bar{\bar{s}}_{uv}^{m}\}_% {I_{m}}+\{\bar{\bar{s}}_{uv}^{s}\}_{I_{m}}\}+\max_{uv}\{\bar{\bar{s}}_{uv}^{\l% }\}_{I_{m}}).

(14)

Detailed explaination can be viewed in the right part of Fig. 5-Left. The final suspected point cloud anomaly score $\tilde{s}_{P_{m}}$ is computed using the same way:

\tilde{s}_{P_{m}}=\frac{1}{3}(s_{P_{m}}+\max_{uv}\{\{\bar{\bar{s}}_{uv}^{m}\}_% {IP_{m}}+\{\bar{\bar{s}}_{uv}^{s}\}_{P_{m}}\}+\max_{uv}\{\bar{\bar{s}}_{uv}^{% \l}\}_{P_{m}}).

(15)

Analogously, the final suspected anomaly score $\tilde{s_{I}}$ is calculated as a weighted combination of $\tilde{s}_{I_{m}}$ and $\tilde{s}_{I_{m}}$ , given by the equation:

\tilde{s_{I}}=\lambda_{I}\tilde{s}_{I_{m}}+\lambda_{P}\tilde{s}_{P_{m}},

(16)

where $\lambda_{I}$ and $\lambda_{P}$ are hyper-parameters controlling the extent to which RGB and point cloud modalities are integrated. Finally, we remove the samples with top $\tau$ percent scores.

3.3 Fused Anomaly Detection

As shown in Fig. 3, Stage III takes in the dataset filtered by Stage I&II as input and learns its pattern to conduct anomaly detection and segmentation. Besides, Stage III also filters out noise at patch-level in case some hard noise samples still exist in the training dataset.

3.3.1 Point Feature Alignment

Point Feature Interpolation. Post-FPS conducted within the Point Transformer ( $E^{\prime}_{P}$ ), the center points of the point cloud are unevenly distributed, leading to an imbalance in the density of point features. To address this, we interpolate the features back to the original point cloud. With $K$ point features ${g_{i}}$ corresponding to $K$ center points $c_{i}$ , we employ inverse distance weighting to interpolate the feature for each point $p_{j}$ in the input point cloud. The interpolation is mathematically represented as:

\displaystyle p^{\prime}_{j}=\sum_{i=1}^{K}\alpha_{i}g_{i},\quad\alpha_{i}=% \frac{\frac{1}{\|c_{i}-p_{j}\|_{2}+\epsilon}}{\sum_{k=1}^{K}\sum_{t=1}^{T}% \frac{1}{\|c_{k}-p_{t}\|_{2}+\epsilon}},

(17)

where $\epsilon$ is a small constant to prevent division by zero.

Point Feature Projection. After interpolation, we project the interpolated point features $p^{\prime}_{j}$ onto a 2D plane as $\hat{p}$ using the point coordinates and camera parameters. Noting the sparsity of point clouds, we assign a value of 0 to any 2D plane position lacking a corresponding point. The resulting projected feature map matches the size of the RGB image.

3.3.2 Unsupervised Feature Fusion

The interaction between multi-modal features can yield new information beneficial for industrial anomaly detection. For instance, as shown in Fig. 1, detecting a hole in a cookie necessitates the integration of both its black color and the shape depression. To decipher the intrinsic relationship between these modalities in the training data, we developed the Unsupervised Feature Fusion (UFF) module.

We introduce a patch-wise contrastive loss to train this module. Given RGB features $f_{I}$ and point cloud features $f_{P}$ , our goal is to promote a higher correlation of information between features from different modalities at identical spatial positions while minimizing this correlation for features at distinct positions.

The features of a sample are represented as $\{\{f_{uv}\}_{I_{i}},\{f_{uv}\}_{P_{i}}\}$ , where $i$ denotes the index of the training sample, and $u, v$ represents the patch position. We employ MLP $\{\chi_{I},\chi_{P}\}$ to derive interaction information between the two modalities and utilize fully connected layers $\{\sigma_{I},\sigma_{P}\}$ to transform the processed features into query or key vectors, denoted as $\{\{h_{uv}\}_{I_{i}},\{h_{uv}\}_{P_{i}}\}$ . For contrastive learning, we apply the InfoNCE loss:

\mathcal{L}_{con}=\frac{\{h_{uv}\}_{I_{i}}\cdot\{h_{uv}\}_{P_{i}}}{\sum_{t=1}^% {N_{b}}\sum_{uv}\{h_{uv}\}^{t}_{I}\cdot\{h_{uv}\}^{t}_{P}},

(18)

where $N_{b}$ is the batch size. The UFF module, trained with collective training data from all categories in MVTec 3D-AD, is depicted in Fig. 6.

During inference, outputs of the MLP layers are concatenated to form a fused patch feature, denoted as $\{f_{uv}\}_{F_{i}}$ .

3.3.3 Noise Discriminative Coreset Selection

In our experimental process, we found that, despite pre-processing the training data to remove noise at the sample level, some noise samples that closely resembled normal samples could not be eliminated. To address this, we conducted a second round of denoising at the patch level. Following Softpatch [8], we discard noise patches in coreset selection process. Initially, we calculated outlier scores for all patches. These scores were then aggregated to identify the noise patches, after which we just remove the patches with top $\tau$ percent scores. We implemented it using the Local Outlier Factor (LOF) method.

LOF is a local-density-based outlier detector. Inspired by Softpatch, we propose to use LOF in M3DM in two ways. Firstly, we will use LOF to rule out noise patches with the aim of making the training datset contain only normal samples. Secondly, we will use the LOF as the soft weight for patches to achieve more accurate anomaly detection.

The k-distance-based absolute local reachability density ${lrd}_{{uv}_{i}}$ is first calculated as:

\begin{gathered}{lrd}_{{uv}_{i}}=(\frac{\sum_{b\in\mathcal{N}_{k}(f_{{uv}_{i}}% )}dist_{k}^{reach}(f_{{uv}_{i}},f^{b}_{uv})}{|\mathcal{N}_{k}(f_{{uv}_{i}})|})% ^{-1},\\ {dist}_{k}^{reach}(f_{{uv}_{i}},f^{b}_{uv})=\max(dist_{k}(f^{b}_{uv}),d(f_{{uv% }_{i}},f^{b}_{uv})),\end{gathered}

(19)

where $d(f_{{uv}_{i}},f^{b}_{uv})$ is L2-norm, $dist_{k}(f_{{uv}_{i}})$ is the distance of kth-neighbor, $\mathcal{N}_{k}(f_{{uv}_{i}})$ is the set of k-nearest neighbors of $f_{{uv}_{i}}$ and $|\mathcal{N}_{k}(f_{{uv}_{i}})|$ is the number of the set which usually equal k when without repeated neighbors. With the local reachability density of each patch, the overwhelming effect of large clusters is largely reduced. To normalize local density to relative density for treating all clusters equally, the relative density $\eta^{i}$ of image $i$ is defined below:

\eta_{{uv}_{i}}=\frac{\sum_{b\in\mathcal{N}_{k}(f_{{uv}_{i}})}{lrd}^{b}_{uv}}{% |\mathcal{N}_{k}(f_{{uv}_{i}})|\cdot{lrd}_{{uv}_{i}}}.

(20)

$\eta_{{uv}_{i}}$ is the relative density of the neighbors over patch’s own, and represents as a patch’s confidence of inlier. Patches with top $\tau$ scores are removed before coreset selection.

3.3.4 Decision Layer Fusion

As depicted in Fig. 1, certain industrial anomalies, such as the protruding part of a potato, manifest exclusively in a single domain, making the correlation between multi-modal features less evident. Additionally, despite the advantages of Feature Fusion in enhancing multi-modal feature interaction, we observed some loss of information during the fusion process. Furthermore, we observed that, despite undergoing denoising at both the image and patch levels, some hard noise patches remain within the dataset. These hard noise elements can adversely affect the precision of anomaly scores during the final inference stage.

To address these issues, we propose utilizing multiple memory banks to preserve the original color feature ( $f_{I}$ ), point cloud feature ( $f_{P}$ ), and fusion feature ( $f_{F}$ ). These are denoted as $\mathcal{M}_{I}$ , $\mathcal{M}_{P}$ , and $\mathcal{M}_{F}$ respectively. Besides, we propose to use $\eta_{{uv}_{i}}$ obtained in Sec. 3.3.3 to re-weight the anomaly score during inference, which can down-weight noisy samples according to outlier scores. During inference, each bank contributes to predicting an anomaly score and a segmentation map. Two learnable One-Class Support Vector Machines (OCSVMs), $\mathcal{D}_{image}$ and $\mathcal{D}_{pixel}$ , are employed to finalize the anomaly score $S_{image}$ and the segmentation map $S_{pixel}$ . This procedure is referred to as Decision Layer Fusion (DLF) and can be mathematically represented as follows:

\begin{gathered}S_{image}=\mathcal{D}_{image}(\phi(\mathcal{M}_{I},f_{I}),\phi% (\mathcal{M}_{P},f_{P}),\phi(\mathcal{M}_{F},f_{F})),\\ S_{pixel}=\mathcal{D}_{pixel}(\psi(\mathcal{M}_{I},f_{I}),\psi(\mathcal{M}_{P}% ,f_{P}),\psi(\mathcal{M}_{F},f_{F})),\end{gathered}

(21)

where $\phi$ and $\psi$ are scoring functions, defined as follows:

\begin{gathered}\phi(\mathcal{M},f)=\eta_{{uv}_{i}}\|f^{*}_{{uv}_{i}}-m^{*}\|_% {2}\\ \psi(\mathcal{M},f)=\{\min_{m\in\mathcal{M}}\|f_{{uv}_{i}}-m\|_{2}\Big{|}f_{{% uv}_{i}}\in f\}\\ f^{i,*}_{uv},m^{*}=\arg\max_{f_{{uv}_{i}}\in f}\arg\min_{m\in\mathcal{M}}\|f_{% {uv}_{i}}-m\|_{2},\end{gathered}

(22)

where $\mathcal{M}\in\{\mathcal{M}_{I},\mathcal{M}_{P},\mathcal{M}_{F}\}$ , $f\in\{f_{I},f_{P},f_{F}\}$ and $\eta_{{uv}_{i}}$ is the weight parameter obtained in Sec. 3.3.3.

4 Experiment

TABLE I: I-AUROC score for regular anomaly detection of all categories of MVTec-3D AD. Our method maintains the regular anomaly detection ability. The results of baselines are from the [10, 20, 18, 75]. Optimal and sub-optimal results are in bold and underlined, respectively.

Method

Bagel

Cable

Gland

Carrot

Dowel

Foam

Peach

Potato

Rope

Tire

Mean

3D-ST[47]

86.2

48.4

83.2

89.4

84.8

66.3

76.3

68.7

95.8

48.6

74.8

FPFH[20]

82.5

55.1

95.2

79.7

88.3

58.2

75.8

88.9

92.9

65.3

78.2

AST[18]

88.1

57.6

96.5

95.7

67.9

79.7

99.0

91.5

95.6

61.1

83.3

M3DM[9]

94.1

65.1

96.5

96.9

90.5

76.0

88.0

97.4

92.6

76.5

87.4

Ours

94.2

66.1

95.5

97.2

90.4

77.2

88.1

96.4

91.6

78.5

87.4

RGB

PADiM[19]

97.5

77.5

69.8

58.2

95.9

66.3

85.8

53.5

83.2

76.0

76.4

PatchCore[5]

87.6

88.0

79.1

68.2

91.2

70.1

69.5

61.8

84.1

70.2

77.0

STFPM[76]

93.0

84.7

89.0

57.5

94.7

76.6

71.0

59.8

96.5

70.1

79.3

CS-Flow[6]

94.1

93.0

82.7

79.5

99.0

88.6

73.1

47.1

98.6

74.5

83.0

AST[18]

94.7

92.8

85.1

82.5

98.1

95.1

89.5

61.3

99.2

82.1

88.0

M3DM[9]

94.4

91.8

89.6

74.9

95.9

76.7

91.9

64.8

93.8

76.7

85.0

Ours

94.2

91.7

89.4

73.9

96.1

77.8

93.3

64.9

92.8

77.7

85.1

RGB + 3D

Voxel GAN[10]

68.0

32.4

56.5

39.9

49.7

48.2

56.6

57.9

60.1

48.2

51.7

PatchCore + FPFH[20]

91.8

74.8

96.7

88.3

93.2

58.2

89.6

91.2

92.1

88.6

86.5

AST[18]

98.3

87.3

97.6

97.1

93.2

88.5

97.4

98.1

100.0

79.7

93.7

M3DM [9]

99.4

90.9

97.2

97.6

96.0

94.2

97.3

89.9

97.2

85.0

94.5

Ours

99.3

91.1

97.7

97.6

96.0

92.2

97.3

89.9

95.5

88.2

94.5

TABLE II: AUPRO score for regular anomaly segmentation of all categories of MVTec-3D. Our method maintains the regular anomaly segmentation ability. The results of baselines are from the [10, 20, 75]. Optimal and sub-optimal results are in bold and underlined, respectively.

Method

Bagel

Cable

Gland

Carrot

Dowel

Foam

Peach

Potato

Rope

Tire

Mean

3D-ST[47]

95.0

48.3

98.6

92.1

90.5

63.2

94.5

98.8

97.6

54.2

83.3

FPFH[20]

97.3

87.9

98.2

90.6

89.2

73.5

97.7

98.2

95.6

96.1

92.4

M3DM [9]

94.3

81.8

97.7

88.2

88.1

74.3

95.8

97.4

95.0

92.9

90.6

Ours

94.2

81.8

97.8

88.3

88.0

74.3

95.8

97.4

95.0

92.9

90.6

RGB

CFlow[6]

85.5

91.9

95.8

86.7

96.9

50.0

88.9

93.5

90.4

91.9

87.1

PatchCore[5]

90.1

94.9

92.8

87.7

89.2

56.3

90.4

93.2

90.8

90.6

87.6

PADiM[19]

98.0

94.4

94.5

92.5

96.1

79.2

96.6

94.0

93.7

91.2

93.0

M3DM [9]

95.2

97.2

97.3

89.1

93.2

84.3

97.0

95.6

96.8

96.6

94.2

Ours

95.4

97.0

97.3

89.1

93.4

84.3

97.0

95.6

96.8

96.6

94.2

RGB+3D

Voxel GAN[10]

66.4

62.0

76.6

74.0

78.3

33.2

58.2

79.0

63.3

48.3

63.9

PatchCore + FPFH[20]

97.6

96.9

97.9

97.3

93.3

88.8

97.5

98.1

95.0

97.1

95.9

M3DM [9]

97.0

97.1

97.9

95.0

94.1

93.2

97.7

97.1

97.5

96.4

Ours

97.4

97.1

97.8

94.5

93.8

94.7

97.8

97.1

97.2

97.4

96.5

4.1 Experimental Setup

Dataset. 3D industrial anomaly detection is in the beginning stage. The MVTec-3D AD dataset is the first 3D industrial anomaly detection dataset. Our experiments were performed on the MVTec-3D dataset. MVTec-3D AD[10] dataset consists of 10 categories, a total of 2,656 training samples, and 1,137 testing samples. The 3D scans were acquired by an industrial sensor using structured light, and position information was stored in 3 channel tensors representing $x$ , $y$ and $z$ coordinates. Those 3 channel tensors can be single-mapped to the corresponding point clouds. Additionally, the RGB information is recorded for each point. Because all samples in the dataset are viewed from the same angle, the RGB information of each sample can be stored in a single image. Totally, each sample of the MVTec-3D AD dataset contains a colored point cloud.

We conduct both regular anomaly detection in Sec. 4.2 and noisy anomaly detection in Sec. 4.3. For noisy anomaly detection, in odrder to generate a noisy training set, we randomly select 10% anomalous samples from the test set and integrate them into the existing training samples. Additionally, we establish two distinct settings, Overlap and Non-Overlap, to assess the robustness of our model. In the Overlap setting, the anomalous samples added to the training dataset will also be included in the test dataset to demonstrate the risk that defects with similar appearance will severely exacerbate the performance of an anomaly detector trained with noisy data. Conversely, in the Non-Overlap setting, these samples will not be retested.

Data Pre-processing. Different from 2D data, 3D ones are easier to remove the background information. Following [20], we estimate the background plane with RANSAC[77] and any point within 0.005 distance is removed. At the same time, we set the corresponding pixel of removed points in the RGB image as 0. This operation not only accelerates the 3D feature processing during training and inference but also reduces the background disturbance for anomaly detection. Finally, we resize both the position tensor and the RGB image to $224\times 224$ size, which is matched with the feature extractor input size.

Feature Extractors. In Stage I&II, we use text and image encoder from LAION-2B based CLIP with ViT-H/14 and point cloud encoder from Point-BIND. In Stage III, we use the ViT-B/8 pretrained on ImageNet[78] with DINO[79] as the RGB image encoder and a Point Transformer[80, 81], which is pretrained on ShapeNet[82] dataset as the 3D point cloud encoder, use the $\{3,7,11\}$ layer output as our 3D point cloud feature.

Learnable Module Details. Stage I&II are traing-free and Stage III has 2 learnable modules: the Unsupervised Feature Fusion module and the Decision Layer Fusion module. 1) For UFF, $\chi_{I}$ and $\chi_{P}$ are 2 two-layer MLPs with $4\times$ hidden dimension as input feature. We use AdamW optimizer with the learning rate as 0.003 and cosine warm-up in 250 steps. Batch size as 16 and we report the best anomaly detection results under 750 UFF training steps. 2) For DLF, we use two linear OCSVMs [83] with SGD [84] optimizers, and the learning rate is set as $1\times 10^{-4}$ and each class is trained for 1000 steps.

Evaluation Metrics. All evaluation metrics are exactly the same as in [10]. We evaluate the image-level anomaly detection performance with the area under the receiver operator curve (I-AUROC), and higher I-AUROC means better image-level anomaly detection performance. For segmentation evaluation, we use the per-region Overlap (AUPRO) metric, which is defined as the average relative Overlap of the binary prediction with each connected component of the ground truth. Similar to I-AUROC, the receiver operator curve of pixel level predictions can be used to calculate P-AUROC for evaluating the segmentation performance.

TABLE III: I-AUROC score for anomaly detection under Overlap setting of all categories in MVTec 3D-AD. Our method clearly outperforms other methods in 3D, RGB, and 3D + RGB settings, indicating the superior anomaly detection ability of our method. We report the mean and standard deviation over 3 random seeds for each measurement. Optimal and sub-optimal results are in bold and underlined, respectively.

Method

Bagel

Cable

Gland

Carrot

Dowel

Foam

Peach

Potato

Rope

Tire

Mean

SIFT

50.0

\pm

0.8

48.5

\pm

1.9

67.8

\pm

0.2

58.1

\pm

0.4

58.2

\pm

3.8

49.2

\pm

2.8

40.5

\pm

0.6

47.0

\pm

1.3

43.3

\pm

1.1

45.0

\pm

2.7

50.8

\pm

0.5

FPFH

53.4

\pm

2.8

40.9

\pm

3.2

71.4

\pm

1.2

62.7

\pm

0.8

64.5

\pm

2.4

38.5

\pm

0.3

46.8

\pm

2.6

45.3

\pm

1.5

52.2

\pm

1.5

51.5

\pm

4.2

52.7

\pm

0.3

AST

61.0

\pm

0.6

38.4

\pm

0.6

72.9

\pm

0.6

75.2

\pm

0.6

47.8

\pm

0.6

55.7

\pm

0.6

66.9

\pm

0.6

60.6

\pm

0.6

55.5

\pm

1.0

49.2

\pm

0.6

58.3

\pm

0.2

Shape-Guided

66.1

\pm

5.1

58.7

\pm

10.4

71.4

\pm

6.0

76.4

\pm

1.4

71.6

\pm

0.7

54.1

\pm

3.1

61.0

\pm

4.5

59.3

\pm

5.7

60.7

\pm

4.5

64.3

\pm

7.4

64.4

\pm

1.9

M3DM

74.0

\pm

0.7

56.7

\pm

1.8

72.2

\pm

1.7

74.5

\pm

0.6

77.4

\pm

0.7

62.3

\pm

0.6

56.2

\pm

1.9

64.1

\pm

0.5

72.5

\pm

0.5

74.3

\pm

1.8

68.4

\pm

0.7

Ours

93.5

\pm

1.6

71.8

\pm

1.3

93.8

\pm

0.7

91.1

\pm

2.3

78.0

\pm

2.7

67.2

\pm

3.2

79.9

\pm

1.4

79.9

\pm

2.2

87.9

\pm

0.4

79.8

\pm

3.5

82.3

\pm

0.4

RGB

PaDim

70.8

\pm

0.7

57.3

\pm

2.6

54.7

\pm

0.5

43.2

\pm

1.6

72.1

\pm

0.3

55.4

\pm

2.2

61.7

\pm

0.3

36.8

\pm

1.3

74.8

\pm

2.5

55.2

\pm

1.5

58.2

\pm

0.4

PatchCore

64.9

\pm

0.7

71.4

\pm

0.9

71.5

\pm

1.5

52.5

\pm

2.2

73.3

\pm

1.2

56.5

\pm

2.9

46.6

\pm

1.1

36.8

\pm

0.4

54.2

\pm

1.3

57.2

\pm

1.3

58.5

\pm

0.4

AST

57.6

\pm

0.6

62.2

\pm

0.0

50.7

\pm

0.0

47.5

\pm

0.6

58.8

\pm

0.0

56.0

\pm

0.0

54.6

\pm

0.0

43.7

\pm

0.6

42.8

\pm

0.0

44.6

\pm

0.6

51.8

\pm

0.2

Shape-Guided

62.7

\pm

4.4

64.3

\pm

9.3

66.9

\pm

7.3

57.3

\pm

16.4

72.1

\pm

0.9

51.5

\pm

3.2

52.9

\pm

10.0

50.3

\pm

11.1

50.5

\pm

9.4

58.2

\pm

9.3

58.7

\pm

5.8

SoftPatch

88.8

\pm

1.1

87.3

\pm

2.2

84.9

\pm

1.3

63.3

\pm

1.2

96.5

\pm

0.8

75.0

\pm

1.6

62.3

\pm

0.7

43.6

\pm

2.1

89.3

\pm

1.4

71.0

\pm

0.9

76.2

\pm

0.3

M3DM

64.1

\pm

1.4

62.1

\pm

2.1

65.5

\pm

0.9

53.6

\pm

2.1

70.7

\pm

0.9

57.0

\pm

1.2

54.7

\pm

2.0

42.1

\pm

2.3

53.8

\pm

1.1

58.3

\pm

0.9

58.2

\pm

0.5

Ours

90.3

\pm

0.4

87.5

\pm

3.4

86.5

\pm

1.8

67.1

\pm

4.6

86.1

\pm

0.6

79.2

\pm

2.8

84.4

\pm

2.3

54.6

\pm

6.2

90.0

\pm

2.2

73.1

\pm

1.1

79.9

\pm

0.4

3D+RGB

PatchCore+FPFH

61.3

\pm

2.7

58.3

\pm

0.9

72.3

\pm

0.4

69.0

\pm

1.1

67.2

\pm

1.0

47.1

\pm

1.9

53.0

\pm

2.0

52.1

\pm

1.3

52.7

\pm

1.0

68.2

\pm

0.8

60.1

\pm

0.4

AST

65.3

\pm

0.6

69.5

\pm

0.6

73.8

\pm

0.6

83.1

\pm

0.0

68.1

\pm

0.6

64.4

\pm

0.6

64.7

\pm

0.6

64.1

\pm

0.6

49.7

\pm

0.6

55.8

\pm

0.0

65.8

\pm

0.0

Shape-Guided

69.1

\pm

0.7

67.2

\pm

1.4

76.3

\pm

0.5

71.3

\pm

0.8

71.8

\pm

0.3

58.0

\pm

0.3

62.0

\pm

0.3

60.4

\pm

0.7

55.3

\pm

0.3

67.8

\pm

0.6

65.9

\pm

0.2

M3DM

72.5

\pm

2.2

62.4

\pm

0.8

69.6

\pm

1.4

72.4

\pm

2.1

73.9

\pm

0.9

64.3

\pm

2.0

60.1

\pm

0.3

54.0

\pm

2.0

62.1

\pm

1.8

71.4

\pm

2.1

66.3

\pm

0.5

Ours

96.7

\pm

2.1

86.2

\pm

3.0

95.5

\pm

1.3

90.3

\pm

3.4

86.0

\pm

3.0

79.1

\pm

3.7

86.6

\pm

3.7

72.2

\pm

3.3

92.0

\pm

0.5

81.3

\pm

1.6

86.6

\pm

1.3

TABLE IV: AUPRO score for anomaly segmentation under Overlap setting of all categories in MVTec 3D-AD. Our method clearly outperforms other methods in 3D, RGB, and 3D + RGB settings, indicating the superior anomaly segmentation ability of our method. We report the mean and standard deviation over 3 random seeds for each measurement. Optimal and sub-optimal results are in bold and underlined, respectively.

Method

Bagel

Cable

Gland

Carrot

Dowel

Foam

Peach

Potato

Rope

Tire

Mean

SIFT

69.1

\pm

1.6

68.2

\pm

0.8

85.3

\pm

0.4

72.3

\pm

0.8

67.1

\pm

1.4

55.7

\pm

1.5

64.3

\pm

1.4

66.6

\pm

1.7

69.9

\pm

0.8

72.6

\pm

1.2

69.1

\pm

0.4

FPFH

70.5

\pm

1.6

73.7

\pm

0.6

88.5

\pm

0.2

72.6

\pm

0.8

72.6

\pm

2.7

56.7

\pm

2.4

66.7

\pm

1.6

75.0

\pm

2.2

65.5

\pm

1.8

77.2

\pm

1.3

71.9

\pm

0.4

Shape-Guided

74.6

\pm

0.6

83.7

\pm

2.2

98.1

\pm

0.1

81.9

\pm

5.4

88.6

\pm

0.1

80.4

\pm

6.7

88.9

\pm

7.3

88.2

\pm

0.0

88.7

\pm

3.6

93.7

\pm

5.5

86.7

\pm

1.7

M3DM

84.0

\pm

1.0

79.7

\pm

1.1

95.8

\pm

0.4

79.6

\pm

1.3

85.5

\pm

0.6

68.3

\pm

1.6

86.4

\pm

0.9

91.3

\pm

0.8

90.3

\pm

1.5

88.7

\pm

0.4

85.0

\pm

0.4

Ours

95.0

\pm

1.3

78.8

\pm

0.8

97.2

\pm

0.1

84.5

\pm

1.4

83.9

\pm

3.0

66.6

\pm

2.4

91.2

\pm

1.6

89.9

\pm

0.6

92.7

\pm

0.5

89.9

\pm

0.7

87.0

\pm

0.2

RGB

PaDim

77.9

\pm

2.7

79.9

\pm

3.8

91.8

\pm

0.2

72.2

\pm

1.3

90.0

\pm

0.7

92.4

\pm

1.9

91.4

\pm

1.2

92.6

\pm

1.2

91.3

\pm

1.3

92.2

\pm

0.8

87.2

\pm

0.7

PatchCore

67.1

\pm

1.7

73.3

\pm

0.0

77.0

\pm

0.3

72.1

\pm

0.8

69.9

\pm

1.2

59.1

\pm

2.4

61.7

\pm

1.2

64.3

\pm

1.1

56.1

\pm

1.6

73.1

\pm

1.2

67.4

\pm

0.8

Shape-Guided

67.5

\pm

0.6

73.9

\pm

0.7

81.2

\pm

0.1

72.1

\pm

0.1

76.1

\pm

0.6

56.0

\pm

0.0

62.5

\pm

0.2

71.6

\pm

1.0

64.7

\pm

0.5

73.8

\pm

0.1

69.9

\pm

0.1

SoftPatch

83.9

\pm

2.0

89.3

\pm

2.7

91.4

\pm

0.5

79.2

\pm

0.7

91.8

\pm

1.8

72.4

\pm

2.8

76.5

\pm

2.4

72.9

\pm

2.7

89.8

\pm

2.6

90.1

\pm

1.7

83.7

\pm

0.3

M3DM

68.6

\pm

1.7

72.7

\pm

0.8

77.4

\pm

0.3

70.5

\pm

0.6

68.6

\pm

1.3

59.8

\pm

1.4

64.9

\pm

1.4

65.0

\pm

1.4

57.0

\pm

0.8

75.1

\pm

1.2

68.0

\pm

0.7

Ours

93.1

\pm

1.6

91.9

\pm

1.3

96.1

\pm

0.4

82.1

\pm

1.8

81.5

\pm

5.6

73.9

\pm

1.0

90.4

\pm

2.1

84.3

\pm

1.4

94.2

\pm

1.0

90.2

\pm

0.6

87.8

\pm

0.5

3D+RGB

PatchCore+FPFH

70.4

\pm

1.5

72.8

\pm

0.6

77.9

\pm

0.3

77.5

\pm

1.0

68.8

\pm

1.5

64.9

\pm

1.0

65.0

\pm

1.7

65.9

\pm

1.3

56.4

\pm

0.8

75.3

\pm

1.3

69.5

\pm

0.6

Shape-Guided

74.6

\pm

0.6

80.9

\pm

0.5

93.6

\pm

0.3

79.3

\pm

0.9

89.3

\pm

0.9

76.6

\pm

0.2

82.4

\pm

0.2

94.0

\pm

0.3

86.6

\pm

0.1

93.7

\pm

0.8

85.1

\pm

0.0

M3DM

69.0

\pm

1.4

72.5

\pm

0.8

77.8

\pm

0.4

72.8

\pm

1.0

68.0

\pm

1.5

61.3

\pm

0.7

65.2

\pm

1.5

65.3

\pm

1.4

57.2

\pm

0.8

75.3

\pm

1.2

68.4

\pm

0.6

Ours

95.9

\pm

1.3

92.0

\pm

1.2

96.7

\pm

0.4

90.4

\pm

1.1

84.6

\pm

2.3

83.4

\pm

1.7

91.9

\pm

2.7

85.8

\pm

1.7

94.5

\pm

0.3

91.4

\pm

0.5

90.7

\pm

0.2

4.2 Regular Anomaly Detection on MVTec 3D-AD

In the regular anomaly detection setting, we compare our method with several 3D-based, RGB-based, and hybrid multi-modal 3D/RGB methods on MVTec-3D. Tabs. I and II show the anomaly detection results record with I-AUROC and the segmentation results record with AUPRO respectively. We report the P-AUROC in P-AUROC for regular anomaly segmentation on MVTec 3D-AD. From Tabs. I and I, we can conclude that our M3DM-NR also maintains the regular anomaly detection ability.

4.3 Noisy Anomaly Detection on MVTec 3D-AD

TABLE V: I-AUROC score for anomaly detection under Non-Overlap setting of all categories in MVTec 3D-AD. Our method clearly outperforms other methods in 3D, RGB, and 3D + RGB settings, indicating the superior anomaly detection ability of our method. We report the mean and standard deviation over 3 random seeds for each measurement. Optimal and sub-optimal results are in bold and underlined, respectively.

Method

Bagel

Cable

Gland

Carrot

Dowel

Foam

Peach

Potato

Rope

Tire

Mean

SIFT

68.8

\pm

1.1

65.0

\pm

2.6

86.1

\pm

0.3

72.9

\pm

0.6

79.7

\pm

5.2

69.1

\pm

3.9

61.3

\pm

1.0

69.7

\pm

1.9

74.6

\pm

1.9

59.3

\pm

3.6

70.7

\pm

0.6

FPFH

73.4

\pm

3.8

54.8

\pm

4.3

90.7

\pm

1.5

78.5

\pm

0.9

88.3

\pm

3.3

54.0

\pm

0.4

70.9

\pm

3.8

67.2

\pm

2.3

90.0

\pm

2.6

67.9

\pm

5.6

73.6

\pm

0.4

AST

82.8

\pm

0.6

51.9

\pm

0.6

91.3

\pm

0.6

92.3

\pm

1.2

64.3

\pm

1.2

78.5

\pm

0.2

98.3

\pm

2.9

90.3

\pm

0.3

94.7

\pm

1.7

63.3

\pm

1.2

80.8

\pm

0.9

Shape-Guided

90.2

\pm

0.9

67.5

\pm

0.1

91.4

\pm

0.3

92.1

\pm

1.2

80.8

\pm

10.1

67.7

\pm

4.1

86.5

\pm

7.4

87.1

\pm

1.0

89.6

\pm

1.3

83.3

\pm

6.7

83.6

\pm

2.0

M3DM

87.1

\pm

0.8

68.2

\pm

1.2

79.4

\pm

3.1

87.8

\pm

1.3

83.8

\pm

2.8

73.0

\pm

2.5

76.6

\pm

2.6

82.6

\pm

0.7

92.9

\pm

2.0

80.0

\pm

1.6

81.1

\pm

0.8

Ours

94.5

\pm

0.6

74.4

\pm

2.4

94.8

\pm

0.9

93.7

\pm

0.8

83.8

\pm

1.1

72.8

\pm

3.5

84.0

\pm

0.2

87.3

\pm

0.4

89.8

\pm

1.3

82.2

\pm

1.2

85.7

\pm

0.7

RGB

PaDim

93.0

\pm

1.0

73.3

\pm

3.3

66.3

\pm

0.7

52.4

\pm

2.0

88.3

\pm

1.0

72.2

\pm

3.2

84.3

\pm

1.3

50.7

\pm

2.2

91.9

\pm

2.7

68.6

\pm

2.2

74.1

\pm

0.6

PatchCore

89.2

\pm

0.9

95.2

\pm

1.4

90.8

\pm

1.9

65.9

\pm

2.8

97.5

\pm

1.0

77.4

\pm

4.7

70.6

\pm

1.7

54.6

\pm

0.6

93.5

\pm

2.2

75.4

\pm

1.7

81.0

\pm

0.7

AST

79.5

\pm

0.1

83.1

\pm

0.1

63.2

\pm

0.8

60.2

\pm

0.1

80.7

\pm

0.6

77.5

\pm

1.8

81.1

\pm

1.0

63.4

\pm

0.1

74.3

\pm

0.8

59.2

\pm

0.0

72.2

\pm

0.1

Shape-Guided

79.3

\pm

1.0

89.6

\pm

2.4

77.4

\pm

0.3

58.6

\pm

2.0

94.3

\pm

0.2

71.4

\pm

3.6

67.7

\pm

0.7

62.1

\pm

0.0

72.0

\pm

1.6

66.5

\pm

0.3

73.9

\pm

0.8

SoftPatch

90.6

\pm

0.2

91.8

\pm

1.7

87.6

\pm

0.4

67.8

\pm

0.8

98.0

\pm

0.6

78.0

\pm

4.8

70.6

\pm

0.7

55.3

\pm

1.5

93.4

\pm

2.7

75.6

\pm

1.2

80.9

\pm

0.4

M3DM

87.7

\pm

2.3

83.0

\pm

2.7

83.1

\pm

1.1

66.4

\pm

1.7

96.7

\pm

1.4

77.7

\pm

1.7

82.7

\pm

3.1

62.5

\pm

3.4

92.9

\pm

1.8

76.7

\pm

1.2

80.9

\pm

0.8

Ours

90.8

\pm

1.3

90.2

\pm

4.0

86.9

\pm

1.8

68.0

\pm

3.6

91.0

\pm

3.6

83.2

\pm

1.8

88.7

\pm

2.1

57.7

\pm

6.7

93.3

\pm

1.1

75.9

\pm

1.6

82.6

\pm

0.5

3D+RGB

PatchCore+FPFH

81.1

\pm

4.0

77.8

\pm

1.4

91.7

\pm

0.5

84.5

\pm

1.6

91.8

\pm

1.3

64.8

\pm

2.6

79.5

\pm

3.1

77.3

\pm

1.9

90.9

\pm

1.6

89.8

\pm

1.1

82.9

\pm

0.8

AST

85.4

\pm

0.6

88.9

\pm

0.6

91.3

\pm

0.6

95.6

\pm

0.6

89.2

\pm

1.0

85.9

\pm

0.6

92.8

\pm

0.6

91.6

\pm

0.6

79.6

\pm

0.6

70.0

\pm

0.6

87.0

\pm

0.3

Shape-Guided

91.0

\pm

0.5

86.3

\pm

2.0

94.2

\pm

0.5

86.4

\pm

1.0

94.2

\pm

0.1

77.1

\pm

0.5

88.6

\pm

0.1

85.8

\pm

1.0

88.3

\pm

0.1

85.1

\pm

0.2

87.7

\pm

0.3

M3DM

96.6

\pm

2.2

85.7

\pm

1.9

88.4

\pm

2.5

86.4

\pm

3.1

96.1

\pm

1.3

86.3

\pm

5.4

85.1

\pm

0.6

76.5

\pm

2.3

94.8

\pm

1.3

79.3

\pm

2.4

87.5

\pm

0.5

Ours

98.1

\pm

0.8

91.0

\pm

2.6

96.8

\pm

0.8

94.2

\pm

2.0

93.7

\pm

0.8

90.6

\pm

2.0

92.9

\pm

1.6

81.9

\pm

2.0

95.3

\pm

1.4

84.7

\pm

2.4

91.9

\pm

1.0

TABLE VI: AUPRO score for anomaly segmentation under Non-Overlap setting of all categories in MVTec 3D-AD. Our method clearly outperforms other methods in 3D and 3D + RGB settings, indicating the superior anomaly segmentation ability of our method. We report the mean and standard deviation over 3 random seeds for each measurement. Optimal and sub-optimal results are in bold and underlined, respectively.

Method

Bagel

Cable

Gland

Carrot

Dowel

Foam

Peach

Potato

Rope

Tire

Mean

SIFT

86.4

\pm

0.0

70.2

\pm

0.0

90.3

\pm

0.0

86.1

\pm

0.0

90.6

\pm

0.0

60.3

\pm

0.0

85.0

\pm

0.0

95.3

\pm

0.0

93.8

\pm

0.0

86.3

\pm

0.0

84.4

\pm

0.0

FPFH

92.6

\pm

0.0

78.3

\pm

0.0

92.1

\pm

0.0

85.5

\pm

0.0

88.2

\pm

0.0

68.3

\pm

0.0

90.5

\pm

0.0

94.3

\pm

0.0

92.1

\pm

0.0

90.3

\pm

0.0

87.2

\pm

0.0

Shape-Guided

95.6

\pm

0.0

80.3

\pm

0.0

98.1

\pm

0.0

89.5

\pm

0.0

88.2

\pm

0.0

70.3

\pm

0.0

95.2

\pm

0.6

96.3

\pm

0.0

93.1

\pm

0.0

93.7

\pm

0.0

90.0

\pm

0.1

M3DM

93.7

\pm

0.5

81.1

\pm

0.3

97.6

\pm

0.2

86.3

\pm

0.4

87.9

\pm

1.3

75.3

\pm

4.6

95.4

\pm

0.2

96.9

\pm

0.4

94.6

\pm

0.4

92.7

\pm

0.3

90.1

\pm

0.6

Ours

95.8

\pm

0.3

81.2

\pm

0.4

97.6

\pm

0.1

86.6

\pm

0.7

88.0

\pm

1.1

73.0

\pm

4.0

95.5

\pm

0.4

96.5

\pm

0.1

94.2

\pm

0.6

93.5

\pm

0.8

90.2

\pm

0.5

RGB

PaDim

93.0

\pm

2.4

87.5

\pm

2.6

93.7

\pm

0.4

86.8

\pm

0.9

92.7

\pm

1.3

93.3

\pm

7.0

94.9

\pm

0.5

95.0

\pm

1.0

92.4

\pm

0.6

94.9

\pm

0.6

92.4

\pm

0.5

PatchCore

90.9

\pm

0.6

97.0

\pm

0.1

96.2

\pm

0.5

88.4

\pm

0.5

95.7

\pm

0.4

79.1

\pm

2.5

89.2

\pm

0.5

93.4

\pm

0.9

96.5

\pm

0.7

95.1

\pm

0.2

92.2

\pm

0.2

Shape-Guided

90.2

\pm

1.9

94.5

\pm

2.2

94.9

\pm

1.3

86.5

\pm

1.2

93.6

\pm

0.5

74.8

\pm

6.5

90.7

\pm

4.0

92.4

\pm

1.7

91.8

\pm

4.3

93.3

\pm

2.2

90.3

\pm

2.2

SoftPatch

93.2

\pm

0.3

96.1

\pm

0.1

96.4

\pm

0.1

89.7

\pm

0.7

95.3

\pm

0.5

78.4

\pm

1.7

90.0

\pm

0.3

93.5

\pm

0.7

96.2

\pm

0.7

94.7

\pm

0.5

92.3

\pm

0.2

M3DM

93.5

\pm

0.3

96.8

\pm

0.3

96.9

\pm

0.5

86.0

\pm

0.6

93.8

\pm

0.8

79.2

\pm

1.6

96.2

\pm

0.4

94.8

\pm

0.6

96.8

\pm

0.4

96.9

\pm

0.1

93.1

\pm

0.1

Ours

93.7

\pm

0.9

96.0

\pm

0.6

96.8

\pm

0.3

84.0

\pm

1.5

92.4

\pm

1.0

79.5

\pm

2.4

95.6

\pm

0.1

94.8

\pm

0.6

96.8

\pm

0.6

95.3

\pm

0.3

92.5

\pm

0.2

3D+RGB

PatchCore+FPFH

96.6

\pm

0.4

96.1

\pm

1.2

97.7

\pm

0.5

92.6

\pm

3.2

92.5

\pm

1.4

89.1

\pm

0.5

96.5

\pm

0.2

96.7

\pm

0.2

95.3

\pm

1.1

97.2

\pm

0.1

95.0

\pm

0.4

Shape-Guided

93.5

\pm

0.1

94.0

\pm

0.2

97.5

\pm

0.3

93.0

\pm

0.3

95.5

\pm

0.1

93.1

\pm

0.8

95.3

\pm

0.1

97.9

\pm

0.1

95.6

\pm

0.1

97.2

\pm

0.2

95.2

\pm

0.1

M3DM

94.3

\pm

0.8

96.5

\pm

0.3

97.4

\pm

0.5

89.2

\pm

0.2

92.7

\pm

0.9

82.8

\pm

1.0

96.4

\pm

0.3

95.4

\pm

0.6

97.2

\pm

0.4

96.7

\pm

0.3

93.9

\pm

0.2

Ours

96.9

\pm

0.3

96.3

\pm

0.2

97.6

\pm

0.0

92.7

\pm

0.5

93.9

\pm

0.4

91.8

\pm

1.3

97.0

\pm

0.5

96.4

\pm

0.1

97.0

\pm

0.2

96.5

\pm

0.1

95.6

\pm

0.1

In the noisy anomaly detection setting, we compare our method with several 3D-based, RGB-based, and hybrid multi-modal 3D/RGB methods on MVTec-3D. Tabs. III and V show the anomaly detection results record with I-AUROC under Overlap and Non-Overlap settings respectively. Tabs. IV and VI show the segmentation results record with AUPRO under Overlap and Non-Overlap settings respectively. We report the P-AUROC in P-AUROC for noisy anomaly segmentation on MVTec 3D-AD.

Overlap and Non-Overlap Analysis. Compared to the Non-Overlap setting, our method significantly outperformed all baseline methods in the Overlap setting, especially in anomaly detection (I-AUROC). Specifically, our approach exceeded the second-best by 13.9%, 3.7%, and 20.3% in I-AUROC for the 3D, RGB, and 3D+RGB settings, respectively. This indicates the effectiveness of sample-level denoising in Stage I & II of our method, as most baseline methods struggled with anomalies existing in both the training and test datasets. This includes approaches like SoftPatch [8], which only perform denoising at the patch-level, whereas our method remained largely unaffected. This demonstrates the enhanced robustness of our proposed Stage I & II, especially in situations where defects with similar appearances existing in both the training and test datasets, i.e., a common scenario in real-world industrial settings.

3D-Based. On pure 3D anomaly detection, we get the highest I-AUROC and outperform M3DM [9] 13.9% in Overlap and Shape-Guided [44] 2.1% in Non-Overlap. For segmentation, we get the best result with AUPRO and outperform Shape-Guided 0.3% in Overlap and M3DM 0.1% in Non-Overlap. This shows our method has much better detection and segementation performance than the previous method, and with our PFA, the Point Transformer is the better 3D feature extractor for this task.

RGB-Based. Our I-AUROC in RGB domain is 3.7% higher than SoftPatch in Overlap and 1.7% higher than Softpatch and M3DM in Non-Overlap. For segmentation, we get the highest AUPRO score, 0.6% higher than PaDim in Overlap and second best score in Non-Overlap.

Hybrid 3D/RGB. On multi-modal 3D/RGB anomaly detection, we get the highest I-AUROC and outperform M3DM 20.3% in Overlap and Shape-Guided 4.2% in Non-Overlap. For segmentation, we get the best result with AUPRO and outperform Shape-Guided 0.6% in Overlap and Shape-guided 0.4% in Non-Overlap. These results are contributed by our fusion strategy and the high-performance 3D anomaly detection results.

TABLE VII: Main ablation study of M3DM-NR. Stage I&II indicates removing stage I&II,

R

indicates removing Intra-modality Reference Selection,

\mathcal{H}_{P}

indicates removing Aligned Multi-scale Point Cloud Extraction and

W

indicates removing Noise-focused Aggregation. Noise-level refers to the percentage of noise data in the entire training set after denoising in stage I&II. We report the mean and standard deviation over 3 random seeds for each measurement. Optimal and sub-optimal results are in bold and underlined, respectively.

Stage I&II	$R$	$\mathcal{H}_{P}$	$W$	Overlap			Non-Overlap			Noise-level $\downarrow$
Stage I&II	$R$	$\mathcal{H}_{P}$	$W$	I-AUROC $\uparrow$	P-AUROC $\uparrow$	AUPRO $\uparrow$	I-AUROC $\uparrow$	P-AUROC $\uparrow$	AUPRO $\uparrow$	Noise-level $\downarrow$
✗	✗	✗	✗	66.4 $\pm$ 0.4	72.9 $\pm$ 0.9	66.5 $\pm$ 3.4	87.7 $\pm$ 0.5	98.7 $\pm$ 0.1	94.5 $\pm$ 0.2	9.09 $\pm$ 0.00
✓	✗	✗	✗	79.7 $\pm$ 1.1	89.2 $\pm$ 1.1	84.5 $\pm$ 0.6	88.6 $\pm$ 0.6	98.8 $\pm$ 0.1	94.9 $\pm$ 0.3	5.13 $\pm$ 0.13
✓	✓	✗	✗	82.6 $\pm$ 0.7	92.7 $\pm$ 0.5	87.8 $\pm$ 0.3	89.2 $\pm$ 0.8	98.7 $\pm$ 0.0	94.9 $\pm$ 0.1	3.87 $\pm$ 0.08
✓	✓	✓	✗	86.2 $\pm$ 0.5	94.3 $\pm$ 0.4	90.3 $\pm$ 0.5	91.3 $\pm$ 0.2	98.9 $\pm$ 0.1	95.4 $\pm$ 0.0	2.79 $\pm$ 0.18
✓	✓	✓	✓	86.6 $\pm$ 1.3	94.6 $\pm$ 0.3	90.7 $\pm$ 0.2	91.9 $\pm$ 1.0	98.9 $\pm$ 0.0	95.6 $\pm$ 0.1	2.73 $\pm$ 0.05

4.4 Visualization Results

In this section, we visualize anomaly segmentation results for all categories of MVTec-3D AD datasets under the overlap setting. As shown in Fig. 7, we visualize the heatmap results of our method and PatchCore + FPFH [20], M3DM [9] and Shape-Guided [44] with multi-modal inputs. Our method outperforms the previous ones by producing more accurate segmentation maps and exhibiting greater resilience to dataset noise. While the earlier approaches were often confounded by noise samples within the dataset, this is particularly noticeable in the Cable Gland, Dowel, Foam, and Peach results for PatchCore + FPFH, as well as the Foam and Rope results for Shape-Guided. More visualization results under the non-overlap setting is shown in Visualization results of Non-Overlap setiing.

4.5 Ablation Study

We conduct an ablation study on the main components introduced in Sec. 3, namely Stage I & II two-stage sample-level denoising, intra-modality reference, Aligned Multi-Scale Point Cloud Feature Extraction and Noise-Focused Aggregation. The results are displayed in Tab. VII. It was observed that the incremental inclusion of each component led to improvements in I-AUROC, P-AUROC, and AUPRO under both Overlap and Non-Overlap settings, particularly under the more challenging Overlap setting. Besides these metrics, the Noise-level metric also clearly demonstrates that the model’s capability for sample-level denoising progressively increased with the addition of each module.

Different Scales. We also conduct an ablation study on the feature scales extracted in the Aligned Multi-Scale Point Cloud Feature Extraction, with results presented in Tab. VIII. The model performance varies across different scale configurations. Notably, when incorporating all scales, all performance metrics peaked, demonstrating that multi-scale consideration can enhance model performance. When the small scale is excluded, our model performs nearly as well as the full configuration, indicating that omitting small-scale processing has a relatively minor impact. This could be attributed to small-scale patches often containing too few point cloud points, many of which might be deemed insignificant and discarded during segmentation.

TABLE VIII: Ablation Study on Aligned Multi-scale Point Cloud Extraction. w/o multi-scale represents removing all big. mid and small scales.

Methods

w/o

multi-scale

w/o

big-scale

w/o

mid-scale

w/o

small-scale

Full

Over

I-AUROC

\uparrow

82.6

\pm

0.7

84.6

\pm

1.0

83.7

\pm

1.2

85.3

\pm

0.4

86.6

\pm

1.3

P-AUROC

\uparrow

92.7

\pm

0.5

94.0

\pm

0.3

93.6

\pm

0.2

94.2

\pm

0.4

94.6

\pm

0.3

AUPRO

\uparrow

87.8

\pm

0.3

89.8

\pm

0.3

89.4

\pm

0.4

90.2

\pm

0.2

90.7

\pm

0.2

N-Over

I-AUROC

\uparrow

89.2

\pm

0.8

89.6

\pm

0.6

89.1

\pm

0.9

89.8

\pm

0.1

91.9

\pm

1.0

P-AUROC

\uparrow

98.7

\pm

0.0

98.7

\pm

0.1

98.8

\pm

0.1

98.8

\pm

0.1

98.9

\pm

0.0

AUPRO

\uparrow

94.9

\pm

0.1

95.0

\pm

0.2

95.4

\pm

0.2

95.4

\pm

0.2

95.6

\pm

0.1

Noise-level

\downarrow

3.87

\pm

0.08

2.77

\pm

0.18

3.18

\pm

0.20

2.76

\pm

0.07

2.73

\pm

0.05

TABLE IX: Exploring Aligned Multi-scale Point Cloud Extraction Setting.

\sigma

represents the thresholds for the minimum number of points required in a point cloud patch to be considered meaningful.

$\theta$		128	256	512	1024
Over	I-AUROC $\uparrow$	86.6 $\pm$ 1.3	86.2 $\pm$ 0.6	85.9 $\pm$ 0.4	84.0 $\pm$ 0.5
	P-AUROC $\uparrow$	94.6 $\pm$ 0.3	94.3 $\pm$ 0.7	94.4 $\pm$ 0.1	93.4 $\pm$ 0.4
	AUPRO $\uparrow$	90.7 $\pm$ 0.2	90.3 $\pm$ 0.4	90.2 $\pm$ 0.4	89.0 $\pm$ 0.5
N-Over	I-AUROC $\uparrow$	91.9 $\pm$ 1.0	91.4 $\pm$ 0.9	91.0 $\pm$ 0.2	89.4 $\pm$ 0.5
	P-AUROC $\uparrow$	98.9 $\pm$ 0.0	98.9 $\pm$ 0.0	98.9 $\pm$ 0.1	98.7 $\pm$ 0.3
	AUPRO $\uparrow$	95.5 $\pm$ 0.1	95.5 $\pm$ 0.1	95.4 $\pm$ 0.1	95.0 $\pm$ 0.3
Noise-level $\downarrow$		2.73 $\pm$ 0.05	2.73 $\pm$ 0.05	2.75 $\pm$ 0.13	3.46 $\pm$ 0.10

Point Cloud Threshold. We also perform an ablation study on the hyper-parameter $\theta$ introduced, representing the thresholds for the minimum number of points required in a point cloud patch to be considered meaningful. The experimental results are shown in Tab. IX. Given that the point cloud encoder used in our experiments has a minimum group size of 128, we commence our testing from this threshold. The findings indicate that for most metrics, a threshold of 128 points is the most appropriate, aligning with expectations as a lower threshold would mean considering more patches for computing the anomaly score, potentially leading to better accuracy. Therefore, after balancing the considerations of computational complexity and the accuracy of RGB-3D multi-modal anomaly detection, we opted for a threshold $\theta$ of 128 in this paper.

$\lambda_{I}$ and $\lambda_{P}$ .

TABLE X: Exploring RGB and Point Cloud Integration Setting.

\lambda_{rgb}

and

\lambda_{pc}

are hyper-parameters controlling the extent to which RGB and point cloud modalities are integrated.

$\lambda_{rgb}\quad\lambda_{pc}$		1.0 1.3	1.0 1.4	1.0 1.5	1.0 1.6	1.0 1.7
Over	I-AUROC $\uparrow$	86.1 $\pm$ 0.7	85.6 $\pm$ 0.5	86.6 $\pm$ 1.3	86.1 $\pm$ 1.0	86.1 $\pm$ 1.0
	P-AUROC $\uparrow$	94.3 $\pm$ 0.7	94.2 $\pm$ 0.7	94.6 $\pm$ 0.3	94.2 $\pm$ 0.7	94.2 $\pm$ 0.0
	AUPRO $\uparrow$	90.3 $\pm$ 0.4	90.2 $\pm$ 0.3	90.7 $\pm$ 0.3	90.3 $\pm$ 0.4	90.3 $\pm$ 0.3
N-Over	I-AUROC $\uparrow$	91.3 $\pm$ 0.5	90.7 $\pm$ 0.8	91.9 $\pm$ 1.0	91.2 $\pm$ 1.1	91.1 $\pm$ 0.8
	P-AUROC $\uparrow$	98.9 $\pm$ 0.1	98.9 $\pm$ 0.1	98.9 $\pm$ 0.0	98.9 $\pm$ 0.0	98.9 $\pm$ 0.1
	AUPRO $\uparrow$	95.4 $\pm$ 0.2	95.4 $\pm$ 0.1	95.5 $\pm$ 0.1	95.4 $\pm$ 0.2	95.5 $\pm$ 0.2
Noise-level $\downarrow$		2.74 $\pm$ 0.09	2.75 $\pm$ 0.07	2.71 $\pm$ 0.19	2.72 $\pm$ 0.04	2.75 $\pm$ 0.06

TABLE XI: Exploring the Number of Intra-modal Reference Samples. Ref Num represents the number of intra-modal reference samples selected.

Ref Num		0	1	2	3	4
Over	I-AUROC $\uparrow$	80.7 $\pm$ 0.9	84.8 $\pm$ 0.7	85.6 $\pm$ 1.5	86.1 $\pm$ 0.5	86.6 $\pm$ 1.3
	P-AUROC $\uparrow$	89.4 $\pm$ 1.3	93.5 $\pm$ 0.4	93.8 $\pm$ 0.3	93.9 $\pm$ 0.2	94.6 $\pm$ 0.3
	AUPRO $\uparrow$	85.5 $\pm$ 0.7	89.3 $\pm$ 0.1	89.8 $\pm$ 0.3	90.0 $\pm$ 0.4	90.7 $\pm$ 0.3
N-Over	I-AUROC $\uparrow$	88.7 $\pm$ 0.9	90.6 $\pm$ 0.4	91.0 $\pm$ 0.9	91.5 $\pm$ 0.4	91.9 $\pm$ 1.0
	P-AUROC $\uparrow$	98.8 $\pm$ 0.1	98.8 $\pm$ 0.1	98.9 $\pm$ 0.0	98.8 $\pm$ 0.1	98.9 $\pm$ 0.0
	AUPRO $\uparrow$	94.9 $\pm$ 0.4	95.5 $\pm$ 0.2	95.5 $\pm$ 0.1	95.4 $\pm$ 0.1	95.5 $\pm$ 0.1
Noise-level $\downarrow$		5.07 $\pm$ 0.13	3.20 $\pm$ 0.04	2.88 $\pm$ 0.20	2.82 $\pm$ 0.19	2.71 $\pm$ 0.19

To assess the extent to which RGB and Point Cloud modalities should be integrated, we conducted experiments with the hyper-parameters $\lambda_{I}$ and $\lambda_{P}$ , which control the level of integration. The results of these experiments are presented in Tab. X. We observed that the model achieves optimal performance across all metrics for both anomaly detection and segmentation with $\lambda_{I}=1.0$ and $\lambda_{P}=1.5$ . This indicates that enhancing the integration of the 3D Point Cloud modality can further improve performance. This outcome aligns with findings reported in Secs. 4.2 and 4.3, where most methods performed better using purely 3D data rather than solely RGB data. This suggests that the 3D Point Cloud data in the MVTec 3D-AD dataset [10] contains richer information and facilitates more effective anomaly detection compared to RGB data within the same dataset.

Number of Intra-Modal Reference Samples. To determine the appropriate number of intra-modal reference samples in Stage I, we conducted an ablation study on the quantity of these samples. The results are shown in Tab. XI. We conclude that increasing the number of intra-modal reference samples enhances the model’s performance. This improvement is logical, as more reference samples mean more normal cases for the model to learn from, naturally boosting performance. However, selecting too many intra-modal reference samples can lead to the inclusion of noise samples and increase computational complexity. Therefore, in practical implementation, we opted for 4 intra-modal reference samples, striking a balance between model performance and computational efficiency.

5 Conclusion

In this paper, we first delve into the RGB-3D multi-modal noisy anomaly detection problem and have introduced a novel framework, M3DM-NR, to address the challenging task of RGB-3D multi-modal noisy industrial anomaly detection. Our approach systematically tackles the issues of reference selection, denoising, and final anomaly detection and segmentation through a three-stage process. In Stage I, we developed the Initial Feature Extraction, Suspected References Selection, and Suspected Anomaly Map Computation modules to filter normal samples and generate suspected anomaly maps, providing a robust foundation for subsequent stages. Stage II, termed Enhanced Multi-modal Denoising, leverages multi-scale feature comparison and weighting methods to refine and denoise the training samples, ensuring cleaner data for model training. Finally, Stage III integrates Point Feature Alignment, Unsupervised Feature Fusion, Noise Discriminative Coreset Selection, and Decision Layer Fusion to achieve precise anomaly detection and segmentation while effectively filtering out noise at the patch level. Extensive experiments demonstrate that our M3DM-NR framework significantly outperforms existing state-of-the-art methods in both detection and segmentation precision for 3D-RGB multi-modal noisy anomaly detection. The ablation studies further validate the effectiveness of each component within our framework, highlighting the importance of our systematic and hierarchical approach.

Future Works. Our work not only advances the field of industrial anomaly detection but also sets a new benchmark for handling noisy multi-modal data. Future research can build upon our framework to explore additional modalities and further enhance the robustness and accuracy of anomaly detection systems in practical industrial settings. Future work could consider more realistic methods of injecting noise into the training set. Currently, the approach of using anomalous samples from the test set as noise in the training set is rather naive. Future research could explore how noise naturally occurs in normal samples within real industrial production environments and attempt to construct new multi-modal noisy industrial detection datasets. Additionally, future efforts could look into fine-tuning the CLIP model to better handle the task of multi-modal noisy industrial anomaly detection. The current method employs a training-free approach. The pre-trained CLIP model used in M3DM-NR is trained on a large-scale image dataset containing all categories of images. Subsequent work could consider fine-tuning the CLIP model on specific industrial detection datasets before using it for multi-modal noisy industrial anomaly detection.

References

[1] Y. Cao, X. Xu, J. Zhang, Y. Cheng, X. Huang, G. Pang, and W. Shen, “A survey on visual anomaly detection: Challenge, approach, and prospect,” arXiv preprint arXiv:2401.16402, 2024.
[2] J. Liu, G. Xie, J. Wang, S. Li, C. Wang, F. Zheng, and Y. Jin, “Deep industrial image anomaly detection: A survey,” Machine Intelligence Research, vol. 21, no. 1, pp. 104–135, 2024.
[3] P. Bergmann, M. Fauser, D. Sattlegger, and C. Steger, “Mvtec ad–a comprehensive real-world dataset for unsupervised anomaly detection,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 9592–9600.
[4] C. Wang, W. Zhu, B.-B. Gao, Z. Gan, J. Zhang, Z. Gu, S. Qian, M. Chen, and L. Ma, “Real-iad: A real-world multi-view dataset for benchmarking versatile industrial anomaly detection,” in CVPR, 2024.
[5] K. Roth, L. Pemula, J. Zepeda, B. Schölkopf, T. Brox, and P. Gehler, “Towards total recall in industrial anomaly detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 14 318–14 328.
[6] D. Gudovskiy, S. Ishizaka, and K. Kozuka, “Cflow-ad: Real-time unsupervised anomaly detection with localization via conditional normalizing flows,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2022, pp. 98–107.
[7] Y. Zheng, X. Wang, R. Deng, T. Bao, R. Zhao, and L. Wu, “Focus your distribution: Coarse-to-fine non-contrastive learning for anomaly detection and localization,” in 2022 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 2022, pp. 1–6.
[8] X. Jiang, J. Liu, J. Wang, Q. Nie, K. Wu, Y. Liu, C. Wang, and F. Zheng, “Softpatch: Unsupervised anomaly detection with noisy data,” Advances in Neural Information Processing Systems, vol. 35, pp. 15 433–15 445, 2022.
[9] Y. Wang, J. Peng, J. Zhang, R. Yi, Y. Wang, and C. Wang, “Multimodal industrial anomaly detection via hybrid fusion,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 8032–8041.
[10] P. Bergmann, X. Jin, D. Sattlegger, and C. Steger, “The mvtec 3d-ad dataset for unsupervised 3d anomaly detection and localization,” in Proceedings of the 17th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications, VISIGRAPP 2022, Volume 5: VISAPP, Online Streaming, February 6-8, 2022, G. M. Farinella, P. Radeva, and K. Bouatouch, Eds. SCITEPRESS, 2022, pp. 202–213. [Online]. Available: https://doi.org/10.5220/0010865000003124
[11] E. Horwitz and Y. Hoshen, “Back to the feature: classical 3d features are (almost) all you need for 3d anomaly detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 2967–2976.
[12] D. Gong, L. Liu, V. Le, B. Saha, M. R. Mansour, S. Venkatesh, and A. v. d. Hengel, “Memorizing normality to detect anomaly: Memory-augmented deep autoencoder for unsupervised anomaly detection,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 1705–1714.
[13] V. Zavrtanik, M. Kristan, and D. Skočaj, “Reconstruction by inpainting for visual anomaly detection,” Pattern Recognition, vol. 112, p. 107706, 2021.
[14] ——, “Draem-a discriminatively trained reconstruction embedding for surface anomaly detection,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 8330–8339.
[15] H. Deng and X. Li, “Anomaly detection via reverse distillation from one-class embedding,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 9737–9746.
[16] P. Perera, R. Nallapati, and B. Xiang, “Ocgan: One-class novelty detection using gans with constrained latent representations,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 2898–2906.
[17] J. Yu, Y. Zheng, X. Wang, W. Li, Y. Wu, R. Zhao, and L. Wu, “Fastflow: Unsupervised anomaly detection and localization via 2d normalizing flows,” arXiv preprint arXiv:2111.07677, 2021.
[18] M. Rudolph, T. Wehrbein, B. Rosenhahn, and B. Wandt, “Asymmetric student-teacher networks for industrial anomaly detection,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2023, pp. 2592–2602.
[19] T. Defard, A. Setkov, A. Loesch, and R. Audigier, “Padim: a patch distribution modeling framework for anomaly detection and localization,” in International Conference on Pattern Recognition. Springer, 2021, pp. 475–489.
[20] E. Horwitz and Y. Hoshen, “An empirical investigation of 3d anomaly detection and segmentation,” arXiv preprint arXiv:2203.05550, 2022.
[21] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning. PMLR, 2021, pp. 8748–8763.
[22] Z. Guo, R. Zhang, X. Zhu, Y. Tang, X. Ma, J. Han, K. Chen, P. Gao, X. Li, H. Li et al., “Point-bind & point-llm: Aligning point cloud with multi-modality for 3d understanding, generation, and instruction following,” arXiv preprint arXiv:2309.00615, 2023.
[23] L. Bonfiglioli, M. Toschi, D. Silvestri, N. Fioraio, and D. De Gregorio, “The eyecandies dataset for unsupervised multimodal anomaly detection and localization,” in Proceedings of the Asian Conference on Computer Vision, 2022, pp. 3586–3602.
[24] C.-L. Li, K. Sohn, J. Yoon, and T. Pfister, “Cutpaste: Self-supervised learning for anomaly detection and localization,” in CVPR, 2021.
[25] G. Zhang, K. Cui, T.-Y. Hung, and S. Lu, “Defect-gan: High-fidelity defect synthesis for automated defect inspection,” in CACV, 2021.
[26] Z. Liu, Y. Zhou, Y. Xu, and Z. Wang, “Simplenet: A simple network for image anomaly detection and localization,” in CVPR, 2023.
[27] M. Yang, P. Wu, and H. Feng, “Memseg: A semi-supervised method for image surface defect detection using differences and commonalities,” Engineering Applications of Artificial Intelligence, 2023.
[28] T. D. Tien, A. T. Nguyen, N. H. Tran, T. D. Huy, S. Duong, C. D. T. Nguyen, and S. Q. Truong, “Revisiting reverse distillation for anomaly detection,” in CVPR, 2023.
[29] L. Chen, Z. You, N. Zhang, J. Xi, and X. Le, “Utrad: Anomaly detection and localization with u-transformer,” Neural Networks, 2022.
[30] Y. Liang, J. Zhang, S. Zhao, R. Wu, Y. Liu, and S. Pan, “Omni-frequency channel-selection representations for unsupervised anomaly detection,” TIP, 2023.
[31] J. Zhang, X. Chen, Y. Wang, C. Wang, Y. Liu, X. Li, M.-H. Yang, and D. Tao, “Exploring plain vit reconstruction for multi-class unsupervised anomaly detection,” arXiv preprint arXiv:2312.07495, 2023.
[32] H. He, Y. Bai, J. Zhang, Q. He, H. Chen, Z. Gan, C. Wang, X. Li, G. Tian, and L. Xie, “Mambaad: Exploring state space models for multi-class unsupervised anomaly detection,” arXiv, 2024.
[33] J. Zhang, X. Li, G. Tian, Z. Xue, Y. Liu, G. Pang, and D. Tao, “Learning feature inversion for multi-class unsupervised anomaly detection under general-purpose coco-ad benchmark,” arXiv, 2024.
[34] H. He, J. Zhang, H. Chen, X. Chen, Z. Li, X. Chen, Y. Wang, C. Wang, and L. Xie, “Diad: A diffusion-based framework for multi-class anomaly detection,” arXiv preprint arXiv:2312.06607, 2023.
[35] Q. Wan, L. Gao, X. Li, and L. Wen, “Unsupervised image anomaly detection and segmentation based on pretrained feature mapping,” TII, 2022.
[36] Y. Cao, X. Xu, Z. Liu, and W. Shen, “Collaborative discrepancy optimization for reliable image anomaly localization,” TII, 2023.
[37] J. Lei, X. Hu, Y. Wang, and D. Liu, “Pyramidflow: High-resolution defect contrastive localization using pyramid normalizing flow,” in CVPR, 2023.
[38] M. Salehi, N. Sadjadi, S. Baselizadeh, M. H. Rohban, and H. R. Rabiee, “Multiresolution knowledge distillation for anomaly detection,” in CVPR, 2021.
[39] Y. Cao, Q. Wan, W. Shen, and L. Gao, “Informative knowledge distillation for image anomaly segmentation,” KBS, 2022.
[40] R. Chen, G. Xie, J. Liu, J. Wang, Z. Luo, J. Wang, and F. Zheng, “Easynet: An easy network for 3d industrial anomaly detection,” in Proceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 7038–7046.
[41] V. Zavrtanik, M. Kristan, and D. Skočaj, “Keep dræming: Discriminative 3d anomaly detection through anomaly simulation,” Pattern Recognition Letters, 2024.
[42] W. Li and X. Xu, “Towards scalable 3d anomaly detection and localization: A benchmark via 3d anomaly synthesis and a self-supervised learning network,” arXiv preprint arXiv:2311.14897, 2023.
[43] Y. Cao, X. Xu, and W. Shen, “Complementary pseudo multimodal feature for point cloud anomaly detection,” arXiv preprint arXiv:2303.13194, 2023.
[44] Y.-M. Chu, L. Chieh, T.-I. Hsieh, H.-T. Chen, and T.-L. Liu, “Shape-guided dual-memory learning for 3d anomaly detection,” 2023.
[45] Y. Tu, B. Zhang, L. Liu, Y. Li, C. Xu, J. Zhang, Y. Wang, C. Wang, and C. R. Zhao, “Self-supervised feature adaptation for 3d industrial anomaly detection,” arXiv preprint arXiv:2401.03145, 2024.
[46] B. Zhao, Q. Xiong, X. Zhang, J. Guo, Q. Liu, X. Xing, and X. Xu, “Pointcore: Efficient unsupervised point cloud anomaly detector using local-global features,” arXiv preprint arXiv:2403.01804, 2024.
[47] P. Bergmann and D. Sattlegger, “Anomaly detection in 3d point clouds using deep geometric descriptors,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2023, pp. 2613–2623.
[48] Z. Gu, J. Zhang, L. Liu, X. Chen, J. Peng, Z. Gan, G. Jiang, A. Shu, Y. Wang, and L. Ma, “Rethinking reverse distillation for multi-modal anomaly detection,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 8, 2024, pp. 8445–8453.
[49] Z. Hu, Z. Yang, X. Hu, and R. Nevatia, “Simple: Similar pseudo label exploitation for semi-supervised classification,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 15 099–15 108.
[50] K. Sohn, D. Berthelot, N. Carlini, Z. Zhang, H. Zhang, C. A. Raffel, E. D. Cubuk, A. Kurakin, and C.-L. Li, “Fixmatch: Simplifying semi-supervised learning with consistency and confidence,” Advances in neural information processing systems, vol. 33, pp. 596–608, 2020.
[51] J. Li, R. Socher, and S. C. Hoi, “Dividemix: Learning with noisy labels as semi-supervised learning,” arXiv preprint arXiv:2002.07394, 2020.
[52] M. Xu, Z. Zhang, H. Hu, J. Wang, L. Wang, F. Wei, X. Bai, and Z. Liu, “End-to-end semi-supervised object detection with soft teacher,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 3060–3069.
[53] Y.-C. Liu, C.-Y. Ma, Z. He, C.-W. Kuo, K. Chen, P. Zhang, B. Wu, Z. Kira, and P. Vajda, “Unbiased teacher for semi-supervised object detection,” arXiv preprint arXiv:2102.09480, 2021.
[54] F. Yang, K. Wu, S. Zhang, G. Jiang, Y. Liu, F. Zheng, W. Zhang, C. Wang, and L. Zeng, “Class-aware contrastive semi-supervised learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 14 421–14 430.
[55] S. Han, X. Hu, H. Huang, M. Jiang, and Y. Zhao, “Adbench: Anomaly detection benchmark,” Advances in Neural Information Processing Systems, vol. 35, pp. 32 142–32 159, 2022.
[56] G. Pang, C. Yan, C. Shen, A. v. d. Hengel, and X. Bai, “Self-trained deep ordinal regression for end-to-end video anomaly detection,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 12 173–12 182.
[57] B. Liu, D. Wang, K. Lin, P.-N. Tan, and J. Zhou, “Rca: A deep collaborative autoencoder approach for anomaly detection,” in IJCAI: proceedings of the conference, vol. 2021. NIH Public Access, 2021, p. 1505.
[58] C. Zhou and R. C. Paffenroth, “Anomaly detection with robust deep autoencoders,” in Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, 2017, pp. 665–674.
[59] S. Wu, J. Zhao, and G. Tian, “Understanding and mitigating data contamination in deep anomaly detection: A kernel-based approach.” in IJCAI, 2022, pp. 2319–2325.
[60] J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y. Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds et al., “Flamingo: a visual language model for few-shot learning,” Advances in neural information processing systems, vol. 35, pp. 23 716–23 736, 2022.
[61] C. Jia, Y. Yang, Y. Xia, Y.-T. Chen, Z. Parekh, H. Pham, Q. Le, Y.-H. Sung, Z. Li, and T. Duerig, “Scaling up visual and vision-language representation learning with noisy text supervision,” in International conference on machine learning. PMLR, 2021, pp. 4904–4916.
[62] R. Taori, A. Dave, V. Shankar, N. Carlini, B. Recht, and L. Schmidt, “Measuring robustness to natural distribution shifts in image classification,” Advances in Neural Information Processing Systems, vol. 33, pp. 18 583–18 599, 2020.
[63] G. Goh, N. Cammarata, C. Voss, S. Carter, M. Petrov, L. Schubert, A. Radford, and C. Olah, “Multimodal neurons in artificial neural networks,” Distill, vol. 6, no. 3, p. e30, 2021.
[64] Y. Rao, W. Zhao, G. Chen, Y. Tang, Z. Zhu, G. Huang, J. Zhou, and J. Lu, “Denseclip: Language-guided dense prediction with context-aware prompting,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 18 082–18 091.
[65] Y. Zhong, J. Yang, P. Zhang, C. Li, N. Codella, L. H. Li, L. Zhou, X. Dai, L. Yuan, Y. Li et al., “Regionclip: Region-based language-image pretraining,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 16 793–16 803.
[66] C. Zhou, C. C. Loy, and B. Dai, “Extract free dense labels from clip,” in European Conference on Computer Vision. Springer, 2022, pp. 696–712.
[67] J. Jeong, Y. Zou, T. Kim, D. Zhang, A. Ravichandran, and O. Dabeer, “Winclip: Zero-/few-shot anomaly classification and segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 19 606–19 616.
[68] X. Chen, Y. Han, and J. Zhang, “A zero-/few-shot anomaly classification and segmentation method for cvpr 2023 vand workshop challenge tracks 1&2: 1st place on zero-shot ad and 4th place on few-shot ad,” arXiv preprint arXiv:2305.17382, 2023.
[69] Y. Cao, X. Xu, C. Sun, Y. Cheng, Z. Du, L. Gao, and W. Shen, “Segment any anomaly without training via hybrid prompt regularization,” arXiv preprint arXiv:2305.10724, 2023.
[70] X. Chen, J. Zhang, G. Tian, H. He, W. Zhang, Y. Wang, C. Wang, Y. Wu, and Y. Liu, “Clip-ad: A language-guided staged dual-path model for zero-shot anomaly detection,” arXiv preprint arXiv:2311.00453, 2023.
[71] J. Zhang, X. Chen, Z. Xue, Y. Wang, C. Wang, and Y. Liu, “Exploring grounding potential of vqa-oriented gpt-4v for zero-shot anomaly detection,” arXiv preprint arXiv:2311.02612, 2023.
[72] R. Zhang, Z. Guo, W. Zhang, K. Li, X. Miao, B. Cui, Y. Qiao, P. Gao, and H. Li, “Pointclip: Point cloud understanding by clip,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 8552–8562.
[73] X. Zhu, R. Zhang, B. He, Z. Guo, Z. Zeng, Z. Qin, S. Zhang, and P. Gao, “Pointclip v2: Prompting clip and gpt for powerful 3d open-world learning,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 2639–2650.
[74] L. Xue, M. Gao, C. Xing, R. Martín-Martín, J. Wu, C. Xiong, R. Xu, J. C. Niebles, and S. Savarese, “Ulip: Learning a unified representation of language, images, and point clouds for 3d understanding,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 1179–1189.
[75] Y. Zheng, X. Wang, Y. Qi, W. Li, and L. Wu, “Benchmarking unsupervised anomaly detection and localization,” 2022.
[76] G. Wang, S. Han, E. Ding, and D. Huang, “Student-teacher feature pyramid matching for anomaly detection,” in 32nd British Machine Vision Conference 2021, BMVC 2021, Online, November 22-25, 2021. BMVA Press, 2021, p. 306. [Online]. Available: https://www.bmvc2021-virtualconference.com/assets/papers/1273.pdf
[77] M. A. Fischler and R. C. Bolles, “Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography,” Communications of the ACM, vol. 24, no. 6, pp. 381–395, 1981.
[78] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE conference on computer vision and pattern recognition. Ieee, 2009, pp. 248–255.
[79] M. Caron, H. Touvron, I. Misra, H. Jégou, J. Mairal, P. Bojanowski, and A. Joulin, “Emerging properties in self-supervised vision transformers,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 9650–9660.
[80] H. Zhao, L. Jiang, J. Jia, P. H. Torr, and V. Koltun, “Point transformer,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 16 259–16 268.
[81] Y. Pang, W. Wang, F. E. H. Tay, W. Liu, Y. Tian, and L. Yuan, “Masked autoencoders for point cloud self-supervised learning,” 2022.
[82] A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su et al., “Shapenet: An information-rich 3d model repository,” arXiv preprint arXiv:1512.03012, 2015.
[83] B. Schölkopf, J. C. Platt, J. Shawe-Taylor, A. J. Smola, and R. C. Williamson, “Estimating the support of a high-dimensional distribution,” Neural computation, vol. 13, no. 7, pp. 1443–1471, 2001.
[84] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga et al., “Pytorch: An imperative style, high-performance deep learning library,” Advances in neural information processing systems, vol. 32, 2019.

Overview

The appendix provides additional sections below to enhance the main manuscript:

•

We report the P-AUROC for regular anomaly segmentation on MVTec 3D-AD in Tab. A1.
•

We report the P-AUROC for noisy anomaly segmentation on MVTec 3D-AD in Tabs. A2 and A3.
•

We show the visualization results of noisy anomaly segmentation under Non-Overlap setiing in Fig. A1.
•

We report the experiment results on Eycandies [23] dataset in Tabs. A4, A5, A6, A7, A8 and A9.
•

We reprot experiment results when injecting different percentages of noise into the training set in Tabs. A11, A10, A12, A13, A14 and A15.

P-AUROC for regular anomaly segmentation on MVTec 3D-AD

TABLE A1: P-AUROC score for regular anomaly segmentation of all categories of MVTec 3D-AD[10] dataset. Our method maintains the regular anomaly segmentation ability. The results of baselines are from the [10, 20, 75]. Optimal and sub-optimal results are in bold and underlined, respectively.

Method

Bagel

Cable

Gland

Carrot

Dowel

Foam

Peach

Potato

Rope

Tire

Mean

FPFH [20]

99.4

96.6

99.9

94.6

96.6

92.7

99.6

99.9

99.6

99.0

97.8

M3DM [9]

98.1

94.9

99.7

93.2

95.9

92.5

98.9

99.5

99.4

98.1

97.0

Ours

98.1

95.0

99.6

93.2

95.9

92.4

98.9

99.6

99.4

98.1

97.0

RGB

PatchCore[5]

98.3

98.4

98.0

97.4

97.2

84.9

97.6

98.3

98.7

97.7

96.7

M3DM [9]

99.2

99.0

99.4

97.7

98.3

95.5

99.4

99.0

99.5

99.4

98.7

Ours

99.1

99.0

99.4

97.7

98.4

95.5

99.3

99.0

99.5

98.7

RGB+3D

AST[18]

97.6

PatchCore + FPFH[20]

99.6

99.2

99.7

99.4

98.1

97.4

99.6

99.8

99.4

99.5

99.2

M3DM [9]

99.5

99.3

99.7

98.5

98.4

99.6

99.4

99.7

99.6

99.2

Ours

99.6

99.3

99.7

97.9

98.5

98.9

99.6

99.5

99.7

99.6

99.2

In the regular anomaly segmentation setting, we compare our method with several 3D-based, RGB-based, and hybrid multi-modal 3D/RGB methods on MVTec-3D. Tab. A1 shows the segmentation results record with P-AUROC and we can conclude that our M3DM-NR also maintains the regular anomaly segmentation ability.

P-AUROC for noisy anomaly segmentation on MVTec 3D-AD

In the main paper, we report the AUPRO score for anomaly segmentation. In this section, we report the P-AUROC score under Overlap and Non-Overlap settings to further verify the segmentation performance of our method, as shown in Tab. A2 and Tab. A3.

3D. On pure 3D anomaly segmentation, we get the highest P-AUROC and outperform Shape-Guided [44] 0.8% in Overlap and M3DM [9] 0.1% in Non-Overlap. This shows our method has better segmentation performance than the previous method and is more resistant to noise in the training dataset, and with our PFA, the Point Transformer is the better 3D feature extractor for this task.

RGB. Our P-AUROC in RGB domain is the same as SoftPatch [8] in Overlap and the same as M3DM in Non-Overlap. But our method has a lower standard deviation, which means our method is more robust.

3D+RGB. On 3D + RGB multi-modal anomaly segmentation, we get the best result with AUPRO and outperform Shape-Guided 0.6% in Overlap and PatchCore+FPFH [20] 0.1% in Non-Overlap. These results are contributed by our novel 3-stage multi-modal noise-resistant framework.

TABLE A2: P-AUROC score for anomaly segmentation under Overlap setting of all categories of MVTec 3D-AD. Our method clearly outperforms other methods in 3D, RGB, and 3D + RGB settings, indicating the superior anomaly detection ability of our method. We report the mean and standard deviation over 3 random seeds for each measurement. Optimal and sub-optimal results are in bold and underlined, respectively.

Method

Bagel

Cable

Gland

Carrot

Dowel

Foam

Peach

Potato

Rope

Tire

Mean

SIFT

69.8

\pm

4.6

80.6

\pm

1.4

95.4

\pm

0.5

78.2

\pm

0.9

70.6

\pm

2.0

77.1

\pm

1.6

66.6

\pm

1.5

76.4

\pm

10.0

91.1

\pm

0.3

75.9

\pm

1.7

78.2

\pm

1.6

FPFH

84.5

\pm

2.7

92.6

\pm

0.2

96.5

\pm

0.4

85.8

\pm

0.6

86.3

\pm

2.2

84.5

\pm

1.4

87.8

\pm

1.2

87.4

\pm

2.0

83.3

\pm

0.7

91.8

\pm

0.7

88.0

\pm

0.3

AST

89.5

\pm

0.6

90.2

\pm

0.0

96.9

\pm

0.0

85.7

\pm

0.6

86.8

\pm

0.0

86.4

\pm

0.0

93.5

\pm

0.0

97.0

\pm

0.6

89.6

\pm

0.6

89.9

\pm

0.6

90.6

\pm

0.2

Shape-Guided

93.5

\pm

1.7

94.2

\pm

1.5

99.4

\pm

0.6

92.4

\pm

1.2

88.1

\pm

6.5

91.0

\pm

3.1

94.6

\pm

0.8

92.5

\pm

3.8

97.1

\pm

1.9

91.2

\pm

1.2

93.4

\pm

0.6

M3DM

94.3

\pm

1.1

94.2

\pm

0.9

98.9

\pm

0.2

90.6

\pm

0.9

89.8

\pm

6.7

87.3

\pm

2.8

95.1

\pm

1.0

91.9

\pm

5.1

98.0

\pm

0.5

92.6

\pm

3.8

93.3

\pm

0.9

Ours

96.6

\pm

1.7

94.3

\pm

0.3

99.3

\pm

0.3

91.8

\pm

0.4

90.2

\pm

4.9

88.8

\pm

1.8

95.7

\pm

1.2

92.6

\pm

3.2

98.7

\pm

0.7

94.3

\pm

2.6

94.2

\pm

0.7

RGB

PaDim

93.4

\pm

0.9

93.9

\pm

0.9

97.3

\pm

0.4

90.6

\pm

1.3

93.5

\pm

6.1

88.4

\pm

0.5

91.8

\pm

4.5

89.3

\pm

1.2

98.5

\pm

0.2

93.8

\pm

3.8

93.1

\pm

0.1

PatchCore

75.2

\pm

3.2

73.6

\pm

6.2

80.0

\pm

4.0

80.2

\pm

3.4

71.1

\pm

5.5

75.4

\pm

9.5

68.9

\pm

7.8

72.3

\pm

9.3

64.9

\pm

17.3

75.3

\pm

6.8

73.7

\pm

1.4

AST

67.8

\pm

0.0

74.2

\pm

0.0

54.2

\pm

0.0

65.8

\pm

0.6

68.9

\pm

0.0

63.4

\pm

0.6

57.5

\pm

0.6

61.1

\pm

0.6

57.2

\pm

0.0

69.3

\pm

0.6

63.9

\pm

0.1

Shape-Guided

78.0

\pm

3.5

91.2

\pm

1.4

93.1

\pm

1.1

84.7

\pm

0.3

90.1

\pm

0.4

73.8

\pm

1.6

82.8

\pm

1.1

89.3

\pm

0.8

88.6

\pm

0.2

88.8

\pm

0.3

86.0

\pm

0.6

SoftPatch

90.4

\pm

1.7

91.9

\pm

4.1

96.9

\pm

1.1

87.7

\pm

2.2

94.8

\pm

4.6

96.5

\pm

4.9

94.4

\pm

0.5

90.9

\pm

0.7

96.7

\pm

1.6

97.3

\pm

0.8

93.8

\pm

0.5

M3DM

68.8

\pm

5.0

77.0

\pm

1.8

77.2

\pm

2.6

77.1

\pm

0.4

71.8

\pm

2.0

68.9

\pm

2.3

65.8

\pm

1.7

65.8

\pm

3.8

60.5

\pm

2.3

75.2

\pm

1.4

70.8

\pm

1.1

Ours

98.5

\pm

0.5

95.8

\pm

1.6

98.7

\pm

0.4

95.0

\pm

1.1

88.5

\pm

5.9

85.9

\pm

1.7

93.4

\pm

2.6

89.5

\pm

1.0

98.6

\pm

0.3

94.6

\pm

0.4

93.8

\pm

0.7

3D+RGB

PatchCore+FPFH

69.1

\pm

4.8

77.0

\pm

1.8

77.4

\pm

2.6

78.4

\pm

0.4

71.5

\pm

2.1

69.3

\pm

1.5

66.0

\pm

1.7

65.8

\pm

3.8

60.5

\pm

2.3

75.2

\pm

1.4

71.0

\pm

0.9

AST

90.7

\pm

0.6

94.3

\pm

0.6

97.5

\pm

0.0

89.4

\pm

0.0

90.6

\pm

0.6

89.4

\pm

0.0

93.3

\pm

0.6

96.9

\pm

0.6

90.6

\pm

0.6

93.6

\pm

0.0

92.6

\pm

0.2

Shape-Guided

91.0

\pm

1.7

94.7

\pm

0.4

98.1

\pm

0.2

90.9

\pm

0.1

91.6

\pm

5.3

90.8

\pm

1.6

95.3

\pm

0.3

95.8

\pm

4.6

96.0

\pm

0.3

95.5

\pm

2.7

94.0

\pm

1.0

M3DM

69.8

\pm

4.7

77.0

\pm

2.0

77.4

\pm

2.6

79.2

\pm

0.5

71.9

\pm

3.1

74.0

\pm

2.4

66.2

\pm

1.8

66.2

\pm

3.8

61.8

\pm

2.5

75.6

\pm

1.3

71.9

\pm

1.2

Ours

99.1

\pm

0.5

95.8

\pm

1.7

99.0

\pm

0.5

95.8

\pm

1.0

90.7

\pm

2.8

88.1

\pm

2.5

93.8

\pm

2.8

89.8

\pm

1.1

98.8

\pm

0.2

94.9

\pm

0.5

94.6

\pm

0.3

TABLE A3: P-AUROC score for anomaly segmentation under Non-Overlap setting of all categories of MVTec 3D-AD. Our method clearly outperforms other methods in 3D, RGB, and 3D + RGB settings, indicating the superior anomaly detection ability of our method. We report the mean and standard deviation over 3 random seeds for each measurement. Optimal and sub-optimal results are in bold and underlined, respectively.

Method

Bagel

Cable

Gland

Carrot

Dowel

Foam

Peach

Potato

Rope

Tire

Mean

SIFT

94.0

\pm

3.1

94.2

\pm

3.0

93.9

\pm

4.9

93.0

\pm

1.9

95.7

\pm

1.3

92.3

\pm

2.9

96.0

\pm

2.8

98.1

\pm

2.9

99.2

\pm

0.7

98.6

\pm

0.7

95.5

\pm

0.6

FPFH

97.7

\pm

0.5

93.8

\pm

2.4

95.2

\pm

4.5

94.4

\pm

0.4

96.5

\pm

0.5

92.6

\pm

1.4

96.1

\pm

1.1

99.1

\pm

1.2

98.9

\pm

1.2

99.1

\pm

0.1

96.3

\pm

0.5

AST

96.4

\pm

0.6

91.3

\pm

0.6

98.3

\pm

0.6

91.9

\pm

0.6

86.4

\pm

0.6

94.0

\pm

0.6

98.9

\pm

0.6

99.3

\pm

0.6

92.9

\pm

0.0

93.8

\pm

0.0

94.3

\pm

0.3

Shape-Guided

98.4

\pm

0.5

94.4

\pm

1.5

98.8

\pm

1.0

93.0

\pm

1.7

95.5

\pm

0.6

90.9

\pm

4.0

98.7

\pm

1.2

97.9

\pm

2.0

98.0

\pm

0.6

97.7

\pm

0.1

96.3

\pm

0.6

M3DM

97.9

\pm

0.3

94.8

\pm

0.3

99.6

\pm

0.1

91.9

\pm

0.9

94.8

\pm

2.0

91.5

\pm

3.1

97.5

\pm

2.2

99.1

\pm

0.1

99.3

\pm

0.1

97.5

\pm

1.0

96.4

\pm

0.7

Ours

98.6

\pm

0.2

94.6

\pm

0.2

99.6

\pm

0.1

92.4

\pm

0.6

95.4

\pm

0.9

90.8

\pm

2.9

98.1

\pm

1.1

98.2

\pm

1.6

99.2

\pm

0.3

97.7

\pm

0.6

96.5

\pm

0.7

RGB

PaDim

97.5

\pm

1.2

96.1

\pm

0.9

97.9

\pm

0.2

95.1

\pm

0.2

97.8

\pm

0.4

99.6

\pm

0.3

99.1

\pm

0.2

98.6

\pm

0.3

98.8

\pm

0.4

99.2

\pm

0.2

98.0

\pm

0.2

PatchCore

96.0

\pm

0.2

98.9

\pm

0.0

98.1

\pm

1.9

96.7

\pm

0.4

98.9

\pm

0.1

99.9

\pm

0.0

98.1

\pm

0.1

96.3

\pm

2.3

98.8

\pm

0.8

99.2

\pm

0.6

98.1

\pm

0.5

AST

88.5

\pm

0.6

92.7

\pm

0.6

65.8

\pm

0.6

79.4

\pm

1.0

96.0

\pm

0.6

80.6

\pm

1.0

84.4

\pm

0.6

80.0

\pm

0.0

89.1

\pm

0.6

85.6

\pm

0.6

84.2

\pm

0.2

Shape-Guided

94.5

\pm

0.4

97.2

\pm

0.4

98.3

\pm

0.2

95.0

\pm

0.6

98.1

\pm

0.1

87.8

\pm

0.8

95.1

\pm

0.2

96.1

\pm

0.3

97.3

\pm

1.0

97.5

\pm

0.5

95.7

\pm

0.1

SoftPatch

96.3

\pm

0.5

98.5

\pm

0.3

99.2

\pm

0.1

96.8

\pm

0.4

98.9

\pm

0.1

98.9

\pm

1.0

98.3

\pm

0.3

97.1

\pm

1.3

98.2

\pm

0.4

98.5

\pm

1.0

98.1

\pm

0.1

M3DM

98.8

\pm

0.3

98.9

\pm

0.6

99.0

\pm

0.6

96.6

\pm

0.3

98.4

\pm

0.4

93.9

\pm

0.8

99.1

\pm

0.1

98.7

\pm

0.3

99.5

\pm

0.1

99.4

\pm

0.1

98.2

\pm

0.2

Ours

99.0

\pm

0.2

98.9

\pm

0.2

99.2

\pm

0.1

96.4

\pm

0.3

97.7

\pm

0.8

94.6

\pm

0.4

98.9

\pm

0.1

98.4

\pm

0.5

99.4

\pm

0.2

98.9

\pm

0.1

98.2

\pm

0.0

3D+RGB

PatchCore+FPFH

99.4

\pm

0.1

98.8

\pm

0.5

99.3

\pm

0.6

98.1

\pm

1.6

98.1

\pm

0.5

97.5

\pm

0.2

99.3

\pm

0.1

98.6

\pm

0.1

99.5

\pm

0.1

99.1

\pm

0.6

98.8

\pm

0.1

AST

97.4

\pm

0.6

97.1

\pm

0.6

99.5

\pm

0.6

94.0

\pm

0.0

91.3

\pm

0.6

97.1

\pm

0.6

98.7

\pm

0.0

98.7

\pm

0.6

93.2

\pm

0.6

96.9

\pm

0.0

96.4

\pm

0.1

Shape-Guided

97.6

\pm

0.1

98.2

\pm

0.3

99.5

\pm

0.1

97.0

\pm

0.3

98.9

\pm

0.1

97.2

\pm

0.2

98.6

\pm

0.1

99.1

\pm

1.0

98.9

\pm

0.5

99.6

\pm

0.2

98.5

\pm

0.2

M3DM

98.9

\pm

0.2

99.1

\pm

0.1

99.3

\pm

0.6

96.8

\pm

0.3

97.5

\pm

0.9

96.0

\pm

0.3

99.2

\pm

0.1

99.0

\pm

0.3

99.7

\pm

0.1

99.3

\pm

0.1

98.5

\pm

0.1

Ours

99.4

\pm

0.1

99.0

\pm

0.1

99.5

\pm

0.1

97.2

\pm

0.2

98.2

\pm

0.4

98.1

\pm

0.4

99.3

\pm

0.1

99.2

\pm

0.0

99.6

\pm

0.1

99.2

\pm

0.1

98.9

\pm

0.0

Visualization results of Non-Overlap setiing

In this section, we visualize anomaly segmentation results for all categories of MVTec-3D AD datasets under Non-Overlap setting. As shown in Fig. A1, we visualize the heatmap results of our method and PatchCore + FPFH [20], M3DM [9] and Shape-Guided [44] with multi-modal inputs. Compared with previous methods, our method gets better segmentation maps.

Eyecandies

We have noticed that recently a new dataset Eyecandies [23] provides multimodel information of 10 categories of candies, and each category contains 1000 samples for training, 50 labeled samples for public testing and 400 unlabeled samples for private testing. The source dataset provides 6 RGB images, which are in different light conditions, a depth map, and a normal map of each sample. In this section, we convert the Eyecandies dataset to the format supported by M3DM-NR. In detail, we use the environment light image as our input RGB data, and for 3D data, we first convert the depth image to point clouds with internal parameters, then we remove the background points with point coordinates. For computation efficiency, we use only less than 400 samples from each category for training. Because the public test dataset only contains 25 normal and 25 anomalous samples, which doesn’t meet 10% of the size of training dataset, we implement the Overlap and Non-Overlap setting differently. For Overlap setting, we only conduct experiments of 5% noise by selecting 400 images from training dataset and 20 images from public test dataset as the whole noisy training dataset. For Non-Overlap setting, as the private test dataset contains 200 normal samples and 200 anomalous samples mixed together, we random select 80 samples from the private test dataset and regard it as 40 normal samples and 40 anomalous samples. These 80 samples, along with 320 normal samples selected from the training dataset, make up of the whole noisy training dataset. We report the mean and standard deviation over 3 random seeds for each measurement.

As illustrated in Tabs. A4, A5, A6, A7, A8 and A9, we report the best I-AUCROC, AUPRO and P-AUCROC scores. under both Overlap and Non-Overlap settings.

TABLE A4: I-AUROC score for anomaly detection under Overlap setting of all categories in Eyecandies [23]. Our method clearly outperforms other methods in 3D + RGB settings, indicating the superior anomaly detection ability of our method. We report the mean and standard deviation over 3 random seeds for each measurement. Optimal and sub-optimal results are in bold and underlined, respectively.

Method

Candy

Cane

Chocolate

Praline

Confetto

Gummy

Bear

Hazelnut

Truffle

Licorice

Sandwich

Lollipop

Marsh-

mallow

Peppermint

Candy

Mean

3D+RGB

PatchCore+FPFH

11.4

\pm

2.8

19.2

\pm

3.6

20.9

\pm

1.6

19.7

\pm

0.9

25.1

\pm

5.9

20.8

\pm

4.7

17.6

\pm

1.2

24.5

\pm

3.6

24.8

\pm

1.4

19.1

\pm

1.3

20.3

\pm

0.3

AST

8.0

\pm

0.6

13.8

\pm

0.6

6.7

\pm

0.6

10.9

\pm

0.6

16.7

\pm

0.6

10.9

\pm

0.6

18.4

\pm

0.6

24.0

\pm

1.0

9.4

\pm

0.0

13.7

\pm

0.0

13.4

\pm

0.2

Shape-Guided

9.1

\pm

4.5

18.5

\pm

1.0

15.3

\pm

2.5

24.7

\pm

2.2

15.5

\pm

3.0

11.8

\pm

2.4

15.8

\pm

0.6

25.7

\pm

1.2

25.9

\pm

1.3

23.6

\pm

3.1

18.6

\pm

0.8

M3DM

17.0

\pm

3.6

30.5

\pm

4.2

39.6

\pm

2.7

41.9

\pm

1.6

39.4

\pm

3.4

20.7

\pm

3.8

28.2

\pm

2.3

33.1

\pm

3.4

54.6

\pm

0.4

50.9

\pm

0.9

35.6

\pm

0.9

Ours

33.5

\pm

3.4

74.9

\pm

4.5

76.9

\pm

5.5

89.3

\pm

3.0

55.8

\pm

6.1

48.0

\pm

5.7

79.4

\pm

5.2

65.0

\pm

4.9

98.9

\pm

1.0

70.5

\pm

2.4

69.2

\pm

1.9

TABLE A5: I-AUROC score for anomaly detection under Non-Overlap setting of all categories in Eyecandies [23]. Our method clearly outperforms other methods in 3D + RGB settings, indicating the superior anomaly detection ability of our method. We report the mean and standard deviation over 3 random seeds for each measurement. Optimal and sub-optimal results are in bold and underlined, respectively.

Method

Candy

Cane

Chocolate

Praline

Confetto

Gummy

Bear

Hazelnut

Truffle

Licorice

Sandwich

Lollipop

Marsh-

mallow

Peppermint

Candy

Mean

3D+RGB

PatchCore+FPFH

55.4

\pm

0.8

86.4

\pm

2.3

72.2

\pm

2.1

94.3

\pm

1.9

71.5

\pm

3.5

49.2

\pm

5.3

80.9

\pm

1.0

82.0

\pm

1.2

99.1

\pm

0.8

85.8

\pm

4.7

77.7

\pm

0.6

AST

47.7

\pm

0.6

93.4

\pm

1.0

78.3

\pm

0.6

93.9

\pm

0.0

74.7

\pm

0.6

66.2

\pm

1.0

83.1

\pm

0.6

87.3

\pm

0.0

99.4

\pm

0.6

92.9

\pm

0.6

81.7

\pm

0.2

Shape-Guided

49.4

\pm

0.5

94.8

\pm

1.3

77.5

\pm

2.2

93.9

\pm

1.1

74.8

\pm

0.9

64.9

\pm

2.0

83.3

\pm

0.4

86.0

\pm

1.6

99.6

\pm

0.1

92.6

\pm

1.3

81.7

\pm

0.7

M3DM

53.9

\pm

5.0

90.1

\pm

0.6

89.4

\pm

0.8

98.4

\pm

0.4

81.5

\pm

1.0

52.3

\pm

1.8

78.4

\pm

1.1

83.3

\pm

1.7

99.5

\pm

0.2

99.4

\pm

0.2

82.6

\pm

0.5

Ours

54.5

\pm

7.7

85.6

\pm

0.5

88.9

\pm

2.1

97.2

\pm

0.7

82.2

\pm

6.1

54.3

\pm

2.5

86.8

\pm

0.2

85.6

\pm

1.2

99.8

\pm

0.1

98.6

\pm

0.7

83.3

\pm

0.6

TABLE A6: AUPRO score for anomaly segmentation under Overlap setting of all categories in Eyecandies [23]. Our method clearly outperforms other methods in 3D + RGB settings, indicating the superior anomaly detection ability of our method. We report the mean and standard deviation over 3 random seeds for each measurement. Optimal and sub-optimal results are in bold and underlined, respectively.

Method

Candy

Cane

Chocolate

Praline

Confetto

Gummy

Bear

Hazelnut

Truffle

Licorice

Sandwich

Lollipop

Marsh-

mallow

Peppermint

Candy

Mean

3D+RGB

PatchCore+FPFH

16.7

\pm

1.9

20.5

\pm

2.1

15.6

\pm

1.4

18.7

\pm

3.2

22.2

\pm

4.8

18.3

\pm

2.6

17.3

\pm

1.9

25.8

\pm

6.2

19.0

\pm

1.1

19.6

\pm

0.5

19.4

\pm

0.6

Shape-Guided

65.6

\pm

0.6

44.1

\pm

0.9

21.1

\pm

0.9

57.8

\pm

4.2

52.8

\pm

2.2

20.7

\pm

1.7

34.3

\pm

2.0

84.0

\pm

3.2

59.1

\pm

3.0

57.6

\pm

2.2

49.7

\pm

1.1

M3DM

21.7

\pm

3.2

21.0

\pm

2.3

18.3

\pm

0.2

18.8

\pm

3.2

23.3

\pm

5.1

21.5

\pm

2.1

17.6

\pm

2.3

26.7

\pm

4.7

19.1

\pm

1.2

20.2

\pm

0.0

20.8

\pm

0.7

Ours

50.5

\pm

2.5

82.1

\pm

2.9

66.8

\pm

2.2

89.7

\pm

2.7

60.7

\pm

4.0

59.3

\pm

2.2

80.8

\pm

1.6

70.3

\pm

2.9

94.1

\pm

2.6

55.9

\pm

3.4

71.0

\pm

0.8

TABLE A7: AUPRO score for anomaly segmentation under Non-Overlap setting of all categories in Eyecandies [23]. Our method clearly outperforms other methods in 3D + RGB settings, indicating the superior anomaly detection ability of our method. We report the mean and standard deviation over 3 random seeds for each measurement. Optimal and sub-optimal results are in bold and underlined, respectively.

Method

Candy

Cane

Chocolate

Praline

Confetto

Gummy

Bear

Hazelnut

Truffle

Licorice

Sandwich

Lollipop

Marsh-

mallow

Peppermint

Candy

Mean

3D+RGB

PatchCore+FPFH

83.5

\pm

1.5

89.9

\pm

0.7

67.0

\pm

0.8

96.4

\pm

0.0

81.9

\pm

0.8

51.6

\pm

1.2

86.7

\pm

0.6

89.9

\pm

0.3

94.6

\pm

0.6

88.6

\pm

0.7

83.0

\pm

0.3

Shape-Guided

84.9

\pm

0.5

91.0

\pm

0.1

69.8

\pm

0.4

95.5

\pm

0.3

84.6

\pm

0.7

61.1

\pm

0.9

90.5

\pm

0.8

95.1

\pm

0.2

96.4

\pm

0.2

93.8

\pm

0.3

86.3

\pm

0.2

M3DM

88.0

\pm

1.1

90.4

\pm

1.2

80.6

\pm

0.2

96.1

\pm

3.6

87.4

\pm

1.2

65.7

\pm

1.3

86.4

\pm

1.4

91.2

\pm

0.2

96.2

\pm

0.6

96.2

\pm

0.8

87.8

\pm

0.3

Ours

89.8

\pm

0.6

91.6

\pm

0.3

77.6

\pm

1.8

98.1

\pm

0.1

86.6

\pm

2.0

65.2

\pm

1.1

85.8

\pm

1.4

90.8

\pm

0.6

96.9

\pm

0.3

96.1

\pm

0.8

87.8

\pm

0.2

TABLE A8: P-AUROC score for anomaly segmentation under Overlap setting of all categories in Eyecandies [23]. Our method clearly outperforms other methods in 3D + RGB settings, indicating the superior anomaly detection ability of our method. We report the mean and standard deviation over 3 random seeds for each measurement. Optimal and sub-optimal results are in bold and underlined, respectively.

Method

Candy

Cane

Chocolate

Praline

Confetto

Gummy

Bear

Hazelnut

Truffle

Licorice

Sandwich

Lollipop

Marsh-

mallow

Peppermint

Candy

Mean

3D+RGB

PatchCore+FPFH

21.7

\pm

2.3

21.4

\pm

2.5

28.9

\pm

3.0

25.0

\pm

2.0

34.6

\pm

5.5

35.5

\pm

3.7

20.6

\pm

2.0

25.6

\pm

21.0

22.3

\pm

3.3

26.8

\pm

7.9

26.2

\pm

1.1

AST

48.3

\pm

0.6

49.3

\pm

0.6

48.3

\pm

0.6

48.6

\pm

0.6

78.1

\pm

1.0

49.0

\pm

1.0

76.1

\pm

1.0

48.7

\pm

1.0

77.0

\pm

0.6

49.0

\pm

0.0

57.2

\pm

0.5

Shape-Guided

89.7

\pm

0.4

82.4

\pm

0.8

71.6

\pm

1.2

86.0

\pm

1.5

78.1

\pm

1.5

67.6

\pm

2.4

78.4

\pm

0.7

94.1

\pm

2.0

81.0

\pm

0.6

65.5

\pm

3.1

79.5

\pm

1.0

M3DM

37.5

\pm

2.6

24.2

\pm

1.8

30.2

\pm

3.9

22.7

\pm

2.1

34.8

\pm

4.9

39.7

\pm

3.0

21.6

\pm

2.6

26.5

\pm

21.1

19.6

\pm

3.6

19.0

\pm

1.3

27.6

\pm

1.2

Ours

57.0

\pm

2.5

87.4

\pm

6.0

78.0

\pm

2.4

91.6

\pm

3.9

70.7

\pm

3.3

82.0

\pm

4.0

90.2

\pm

2.3

81.8

\pm

6.4

98.5

\pm

1.2

60.3

\pm

8.3

79.8

\pm

0.7

TABLE A9: P-AUROC score for anomaly segmentation under Non-Overlap setting of all categories in Eyecandies [23]. Our method clearly outperforms other methods in 3D + RGB settings, indicating the superior anomaly detection ability of our method. We report the mean and standard deviation over 3 random seeds for each measurement. Optimal and sub-optimal results are in bold and underlined, respectively.

Method

Candy

Cane

Chocolate

Praline

Confetto

Gummy

Bear

Hazelnut

Truffle

Licorice

Sandwich

Lollipop

Marsh-

mallow

Peppermint

Candy

Mean

3D+RGB

PatchCore+FPFH

95.7

\pm

0.2

97.4

\pm

0.1

91.7

\pm

0.3

99.4

\pm

0.0

92.9

\pm

0.2

87.4

\pm

0.5

96.9

\pm

0.2

98.1

\pm

0.2

99.2

\pm

0.1

97.3

\pm

0.2

95.6

\pm

0.1

AST

95.1

\pm

0.6

98.3

\pm

1.0

91.4

\pm

0.6

99.3

\pm

0.6

92.0

\pm

0.6

88.2

\pm

0.6

96.0

\pm

0.6

95.9

\pm

0.6

98.8

\pm

0.6

97.0

\pm

0.6

95.2

\pm

0.2

Shape-Guided

95.8

\pm

0.1

98.3

\pm

0.0

92.7

\pm

0.0

99.0

\pm

0.1

91.9

\pm

0.3

89.0

\pm

0.2

97.9

\pm

0.2

98.5

\pm

0.1

99.5

\pm

0.1

98.4

\pm

0.1

96.1

\pm

0.1

M3DM

96.4

\pm

0.3

98.3

\pm

0.3

95.2

\pm

1.9

99.8

\pm

0.0

97.5

\pm

0.3

93.3

\pm

0.2

95.5

\pm

3.1

98.9

\pm

0.0

99.6

\pm

0.1

99.4

\pm

0.1

97.4

\pm

0.5

Ours

96.9

\pm

0.3

98.4

\pm

0.0

95.5

\pm

0.7

99.8

\pm

0.1

96.7

\pm

0.5

92.8

\pm

0.7

97.1

\pm

0.2

98.7

\pm

0.1

99.7

\pm

0.0

99.3

\pm

0.3

97.5

\pm

0.0

Experiments on different noise level

To further validate the robustness of our method against noise in the training dataset, we conducted experiments by injecting different percentages of noise into the training set. Specifically, we performed experiments with 20% and 30% noise data injected into the training dataset. The results of these experiments are presented in the Tabs. A11, A10, A12, A13, A14 and A15 below. Comparing the results of injecting 10% noise, 20% noise and 30% noise, we can conclude that our method is much more robust to noise in the training dataset than previous methods.

TABLE A10: I-AUROC score for anomaly detection under Overlap setting of all categories in MVTec 3D-AD. We inject 20% and 30% noise into the training dataset. Our method outperforms other methods, indicating the superior anomaly detection ability of our method. We report the mean and standard deviation over 3 random seeds for each measurement. Optimal and sub-optimal results are in bold and underlined, respectively.

Method

Bagel

Cable

Gland

Carrot

Dowel

Foam

Peach

Potato

Rope

Tire

Mean

Noise 20%

PatchCore+FPFH

42.0

\pm

1.6

40.8

\pm

2.8

49.5

\pm

0.3

53.0

\pm

0.6

44.1

\pm

1.3

28.2

\pm

2.1

27.3

\pm

1.2

25.9

\pm

1.5

13.2

\pm

1.3

45.1

\pm

2.0

36.9

\pm

0.3

AST

37.3

\pm

1.0

44.8

\pm

0.6

50.3

\pm

0.6

59.5

\pm

0.0

43.2

\pm

0.6

33.2

\pm

0.6

29.4

\pm

1.0

31.5

\pm

1.0

12.4

\pm

0.6

38.1

\pm

0.6

38.0

\pm

0.1

Shape-Guided

42.3

\pm

1.1

45.1

\pm

1.6

53.2

\pm

0.3

50.6

\pm

0.5

44.6

\pm

1.3

32.8

\pm

0.7

29.4

\pm

0.1

30.1

\pm

0.5

14.0

\pm

0.7

45.9

\pm

1.3

38.8

\pm

0.3

M3DM

45.0

\pm

1.1

47.3

\pm

1.0

47.6

\pm

1.0

56.8

\pm

1.9

51.4

\pm

1.0

41.3

\pm

0.5

32.7

\pm

0.7

27.9

\pm

1.5

25.5

\pm

1.4

53.8

\pm

1.2

42.9

\pm

0.5

Ours

92.8

\pm

1.5

76.4

\pm

1.8

93.0

\pm

0.5

85.7

\pm

0.9

82.4

\pm

0.7

71.4

\pm

5.2

67.7

\pm

5.0

60.2

\pm

2.9

90.2

\pm

1.5

73.3

\pm

2.3

79.3

\pm

1.0

Noise 30%

PatchCore+FPFH

18.6

\pm

1.5

22.2

\pm

1.8

30.8

\pm

0.8

39.7

\pm

3.4

18.2

\pm

1.2

13.4

\pm

2.0

4.2

\pm

0.4

4.1

\pm

0.4

7.0

\pm

0.3

24.9

\pm

1.3

18.3

\pm

0.7

AST

14.6

\pm

0.6

21.4

\pm

1.0

28.7

\pm

0.6

38.4

\pm

0.0

16.4

\pm

0.0

9.3

\pm

1.0

4.3

\pm

0.6

5.6

\pm

0.6

6.8

\pm

0.0

20.2

\pm

1.0

16.6

\pm

0.1

Shape-Guided

15.7

\pm

0.6

22.3

\pm

1.2

32.8

\pm

1.0

31.3

\pm

0.2

18.3

\pm

0.3

9.7

\pm

0.9

4.2

\pm

0.1

4.7

\pm

0.8

7.2

\pm

0.1

24.7

\pm

1.5

17.1

\pm

0.3

M3DM

30.4

\pm

1.6

27.4

\pm

1.9

32.5

\pm

0.8

40.7

\pm

1.4

36.7

\pm

2.4

25.5

\pm

3.1

16.0

\pm

1.4

12.2

\pm

1.2

19.9

\pm

2.0

37.9

\pm

1.3

27.9

\pm

0.8

Ours

89.7

\pm

1.3

69.1

\pm

1.8

93.7

\pm

0.7

83.7

\pm

2.0

78.8

\pm

2.1

69.9

\pm

4.9

67.1

\pm

3.4

55.3

\pm

2.0

90.5

\pm

0.9

70.0

\pm

2.1

76.8

\pm

0.6

TABLE A11: I-AUROC score for anomaly detection under Non-Overlap setting of all categories in MVTec 3D-AD. We inject 20% and 30% noise into the training dataset. Our method outperforms other methods, indicating the superior anomaly detection ability of our method. We report the mean and standard deviation over 3 random seeds for each measurement. Optimal and sub-optimal results are in bold and underlined, respectively.

Method

Bagel

Cable

Gland

Carrot

Dowel

Foam

Peach

Potato

Rope

Tire

Mean

Noise 20%

PatchCore+FPFH

84.0

\pm

1.4

84.0

\pm

0.8

87.5

\pm

0.1

79.5

\pm

2.5

93.0

\pm

0.5

56.9

\pm

3.6

82.6

\pm

3.7

73.0

\pm

4.8

90.3

\pm

8.1

84.8

\pm

3.3

81.6

\pm

0.1

AST

82.1

\pm

1.0

91.6

\pm

0.6

87.6

\pm

0.6

92.8

\pm

1.0

93.4

\pm

0.6

79.7

\pm

1.0

91.1

\pm

0.6

90.1

\pm

0.6

88.1

\pm

0.6

72.3

\pm

0.6

86.9

\pm

0.3

Shape-Guided

82.8

\pm

2.3

81.8

\pm

2.9

86.6

\pm

0.5

79.0

\pm

0.9

86.2

\pm

1.3

69.1

\pm

1.5

74.1

\pm

0.3

72.8

\pm

1.2

60.3

\pm

3.0

79.8

\pm

2.3

77.3

\pm

0.7

M3DM

92.6

\pm

3.4

76.8

\pm

2.1

82.6

\pm

1.2

82.4

\pm

3.1

95.2

\pm

0.8

75.3

\pm

0.6

83.0

\pm

4.1

74.1

\pm

4.2

98.0

\pm

2.4

84.3

\pm

2.1

84.4

\pm

1.0

Ours

97.4

\pm

0.3

85.0

\pm

4.2

95.1

\pm

0.3

90.6

\pm

0.9

94.0

\pm

1.9

88.1

\pm

1.9

87.4

\pm

1.4

79.8

\pm

2.4

98.1

\pm

1.0

85.5

\pm

0.9

90.1

\pm

0.7

Noise 30%

PatchCore+FPFH

78.2

\pm

2.3

81.5

\pm

2.9

86.5

\pm

2.4

80.7

\pm

3.6

95.4

\pm

2.7

62.0

\pm

5.8

74.1

\pm

3.6

74.6

\pm

6.8

96.7

\pm

3.2

88.5

\pm

4.5

81.8

\pm

1.5

AST

73.4

\pm

0.6

88.8

\pm

0.6

81.8

\pm

0.6

96.6

\pm

0.6

94.4

\pm

1.0

74.0

\pm

0.0

96.6

\pm

0.6

94.4

\pm

1.0

73.7

\pm

0.6

85.3

\pm

0.6

85.7

\pm

0.8

Shape-Guided

60.2

\pm

2.2

69.2

\pm

3.7

77.3

\pm

2.3

68.5

\pm

0.5

65.9

\pm

1.1

45.5

\pm

4.1

28.0

\pm

0.4

31.0

\pm

5.1

41.4

\pm

0.5

69.2

\pm

4.0

55.6

\pm

0.9

M3DM

90.6

\pm

3.5

85.7

\pm

7.6

78.5

\pm

2.1

82.4

\pm

0.9

93.2

\pm

0.9

84.8

\pm

3.8

87.2

\pm

2.3

71.5

\pm

21.3

95.8

\pm

4.1

85.1

\pm

5.3

85.5

\pm

3.1

Ours

97.9

\pm

0.9

80.7

\pm

6.2

95.6

\pm

1.0

89.7

\pm

1.4

94.1

\pm

2.0

83.8

\pm

1.5

90.2

\pm

3.5

78.5

\pm

4.7

98.6

\pm

1.0

83.8

\pm

6.8

89.3

\pm

0.9

TABLE A12: AUPRO score for anomaly segmentation under Overlap setting of all categories in MVTec 3D-AD. We inject 20% and 30% noise into the training dataset. Our method outperforms other methods, indicating the superior anomaly detection ability of our method. We report the mean and standard deviation over 3 random seeds for each measurement. Optimal and sub-optimal results are in bold and underlined, respectively.

Method

Bagel

Cable

Gland

Carrot

Dowel

Foam

Peach

Potato

Rope

Tire

Mean

Noise 20%

PatchCore+FPFH

46.3

\pm

1.6

48.9

\pm

1.1

56.0

\pm

0.3

58.5

\pm

1.3

43.3

\pm

1.0

35.3

\pm

1.4

31.9

\pm

0.5

35.3

\pm

2.8

13.7

\pm

0.7

48.2

\pm

0.6

41.7

\pm

0.2

Shape-Guided

68.5

\pm

0.6

69.2

\pm

1.3

90.0

\pm

1.2

64.4

\pm

1.3

85.1

\pm

0.9

60.5

\pm

1.6

82.7

\pm

1.1

92.4

\pm

0.5

82.4

\pm

0.4

90.4

\pm

1.0

78.6

\pm

0.1

M3DM

45.7

\pm

1.1

48.8

\pm

1.4

55.9

\pm

0.4

56.1

\pm

2.3

43.0

\pm

0.7

36.3

\pm

1.3

32.3

\pm

0.2

35.7

\pm

2.9

13.7

\pm

0.8

48.2

\pm

0.6

41.6

\pm

0.3

Ours

93.0

\pm

0.8

85.5

\pm

1.6

95.2

\pm

0.6

86.3

\pm

0.5

78.3

\pm

2.1

76.8

\pm

2.7

76.0

\pm

5.0

74.6

\pm

3.1

90.3

\pm

0.6

81.3

\pm

2.9

83.7

\pm

0.5

Noise 30%

PatchCore+FPFH

18.1

\pm

1.0

23.6

\pm

1.3

35.2

\pm

0.6

38.3

\pm

0.9

17.2

\pm

0.1

11.7

\pm

2.7

5.3

\pm

1.3

6.2

\pm

1.0

7.0

\pm

0.8

25.0

\pm

0.1

18.8

\pm

0.3

Shape-Guided

70.9

\pm

0.3

64.9

\pm

1.9

89.1

\pm

0.3

55.3

\pm

1.4

83.2

\pm

0.1

56.6

\pm

2.2

85.6

\pm

0.5

93.7

\pm

0.3

82.6

\pm

0.4

89.7

\pm

1.3

77.2

\pm

0.1

M3DM

18.7

\pm

1.0

24.0

\pm

1.0

35.3

\pm

0.6

39.2

\pm

0.6

17.7

\pm

0.2

18.2

\pm

1.7

5.7

\pm

1.4

7.1

\pm

0.7

7.6

\pm

0.6

25.1

\pm

0.2

19.9

\pm

0.2

Ours

90.7

\pm

1.2

81.5

\pm

1.4

94.8

\pm

0.3

84.5

\pm

1.5

75.4

\pm

2.0

76.5

\pm

3.4

75.2

\pm

1.8

71.4

\pm

1.8

90.4

\pm

0.6

80.6

\pm

2.8

82.1

\pm

0.4

TABLE A13: AUPRO score for anomaly segmentation under Non-Overlap setting of all categories in MVTec 3D-AD. We inject 20% and 30% noise into the training dataset. Our method outperforms other methods, indicating the superior anomaly detection ability of our method. We report the mean and standard deviation over 3 random seeds for each measurement. Optimal and sub-optimal results are in bold and underlined, respectively.

Method

Bagel

Cable

Gland

Carrot

Dowel

Foam

Peach

Potato

Rope

Tire

Mean

Noise 20%

PatchCore+FPFH

97.0

\pm

0.2

96.8

\pm

0.6

96.8

\pm

0.0

94.8

\pm

1.6

91.6

\pm

0.9

89.7

\pm

0.4

96.6

\pm

0.5

95.6

\pm

0.2

96.6

\pm

1.6

95.5

\pm

1.1

95.1

\pm

0.3

Shape-Guided

91.6

\pm

1.2

89.4

\pm

0.7

96.0

\pm

0.5

88.2

\pm

0.8

93.1

\pm

0.8

84.9

\pm

6.0

90.1

\pm

1.5

95.1

\pm

1.0

84.4

\pm

4.7

96.0

\pm

1.2

90.9

\pm

1.2

M3DM

93.8

\pm

1.3

95.6

\pm

0.8

96.5

\pm

0.1

88.1

\pm

1.3

92.6

\pm

2.1

80.0

\pm

0.9

97.1

\pm

0.2

95.3

\pm

0.7

97.9

\pm

0.5

97.0

\pm

0.5

93.4

\pm

0.3

Ours

96.5

\pm

0.6

95.6

\pm

0.2

97.7

\pm

0.1

92.2

\pm

0.5

92.6

\pm

1.7

90.1

\pm

0.7

97.3

\pm

0.1

96.0

\pm

0.2

97.6

\pm

1.0

96.6

\pm

0.7

95.2

\pm

0.3

Noise 30%

PatchCore+FPFH

96.6

\pm

0.9

96.3

\pm

1.9

96.8

\pm

1.0

94.6

\pm

0.9

93.1

\pm

1.2

87.9

\pm

4.0

97.0

\pm

0.6

92.3

\pm

7.5

97.5

\pm

1.4

97.6

\pm

0.3

95.0

\pm

0.3

Shape-Guided

73.7

\pm

3.3

79.4

\pm

1.9

93.6

\pm

0.3

82.4

\pm

2.1

88.4

\pm

2.6

69.3

\pm

0.2

72.6

\pm

3.4

88.7

\pm

3.3

81.0

\pm

5.7

93.7

\pm

1.9

82.3

\pm

0.7

M3DM

94.3

\pm

2.7

97.2

\pm

0.7

96.4

\pm

0.9

87.5

\pm

0.4

92.5

\pm

1.6

83.6

\pm

6.5

97.4

\pm

0.1

93.3

\pm

5.5

97.6

\pm

1.2

96.9

\pm

1.1

93.7

\pm

0.6

Ours

96.6

\pm

0.7

95.0

\pm

0.4

97.7

\pm

0.1

92.3

\pm

0.8

93.9

\pm

0.5

89.5

\pm

2.6

97.7

\pm

0.4

95.4

\pm

0.4

97.5

\pm

1.3

96.1

\pm

1.2

95.2

\pm

0.2

TABLE A14: P-AUROC score for anomaly segmentation under Overlap setting of all categories in MVTec 3D-AD. We inject 20% and 30% noise into the training dataset. Our method outperforms other methods, indicating the superior anomaly detection ability of our method. We report the mean and standard deviation over 3 random seeds for each measurement. Optimal and sub-optimal results are in bold and underlined, respectively.

Method

Bagel

Cable

Gland

Carrot

Dowel

Foam

Peach

Potato

Rope

Tire

Mean

Noise 20%

PatchCore+FPFH

50.2

\pm

2.1

52.4

\pm

2.6

55.8

\pm

3.0

62.3

\pm

2.0

46.3

\pm

0.6

39.8

\pm

1.0

32.9

\pm

2.1

36.3

\pm

2.1

18.3

\pm

6.3

49.4

\pm

0.8

44.4

\pm

0.8

AST

83.1

\pm

0.0

91.9

\pm

0.6

95.8

\pm

1.0

83.6

\pm

1.0

89.1

\pm

0.6

84.6

\pm

0.6

88.8

\pm

0.6

88.0

\pm

0.6

89.0

\pm

0.6

88.9

\pm

1.0

88.8

\pm

0.2

Shape-Guided

89.9

\pm

0.3

91.0

\pm

0.5

97.0

\pm

0.1

86.5

\pm

0.2

85.1

\pm

0.5

86.4

\pm

0.6

84.9

\pm

0.3

81.3

\pm

5.8

94.7

\pm

0.4

91.1

\pm

5.6

88.8

\pm

0.5

M3DM

52.4

\pm

2.3

53.0

\pm

3.1

56.1

\pm

2.7

65.8

\pm

1.9

47.3

\pm

1.3

51.3

\pm

0.7

34.4

\pm

2.3

37.0

\pm

2.1

18.7

\pm

6.0

50.7

\pm

0.4

46.7

\pm

0.8

Ours

97.8

\pm

0.4

91.4

\pm

2.1

96.4

\pm

0.6

94.3

\pm

0.1

85.5

\pm

2.4

80.8

\pm

2.8

81.4

\pm

2.1

78.4

\pm

3.0

97.8

\pm

0.4

86.7

\pm

1.9

89.0

\pm

0.4

Noise 30%

PatchCore+FPFH

24.0

\pm

4.5

26.8

\pm

1.3

34.5

\pm

2.8

40.6

\pm

3.4

21.3

\pm

2.6

17.4

\pm

0.9

8.8

\pm

1.6

8.0

\pm

2.1

8.2

\pm

3.2

25.8

\pm

2.7

21.6

\pm

1.4

AST

15.3

\pm

0.0

21.4

\pm

0.0

29.3

\pm

0.6

37.8

\pm

0.6

16.4

\pm

1.0

8.9

\pm

0.6

3.6

\pm

1.0

5.3

\pm

0.0

6.8

\pm

0.0

19.9

\pm

0.6

16.5

\pm

0.1

Shape-Guided

90.7

\pm

0.7

89.4

\pm

0.5

96.6

\pm

0.3

83.0

\pm

1.0

80.9

\pm

5.8

80.8

\pm

4.8

90.0

\pm

5.1

81.6

\pm

5.7

94.5

\pm

0.2

87.8

\pm

0.5

87.5

\pm

0.9

M3DM

26.3

\pm

4.5

27.3

\pm

1.7

35.0

\pm

2.3

48.3

\pm

5.0

22.6

\pm

3.1

35.4

\pm

1.6

9.9

\pm

1.4

9.0

\pm

1.9

7.9

\pm

3.5

28.4

\pm

2.4

25.0

\pm

1.4

Ours

96.6

\pm

0.5

89.3

\pm

1.5

96.5

\pm

0.3

92.7

\pm

1.5

81.9

\pm

2.8

79.9

\pm

2.7

81.9

\pm

0.7

74.8

\pm

1.3

97.8

\pm

0.4

86.8

\pm

1.9

87.8

\pm

0.4

TABLE A15: P-AUROC score for anomaly segmentation under Non-Overlap setting of all categories in MVTec 3D-AD. We inject 20% and 30% noise into the training dataset. Our method outperforms other methods, indicating the superior anomaly detection ability of our method. We report the mean and standard deviation over 3 random seeds for each measurement. Optimal and sub-optimal results are in bold and underlined, respectively.

Method

Bagel

Cable

Gland

Carrot

Dowel

Foam

Peach

Potato

Rope

Tire

Mean

Noise 20%

PatchCore+FPFH

99.5

\pm

0.0

99.1

\pm

0.3

98.2

\pm

0.1

98.7

\pm

0.1

98.0

\pm

0.8

97.6

\pm

0.1

99.3

\pm

0.1

98.0

\pm

0.3

99.7

\pm

0.2

99.5

\pm

0.1

98.8

\pm

0.0

AST

97.9

\pm

0.6

97.8

\pm

0.0

99.4

\pm

1.0

92.2

\pm

0.6

93.1

\pm

0.0

99.2

\pm

0.6

99.5

\pm

1.0

99.8

\pm

0.6

97.5

\pm

0.0

98.6

\pm

1.0

97.5

\pm

0.1

Shape-Guided

97.5

\pm

1.7

97.5

\pm

0.5

98.9

\pm

0.3

95.2

\pm

0.3

97.8

\pm

0.4

93.3

\pm

3.3

97.1

\pm

0.3

98.9

\pm

0.2

95.6

\pm

1.4

99.2

\pm

0.7

97.1

\pm

0.4

M3DM

99.0

\pm

0.2

98.8

\pm

0.3

98.1

\pm

0.1

96.8

\pm

0.1

97.8

\pm

1.0

95.4

\pm

0.1

99.4

\pm

0.1

99.0

\pm

0.2

99.8

\pm

0.1

99.5

\pm

0.2

98.4

\pm

0.0

Ours

99.5

\pm

0.1

99.0

\pm

0.1

99.6

\pm

0.1

97.3

\pm

0.3

97.9

\pm

0.6

97.7

\pm

0.2

99.5

\pm

0.1

99.1

\pm

0.1

99.8

\pm

0.2

99.3

\pm

0.2

98.9

\pm

0.1

Noise 30%

PatchCore+FPFH

99.6

\pm

0.1

99.3

\pm

0.1

98.2

\pm

1.3

98.5

\pm

0.2

98.1

\pm

1.3

98.3

\pm

0.5

99.5

\pm

0.2

95.3

\pm

7.8

99.6

\pm

0.6

99.6

\pm

0.1

98.6

\pm

0.7

AST

91.0

\pm

1.0

96.3

\pm

0.6

99.1

\pm

1.0

92.4

\pm

0.6

95.6

\pm

0.0

97.4

\pm

0.6

99.7

\pm

0.6

100.2

\pm

0.6

97.7

\pm

0.6

98.6

\pm

1.0

96.8

\pm

0.2

Shape-Guided

93.3

\pm

1.3

93.8

\pm

1.0

98.2

\pm

0.3

90.5

\pm

1.2

97.0

\pm

1.1

85.6

\pm

2.8

92.7

\pm

0.0

96.7

\pm

0.5

92.8

\pm

1.9

98.6

\pm

0.7

93.9

\pm

0.2

M3DM

99.1

\pm

0.2

99.3

\pm

0.4

98.0

\pm

1.4

96.0

\pm

0.8

97.6

\pm

1.1

96.6

\pm

1.9

99.6

\pm

0.1

98.2

\pm

2.0

99.7

\pm

0.5

99.5

\pm

0.3

98.4

\pm

0.2

Ours

99.5

\pm

0.2

98.7

\pm

0.3

99.6

\pm

0.1

97.1

\pm

0.6

98.5

\pm

0.3

97.4

\pm

1.3

99.6

\pm

0.1

98.8

\pm

0.3

99.6

\pm

0.6

99.2

\pm

0.4

98.8

\pm

0.0