Out-of-Distribution Detection with Deep Nearest Neighbors

Yiyou Sun    Yifei Ming    Xiaojin Zhu    Yixuan Li

Out-of-distribution (OOD) detection is a critical task for deploying machine learning models in the open world. Distance-based methods have demonstrated promise, where testing samples are detected as OOD if they are relatively far away from in-distribution (ID) data. However, prior methods impose a strong distributional assumption of the underlying feature space, which may not always hold. In this paper, we explore the efficacy of non-parametric nearest-neighbor distance for OOD detection, which has been largely overlooked in the literature. Unlike prior works, our method does not impose any distributional assumption, hence providing stronger flexibility and generality. We demonstrate the effectiveness of nearest-neighbor-based OOD detection on several benchmarks and establish superior performance. Under the same model trained on ImageNet-1k, our method substantially reduces the false positive rate (FPR@TPR95) by 24.77% compared to a strong baseline SSD+, which uses a parametric approach Mahalanobis distance in detection. Code is available: https://github.com/deeplearning-wisc/knn-ood.

Out-of-distribution Detection

1 Introduction

Modern machine learning models deployed in the open world often struggle with out-of-distribution (OOD) inputs—samples from a different distribution that the network has not been exposed to during training, and therefore should not be predicted at test time. A reliable classifier should not only accurately classify known in-distribution (ID) samples, but also identify as “unknown” any OOD input. This gives rise to the importance of OOD detection, which determines whether an input is ID or OOD and enables the model to take precautions.

A rich line of OOD detection algorithms has been developed recently, among which distance-based methods demonstrated promise (Lee et al., 2018; Tack et al., 2020; Sehwag et al., 2021). Distance-based methods leverage feature embeddings extracted from a model, and operate under the assumption that the test OOD samples are relatively far away from the ID data. For example, Lee et al. modeled the feature embedding space as a mixture of multivariate Gaussian distributions, and used the maximum Mahalanobis distance (Mahalanobis, 1936) to all class centroids for OOD detection. However, all these approaches make a strong distributional assumption of the underlying feature space being class-conditional Gaussian. As we verify, the learned embeddings can fail the Henze-Zirkler multivariate normality test (Henze & Zirkler, 1990). This limitation leads to the open question:

Can we leverage the non-parametric nearest neighbor approach for OOD detection?

Unlike prior works, the non-parametric approach does not impose any distributional assumption about the underlying feature space, hence providing stronger flexibility and generality. Despite its simplicity, the nearest neighbor approach has received scant attention. Looking at the literature on OOD detection in the past several years, there has not been any work that demonstrated the efficacy of a non-parametric nearest neighbor approach for this problem. This suggests that making the seemingly simple idea work is non-trivial. Indeed, we found that simply using the nearest neighbor distance derived from the feature embedding of a standard classification model is not performant.

Figure 1: Illustration of our framework using nearest neighbors for OOD detection. KNNperforms non-parametric level set estimation, partitioning the data into two sets (ID vs. OOD) based on the k-th nearest neighbor distance. The distances are estimated from the penultimate feature embeddings, visualized via UMAP (McInnes et al., 2018). Models are trained on ResNet-18 (He et al., 2016) using cross-entropy loss (left) v.s. contrastive loss (right). The in-distribution data is CIFAR-10 (colored in non-gray colors) and OOD data is LSUN (colored in gray). The shaded grey area in the density distribution plot indicates OOD samples that are misidentified as ID data.

In this paper, we challenge the status quo by presenting the first study exploring and demonstrating the efficacy of the non-parametric nearest-neighbor distance for OOD detection. To detect OOD samples, we compute the k-th nearest neighbor (KNN) distance between the embedding of test input and the embeddings of the training set and use a threshold-based criterion to determine if the input is OOD or not. In a nutshell, we perform non-parametric level set estimation, partitioning the data into two sets (ID vs. OOD) based on the deep k-nearest neighbor distance. KNN offers compelling advantages of being: (1) distributional assumption free, (2) OOD-agnostic (i.e., the distance threshold is estimated on the ID data only, and does not rely on information of unknown data), (3) easy-to-use (i.e., no need to calculate the inverse of the covariance matrix which can be numerically unstable), and (4) model-agnostic (i.e., the testing procedure is applicable to different model architectures and training losses).

Our exploration leads to both empirical effectiveness (Section 4 & 5) and theoretical justification (Section 6). By studying the role of representation space, we show that a compact and normalized feature space is the key to the success of the nearest neighbor approach for OOD detection. Extensive experiments show that KNN outperforms the parametric approach, and scales well to the large-scale dataset. Computationally, modern implementations of approximate nearest neighbor search allow us to do this in milliseconds even when the database contains billions of images (Johnson et al., 2019). On a challenging ImageNet OOD detection benchmark (Huang & Li, 2021), our KNN-based approach achieves superior performance under a similar inference speed as the baseline methods. The overall simplicity and effectiveness of KNN make it appealing for real-world applications. We summarize our contributions below:

  1. 1.

    We present the first study exploring and demonstrating the efficacy of non-parametric density estimation with nearest neighbors for OOD detection—a simple, flexible yet overlooked approach in literature. We hope our work draws attention to the strong promise of the non-parametric approach, which obviates data assumption on the feature space.

  2. 2.

    We demonstrate the superior performance of the KNN-based method on several OOD detection benchmarks, different model architectures (including CNNs and ViTs), and different training losses. Under the same model trained on ImageNet-1k, our method substantially reduces the false positive rate (FPR@TPR95) by 24.77% compared to a strong baseline SSD+ (Sehwag et al., 2021), which uses a parametric approach (i.e., Mahalanobis distance (Lee et al., 2018)) for detection.

  3. 3.

    We offer new insights on the key components to make KNN effective in practice, including feature normalization and a compact representation space. Our findings are supported by extensive ablations and experiments. We believe these insights are valuable to the community in carrying out future research.

  4. 4.

    We provide theoretical analysis, showing that KNN-based OOD detection can reject inputs equivalent to the Bayes optimal estimator. By modeling the nearest neighbor distance in the feature space, our theory (1) directly connects to our method which also operates in the feature space, and (2) complements our experiments by considering the universality of OOD data.

2 Preliminaries

We consider supervised multi-class classification, where 𝒳 denotes the input space and 𝒴={1,2,,C} denotes the label space. The training set 𝔻in={(𝐱i,yi)}i=1n is drawn i.i.d. from the joint data distribution P𝒳𝒴. Let 𝒫in denote the marginal distribution on 𝒳. Let f:𝒳|𝒴| be a neural network trained on samples drawn from P𝒳𝒴 to output a logit vector, which is used to predict the label of an input sample.

Out-of-distribution detection When deploying a machine model in the real world, a reliable classifier should not only accurately classify known in-distribution (ID) samples, but also identify as “unknown” any OOD input. This can be achieved by having an OOD detector, in tandem with the classification model f.

OOD detection can be formulated as a binary classification problem. At test time, the goal of OOD detection is to decide whether a sample 𝐱𝒳 is from 𝒫in (ID) or not (OOD). The decision can be made via a level set estimation:


where samples with higher scores S(𝐱) are classified as ID and vice versa, and λ is the threshold. In practice, OOD is often defined by a distribution that simulates unknowns encountered during deployment time, such as samples from an irrelevant distribution whose label set has no intersection with 𝒴 and therefore should not be predicted by the model.

3 Deep Nearest Neighbor for OOD detection

In this section, we describe our approach using the deep k-Nearest Neighbor (KNN) for OOD detection. We illustrate our approach in Figure 1, which at a high level, can be categorized as a distance-based method. Distance-based methods leverage feature embeddings extracted from a model and operate under the assumption that the test OOD samples are relatively far away from the ID data. Previous distance-based OOD detection methods employed parametric density estimation and modeled the feature embedding space as a mixture of multivariate Gaussian distributions (Lee et al., 2018). However, such an approach makes a strong distributional assumption of the learned feature space, which may not necessarily hold111We verified this by performing the Henze-Zirkler multivariate normality test (Henze & Zirkler, 1990) on the embeddings. The testing results show that the feature vectors for each class are not normally distributed at the significance level of 0.05..

In this paper, we instead explore the efficacy of non-parametric density estimation using nearest neighbors for OOD detection. Despite the simplicity, KNN approach is not systematically explored or compared in most current OOD detection papers. Specifically, we compute the k-th nearest neighbor distance between the embedding of each test image and the training set, and use a simple threshold-based criterion to determine if an input is OOD or not. Importantly, we use the normalized penultimate feature 𝐳=ϕ(𝐱)/ϕ(𝐱)2 for OOD detection, where ϕ:𝒳m is a feature encoder. Denote the embedding set of training data as n=(𝐳1,𝐳2,,𝐳n). During testing, we derive the normalized feature vector 𝐳 for a test sample 𝐱, and calculate the Euclidean distances 𝐳i𝐳2 with respect to embedding vectors 𝐳in. We reorder n according to the increasing distance 𝐳i𝐳2. Denote the reordered data sequence as n=(𝐳(1),𝐳(2),,𝐳(n)). The decision function for OOD detection is given by:


where rk(𝐳)=𝐳𝐳(k)2 is the distance to the k-th nearest neighbor (k-NN) and 𝟏{} is the indicator function. The threshold λ is typically chosen so that a high fraction of ID data (e.g., 95%) is correctly classified. The threshold does not depend on OOD data.

Input: Training dataset 𝔻in, pre-trained neural network encoder ϕ, test sample 𝐱, threshold λ
For 𝐱i in the training data 𝔻in, collect feature vectors n=(𝐳1,𝐳2,,𝐳n)
Testing Stage:
Given a test sample, we calculate feature vector 𝐳=ϕ(𝐱)/ϕ(𝐱)2
Reorder n according to the increasing value of 𝐳i𝐳2 as n=(𝐳(1),𝐳(2),,𝐳(n))
Output: OOD detection decision 𝟏{𝐳𝐳(k)2λ}
Algorithm 1 OOD Detection with Deep Nearest Neighbors
Table 1: Results on CIFAR-10. Comparison with competitive OOD detection methods. All methods are based on a discriminative model trained on ID data only, without using outlier data. indicates larger values are better and vice versa.
Method OOD Dataset Average ID ACC
SVHN LSUN iSUN Texture Places365
Without Contrastive Learning
MSP 59.66 91.25 45.21 93.80 54.57 92.12 66.45 88.50 62.46 88.64 57.67 90.86 94.21
ODIN 53.78 91.30 10.93 97.93 28.44 95.51 55.59 89.47 43.40 90.98 38.43 93.04 94.21
Energy 54.41 91.22 10.19 98.05 27.52 95.59 55.23 89.37 42.77 91.02 38.02 93.05 94.21
GODIN 18.72 96.10 11.52 97.12 30.02 94.02 33.58 92.20 55.25 85.50 29.82 92.97 93.64
Mahalanobis 9.24 97.80 67.73 73.61 6.02 98.63 23.21 92.91 83.50 69.56 37.94 86.50 94.21
KNN (ours) 27.97 95.48 18.50 96.84 24.68 95.52 26.74 94.96 47.84 89.93 29.15 94.55 94.21
With Contrastive Learning
CSI 37.38 94.69 5.88 98.86 10.36 98.01 28.85 94.87 38.31 93.04 24.16 95.89 94.38
SSD+ 1.51 99.68 6.09 98.48 33.60 95.16 12.98 97.70 28.41 94.72 16.52 97.15 95.07
KNN+ (ours) 2.42 99.52 1.78 99.48 20.06 96.74 8.09 98.56 23.02 95.36 11.07 97.93 95.07

We summarize our approach in Algorithm 1. Noticeably, KNN-based OOD detection offers several compelling advantages:

  1. 1.

    Distributional assumption free: Non-parametric nearest neighbor approach does not impose distributional assumptions about the underlying feature space. KNN therefore provides stronger flexibility and generality, and is applicable even when the feature space does not conform to the mixture of Gaussians.

  2. 2.

    OOD-agnostic: The testing procedure does not rely on the information of unknown data. The distance threshold is estimated on the ID data only.

  3. 3.

    Easy-to-use: Modern implementations of approximate nearest neighbor search allow us to do this in milliseconds even when the database contains billions of images (Johnson et al., 2019). In contrast, Mahalanobis distance requires calculating the inverse of the covariance matrix, which can be numerically unstable.

  4. 4.

    Model-agnostic: The testing procedure applies to a variety of model architectures, including CNNs and more recent Transformer-based ViT models (Dosovitskiy et al., 2021). Moreover, we will show that KNN is agnostic to the training procedure as well, and is compatible with models trained under different loss functions (e.g., cross-entropy loss and contrastive loss).

We proceed to show the efficacy of the KNN-based OOD detection approach in Section 4.

4 Experiments

The goal of our experimental evaluation is to answer the following questions: (1) How does KNN fare against the parametric counterpart such as Mahalanobis distance for OOD detection? (2) Can KNN scale to a more challenging task when the training data is large-scale (e.g., ImageNet)? (3) Is KNN-based OOD detection effective under different model architectures and training objectives? (4) How do various design choices affect the performance?

Evaluation metrics

We report the following metrics: (1) the false positive rate (FPR95) of OOD samples when the true positive rate of ID samples is at 95%, (2) the area under the receiver operating characteristic curve (AUROC), (3) ID classification accuracy (ID ACC), and (4) per-image inference time (in milliseconds, averaged across test images).

Training losses

In our experiments, we aim to show that KNN-based OOD detection is agnostic to the training procedure, and is compatible with models trained under different losses. We consider two types of loss functions, with and without contrastive learning respectively. We employ (1) cross-entropy loss which is the most commonly used training objective in classification, and (2) supervised contrastive learning (SupCon) (Khosla et al., 2020)— the latest development for representation learning, which leverages the label information by aligning samples belonging to the same class in the embedding space.

Remark on the implementation

All of the experiments are based on PyTorch (Paszke et al., 2019). Code is made publicly available online. We use Faiss (Johnson et al., 2019), a library for efficient nearest neighbor search. Specifically, we use faiss.IndexFlatL2 as the indexing method with Euclidean distance. In practice, we pre-compute the embeddings for all images and store them in a key-value map to make KNN search efficient. The embedding vectors for ID data only need to be extracted once after the training is completed.

4.1 Evaluation on Common Benchmarks


We begin with the CIFAR benchmarks that are routinely used in literature. We use the standard split with 50,000 training images and 10,000 test images. We evaluate the methods on common OOD datasets: Textures (Cimpoi et al., 2014), SVHN (Netzer et al., 2011), Places365 (Zhou et al., 2017), LSUN-C (Yu et al., 2015), and iSUN (Xu et al., 2015). All images are of size 32×32.

Experiment details

We use ResNet-18 as the backbone for CIFAR-10. Following the original settings in Khosla et al., models with SupCon loss are trained for 500 epochs, with the batch size of 1024. The temperature τ is 0.1. The dimension of the penultimate feature where we perform the nearest neighbor search is 512. The dimension of the projection head is 128. We use the cosine annealing learning rate (Loshchilov & Hutter, 2016) starting at 0.5. We use k=50 for CIFAR-10 and k=200 for CIFAR-100, which is selected from k={1,10,20,50,100,200,500,1000,3000,5000} using the validation method in (Hendrycks et al., 2019). We train the models using stochastic gradient descent with momentum 0.9, and weight decay 104. The model without contrastive learning is trained for 100 epochs. The start learning rate is 0.1 and decays by a factor of 10 at epochs 50, 75, and 90 respectively.

Table 2: Evaluation (FPR95) on hard OOD detection tasks. Model is trained on CIFAR-10 with SupCon loss.
LSUN-FIX ImageNet-FIX ImageNet-R C-100
SSD+ 29.86 32.26 45.62 45.50
KNN+ (Ours) 21.52 25.92 29.92 38.83

Nearest neighbor distance achieves superior performance

We present results in Table 1, where non-parametric KNN approach shows favorable performance. Our comparison covers an extensive collection of competitive methods in the literature. For clarity, we divide the baseline methods into two categories: trained with and without contrastive losses. Several baselines derive OOD scores from a model trained with common softmax cross-entropy (CE) loss, including MSP (Hendrycks & Gimpel, 2017), ODIN (Liang et al., 2018), Mahalanobis (Lee et al., 2018), and Energy (Liu et al., 2020). GODIN (Hsu et al., 2020) is trained using a DeConf-C loss, which does not involve contrastive loss either. For methods involving contrastive losses, we use the same network backbone architecture and embedding dimension, while only varying the training objective. These methods include CSI (Tack et al., 2020) and SSD+ (Sehwag et al., 2021). For terminology clarity, KNN refers to our method trained with CE loss, and KNN+ refers to the variant trained with SupCon loss. We highlight two groups of comparisons:

  • KNN vs. Mahalanobis (without contrastive learning): Under the same model trained with cross-entropy (CE) loss, our method achieves an average FPR95 of 29.15%, compared to that of Mahalanobis distance 37.94%. The performance gain precisely demonstrates the advantage of KNN over the parametric method Mahalanobis distance.

  • KNN+ vs. SSD+ (with contrastive loss): KNN+ and SSD+ are fundamentally different in OOD detection mechanisms, despite both benefit from the contrastively learned representations. SSD+ modeled the feature embedding space as a multivariate Gaussian distribution for each class, and use Mahalanobis distance (Lee et al., 2018) for OOD detection. Under the same model trained with Supervised Contrastive Learning (SupCon) loss, our method with the nearest neighbor distance reduces the average FPR95 by 5.45%, which is a relatively 32.99% reduction in error. It further suggests the advantage of using nearest neighbors without making any distributional assumptions on the feature embedding space.

The above comparison suggests that the nearest neighbor approach is compatible with models trained both with and without contrastive learning. In addition, KNN is also simpler to use and implement than CSI, which relies on sophisticated data augmentations and ensembling in testing. Lastly, as a result of the improved embedding quality, the ID accuracy of the model trained with SupCon loss is improved by 0.86% on CIFAR-10 and 2.45% on ImageNet compared to training with the CE loss. Due to space constraints, we provide results on DenseNet (Huang et al., 2017) in Appendix C.

Contrastively learned representation helps

While contrastive learning has been extensively studied in recent literature, its role remains untapped when coupled with a non-parametric approach (such as nearest neighbors) for OOD detection. We examine the effect of using supervised contrastive loss for KNN-based OOD detection. We provide both qualitative and quantitative evidence, highlighting advantages over the standard softmax cross-entropy (CE) loss. (1) We visualize the learned feature embeddings in Figure 1 using UMAP (McInnes et al., 2018), where the colors encode different class labels. A salient observation is that the representation with SupCon is more distinguishable and compact than the representation obtained from the CE loss. The high-quality embedding space indeed confers benefits for KNN-based OOD detection. (2) Beyond visualization, we also quantitatively compare the performance of KNN-based OOD detection using embeddings trained with SupCon vs CE. As shown in Table 1, KNN+ with contrastively learned representations reduces the FPR95 on all test OOD datasets compared to using embeddings from the model trained with CE loss.

Comparison with other non-parametric methods

In Table 3, we compare the nearest neighbor approach with other non-parametric methods. For a fair comparison, we use the same embeddings trained with SupCon loss. Our comparison covers an extensive collection of outlier detection methods in literature including: IForest (Liu et al., 2008), OCSVM (Schölkopf et al., 2001), LODA (Pevnỳ, 2016), PCA (Shyu et al., 2003), and LOF (Breunig et al., 2000). The parameter setting for these methods is available in Appendix B. We show that KNN+ outperforms alternative non-parametric methods by a large margin.

Table 3: Comparison with other non-parametric methods. Results are averaged across all test OOD datasets. Model is trained on CIFAR-10.
IForest (Liu et al., 2008) 65.49 76.98
OCSVM (Schölkopf et al., 2001) 52.27 65.16
LODA (Pevnỳ, 2016) 76.38 62.59
PCA (Shyu et al., 2003) 37.26 83.13
LOF (Breunig et al., 2000) 40.06 93.47
KNN+ (ours) 11.07 97.93
Table 4: Results on ImageNet. All methods are based on a model trained on ID data only (ImageNet-1k (Deng et al., 2009)). We report the OOD detection performance, along with the per-image inference time.
Methods Inference time (ms) OOD Datasets Average ID ACC
iNaturalist SUN Places Textures
Without Contrastive Learning
MSP 7.04 54.99 87.74 70.83 80.86 73.99 79.76 68.00 79.61 66.95 81.99 75.08
ODIN 7.05 47.66 89.66 60.15 84.59 67.89 81.78 50.23 85.62 56.48 85.41 75.08
Energy 7.04 55.72 89.95 59.26 85.89 64.92 82.86 53.72 85.99 58.41 86.17 75.08
GODIN 7.04 61.91 85.40 60.83 85.60 63.70 83.81 77.85 73.27 66.07 82.02 70.43
Mahalanobis 35.83 97.00 52.65 98.50 42.41 98.40 41.79 55.80 85.01 87.43 55.47 75.08
KNN (α=100%) 10.31 59.77 85.89 68.88 80.08 78.15 74.10 10.90 97.42 54.68 84.37 75.08
KNN (α=1%) 7.04 59.08 86.20 69.53 80.10 77.09 74.87 11.56 97.18 54.32 84.59 75.08
With Contrastive Learning
SSD+ 28.31 57.16 87.77 78.23 73.10 81.19 70.97 36.37 88.52 63.24 80.09 79.10
KNN+ (α=100%) 10.47 30.18 94.89 48.99 88.63 59.15 84.71 15.55 95.40 38.47 90.91 79.10
KNN+ (α=1%) 7.04 30.83 94.72 48.91 88.40 60.02 84.62 16.97 94.45 39.18 90.55 79.10
Figure 2: Comparison with the effect of different k and sampling ratio α. We report an average FPR95 score over four test OOD datasets. The variances are estimated across 5 different random seeds. The solid blue line represents the averaged value across all runs and the shaded blue area represents the standard deviation. Note that the full ImageNet dataset (α=100%) has 1000 images per class.

Evaluations on hard OOD tasks

Hard OOD samples are particularly challenging to detect. To test the limit of the non-parametric KNN approach, we follow CSI (Tack et al., 2020) and evaluate on several hard OOD datasets: LSUN-FIX, ImageNet-FIX, ImageNet-R, and CIFAR-100. The results are summarized in Table 2. Under the same model, KNN+ consistently outperforms SSD+.

Figure 3: Ablation results. In (a), we compare the inference speed (per-image) using different k and sampling ration α. For (b) (c) (d), the FPR95 value is reported over all test OOD datasets. Specifically, (b) compares the effect of using normalization in the penultimate layer feature vs. without normalization, (c) compares using features in the penultimate layer feature vs the projection head, and (d) compares the OOD detection performance using k-th and averaged k (k-avg) nearest neighbor distance.

4.2 Evaluation on Large-scale ImageNet Task

We evaluate on a large-scale OOD detection task based on ImageNet (Deng et al., 2009). Compared to the CIFAR benchmarks above, the ImageNet task is more challenging due to a large amount of training data. Our goal is to verify KNN’s performance benefits and whether it scales computationally with millions of samples.


We use a ResNet-50 backbone (He et al., 2016) and train on ImageNet-1k (Deng et al., 2009) with resolution 224×224. Following the experiments in Khosla et al., models with SupCon loss are trained for 700 epochs, with a batch size of 1024. The temperature τ is 0.1. The dimension of the penultimate feature where we perform the nearest neighbor search is 2048. The dimension of the project head is 128. We use the cosine learning rate (Loshchilov & Hutter, 2016) starting at 0.5. We train the models using stochastic gradient descent with momentum 0.9, and weight decay 104. We use k=1000 which follows the same validation procedure as before. When randomly sampling α% training data for nearest neighbor search, k is scaled accordingly to 1000α%.

Following the ImageNet-based OOD detection benchmark in MOS (Huang & Li, 2021), we evaluate on four test OOD datasets that are subsets of: Places365 (Zhou et al., 2017), Textures (Cimpoi et al., 2014), iNaturalist (Van Horn et al., 2018), and SUN (Xiao et al., 2010) with non-overlapping categories w.r.t. ImageNet. The evaluations span a diverse range of domains including fine-grained images, scene images, and textural images.

Nearest neighbor approach achieves superior performance without compromising the inference speed

In Table 4, we compare our approach with OOD detection methods that are competitive in the literature. The baselines are the same as what we described in Section 4.1 except for CSI222The training procedure of CSI is computationally prohibitive on ImageNet, which takes three months on 8 Nvidia 2080Tis.. We report both OOD detection performance and the inference time (measured by milliseconds). We highlight three trends: (1) KNN+ outperforms the best baseline by 18.01% in FPR95. (2) Compared to SSD+, KNN+ substantially reduces the FPR95 by 24.77% averaged across all test sets. The limiting performance of SSD+ is due to the increased size of label space and data complexity, which makes the class-conditional Gaussian assumption less viable. In contrast, our non-parametric method does not suffer from this issue, and can better estimate the density of the complex distribution for OOD detection. (3) KNN+ achieves strong performance with a comparable inference speed as the baselines. In particular, we show that performing nearest neighbor distance estimation with only 1% randomly sampled training data can yield a similar performance as using the full dataset.

Nearest neighbor approach is competitive on ViT

Going beyond convolutional neural networks, we show in Table 5 that the nearest neighbor approach is effective for transformer-based ViT model (Dosovitskiy et al., 2021). We adopt the ViT-B/16 architecture fine-tuned on the ImageNet-1k dataset using cross-entropy loss. Under the same ViT model, our non-parametric KNN method consistently outperforms Mahalanobis.

Table 5: Performance comparison (FPR95) on ViT-B/16 model fine-tuned on ImageNet-1k.
iNaturalist SUN Places Textures
Mahalanobis (parametric) 17.56 80.51 84.12 70.51
KNN (non-parametric) 7.30 48.40 56.46 39.91

5 A Closer Look at KNN-based OOD Detection

We provide further analysis and ablations to understand the behavior of KNN-based OOD detection. All the ablations are based on the ImageNet model trained with SupCon loss (same as in Section 4.2).

Effect of k and sampling ratio

In Figure 2 and Figure 3 (a), we systematically analyze the effect of k and the dataset sampling ratios α. We vary the number of neighbors k={1,10,20,50,100,200,500,1000,3000,5000} and random sampling ratio α={1%,10%,50%,100%}. We note several interesting observations: (1) The optimal OOD detection (measured by FPR95) remains similar under different random sampling ratios α. (2) The optimal k is consistent with the one chosen by our validation strategy. For example, the optimal k is 1,000 when α=100%; and the optimal k becomes 10 when α=1%. (3) Varying k does not significantly affect the inference speed when k is relatively small (e.g., k<1000) as shown in Figure 3 (a).

Feature normalization is critical

In this ablation, we contrast the performance of KNN-based OOD detection with and without feature normalization. The k-th NN distance can be derived by rk(ϕ(𝐱)(ϕ(𝐱)) and rk(ϕ(𝐱)), respectively. As shown in Figure 3 (b), using feature normalization improved the FPR95 drastically by 61.05%, compared to the counterpart without normalization. To better understand this, we look into the Euclidean distance r=uv2 between two vectors u and v. The norm of the feature vector u and v could notably affect the value of the Euclidean distance. Interestingly, recent studies share the observation in Figure 4 (a) that the ID data has a larger L2 feature norm than OOD data (Tack et al., 2020; Huang et al., 2021). Therefore, the Euclidean distance between ID features can be large (Figure 4 (b)). This contradicts the hope that ID data has a smaller k-NN distance than OOD data. Indeed, the normalization effectively mitigated this problem, as evidenced in Figure 4 (c). Empirically, the normalization plays a key role in the nearest neighbor approach to be successful in OOD detection as shown in Figure 3 (b).

Figure 4: Distribution of (a) the L2-norm of feature embeddings, (b) the k-NN distance with the unnormalized feature embeddings, and (c) the k-NN distance with the normalized features.

Using the penultimate layer’s feature is better than using the projection head

In this paper, we follow the convention in SSD+, which uses features from the penultimate layer instead of the projection head. We also verify in Figure 3 (c) that using the penultimate layer’s feature is better than using the projection head on all test OOD datasets. This is likely due to the penultimate layer preserves more information than the projection head, which has much smaller dimensions.

KNN can be further boosted by activation rectification

We show that KNN+ can be made stronger with a recent method of activation rectification (Sun et al., 2021). It was shown that the OOD data can have overly high activations on some feature dimensions, and this rectification is effective in suppressing the values. Empirically, we compare the results in Table 6 by using the activation rectification and achieve improved OOD detection performance.

Table 6: Comparison of KNN-based method with and without activation truncation. The ID data is ImageNet-1k. The value is averaged over all test OOD datasets.
Method FPR95 AUROC
KNN+ 38.47 90.91
KNN+ (w. ReAct (Sun et al., 2021)) 26.45 93.76

Using k-th and averaged k nearest nerighbors’ distance has similar performance

We compare two variants for OOD detection: k-th nearest neighbor distance vs. averaged k (k-avg) nearest neighbor distance. The comparison is shown in Figure 3 (d), where the average performance (on four datasets) is on par. The reported results are based on the full ID dataset (α=100%) with the optimal k chosen for k-th NN and k-avg NN respectively. Despite the similar performance, using k-th NN distance has a stronger theoretical interpretation, as we show in the next section.

6 Theoretical Justification

In this section, we provide a theoretical analysis of using KNN for OOD detection. By modeling the KNN in the feature space, our theory (1) directly connects to our method which also operates in the feature space, and (2) complements our experiments by considering the universality of OOD data. Our goal here is to analyze the average performance of our algorithm while being OOD-agnostic and training-agnostic.


We consider OOD detection task as a special binary classification task, where the negative samples (OOD) are only available in the testing stage. We assume the input is from feature embeddings space 𝒵 and the labeling set 𝒢={0(OOD),1(ID)}. In the inference stage, the testing set {(𝐳i,gi)} is drawn i.i.d. from P𝒵𝒢.

Denote the marginal distribution on 𝒵 as 𝒫. We adopt the Huber contamination model (Huber, 1964) to model the fact that we may encounter both ID and OOD data in test time:


where 𝒫in and 𝒫out are the underlying distributions of feature embeddings for ID and OOD data, respectively, and ε is a constant controlling the fraction of OOD samples in testing. We use lower case pin(𝐳i) and pout(𝐳i) to denote the probability density function, where pin(𝐳i)=p(𝐳i|gi=1) and pout(𝐳i)=p(𝐳i|gi=0).

A key challenge in OOD detection (and theoretical analysis) is the lack of knowledge on OOD distribution, which can arise universally outside ID data. We thus try to keep our analysis general and reflect the fact that we do not have any strong prior information about OOD. For this reason, we model OOD data with an equal chance to appear outside of the high-density region of ID data, pout(𝐳)=c0𝟏{pin(𝐳)<c1}333In experiments, as it is difficult to simulate the universal OOD, we approximate it by using a diverse yet finite collection of datasets. Our theory is thus complementary to our experiments and captures the universality of OOD data.. The Bayesian classifier is known as the optimal binary classifier defined by hBay(𝐳i)=𝟏{p(gi=1|𝐳i)β}444Note that β does not have to be 12 for the Bayesian classifier to be optimal. β can be any value larger than (1ϵ)c1(1ϵ)c1+ϵc0 when ϵc0(1ϵ)c1., assuming the underlying density function is given.

Without such oracle information, our method applies k-NN as the distance measure which acts as a probability density estimation, and thus provides the decision boundary based on it. Specifically, KNN’s hypothesis class is given by {h:hλ,k,n(𝐳i)=𝟏{rk(𝐳i)λ}}, where rk(𝐳i) is the distance to the k-th nearest neighbor (c.f. Section 3).

Main result

We show that our KNN-based OOD detector can reject inputs equivalent to the estimated Bayesian binary decision function. A small KNN distance rk(𝐳i) directly translates into a high probability of being ID, and vice versa. We depict this in the following Theorem.

Theorem 6.1.

With the setup specified above, if p^out(𝐳i)=c^0𝟏{p^in(𝐳i;k,n)<βεc^0(1β)(1ε)}, and λ=(1β)(1ε)kβεcbnc^0m1, we have


where p^() denotes the empirical estimation. The proof is in Appendix A.

7 Related Work

OOD detection

The phenomenon of neural networks’ overconfidence in out-of-distribution data is first revealed in (Nguyen et al., 2015), which attracts growing research attention in several thriving directions:

(1) One line of work attempted to perform OOD detection by devising scoring functions, including OpenMax score (Bendale & Boult, 2015), maximum softmax probability (Hendrycks & Gimpel, 2017), ODIN score (Liang et al., 2018), deep ensembles (Lakshminarayanan et al., 2017), Mahalanobis distance-based score (Lee et al., 2018), energy score (Liu et al., 2020; Lin et al., 2021; Wang et al., 2021; Morteza & Li, 2022), activation rectification (ReAct) (Sun et al., 2021), gradient-based score (Huang et al., 2021) and ViM score (Wang et al., 2022). In Huang & Li (2021), the authors revealed that approaches developed for CIFAR datasets might not translate effectively into a large-scale ImageNet benchmark, and highlight the need to evaluate OOD detection methods in a real-world setting. To date, none of the prior works investigated the non-parametric nearest neighbor approach for OOD detection. Our work bridges the gap by presenting the first study exploring the efficacy of using nearest neighbor distance for OOD detection. We demonstrate superior performance on several OOD detection benchmarks, and we hope our work draws attention to the strong promise of the non-parametric approach.

(2) Another promising line of work addressed OOD detection by training-time regularization (Lee et al., 2017; Bevandić et al., 2018; Malinin & Gales, 2018; Hendrycks et al., 2019; Geifman & El-Yaniv, 2019; Hein et al., 2019; Meinke & Hein, 2019; Mohseni et al., 2020; Liu et al., 2020; Jeong & Kim, 2020; Van Amersfoort et al., 2020; Yang et al., 2021; Chen et al., 2021; Wei et al., 2022; Ming et al., 2022a; Katz-Samuels et al., 2022). For example, models are encouraged to give predictions with uniform distribution (Lee et al., 2017; Hendrycks et al., 2019) or higher energies (Liu et al., 2020; Ming et al., 2022a; Du et al., 2022a; Katz-Samuels et al., 2022) for outlier data. Most regularization methods require the availability of auxiliary OOD data. Recently, VOS (Du et al., 2022b) alleviates the need by automatically synthesizing virtual outliers that can meaningfully regularize the model’s decision boundary during training.

(3) More recently, several works explored the role of representation learning for OOD detection. In particular, CSI (Tack et al., 2020) investigate the type of data augmentations that are particularly beneficial for OOD detection. Other works (Winkens et al., 2020; Sehwag et al., 2021) verify the effectiveness of applying the off-the-shelf multi-view contrastive losses such as SimCLR (Chen et al., 2020) and SupCon (Khosla et al., 2020) for OOD detection. These two works both use Mahalanobis distance as the OOD score, and make strong distributional assumptions by modeling the class-conditional feature space as multivariate Gaussian distribution. Ming et al. (2022b) propose a prototype-based contrastive learning framework for OOD detection, which promote stronger ID-OOD separability than SupCon loss. Our method and previous works are fundamentally different in the OOD detection method, despite all benefit from high-quality representations. In particular, KNN is a non-parametric method that does not impose prior of ID distribution. Performance-wise, our method outperforms SSD by a substantial margin, and is easy to use in practice.

KNN for anomaly detection

KNN has been explored for anomaly detection (Jing et al., 2014; Zhao & Lai, 2020; Bergman et al., 2020), which aims to detect abnormal input samples from one class. We focus on OOD detection, which requires additionally performing multi-class classification for ID data. Some other recent works (Dang et al., 2015; Gu et al., 2019; Pires et al., 2020) explore the effectiveness of KNN-based anomaly detection for the tabular data. The potential of using KNN for OOD detection in deep neural networks is currently underexplored. Our work provides both new empirical insights and theoretical analysis of using the KNN-based approach for OOD detection.

8 Conclusion

This paper presents the first study exploring and demonstrating the efficacy of the non-parametric nearest-neighbor distance for OOD detection. Unlike prior works, the non-parametric approach does not impose any distributional assumption about the underlying feature space, hence providing stronger flexibility and generality. We provide important insights that a high-quality feature embedding and a suitable distance measure are two indispensable components for the OOD detection task. Extensive experiments show KNN-based method can notably improve the performance on several OOD detection benchmarks, establishing superior results. We hope our work inspires future research on using the non-parametric approach to OOD detection.


Appendix A Theoretical Analysis

Proof of Theorem 6.1

We now provide the proof sketch for readers to understand the key idea, which revolves around performing the empirical estimation of the probability p^(gi=1|𝐳i). By the Bayesian rule, the probability of 𝐳 being ID data is:

p(gi=1|𝐳i) =p(𝐳i|gi=1)p(gi=1)p(𝐳i)
p^(gi=1|𝐳i) =(1ε)p^in(𝐳i)(1ε)p^in(𝐳i)+εp^out(𝐳i).

Hence, estimating p^(gi=1|𝐳i) boils down to deriving the empirical estimation of p^in(𝐳i) and p^out(𝐳i), which we show below respectively.

Estimation for p^in(𝐳i)

Recall that 𝐳 is a normalized feature vector in m. Therefore 𝐳 locates on the surface of a m-dimensional unit sphere. We denote B(𝐳,r)={𝐳:𝐳𝐳2r}{𝐳2=1}, which is a set of data points on the unit hyper-sphere and are at most r Euclidean distance away from the center 𝐳. Note that the local dimension of B(𝐳,r) is m1.

Assuming the density satisfies Lebesgue’s differentiation theorem, the probability density function can be attained by:


In training time, we empirically observe n in-distribution samples n={𝐳1,𝐳2,,𝐳n}. We assume each sample 𝐳j is i.i.d with a probability mass 1n. The empirical point-density for the ID data can be estimated by k-NN distance:

p^in(𝐳i;k,n) =p(𝐳jB(𝐳i,rk(𝐳i))|𝐳jn)|B(𝐳i,rk(𝐳i))|

where cb is a constant. The following Lemma A.1 establishes the convergence rate of the estimator.

Lemma A.1.



The proof is given in (Zhao & Lai, 2020).

Estimation for p^out(𝐳i)

A key challenge in OOD detection is the lack of knowledge on OOD distribution, which can arise universally outside ID data. We thus try to keep our analysis general and reflect the fact that we do not have any strong prior information about OOD. For this reason, we model OOD data with an equal chance to appear outside of the high-density region of ID data. Our theory is thus complementary to our experiments and captures the universality of OOD data. Specifically, we denote


where the threshold is chosen to satisfy the theorem.

Lastly, our theorem holds by plugging in the empirical estimation of p^in(𝐳i) and p^out(𝐳i).

𝟏{rk(𝐳i)λ} =𝟏{εcbnc^0(rk(𝐳i))m11ββ(1ε)k}

Table 7: Comparison results with DenseNet-101. Comparison with competitive out-of-distribution detection methods. All methods are based on a model trained on ID data only. All values are percentages and are averaged over all OOD test datasets.
Method CIFAR-10 CIFAR-100
MSP 49.95 92.05 94.38 79.10 75.39 75.08
Energy 30.16 92.44 94.38 68.03 81.40 75.08
ODIN 30.02 93.86 94.38 55.96 85.16 75.08
Mahalanobis 35.88 87.56 94.38 74.57 66.03 75.08
GODIN 28.98 92.48 94.22 55.38 83.76 74.50
CSI 70.97 78.42 93.49 79.13 60.41 68.48
SSD+ 16.21 96.96 94.45 43.44 88.97 75.21
KNN+ (ours) 12.16 97.58 94.45 37.27 89.63 75.21

Appendix B Configurations

Non-parametric methods for anomaly detection We provide implementation details of the non-parametric methods in this section. Specifically,

IForest (Liu et al., 2008) generates a random forest assuming the test anomaly can be isolated in fewer steps. We use 100 base estimators in the ensemble and each estimator draws 256 samples randomly for training. The number of features to train each base estimator is set to 512.

LOF (Breunig et al., 2000) defines an outlier score based on the sample’s k-NN distances. We set k=50.

LODA (Pevnỳ, 2016) is an ensemble solution combining multiple weaker binary classifiers. The number of bins for the histogram is set to 10.

PCA (Shyu et al., 2003) detects anomaly samples with large values when mapping to the directions with small eigenvalues. We use 50 components for calculating the outlier scores.

OCSVM (Schölkopf et al., 2001) learns a decision boundary that corresponds to the desired density level set of with the kernel function. We use the RBF kernel with γ=1512. The upper bound on the fraction of training error is set to 0.5.

Some of these methods (Schölkopf et al., 2001; Shyu et al., 2003) are specifically designed for anomaly detection scenarios that assume ID data is from one class. We show that k-NN distance with the class-aware embeddings can achieve both OOD detection and multi-class classification tasks.

Appendix C Results on Different Architecture

In the main paper, we have shown that the nearest neighbor approach is competitive on ResNet. In this section, we show in Table 7 that KNN’s strong performance holds on different network architectures DenseNet-101 (Huang et al., 2017). All the numbers reported are averaged over OOD test datasets described in Section 4.1.