Out-of-Distribution Detection in Dermatology using Input Perturbation and Subset Scanning

Hannah Kim
Duke University
Durham, NC, USA
&Girmaw Abebe Tadesse
IBM Research
Nairobi, Kenya
&Celia Cintas
IBM Research
Nairobi, Kenya
&Skyler Speakman
IBM Research
Nairobi, Kenya
Kush Varshney
IBM Research
Yorktown Heights, NY, USA

Abstract

Recent advances in deep learning have led to breakthroughs in the development of automated skin disease classification. As we observe an increasing interest in these models in the dermatology space, it is crucial to address aspects such as the robustness towards input data distribution shifts. Current skin disease models could make incorrect inferences for test samples from different hardware devices and clinical settings or unknown disease samples, which are out-of-distribution (OOD) from the training samples.To this end, we propose a simple yet effective approach that detect these OOD samples prior to making any decision. The detection is performed via scanning in the latent space representation (e.g., activations of the inner layers of any pre-trained skin disease classifier). The input samples could also perturbed to maximise divergence of OOD samples. We validate our ODD detection approach in two use cases: 1) identify samples collected from different protocols, and 2) detect samples from unknown disease classes. Additionally, we evaluate the performance of the proposed approach and compare it with other state-of-the-art methods. Furthermore, data-driven dermatology applications may deepen the disparity in clinical care across racial and ethnic groups since most datasets are reported to suffer from bias in skin tone distribution. Therefore, we also evaluate the fairness of these OOD detection methods across different skin tones. Our experiments resulted in competitive performance across multiple datasets in detecting OOD samples, which could be used (in the future) to design more effective transfer learning techniques prior to inferring on these samples.

Keywords Skin disease classification $\cdot$ Out-of-distribution sample detection $\cdot$ Algorithmic Fairness

1 Introduction

Skin disease remains a global health challenge, with skin cancer being the most common cancer worldwide (Codella et al., 2017). Following the recent success of deep learning (DL) in various computer vision problems (partly due to its automated feature encoding capability), convolutional neural networks (CNNs) (Huang et al., 2016) have been employed for skin disease classification tasks. As we observe increasing interest in DL in applying dermatology (Esteva et al., 2017; Gomolin et al., 2020), it is imperative to address transparency, robustness, and fairness of these solutions (Adamson and Smith, 2018; Qayyum et al., 2020). While many existing deep learning techniques (Mahbod et al., 2020; Gessert et al., 2018; Ahmed et al., 2019) achieve high performance on publicly available datasets (Codella et al., 2017; Tschandl et al., 2018; Combalia et al., 2019; Sun et al., 2016), they utilize ensembles of multiple models aimed at maximising performance with limited consideration to shifts in the input data (Ahmed et al., 2019; Gessert et al., 2019; Zhang et al., 2019), which might result in incorrectly classifying new samples as one of the training classes (with high confidence) though these samples might be from previously unknown or new classes.

Thus, it is necessary to detect out-of-distribution (OOD) samples prior to making decisions in order to achieve principled transfer of knowledge from in-distribution training samples to OOD test samples, thereby extending the usability of the models to previously unseen scenarios. Furthermore, OOD detectors and other DL solutions need to guarantee equivalent detection capability across sub-populations. Particularly in dermatology, bias in representations of skin tones in academic materials (Mcf, ) and clinical care (rab, ) is becoming a primary concern. For instance, New York Times reports major disparities in dermatology when treating skin of color (rab, ) as common conditions often manifest differently on dark skin, and physicians are trained mostly to diagnose them on light skin. STAT (Mcf, ) also reported that lack of darker skin tones in dermatology academic materials adversely affects the quality of care for patients of color. Alarmingly, the growing practice of using artificial intelligence to aid the diagnosis of skin diseases will further deepen the divide in patient care because of the machine learning algorithms, which are trained with such imbalanced datasets (Codella et al., 2019, 2017; Tschandl et al., 2018; Combalia et al., 2019; Sun et al., 2016) (with overwhelming majority of samples with light skin tones). This is supported by the work of Kinyanjui et al. (Kinyanjui et al., 2019), which use Individual Typology Angle (ITA) to approximate skin tones in various publicly available skin disease datasets (Codella et al., 2017; Tschandl et al., 2018; Combalia et al., 2019; Sun et al., 2016) and show that these datasets heavily under-represent darker skin tones. As a result, we also validated the performance of

To this end, we propose a simple yet effective approach that scans over the activations of the inner layers of any pre-trained skin disease classifier to detect OOD samples. We additionally perturb the input data beforehand with our proposed ODIN_low, a modification of ODIN (Liang et al., 2017), which improve OOD detection performance in earlier layers of the network. In our framework, we define two different OOD use cases: protocol variations (e.g., different hardware devices, lighting settings and not compliant with clinical protocol); and unknown disease types (e.g., samples from new disease type that was not observed during training). Without requiring any prior knowledge of the OOD samples, the proposed approach out-performed existing OOD detectors, softmax score (Hendrycks and Gimpel, 2016) and ODIN (Liang et al., 2017), for OOD samples with different validation protocols, and competitive performance is achieved in detecting samples with unknown disease types. We further explore how our proposed and existing OOD detectors perform across skin tones to evaluate fairness. We show that the current OOD detectors show higher performance in detecting darker skin tones as OOD samples than those of lighter skin tones, which is likely impacted by the imbalanced training skin datasets that heavily lack samples of dark skin tones.

Generally, our main contributions are highlighted as follows: 1) We propose a weakly-supervised approach based on subset scanning over the activations of the inner layers of a pre-trained skin disease classifier to detect OOD samples across two use cases: detection of OOD samples from different collection protocol and those from unknown disease classes; 2)We propose perturb input images with ODIN_low noise, for improved OOD detection performance;3) We evaluate our methods against existing OOD detectors: Softmax Score (Hendrycks and Gimpel, 2016) and ODIN (Liang et al., 2017); Furthermore, we evaluate the fairness of the proposed approach and existing methods in their detection performance across skin tones.

Ensemble

Test Data

Augmentation

OOD Detection

Post-Training

New Protocol

Detection

New Disease

Detection

Algorithmic

Fairness

(Ahmed et al., 2019)

✓

✗

✓

✗

(Zhang et al., 2019)

✓

✗

✓

✗

(Gessert et al., 2019)

✓

✗

✓

✗

(Bagchi et al., 2020)

✗

✓

✗

(Pacheco et al., 2019)

✓

✗

✓

✗

✓

✗

(Combalia et al., In Press)

✗

✓

✗

✓

✗

(Pacheco et al., 2020)

✗

✓

✗

Ours

✗

✓

Table 1: Summary of the state-of-the-art OOD sample detection in skin disease classification task, and the differentiation of our proposed approach.

2 Related Work

Our review of existing OOD detection methods is grouped into pre-training (Ahmed et al., 2019; Bagchi et al., 2020; Gessert et al., 2019; Zhang et al., 2019) and post-training (Combalia et al., In Press; Pacheco et al., 2019, 2020), based on where the detection step is applied.

Pre-training OOD detection approaches have prior knowledge of the OOD samples and incorporate it during their training phases. Many of these approaches utilize ensembles of existing CNNs (and their variants) to detect OOD samples (Ahmed et al., 2019; Gessert et al., 2019; Zhang et al., 2019). Ahmed et al. (Ahmed et al., 2019) applied one-class learning using deep neural network features where one-class samples were iteratively discarded as OOD samples in a one-vs-all cross-validation strategy, and the OOD samples were detected by taking the prediction average of all the models. Gessert et al. (Gessert et al., 2019) utilized an additional dataset of skin lesions as OOD samples to train their ensemble of CNNs to detect OODs. Zhang et al. (Zhang et al., 2019) employed an ensemble DenseNet-based CNNs consisting of both multi-class and binary classifiers to detect OOD samples. Bagchi et al. (Bagchi et al., 2020) proposed Class Specific - Known vs. Simulated Unknown to detect OOD samples.

Post-training OOD detection approaches do not require any prior knowledge of the OOD samples during training (Combalia et al., In Press; Pacheco et al., 2019, 2020). Pacheco et al. (Pacheco et al., 2019) detected OOD samples using Shannon entropy (Shannon, 1948) and cosine similarity metrics on their CNN’s probability outputs. Instead, Combalia et al. (Combalia et al., In Press) detected OOD samples using Monte-Carlo Dropout (Gal and Ghahramani, 2016) and test data augmentation to estimate uncertainty such as entropy and variance in their network predictions. Pacheco et al. (Pacheco et al., 2020) extended Gram-OOD (Sastry and Oore, 2019) with layer-specific normalization of Gram Matrix values to detect OOD samples.

Table 1 summarizes notable OOD detection studies in dermatology. The majority of these studies employ pre-training approaches using ensembles of CNNs, which result in model complexity and impracticality due to their need of prior knowledge of OOD samples. Test data augmentation is also less plausible to domain experts as it might partially re-synthesize the samples. In this work, we propose a simple, post-training OOD detector that can be applied to any single pre-trained network without any test data augmentation nor prior knowledge of the OOD samples.

Refer to caption — Figure 1: Block diagram of the proposed approach. $C$ : a trained model for skin disease classification over mentioned datasets ( $\mathcal{D}_{1}$ , $\mathcal{D}_{2}$ ); $T$ : a skin tone extractor.

3 Proposed Framework

We propose a weakly-supervised OOD detection method to identify skin images collected in different validation protocols and derived from unknown skin disease types, based on subset scanning (Cintas et al., 2020) and ODIN (Liang et al., 2017). Subset scanning treats the OOD detection problem as a search for the most anomalous subset of observations in the activation space of any pre-trained classifier. This exponentially large search space is efficiently explored by exploiting mathematical properties of our measure of anomalousness (Neill, 2012). Our solution can be applied to any off-the-shelf skin disease classifier. Additionally, we evaluate algorithmic fairness of the proposed and existing OOD detectors across skin tones. The overview of the proposed approach is shown in Fig. 1. Given a set of skin datasets $D$ and a pre-trained skin disease classifier $C$ as an input; first, we stratify each dataset through a skin tone distribution extractor $T$ for evaluation purposes. Then, we apply subset scanning across each layer of the classifier $C$ and compute the subset score for the unknown disease use case. To detect protocol variations, we first perturb the input data for the best performing results. In the following sections, we describe the details of the proposed approach.

3.1 Subset scanning for out-of-distribution sample detection

Given a pre-trained network $C$ for skin disease classification, we apply subset scanning (Cintas et al., 2020) on the activations in the intermediate layers of the network $C$ to detect a subset ( $S$ ) of OOD samples (see Algorithm 1). Subset scanning searches for the most anomalous subset $S^{*}=\arg\max_{S}F(S)$ in each layer, where the anomalousness is quantified by a scoring function $F(\cdot)$ , such as a log-likelihood ratio statistic. When searching for this subset, an exhaustive search across all possible subsets is computationally infeasible as the number of subsets ( $2^{N}$ ) increases exponentially with the number of nodes ( $N$ ) in a layer. Instead, we utilize a scoring function that satisfies the Linear Time Subset Scanning (LTSS) (Neill, 2012) property, which enables efficient maximization over all subsets of data. This LTSS property guarantees that the highest-scoring subset of nodes in a layer are identified within $N$ searches instead of $2^{N}$ searches. Following the literature on pattern detection (McFowland et al., 2013), we utilize non-parametric scan statistics (NPSS) (McFowland et al., 2013) as our scoring function as it satisfies LTSS property and makes minimal assumptions on the underlying distribution of node activations.

We apply subset scanning on set of layers $C_{Y}$ of our pre-trained network $C$ . For each layer $C_{y}\in C_{Y}$ , we form a distribution of expected activations at each node using the known in-distribution (ID) samples $X_{z}$ , which were used during training and can also be referred as background images. Comparing this expected distribution to the node activations of each test or evaluation sample $X_{i}$ , we can obtain p-values $p_{ij}$ for each $i^{th}$ test sample and $j^{th}$ node of layer $C_{y}$ . We can then quantify the anomalousness of the p-values by finding the subset of nodes that maximize divergence of the test sample activations from the expected. This yields $|C_{Y}|$ anomalous scores $S^{\ast}_{(C_{y})}$ for each test sample. We expect OOD samples to yield higher anomalous scores $S$ than ID samples, and we detect OOD samples with simple thresholding. Note that the OOD detection is performed in an unsupervised fashion without any prior knowledge of the OOD samples.

input : Background Image:

X_{z}\in D^{H_{0}}

, Evaluation Image:

X_{i}

, training dataset:

D_{train}

\alpha_{\text{max}}

output :

A ​ U ​ R ​ O ​ C

F_{1}

AUROC^{t}

, and

F_{1}^{t}

for

X_{i}

C\leftarrow

TrainSkinDiseaseClassifier (

D_{train}

);

C_{Y}\leftarrow

Set of layers in

C

;

X_{i}^{t}\leftarrow

PredictITASkinTone (

X_{i}

);

\hat{X}_{z}\leftarrow

AddODINNoise (

X_{z}

);

\hat{X}_{i}\leftarrow

AddODINNoise (

X_{i}

) ;

8 for $C_{y}$ in $C_{Y}$ do

9 for $j\leftarrow 0$ to $|C_{y}|$ do

A^{H_{0}}_{zj}\leftarrow

ExtractActivation (

C_{y}

\hat{X}_{z}

);

A_{ij}\leftarrow

ExtractActivation (

C_{y}

\hat{X}_{i}

);

p_{ij}=\frac{\sum_{X_{z}\in D^{H_{0}}}I(A_{zj}>=A_{ij})+1}{M+1}

;

p^{\ast}_{ij}=\{y<\alpha_{\text{max}}\>\forall\>y\subseteq p_{ij}\}

;

p^{s}_{ij}\leftarrow

SortAscending (

p^{\ast}_{ij}

);

17 for $k\leftarrow 1$ to $|C_{y}|$ do

S_{(k)}=\{p_{y}\subseteq p^{s}_{ij}\forall y\in\{1,\ldots,k\}\}

;

\alpha_{k}=max(S_{(k)})

;

F(S_{(k)})\leftarrow

NPSS (

\alpha_{k}

, k, k);

k^{\ast}_{(C_{y})}\leftarrow\arg\max F(S_{(k)})

;

\alpha^{\ast}_{(C_{y})}=\alpha_{k^{\ast}_{(C_{y})}}

;

S^{\ast}_{(C_{y})}=S_{(k^{\ast}_{(C_{y})})}

;

A ​ U ​ R ​ O ​ C

F_{1}

= ComputeDetectionPerformance (

\sum_{C_{y}}{S^{\ast}_{(C_{y})}}

);

AUROC^{t}

F_{1}^{t}

= StratifyPerSkinTone( $X_{i}^{t}$ , $A U R O C$ , $F_{1}$ );

return

A ​ U ​ R ​ O ​ C

F_{1}

AUROC^{t}

, and

F_{1}^{t}

Algorithm 1 Pseudo-code for the proposed new protocol (OOD) detection.

3.2 ODIN and ODIN_low Perturbations

We have also evaluated the impact of adding small perturbations, prior to subset scanning, to each test sample following ODIN (Liang et al., 2017) for enhanced OOD. ODIN involves two steps, input pre-processing and temperature scaling. In the first step, $X_{i}$ is perturbed by adding a small perturbation computed by back-propagating the gradient of the training loss with respect to $X_{i}$ and weighted by parameter $\epsilon$ . This pre-processed $X_{i}$ is then fed into the neural network and temperature scaling with parameter $\tau$ is applied in the final softmax layer $C_{s}$ . The two hyperparemters, $\epsilon$ and $\tau$ , are chosen so that the OOD detection performance of softmax score (Hendrycks and Gimpel, 2016), the maximum value of the softmax layer output, is optimized. We further modified ODIN and propose ODIN_low with parameters $\tau_{low}$ and $\epsilon_{low}$ that leads to the lowest softmax score performance. As subset scanning is applied not only on the softmax layer but also on the the inner layers of the network, we show that ODIN_low helps improve OOD detection in the earlier layers of the network.

3.3 Algorithmic Fairness of OOD detectors across skin tone

We further evaluate algorithmic fairness of our proposed OOD dectector across skin tones, estimated by adopting an existing framework (Kinyanjui et al., 2019). To this end, the non-diseased regions of a given skin image are segmented using Mask R-CNN (He et al., 2017), and individual typology angle (ITA) values are computed as $ITA=\arctan\left(\frac{L_{\mu}-50}{b_{\mu}}\right)\times\frac{180^{\circ}}{\pi}$ , where $L_{\mu}$ and $b_{\mu}$ are the average of luminance and yellow values of non-diseased pixels in CIELab-space. ITA values are used to stratify the samples into three Fitzpatrick skin tone categories, Light, Intermediate, and Dark, as shown in Table 2.

ITA Range	Skin Tone Category
$ITA>41^{\circ}$	Light
$28^{\circ}<ITA\leq 41^{\circ}$	Intermediate
$ITA\leq 28^{\circ}$	Dark

Table 2: Summary of Fitzpatrick skin tone categorization of computed

I ​ T ​ A

values.

4 Datasets

We validate the proposed frame work using two datasets: ISIC 2019 (Codella et al., 2017; Tschandl et al., 2018; Combalia et al., 2019) for samples of unknown diseases; and SD-198 (Sun et al., 2016) for samples from unknown collection protocols. We stratify the samples from both datasets based on skin-tones to observe the impact of various OOD methods across the population spectrum (see Figure 2).

4.1 ISIC 2019

ISIC 2019 (Codella et al., 2017; Tschandl et al., 2018; Combalia et al., 2019) dataset is an extension of ISIC 2018 and merges HAM10000 (Tschandl et al., 2018), BCN20000 (Combalia et al., 2019), and MSK (Codella et al., 2017) datasets. It consists of $25,331$ dermoscopic images among eight diagnostic categories: Melanoma, Melanocytic nevus, Basal cell carcinoma, Actinic keratosis, Benign keratosis, Dermatofibroma, Vascular lesion, and Squamous cell carcinoma. As its test set is not available publicly, we set aside Dermatofibroma (DF) and Vascular lesion (VASC) samples during training, and utilize them during the test time as OOD samples of unknown diseases. These two classes are chosen as they contain the least number of samples in the dataset. First row of Figure 2 show example images of this dataset for each of the three skin tone categories we consider in this work.

4.2 SD-198

SD-198 (Sun et al., 2016) dataset contains $198$ different diseases from different types of eczema, acne and various cancerous conditions, totalling $6,584$ images. The images are collected via various devices, mostly digital cameras and mobile phones with higher levels of noise and varying illumination. We use this dataset for OOD samples that are collected from unknown protocols. We show some example images of the dataset in the second row of Figure 2 that are stratified into three skin-tone categories, Light, Intermediate, and Dark.

5 Experimental setup

5.1 Skin disease model setup

We adopt DenseNet-121 (Huang et al., 2016) pre-trained on ImageNet (Deng et al., 2009) for the skin disease classification task and fine-tune it on ISIC 2019 (Codella et al., 2017). To accommodate for the change in number of classes for the skin disease classification task, we resize the last four fully connected layers of DenseNet to $512$ , $256$ , $128$ , and $7$ nodes followed by a SoftMax with $7$ nodes for the seven skin disease classes. We use Adam (Kingma and Ba, 2015) optimization with a learning rate of $1e^{-4}$ and a batch size of $40$ . To address the class imbalance problem, we employ weighted cross-entropy loss. The implementation is done with the Python 3.6 (Harris et al., 2020) and TensorFlow 1.14 (Abadi et al., 2016). To validate detection of unknown disease samples, we use DF and VASC classes from ISIC-2019, consisting of $253$ and $225$ samples, respectively. Similarly, for samples with different collection protocols, we extract $10$ sets of $260$ samples from SD-198 and report their aggregate performance.

5.2 Subset scanning setup

We apply subset scanning across eight layers $C_{Y}$ consisting of six convolutional layers $(C_{conv_{1}},...,C_{conv_{6}})$ , global pooling layer $(C_{gp})$ , and softmax layer $(C_{s})$ . For ODIN (Liang et al., 2017), we use temperature scaling parameter $\tau=10$ and perturbation magnitude $\epsilon=0$ (optimized on ISIC-2019) for SD-198 samples and $\tau=5$ and $\epsilon=0.0002$ (optimized on SD-198) for ISIC-2019 samples. For ODIN_low, we use $\tau_{low}=2$ and $\epsilon_{low}=0.2$ , which leads to AUROC equal to 0.5 for Softmax Score for both OOD use cases. We employ Area Under Receiver Operating Characteristic Curve (AUROC) and maximum $F_{1}$ -score ( $F_{1}$ ) as our metrics to evaluate the OOD detection performance.

6 Results

In this section, we show the result of proposed OOD detector with subset scanning and ODIN as detailed in Section 3. We first compare our result of OOD detection to Softmax Score (Hendrycks and Gimpel, 2016) and ODIN (Liang et al., 2017) in Tables 3 for OOD samples with different collection protocol and in 4 for OOD samples with unknown disease types. We further stratify OOD samples based on skin tone for these approaches and report their performance in Table 5. We show in Figure 3 the detection performance of our proposed method on individual layers across our network and further stratify these performances across skin tone in Figure 4.

Methods	AUROC	$F_{1}$
Softmax Score (Hendrycks and Gimpel, 2016)	$74.4\pm 1.7$	${71.0\pm 1.1}$
ODIN (Liang et al., 2017)	${74.5\pm 1.6}$	$70.8\pm 1.1$
SS ( $C_{s}$ )	$68.2\pm 1.4$	$71.3\pm 0.5$
SS ( $C_{gp}$ )	$62.7\pm 1.2$	${72.5\pm 0.6}$
SS ( $C_{conv_{1}}$ )	$41.6\pm 1.8$	$68.1\pm 0.2$
SS ( $C_{s}$ )+ODIN	$51.2\pm 1.9$	$67.9\pm 0.3$
SS ( $C_{conv_{1}}$ )+ODIN_low	$85.4\pm 0.6$	$81.9\pm 0.6$
SS (Sum All Layers)+ODIN_low	$\mathbf{{91.0\pm 0.8}}$	$\mathbf{{86.9\pm 1.1}}$

Table 3: Detection performance for OOD samples of unknown collection protocols validated with SD-198 (Sun et al., 2016). Bold values are the best performers in each column.

6.1 OOD samples from a different protocol or equipment

We first show the result of detecting OOD samples that are collected with different protocols or equipment. Table 3 summarizes the results of the proposed approach - subset scanning (SS) with and without noise, and compared with the existing baselines (Hendrycks and Gimpel, 2016; Liang et al., 2017). In the top panel, we see that ODIN (Liang et al., 2017) increases the AUROC performance of Softmax Score by around 0.1 on average. For samples with ODIN noise, we show the performance of subset scanning on the softmax layer $C_{s}$ , as ODIN is optimized on Softmax Score, and for samples with ODIN_low noise, we show the result of subset scanning on the first convolutional layer ( $C_{conv_{1}}$ ). We achieve the best performance with AUROC of $91.0\pm 0.8$ and maximum $F_{1}$ -score of $86.9\pm 1.1$ using the sum of subset scores $S^{*}_{(C_{y})}$ across all eight layers with ODIN_low (bottom row in Table 3).

Methods	AUROC		$F_{1}$
Methods	DF	VASC	DF	VASC
Softmax Score (Hendrycks and Gimpel, 2016)	80.9	73.2	76.5	70.5
ODIN (Liang et al., 2017)	72.3	65.3	70.3	67.4
SS ( $C_{s}$ )	80.8	70.8	75.7	72.3
SS ( $C_{gp}$ )	37.4	57.9	65.9	69.2
SS ( $C_{conv_{1}}$ )	50.9	62.5	65.8	68.7
SS ( $C_{s}$ )+ODIN	71.8	63.3	70.4	67.4
SS ( $C_{conv_{1}}$ )+ODIN_low	47.6	39.8	65.9	67.1
SS (Sum All Layers)+ODIN_low	47.6	40.4	65.9	67.2

Table 4: Performances of detecting OOD samples of unknown disease types, DF and VASC. Bold values are the best performers in each column.

6.2 OOD samples of unknown diseases

Table 4 shows the performance of detecting OOD samples of unknown diseases (DF and VASC) that are unseen during training. While Softmax Score (Hendrycks and Gimpel, 2016) yields the best performance, subset scanning on the softmax layer $C_{s}$ shows comparable performance. We see worse performances with ODIN as these OOD samples are from the same dataset as ID samples and adding noise likely blurs the unique features present in each skin disease class.

Methods	Skin Tone	Unknown diseases				Collection protocol
		DF		VASC		SD-198
		R	AUROC	R	AUROC	R	AUROC
Softmax Score (Hendrycks and Gimpel, 2016)	Light	171	81.0	185	72.1	986	75.8
	Intermediate	52	80.7	58	75.8	1278	73.7
	Dark	10	74.9	9	77.0	326	73.2
ODIN (Liang et al., 2017)	Light	171	71.6	185	64.0	986	76.2
	Intermediate	52	69.9	58	64.9	1278	73.8
	Dark	10	86.3	9	89.4	326	72.1
SS ( $C_{s}$ )	Light	171	78.6	185	70.7	986	68.3
	Intermediate	52	87.0	58	71.3	1278	68.0
	Dark	10	87.6	9	69.5	326	68.6
SS ( $C_{s}$ )+ODIN	Light	171	69.7	185	62.7	986	52.1
	Intermediate	52	73.8	58	63.1	1278	50.6
	Dark	10	88.2	9	74.5	326	50.9
SS ( $C_{conv_{1}}$ ) + ODIN_low	Light	171	45.1	185	38.8	986	83.1
	Intermediate	52	49.9	58	37.8	1278	86.7
	Dark	10	63.6	9	68.4	326	87.2
SS (Sum All Layers) + ODIN_low	Light	171	45.1	185	38.4	986	89.3
	Intermediate	52	51.8	58	40.0	1278	92.0
	Dark	10	56.2	9	78.7	326	92.3

Table 5: Performance of methods in Tables 3 and 4 stratified into three different skin tone categories.

R

represents the number of OOD samples in each category.

6.3 Performance stratified by skin-tone

We further stratify the OOD samples into three skin tone categories and show the results in Table 5. In each set of columns, we include the number of test samples $R$ for each skin tone category and its corresponding AUROC performance. Samples of Dark skin tones constitute only around $3.9\%$ of DF and VASC samples and around $13\%$ of SD-198 samples. Majority of the listed methods (13 out of 18), show higher detection performance of Dark OOD samples. This could be partially because network is trained on the ISIC 2019 dataset that heavily lacks samples of dark skin tones, and thus easily detects OOD samples of dark skin tone to be out of distribution. Overall, it requires further investigation to clearly understand whether such performance reveals the lack of Dark samples in these datasets or variant manifestations of skin diseases in Dark skin.

6.4 OOD detection across individual layers

Figure 3 shows the OOD detection performance in terms of AUROC of our proposed work on the eight layers of our pre-trained CNN in $C_{Y}$ that we consider. The first column shows the result of subset scanning without any added noise, and the other columns show the result of applying ODIN (Liang et al., 2017) and ODIN_low perturbations, respectively, to our test images before applying subset scanning. In each sub-plot, we show results of both use cases, i.e., detection of samples of unknown diseases (DF (yellow), VASC (green)) and samples from different protocols (SD-198 (red)). Overall, DF and VASC samples from ISIC 2019 dataset have similar performance across the eight layers we consider while samples from SD-198 dataset leads to varying performances depending on the layer and ODIN parameters. This is partly because DF and VASC samples are from the same distribution as the training set as they are both from the same ISIC 2019 dataset, while SD-198 has different distribution than the training set of ISIC 2019 with different collection protocol. Comparing the last two plots, we see that standard ODIN leads to better performance near the end of the network while ODIN_low leads to better performance in earlier layers of the network. This is as expected as ODIN parameters ( $\tau$ and $\epsilon$ ) are optimized on the Softmax Scores while ODIN_low parameters, $\tau_{low}$ and $\epsilon_{low}$ , are not.

We further stratify the performance of individual layers based on skin tone represented in the samples and show the change in AUROC with the stratification in Figure 4. While the samples of Light (blue) and Intermediate (magenta) skin tones show consistent performances throughout the layers, we see varying performances for samples of Dark (cyan) skin tones. This instability of performance for samples of Dark skin tones may be partially because network is trained on the ISIC 2019 dataset that heavily lacks samples of Dark skin tones.

7 Conclusion

We propose a weakly-supervised method to detect out-of-distribution (OOD) skin images (collected in different protocols or from unknown disease types) using input perturbation and scanning of the activations in the intermediate layers of pre-trained on-the-shelf classifier. The scanning of activations is optimised as a search problem to identify nodes in a layer that results in maximum divergence of the activations from subset of test samples compared to the expected activations derived from the training (in-distribution) samples. We exploited Linear Time Subset Scanning (LTSS) (Neill, 2012) property of subset scanning to achieve efficient search that scales linearly with the number of nodes in the a layer. Our proposed method improves on the state-of-the-art detection for OOD samples that are collected from a different protocol or equipment than those in-distribution samples used to train the classifier, and it achieves competitive performance with the state-of-the-art in detecting samples of unknown diseases. We further stratify these OOD samples based on three skin tone categories, Light, Intermediate, and Dark. From our results we observe imbalanced detection performance across skin tones, where the Dark samples are detected as OOD with higher performance. Thus, future work aims to understand the reasons for such detection disparity across skin tones, e.g., lack of training representation or different manifestation of skin diseases.

References

Codella et al. [2017] Noel C. F. Codella, David Gutman, M. Emre Celebi, Brian Helba, Michael A. Marchetti, Stephen W. Dusza, Aadi Kalloo, Konstantinos Liopyris, Nabin K. Mishra, Harald Kittler, and Allan Halpern. Skin lesion analysis toward melanoma detection: A challenge at the 2017 international symposium on biomedical imaging (isbi), hosted by the international skin imaging collaboration (ISIC). CoRR, abs/1710.05006, 2017. URL http://arxiv.org/abs/1710.05006.
Huang et al. [2016] Gao Huang, Zhuang Liu, and Kilian Q. Weinberger. Densely connected convolutional networks. CoRR, abs/1608.06993, 2016. URL http://arxiv.org/abs/1608.06993.
Esteva et al. [2017] Andre Esteva, Brett Kuprel, Roberto A Novoa, Justin Ko, Susan M Swetter, Helen M Blau, and Sebastian Thrun. Dermatologist-level classification of skin cancer with deep neural networks. Nature, 542(7639):115–118, 2017.
Gomolin et al. [2020] Arieh Gomolin, Elena Netchiporouk, Robert Gniadecki, and Ivan V Litvinov. Artificial intelligence applications in dermatology: where do we stand? Frontiers in medicine, 7, 2020.
Adamson and Smith [2018] Adewole S Adamson and Avery Smith. Machine learning and health care disparities in dermatology. JAMA dermatology, 154(11):1247–1248, 2018.
Qayyum et al. [2020] Adnan Qayyum, Junaid Qadir, Muhammad Bilal, and Ala Al-Fuqaha. Secure and robust machine learning for healthcare: A survey. arXiv preprint arXiv:2001.08103, 2020.
Mahbod et al. [2020] Amirreza Mahbod, Gerald Schaefer, Chunliang Wang, Georg Dorffner, Rupert Ecker, and Isabella Ellinger. Transfer learning using a multi-scale and multi-network ensemble for skin lesion classification. Computer Methods and Programs in Biomedicine, 193:105475, 03 2020.
Gessert et al. [2018] Nils Gessert, Thilo Sentker, Frederic Madesta, Rüdiger Schmitz, Helge Kniep, Ivo M. Baltruschat, René Werner, and Alexander Schlaefer. Skin lesion diagnosis using ensembles, unscaled multi-crop evaluation and loss weighting. CoRR, abs/1808.01694, 2018. URL http://arxiv.org/abs/1808.01694.
Ahmed et al. [2019] Sara Atito Ali Ahmed, Berrin Yanikoglu, Erchan Aptoula, and Ozgu Goksu. Skin lesion classification with deep learning ensembles in isic 2019. 2019.
Tschandl et al. [2018] Philipp Tschandl, Cliff Rosendahl, and Harald Kittler. The HAM10000 dataset: A large collection of multi-source dermatoscopic images of common pigmented skin lesions. CoRR, abs/1803.10417, 2018. URL http://arxiv.org/abs/1803.10417.
Combalia et al. [2019] Marc Combalia, Noel C. F. Codella, Veronica Rotemberg, Brian Helba, Veronica Vilaplana, Ofer Reiter, Cristina Carrera, Alicia Barreiro, Allan C. Halpern, Susana Puig, and Josep Malvehy. Bcn20000: Dermoscopic lesions in the wild. 2019.
Sun et al. [2016] Xiaoxiao Sun, Jufeng Yang, Ming Sun, and Kai Wang. A benchmark for automatic visual classification of clinical skin disease images. In Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling, editors, Computer Vision – ECCV 2016, pages 206–222, Cham, 2016. Springer International Publishing.
Gessert et al. [2019] Nils Gessert, Maximilian Nielsen, Mohsin Shaikh, René Werner, and Alexander Schlaefer. Skin lesion classification using loss balancing and ensembles of multi-resolution efficientnets. 2019.
Zhang et al. [2019] Pengyi Zhang, Yunxin Zhong, and Xiaoqiong Li. Melanet: A deep dense attention network for melanoma detection in dermoscopy images. 2019.
[15] Dermatology faces a reckoning: Lack of darker skin in textbooks and journals harms care for patients of color.
[16] Dermatology has a problem with skin color.
Codella et al. [2019] Noel C. F. Codella, Veronica Rotemberg, Philipp Tschandl, M. Emre Celebi, Stephen W. Dusza, David Gutman, Brian Helba, Aadi Kalloo, Konstantinos Liopyris, Michael A. Marchetti, Harald Kittler, and Allan Halpern. Skin lesion analysis toward melanoma detection 2018: A challenge hosted by the international skin imaging collaboration (ISIC). CoRR, abs/1902.03368, 2019. URL http://arxiv.org/abs/1902.03368.
Kinyanjui et al. [2019] Newton M. Kinyanjui, Timothy Odonga, Celia Cintas, Noel C. F. Codella, Rameswar Panda, Prasanna Sattigeri, and Kush R. Varshney. Estimating skin tone and effects on classification performance in dermatology datasets. 2019.
Liang et al. [2017] Shiyu Liang, Yixuan Li, and R. Srikant. Principled detection of out-of-distribution examples in neural networks. CoRR, abs/1706.02690, 2017. URL http://arxiv.org/abs/1706.02690.
Hendrycks and Gimpel [2016] Dan Hendrycks and Kevin Gimpel. A baseline for detecting misclassified and out-of-distribution examples in neural networks. CoRR, abs/1610.02136, 2016. URL http://arxiv.org/abs/1610.02136.
Bagchi et al. [2020] Subhranil Bagchi, Anurag Banerjee, and Deepti R. Bathula. Learning a meta-ensemble technique for skin lesion classification and novel class detection. In CVPR Workshops, June 2020.
Pacheco et al. [2019] Andre G. C. Pacheco, Abder-Rahman Ali, and Thomas Trappenberg. Skin cancer detection based on deep learning and entropy to detect outlier samples, 2019.
Combalia et al. [In Press] Marc Combalia, Ferran Hueto, Susana Puig, Josep Malvehy, and Verónica Vilaplana. Uncertainty estimation in deep neural networks for dermoscopic image classification. In CVPR 2020, ISIC Skin Image Analysis Workshop, 2020 In Press.
Pacheco et al. [2020] Andre G. C. Pacheco, Chandramouli S. Sastry, Thomas Trappenberg, Sageev Oore, and Renato A. Krohling. On out-of-distribution detection algorithms with deep neural skin cancer classifiers. In CVPR Workshops, June 2020.
Shannon [1948] Claude E. Shannon. A mathematical theory of communication. Bell Syst. Tech. J., 27(3):379–423, 1948. URL http://dblp.uni-trier.de/db/journals/bstj/bstj27.html#Shannon48.
Gal and Ghahramani [2016] Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, ICML’16, page 1050–1059. JMLR.org, 2016.
Sastry and Oore [2019] Chandramouli Shama Sastry and Sageev Oore. Detecting out-of-distribution examples with in-distribution examples and gram matrices. 2019.
Cintas et al. [2020] Celia Cintas, Skyler Speakman, Victor Akinwande, William Ogallo, Komminist Weldemariam, Srihari Sridharan, and Edward McFowland. Detecting adversarial attacks via subset scanning of autoencoder activations and reconstruction error. In IJCAI 2020, 2020.
Neill [2012] Daniel B. Neill. Fast subset scan for spatial pattern detection, 2012.
McFowland et al. [2013] Edward McFowland, Skyler Speakman, and Daniel B. Neill. Fast generalized subset scan for anomalous pattern detection. J. Mach. Learn. Res., 14(1):1533–1561, January 2013. ISSN 1532-4435.
He et al. [2017] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross B. Girshick. Mask R-CNN. CoRR, abs/1703.06870, 2017. URL http://arxiv.org/abs/1703.06870.
Deng et al. [2009] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, 2009.
Kingma and Ba [2015] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. URL http://arxiv.org/abs/1412.6980.
Harris et al. [2020] Charles R Harris, K Jarrod Millman, Stéfan J van der Walt, Ralf Gommers, Pauli Virtanen, David Cournapeau, Eric Wieser, Julian Taylor, Sebastian Berg, Nathaniel J Smith, et al. Array programming with numpy. Nature, 585(7825):357–362, 2020.
Abadi et al. [2016] Martin Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. Tensorflow: A system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation, pages 265–283, 2016. URL https://www.usenix.org/system/files/conference/osdi16/osdi16-abadi.pdf.

SS	SS+ODIN	SS+ODIN_low