Homophone Reveals the Truth: A Reality Check for Speech2Vec

Guangyu Chen
Renmin University of China
hcs@ruc.edu.cn
Abstract

Generating spoken word embeddings that possess semantic information is a fascinating topic. Compared with text-based embeddings, they cover both phonetic and semantic characteristics, which can provide richer information and are potentially helpful for improving ASR and speech translation systems. In this paper, we review and examine the authenticity of a seminal work in this field: Speech2Vec. First, a homophone-based inspection method is proposed to check the speech embeddings released by the author of Speech2Vec. There is no indication that these embeddings are generated by the Speech2Vec model. Moreover, through further analysis of the vocabulary composition, we suspect that a text-based model fabricates these embeddings. Finally, we reproduce the Speech2Vec model, referring to the official code and optimal settings in the original paper. Experiments showed that this model failed to learn effective semantic embeddings. In word similarity benchmarks, it gets a correlation score of 0.08 in MEN and 0.15 in WS-353-SIM tests, which is over 0.5 lower than those described in the original paper. Our data and code are available111https://github.com/my-yy/s2v_rc.

1 Introduction

Word embeddings are fixed-length representations of words, which carry syntactic and semantic information and become the building block in natural language processing tasks, such as named entity recognition [24] and question answering [23]. Speech, as another carrier of information, also reflects semantic meaning, which implies the possibility of directly using it to learn semantic word representations. To this end, some attempts have been made to generate semantic speech embeddings [7, 18, 5, 6]. Compared with learning text-based embeddings, this is a much more difficult task. Because there are intrinsic interferences between phonetics and semantics [5]. For example, the words ‘ate’ and ‘eight’ have almost the same pronunciation but are semantically different. If the model can not distinguish these sound-alike inputs, it is impossible to obtain effective semantic representations.

Refer to caption
Figure 1: The Speech2Vec model (skip-gram style).

In this paper, we focus on Speech2Vec [6, 7], one of the field’s earliest and most influential studies. It can be viewed as a speech version of Word2Vec [17], which leverages the skip-gram and CBOW strategy to learn semantic speech embeddings. Despite being an imitator of Word2Vec, it seems to have magically resolved the mentioned phonetic-semantic interference and has been reported to exceed the Word2Vec on 13 word similarity benchmarks [10]. This phenomenon aroused our interest in exploring the validity of this study. Specifically, we analyze the authenticity of Speech2Vec from two perspectives of the official-released speech embeddings and our reproduction results. The conclusions can be summarized as follows:

1) The official speech embeddings fail to reflect speech characteristics, implying that they are not generated by the Speech2Vec model.

2) Analysis of the vocabulary composition reveals that these released embeddings may come from a text-based model.

3) Our reproduction results show that the Speech2Vec model failed to generate valid semantic speech embeddings. Even after 500 epochs of training, its performance is still negligible in most benchmark metrics.

2 Recap Speech2Vec

Speech2Vec can be regarded as a speech version of Word2Vec, which relies on the skip-gram or CBOW strategy to learn semantic speech embeddings. Herein, we focused on the skip-gram style of Speech2Vec, which is reported to have better performance. Fig. 1 shows its structure.

When training, the input speech sentence is segmented into spoken word sequences in advance. Then the encoder network converts the current spoken word into an embedding vector. And this vector is used for decoding context words. In the corpus, the audio segments corresponding to the same text word appear multiple times (with different speakers, environment, emotions, or talking speed). To obtain a deterministic word representation, the average of all embeddings corresponding to the same text word is taken as the final embedding.

The official code of Speech2Vec is not publicly available, but the pre-trained speech embeddings are released on GitHub 222https://github.com/iamyuanchung/speech2vec-pretrained-vectors. These embeddings are described from a skip-gram style Speech2Vec model trained on 500 hours of LibriSpeech [19] dataset:

… In this repository, we release the speech embeddings of different dimensionalities learned by Speech2Vec using skip-grams as the training methodology. The model is trained on a corpus consisting of about 500 hours of speech from LibriSpeech (the clean-360 + clean-100 subsets) …

We further provide a fork to this repository 333https://github.com/my-yy/speech2vec-pretrained-vectors.

3 Reality Check

This section aims to check the authenticity of the released speech embeddings. We analyze them from two perspectives: phonetic characteristics and vocabulary composition.

3.1 Homophone Inspection

Homophones refer to words that have the same pronunciation but reflect different meanings (e.g., ‘ate, eight’,‘ dear, deer’, and ‘whine, wine’). Since these spoken words are consistent in pronunciation, in a speech recognition system, the speech context and language model [3] are usually required to distinguish them. For the skip-gram style Speech2Vec, its input is a single spoken word that does not contain context information. Therefore, this model should not be able to distinguish these homophones effectively. And the embedding similarity of the homophone pair should have a higher value. We used 307 homophone pairs444 Sourced from: http://www.singularis.ltd.uk/bifroest/misc/homophones-list.html and calculated their similarities. The results are shown in Tab. 1.

Embedding Dim Pair Similarity
Homophone Random
Official 50 0.33±0.15 0.39±0.17
100 0.28±0.14 0.35±0.15
200 0.23±0.12 0.31±0.14
300 0.21±0.11 0.29±0.14
Ours 50 0.95±0.05 0.77±0.10
Table 1: Cosine similarities of word pairs (mean ± std). Random: Similarities between 307 randomly selected pairs. Official: Embeddings released by the author of Speech2Vec. Ours: Embeddings generated by our reproduced model (detailly described in Sec. 4. )

As we can see, for the official embeddings, their homophone similarities are counterintuitively lower than the random results. It suggested that the official model may not be affected by phonetic similarity and can successfully distinguish these highly similar inputs. However, our reproduced model did not ‘show the magic’. Although it had a 0.77 similarity for random pairs (which implies the embedding space is ill-distributed), it still obtained a higher similarity of 0.95 for the homophone pairs.

Homophone Pair Similarity
Raw Official Ours
ate eight 1.00 0.22 1.00
hail hale 0.99 0.34 0.97
know no 1.00 0.63 0.97
made maid 1.00 0.25 0.98
meat meet 1.00 0.13 0.99
peace piece 1.00 0.15 0.99
sea see 1.00 0.31 0.98
steal steel 1.00 0.19 0.99
tale tail 1.00 0.23 1.00
whine wine 1.00 0.13 0.97
MEAN 0.99 0.32 0.95
Table 2: Examples of homophone similarities. The Official stands for the official 50-dimensional embeddings. The appendix gives the full 307 results.
Word Official Ours
K Nearest Neighbor HR K Nearest Neighbor HR
ate eating drank eat 25,099 eight agent aid 1
hail thunder blast thunders 15,253 hell handle held 9
know tell think why 2,305 narrow natural no 3
made making gave make 35,582 me mid necessity 15
meat roasting roasted venison 33,108 meet mate mute 1
peace tranquility liberty deliverer 33,609 plates place patience 7
sea ocean shore waters 18,610 city safety succeed 100
steal flay stealing kill 29,506 still steel special 2
tale story adventure tales 28,617 tail title tackle 1
whine snarling growls growl 34,575 watering washing wide 4
MEAN 25,626 14
Table 3: The k nearest neighbor results (k=3). HR: Homophone ranking.

Tab. 2 further gives some pair examples. In this table, we introduce a ‘Raw’ metric to measure the similarity of audio input. This metric is the cosine similarity of extracted features from a pre-trained wav2vec [21] model.

It is clear to see that most homophone pairs have a Raw score close to 1.0, which means there is a high degree of similarity between these homophone words. And the wav2vec model failed to distinguish them. In addition, our reproduced model has similar behavior, which gets an average similarity of 0.95. However, the official embeddings get relatively lower similarities. Taking the ‘ate, eight’ pair as an example, the similarities of Raw and ours are both 1.00, while the official gives a 0.22 similarity score.

The k-nearest neighbor results (Tab. 3) further reveals the anomaly of the official embedding. Although the official’s nearest neighbors reflect similar semantics, there are considerable differences in pronunciation. For example, the nearest neighbors of ‘hail’ are ‘thunder, blast, thunders’, while its homophone counterpart (‘hale’) is ranked as high as 15,253. In contrast, the homophone rank in our reproduction is 9.

The above anomaly behavior of the official embeddings can be explained by a bold hypothesis: the official embeddings come from a text-based model. As in a text-based model (like Word2Vec), words of different spellings are assigned with different coding values. In the learning process, the model does not rely on pronunciation to identify words; therefore, it will not be disturbed by them. Moreover, because homophone pairs have different meanings, their similarities may be smaller than the random pairs. This explains why the official results have smaller similarities than the random in Tab. 1.

3.2 Vocabulary Analysis

3.2.1 Checking the Released Vocabulary

Corpus Division Hours Speakers
Clean train-clean-100 100.6 475 251 1,252
train-clean-360 363.6 921
dev-clean 5.4 40
test-clean 5.4 40
Other train-other-500 496.7 507 1,166 1,232
dev-other 5.3 33
test-other 5.1 33
Table 4: The statistics of the LibrSpeech corpus.

The Seepch2Vec paper claimed that the training is based on 500 hours of LibriSpeech [19], which contains 1,252 speakers. We refer to this dataset as Libri-Clean. Its statistics are shown in Tab. 4. We counted distinct words in Libri-Clean’s transcripts and compared them with the official. The results are shown in Tab. 5.

Official Libri-Clean Libri-All
- count5 count5
Vocab 37,622 66,721 27,454 37,622
Missing - 1,685 10,168 0
Table 5: The vocabulary composition.

We found that 1.6k official words are not in the Libri-Clean. If we only consider frequent words of occurrence count 5, this number rises to 10k. However, if we use the entire LibriSpeech corpus and filter out infrequent words, the vocabulary composition is exactly the same as the official. This proves that the released embeddings are derived from all LibriSpeech corpus, not the claimed ‘500 hours’ data.

In addition, we believe that these embeddings are generated by a text-based model, which directly uses text transcripts for training. Because the LibriSpeech dataset needs to be pre-processed for training, each speech sentence should be aligned with the transcript to locate word boundaries. However, this alignment process is not ideal, which may cause data loss.

In our analysis, we used the alignment results based on the widely used Montreal Forced Aligner [16] 555https://github.com/CorentinJ/librispeech-alignments. It aligned the entire corpus of 292,367 speech sentences and encountered 127 failures (0.04% fail rate). After removing these unaligned speeches, the corpus changes, which causes additional 17 missing words in the vocabulary. Therefore, if the authors of Speech2Vec do not realize an ideal alignment, their vocab size will not equal 37,622 exactly. Otherwise, their embeddings come from a text-based model.

Benchmark Not Found Pairs
Paper Data Libri-Clean
MEN [4] 122 231
MTurk-287 [20] 13 29
MTurk-771 [12] 22 43
Rare-Word [15] 783 1027
SimLex-999 [13] 0 11
SimVerb-3500 [11] 126 118
Verb-143 [2] 0 5
WS-353 [26] 21 34
WS-353-REL [1] 12 24
WS-353-SIM [1] 7 16
YP-130 [26] 0 9
Table 6: Not found pairs in word similarity benchmarks.

3.2.2 Checking The Paper Data

Although the vocabulary size is not mentioned in the paper of Speech2Vec, we can still find some clues from the word similarity benchmark results. In the original paper, these benchmarks are used to evaluate the semantic information carried by embeddings. Each benchmark contains a series of predefined word pairs with the corresponding oracle correlation. In an early version of Speech2vec [6], the author reported the ‘not found’ pairs in these benchmarks (Tab. 6).

If we use all the 66,721 words in Libri-Clean to construct the vocabulary, there are still more not found pairs in these tests (except for SimVerb-3500). For example, the original paper reported 122 not found pairs in the MEN test, but we found 231. This illustrates that the authors of Speech2Vec used a larger corpus other than the claimed 500h Libri-Clean.

4 Our Reproduction

4.1 Reproduction Details

4.1.1 Code Implementation

Our reproduction is based on the incomplete code provided by the author of Speech2Vec. This code covers the model implementation and training logic while lacking the data pre-processing. We complement the missing parts and train the Speech2Vec, referring to the optimal setting in the original paper. Specifically, the embedding dimension is set as 50. The window size is taken as 3. The encoder is a single-layer bidirectional LSTM. And the decoder is a single-layer unidirectional LSTM. During the decoding process, the attention mechanism [14] is used to reference the encoder’s outputs (Fig. 5 in the appendix gives a detailed overview of its structure).

4.1.2 Training Details

In accordance with Speech2Vec’s paper, the Libri-Clean dataset is used for training. Each speech sentence is segmented into spoken words in advance. Then we filter out the low-frequency words that appear less than four times and finally get a vocabulary of size 30,788. For each spoken word, we convert it into 13-dim MFCC frames.

During the mini-batch construction, we set batch size as 4096 and trained the model for 500 epochs with SGD optimizers of 0.001 learning rate. This training process took 8.4 days on an AMD 3900XT + RTX3090 device. The output models of every epoch are saved, and the 500th epoch’s model is used for evaluation.

Refer to caption
Figure 2: Comparisons on word similarity benchmarks. Paper Data: The results claimed in the paper of Speech2Vec [7]. Rand Init: The randomly initialized model without training. 500-Epoch: Train the model for 500 epochs.
Refer to caption
Figure 3: The learning curve. X-axis: training epochs. Y-axis: Spearman coefficient (except for Loss).

4.2 Results & Analysis

Same with the official, we tested our reproduced model on 13 similarity benchmarks [10]. The evaluation score is Spearman’s rank correlation coefficient ρ. It has a range of [-1,1], and a larger value represents a better result. The results are shown in Fig. 2.

Obviously, the reproduced model gives trivial results on these benchmarks. Compared with the random initialization, the 500-epoch model only has minor improvements or even decreases in indicators of SimLex-999 and SimVerb-3500. Moreover, compared with the claimed paper data, there is a substantial performance drop, even over 0.5 in some metrics (MC-30, MEN, RG-65, WS-353-SIM).

We further depicted the training process in Fig. 3. With the loss decreasing, most indicators showed a rising trend. But the increasing speed is relatively slow and is finally limited to a small range (e.g., 00.022 for MTurk-287). More training epochs may provide better results on these metrics, but it should be much larger than the claimed ‘500 epochs’. Moreover, two indicators seem to have passed their peak period, with performance turning from rising to falling (WS-353-REL and MEN). However, their peak performance is still relatively lower than the original paper. There are two indicators whose performance is consistently declining, which means that the required semantics have not been learned successfully (SimLex-999 and SimVerb-3500).

As mentioned in Sec. 2, each text word corresponds to multiple speech segments. In Fig. 4, we visualize the embedding distribution of specific words. Specifically, the multi-dimensional scaling (MDS) [25] is used for visualization. Compared with the popular T-SNE, the MDS preserves distance and global structure.

As we can see, the homophone words are mixed, while words with different pronunciations tend to be separated. This result confirms that Speech2Vec’s encoder can not successfully distinguish homophones. Moreover, we noticed some mixed points for the ‘need-thus’ pair. This shows that the model did not have a good discernibility even for not sound-alike pairs, which may be caused by the interference between phonetics and semantics.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 4: MDS visualization of speech word embeddings.

5 Related Works

5.1 Works on Learning Semantic Speech Embeddings

Five years since the Speech2Vec was proposed, there is rarely work in the field of learning semantic speech embeddings. The Speech2Vec has received 158 citations so far, but no article has directly compared with it, and no subsequent work has proved its validity.

5.1.1 Chen et al.’s Work

Chen et al. [5] pointed out that when learning semantic speech embeddings, the phonetics and semantics inevitably disturb each other. Therefore, they designed a two-stage learning scheme to reduce the learning difficulty. In the first stage, an encoder-decoder structure generates phonetic embeddings in which the speaker characteristics are disentangled. In the second stage, these embeddings possess semantic meaning through a context prediction task. Experiments have verified their embeddings’ phonetic and semantic characteristics. However, this work is not compared with Speech2Vec on the similarity benchmarks.

5.1.2 CAWE

CAWE [18] is an encoder-decoder-based speech recognition model, which relies on an attention mechanism to automatically segment speech sentences into spoken words and generate speech embeddings. This model is tested on 16 sentence evaluation tasks and shows competitive results to Word2Vec. Compared with skip-gram style Speech2Vec, the CAWE can leverage the context information. Thus their effectiveness can not laterally prove Speech2Vec’s validity. Moreover, we have not found any successful third-party replications of CAWE.

5.1.3 Audio2Vec

Audio2Vec [22] is a CNN-based model which uses skip-gram, CBOW, and temporal-gap strategies to generate fixed-length audio representations. Although it has a similar training style to Speech2Vec, the goal of Audio2Vec is to produce general-purpose audio representations. In the training process, the speech sentence is segmented into fixed-length slices without reference to word boundaries. Thus this model was not tested on the word similarity benchmarks and not compared with Speech2Vec.

5.2 Works Directly Based on Speech2Vec Embedding

Ideally, if the experimental data is questionable, the conclusion’s validity cannot be guaranteed. We note that there are some works [8, 9, 27] directly based on the embeddings of Speech2Vec. Among them, two works come from the authors of Speech2vec [8, 9], which align the speech space and text space to realize cross-language speech-to-text translation.

Yi et al. [27] reported that they successfully improved punctuation prediction performance by involving the released Speech2Vec embeddings in their input. But they did not explore how these speech embeddings work. We think their results are authentic but unintentionally provide a wrong interpretation. As we have analyzed, these Speech2Vec embeddings cannot reflect phonetic information. Thus their improvements come from involving more input features. And it has nothing to do with whether these features come from a text-based or speech-based model. Therefore, we believe that the above works cannot provide evidence for the authenticity of Speech2Vec.

5.3 Replication Attempts by Other Researchers

We investigated the replication attempts by other researchers, including Jang’s666https://github.com/yjang43/Speech2Vec and Wang’s777https://github.com/ZhanpengWang96/pytorch-speech2vec. Like us, they fail to achieve the results claimed in Speech2Vec’s paper. Jang reported that the model gets trivial results. As for Wang’s, the reproduced model is also vulnerable, which only gets ρ=0.08 in WS-353-REL and ρ=0.15 in YP-130.

6 Disscussion

“If I have seen further, it is by standing on the shoulders of Giants. ——Isaac Newton,1675.”

Scientific progress builds on the innovations of predecessors. If an influential innovation is falsified, it is not ‘the shoulder of a giant’. Instead, it becomes a stumbling block and brings huge obstacles to developing the field. We think science is not a kind of religion, especially when it comes to computer science. The authenticity of innovation should depend on reproducibility rather than the reputation of the author, institution, or paper grade.

In 2018, a master’s student was assigned to begin his academic research in an emerging field. Unluckily, he took faith in a bogus study and get off the wrong road. He was passionate about it, striving for years to achieve an impossible goal, and ended up with nothing. Four years after setting out, he reflected on the failures of his life and wrote this article. To prevent others from repeating his past, this is the value of this study.

References

  • [1] Eneko Agirre, Enrique Alfonseca, Keith Hall, Jana Kravalova, Marius Pasca, and Aitor Soroa. A study on similarity and relatedness using distributional and wordnet-based approaches. 2009.
  • [2] Simon Baker, Roi Reichart, and Anna Korhonen. An unsupervised model for instance level subcategorization acquisition. In EMNLP, pages 278–289, 2014.
  • [3] Yoshua Bengio, Réjean Ducharme, and Pascal Vincent. A neural probabilistic language model. Advances in neural information processing systems, 13, 2000.
  • [4] Elia Bruni, Gemma Boleda, Marco Baroni, and Nam-Khanh Tran. Distributional semantics in technicolor. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 136–145, 2012.
  • [5] Yi-Chen Chen, Sung-Feng Huang, Chia-Hao Shen, Hung-Yi Lee, and Lin-Shan Lee. Phonetic-and-semantic embedding of spoken words with applications in spoken content retrieval. In 2018 IEEE Spoken Language Technology Workshop (SLT), pages 941–948. IEEE, 2018.
  • [6] Yu-An Chung and James Glass. Learning word embeddings from speech. arXiv preprint arXiv:1711.01515, 2017.
  • [7] Yu-An Chung and James Glass. Speech2vec: A sequence-to-sequence framework for learning word embeddings from speech. In INTERSPEECH, 2018.
  • [8] Yu-An Chung, Wei-Hung Weng, Schrasing Tong, and James Glass. Unsupervised cross-modal alignment of speech and text embedding spaces. Advances in neural information processing systems, 31, 2018.
  • [9] Yu-An Chung, Wei-Hung Weng, Schrasing Tong, and James Glass. Towards unsupervised speech-to-text translation. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7170–7174. IEEE, 2019.
  • [10] Manaal Faruqui and Chris Dyer. Community evaluation and exchange of word vectors at wordvectors. org. In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 19–24, 2014.
  • [11] Daniela Gerz, Ivan Vulić, Felix Hill, Roi Reichart, and Anna Korhonen. Simverb-3500: A large-scale evaluation set of verb similarity. arXiv preprint arXiv:1608.00869, 2016.
  • [12] Guy Halawi, Gideon Dror, Evgeniy Gabrilovich, and Yehuda Koren. Large-scale learning of word relatedness with constraints. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1406–1414, 2012.
  • [13] Felix Hill, Roi Reichart, and Anna Korhonen. Simlex-999: Evaluating semantic models with (genuine) similarity estimation. Computational Linguistics, 41(4):665–695, 2015.
  • [14] Minh-Thang Luong, Hieu Pham, and Christopher D Manning. Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025, 2015.
  • [15] Minh-Thang Luong, Richard Socher, and Christopher D Manning. Better word representations with recursive neural networks for morphology. In Proceedings of the seventeenth conference on computational natural language learning, pages 104–113, 2013.
  • [16] Michael McAuliffe, Michaela Socolof, Sarah Mihuc, Michael Wagner, and Morgan Sonderegger. Montreal forced aligner: Trainable text-speech alignment using kaldi. In Interspeech, volume 2017, pages 498–502, 2017.
  • [17] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems, 26, 2013.
  • [18] Shruti Palaskar, Vikas Raunak, and Florian Metze. Learned in speech recognition: Contextual acoustic word embeddings. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6530–6534. IEEE, 2019.
  • [19] Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 5206–5210. IEEE, 2015.
  • [20] Kira Radinsky, Eugene Agichtein, Evgeniy Gabrilovich, and Shaul Markovitch. A word at a time: computing word relatedness using temporal semantic analysis. In Proceedings of the 20th international conference on World wide web, pages 337–346, 2011.
  • [21] Steffen Schneider, Alexei Baevski, Ronan Collobert, and Michael Auli. wav2vec: Unsupervised pre-training for speech recognition. arXiv preprint arXiv:1904.05862, 2019.
  • [22] Marco Tagliasacchi, Beat Gfeller, Félix de Chaumont Quitry, and Dominik Roblek. Self-supervised audio representation learning for mobile devices. arXiv preprint arXiv:1905.11796, 2019.
  • [23] Stefanie Tellex, Boris Katz, Jimmy Lin, Aaron Fernandes, and Gregory Marton. Quantitative evaluation of passage retrieval algorithms for question answering. In Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, pages 41–47, 2003.
  • [24] Joseph Turian, Lev Ratinov, and Yoshua Bengio. Word representations: a simple and general method for semi-supervised learning. In Proceedings of the 48th annual meeting of the association for computational linguistics, pages 384–394, 2010.
  • [25] Florian Wickelmaier. An introduction to mds. Sound Quality Research Unit, Aalborg University, Denmark, 46(5):1–26, 2003.
  • [26] Dongqiang Yang and David MW Powers. Verb similarity on the taxonomy of WordNet. Masaryk University, 2006.
  • [27] Jiangyan Yi and Jianhua Tao. Self-attention based model for punctuation prediction using word and speech embeddings. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7270–7274. IEEE, 2019.

Appendix A Encoding-Decoding Details

Fig. 5 depicts the encoding-decoding process. For clarity, this illustration takes a single spoken word as input, while the model is trained in batch mode. Since the input words have various time lengths, the length uniformity is implemented in the official code. The official code lacks the data processing part, and we don’t know the exact time length. According to statistics, our dataset has an average MFCC frame of 10, and there are 94.6% of words less than 21 frames. Thus we padded (or truncated) the input word to fixed 20 MFCC frames.

In the encoding stage, the official code used the pack_padded_sequence of PyTorch to eliminate the padding effects. However, in the decoding stage, the reconstruction loss corresponding to the padded frames is still considered. We fixed this issue in our reproduction. More details can be found in our released code.

Refer to caption
(a) Encoding process
Refer to caption
(b) One decoding step
Figure 5: Illustration of the encoding-decoding steps. Best view in color.

Appendix B Comparisons on Similarity Benchmarks

We note that the optimizer in the official code is the Adam optimizer (not the described SGD in the official paper). To exclude the optimizer’s impact, we tried the Adam setting. However, we did not observe essential differences in the results, as shown in Tab. 7.

Benchmark Paper Data SGD Adam
Rand Init 500-Epoch Rand Init 500-Epoch
MC-30 0.85 0.18 0.24↓0.60 0.37 -0.28↓1.13
MEN 0.62 0.08 0.08↓0.54 0.06 0.04↓0.58
MTurk-287 0.47 0.00 0.02↓0.45 0.04 0.08↓0.39
MTurk-771 0.52 -0.00 0.03↓0.49 -0.00 0.04↓0.48
RG-65 0.79 0.11 0.18↓0.61 0.25 0.05↓0.74
Rare-Word 0.32 0.10 0.12↓0.21 0.10 0.04↓0.28
SimLex-999 0.29 0.02 -0.05↓0.34 -0.06 -0.04↓0.33
SimVerb-3500 0.16 0.00 -0.02↓0.17 -0.03 -0.03↓0.18
Verb-143 0.32 0.24 0.29↓0.02 0.34 0.24↓0.07
WS-353 0.51 0.11 0.14↓0.36 0.18 0.12↓0.39
WS-353-REL 0.35 0.10 0.11↓0.24 0.12 0.11↓0.24
WS-353-SIM 0.66 0.10 0.15↓0.51 0.22 0.16↓0.51
YP-130 0.32 0.01 0.07↓0.25 0.11 -0.01↓0.33
Table 7: Results on semantic benchmarks. The ↓ represents the gap with the Paper Data. Note that the optimizer type does not affect the ‘Rand Init’ results. We used different seeds for these two runs; that’s why their Rand Init results are different.

Appendix C The Homophone List

The full 307 homophone pairs are listed in LABEL:tab:homo_full_list. Since our reproduction is based on Libri-Clean, which has a smaller vocab than the released embedding. Thus there are 33 missing pairs in our reproduction (marked by ‘-’).

Table 8: The homophone list.
Homophone Pairs Raw Official Ours Homophone Pairs Raw Official Ours
ad add 0.99 0.09 0.92 lieu loo 0.98 0.43 0.91
ail ale 0.98 0.49 0.75 links lynx 1.00 0.24 0.96
air heir 1.00 0.07 0.97 lo low 0.99 0.22 0.98
all awl - 0.40 - loan lone 1.00 0.09 0.94
allowed aloud 1.00 0.21 0.96 loot lute 0.99 0.17 0.92
alms arms 0.95 0.26 0.95 made maid 1.00 0.25 0.98
altar alter 0.99 0.13 0.95 mail male 1.00 0.23 0.98
arc ark 0.98 0.19 0.91 main mane 0.99 0.17 0.97
aren’t aunt 0.96 0.44 0.95 maize maze 0.99 0.25 0.93
ate eight 1.00 0.22 1.00 manna manner 0.97 0.33 0.89
auger augur - 0.51 - mantel mantle 1.00 0.43 0.95
awe oar 0.92 0.26 0.78 mare mayor 0.99 0.38 0.96
axel axle 0.99 0.41 0.86 marshal martial 1.00 0.60 0.98
aye eye 0.99 0.29 0.89 marten martin - 0.36 -
bail bale 0.99 0.36 0.98 mask masque - 0.51 -
bait bate 0.98 0.41 0.95 mean mien 0.98 0.24 0.90
baize bays 0.99 0.50 0.91 meat meet 1.00 0.13 0.99
bald bawled 0.99 0.21 0.96 medal meddle 0.99 0.11 0.90
bard barred 0.98 0.01 0.96 metal mettle 1.00 0.31 0.92
bare bear 1.00 0.51 1.00 meter metre 0.99 0.59 0.95
bark barque 0.96 0.33 0.86 might mite 0.99 0.38 0.96
baron barren 1.00 0.07 0.97 mind mined - 0.21 -
base bass 0.99 0.28 0.98 miner minor 1.00 0.17 0.95
bazaar bizarre 0.98 0.12 0.92 missed mist 0.99 0.20 0.98
be bee 0.99 0.35 0.98 moat mote 0.99 0.46 0.93
beach beech 1.00 0.47 0.97 mode mowed 0.99 0.21 0.94
bean been 0.92 0.29 0.96 moor more 0.99 0.41 0.94
beat beet - 0.41 - morning mourning 1.00 0.46 0.99
beer bier 0.99 0.24 0.96 muscle mussel 0.99 0.48 0.95
bel bell 0.99 0.29 0.98 naval navel - 0.34 -
berry bury 0.99 0.16 0.97 nay neigh - 0.34 -
berth birth 1.00 0.08 1.00 none nun 0.99 0.35 0.84
bitten bittern - 0.63 - ode owed 0.99 0.30 0.95
blew blue 1.00 0.48 1.00 oh owe 0.97 0.55 0.82
boar bore 0.99 0.39 0.96 one won 1.00 0.39 1.00
board bored 1.00 0.13 0.99 pail pale 1.00 0.27 0.99
boarder border 0.99 0.34 0.92 pain pane 1.00 0.38 0.98
bold bowled 0.99 0.50 0.92 pair pare 0.99 0.35 0.95
born borne 1.00 0.29 0.99 palate palette 0.98 0.50 0.88
bough bow 0.99 0.48 0.97 pause paws 1.00 0.34 0.99
boy buoy 0.96 0.35 0.88 pea pee 0.93 0.54 0.87
braid brayed - 0.60 - peace piece 1.00 0.15 0.99
brake break 1.00 0.38 0.99 peak peek - 0.30 -
bread bred 1.00 0.24 0.99 peal peel 0.99 0.21 0.96
bridal bridle 1.00 0.36 0.97 peer pier 0.99 0.33 0.96
broach brooch 0.98 0.52 0.89 plain plane 1.00 0.40 0.99
but butt 0.97 0.37 0.92 pleas please - 0.42 -
buy by 0.99 0.25 0.98 plum plumb 0.99 0.57 0.96
calendar calender 0.98 0.39 0.95 pole poll 0.98 0.19 0.96
canvas canvass 0.99 0.41 0.98 practice practise 1.00 0.74 0.99
cast caste 0.99 0.33 0.98 praise prays 0.99 0.51 0.99
caught court 0.95 0.11 0.91 principal principle 1.00 0.48 1.00
ceiling sealing 1.00 0.63 0.95 profit prophet 1.00 0.30 0.99
cell sell 0.99 0.06 0.95 quarts quartz 0.99 0.49 0.95
censer censor 0.98 0.35 0.78 rain reign 1.00 0.18 0.99
cent scent 0.99 0.00 0.98 raise rays 1.00 0.10 0.99
check cheque 0.99 0.61 0.98 rap wrap 0.99 0.40 0.96
chord cord 1.00 0.42 0.98 raw roar 0.96 0.18 0.88
cite sight 0.99 0.24 0.97 read reed 0.97 0.28 0.97
clew clue 0.99 0.83 0.96 real reel 1.00 0.27 0.97
climb clime 0.99 0.29 0.93 reek wreak 0.98 0.33 0.92
coarse course 1.00 0.20 0.99 rest wrest 0.99 0.54 0.95
coign coin - 0.37 - right rite 1.00 0.19 0.96
colonel kernel 0.98 0.02 0.94 ring wring 0.99 0.48 0.97
complacent complaisant - 0.54 - road rode 1.00 0.66 0.96
complement compliment 1.00 0.27 0.96 roe row 0.95 0.23 0.80
coo coup 0.98 0.35 0.87 role roll 1.00 0.18 0.99
cops copse 0.98 0.39 0.96 rood rude - 0.32 -
council counsel 1.00 0.52 0.99 root route 0.98 0.06 0.98
creak creek 0.99 0.37 0.98 rose rows 1.00 0.37 0.99
crews cruise 1.00 0.67 0.95 rote wrote 0.98 0.46 0.91
currant current 0.99 0.11 0.97 rough ruff 0.99 0.51 0.94
dam damn 0.99 0.21 0.95 rung wrung 1.00 0.33 0.97
days daze 0.99 0.47 0.94 rye wry 0.99 0.35 0.92
dear deer 0.99 0.08 0.98 sale sail 1.00 0.32 0.99
descent dissent 1.00 0.42 0.98 sane seine - 0.09 -
desert dessert 0.99 0.14 0.97 sauce source 0.97 0.22 0.97
dew due 1.00 0.25 0.98 saw soar 0.96 0.33 0.92
die dye 0.99 0.26 0.89 scene seen 1.00 0.41 1.00
done dun 0.99 0.21 0.97 sea see 1.00 0.31 0.98
draft draught 0.99 0.30 0.96 seam seem 0.98 0.30 0.93
dual duel 0.98 0.48 0.88 sear seer 0.98 0.57 0.86
earn urn 0.99 0.04 0.89 seas sees 0.99 0.18 0.96
ewe yew 0.99 0.45 0.90 sew so 0.99 0.55 0.98
faint feint 0.99 0.55 0.95 shear sheer 0.99 0.25 0.94
fair fare 1.00 0.42 0.99 side sighed 1.00 0.31 0.99
farther father 0.98 0.29 0.99 sign sine - 0.39 -
feat feet 1.00 0.37 0.99 slay sleigh 0.99 0.30 0.97
few phew - 0.38 - sole soul 1.00 0.43 0.98
find fined 0.99 0.24 0.97 some sum 0.99 0.34 0.99
fir fur 1.00 0.32 0.99 son sun 1.00 0.26 1.00
flaw floor 0.97 0.30 0.87 sort sought 0.96 0.21 0.96
flea flee 0.99 0.36 0.98 staid stayed 1.00 0.55 0.99
floe flow 0.99 0.42 0.89 stair stare 1.00 0.27 0.99
flour flower 1.00 0.30 0.99 stake steak 1.00 0.19 0.98
for fore 0.95 0.26 0.99 stalk stork 0.96 0.55 0.92
fort fought 0.95 0.42 0.93 stationary stationery 0.99 0.27 0.85
forth fourth 1.00 0.25 1.00 steal steel 1.00 0.19 0.99
foul fowl 1.00 0.36 0.97 stile style 0.99 0.21 0.89
franc frank 0.98 0.31 0.95 storey story 0.99 0.37 0.95
freeze frieze - 0.38 - straight strait 1.00 0.39 0.97
furs furze - 0.56 - sweet suite 1.00 0.03 0.99
gait gate 1.00 0.31 0.99 tacks tax 0.98 0.25 0.96
gamble gambol - 0.41 - tale tail 1.00 0.23 1.00
gild guild - 0.41 - taught taut 0.99 0.12 0.93
gilt guilt 1.00 0.21 0.99 team teem - 0.28 -
gnaw nor 0.95 0.38 0.84 tear tier 0.98 0.24 0.92
grate great 1.00 0.15 0.97 teas tease 0.99 0.35 0.97
greys graze 0.98 0.65 0.90 tern turn - 0.32 -
grisly grizzly 0.99 0.64 0.88 there their 0.99 0.34 0.98
groan grown 0.99 0.25 0.95 threw through 0.99 0.39 0.99
guessed guest 1.00 0.33 0.99 throes throws 0.99 0.31 0.95
hail hale 0.99 0.34 0.97 throne thrown 1.00 0.30 0.96
hair hare 1.00 0.32 0.98 thyme time 0.99 0.25 0.97
hall haul 1.00 0.13 0.93 tide tied 1.00 0.10 0.99
hart heart 0.99 0.23 0.94 tire tyre 0.99 0.21 0.90
haw hoar 0.91 0.12 0.71 to too 0.97 0.63 0.97
hay hey 0.98 0.37 0.92 toad toed 0.98 0.55 0.96
he’d heed 0.99 0.39 0.98 told tolled 0.98 0.33 0.96
heal heel 1.00 0.37 0.95 ton tun - 0.55 -
hear here 1.00 0.58 0.98 vain vane 0.99 0.19 0.96
heard herd 0.99 0.32 0.97 vale veil 1.00 0.65 0.98
hew hue 0.99 0.34 0.94 vial vile 0.99 0.33 0.94
hi high 0.98 0.18 0.98 wain wane 0.98 0.73 0.86
higher hire 1.00 0.27 0.98 waist waste 1.00 0.13 0.99
him hymn 0.99 0.28 0.98 wait weight 1.00 0.21 0.99
ho hoe 0.98 0.40 0.96 waive wave 0.99 0.22 0.95
hoard horde 0.99 0.47 0.89 war wore 0.99 0.23 0.96
hoarse horse 1.00 0.21 0.99 ware wear 0.99 0.32 0.97
hour our 0.98 0.28 0.94 warn worn 1.00 0.02 0.98
idle idol 1.00 0.46 0.97 watt what - 0.52 -
in inn 0.98 0.38 0.95 wax whacks - 0.48 -
it’s its 0.99 0.20 0.95 way weigh 0.99 0.36 0.96
key quay 0.97 0.33 0.95 we wee 0.97 0.21 0.96
knave nave 0.99 0.06 0.90 we’d weed 0.96 0.26 0.94
knead need 0.99 0.51 0.98 weak week 1.00 0.10 1.00
knew new 1.00 0.31 0.99 weal we’ll 0.94 0.15 0.90
knight night 1.00 0.26 0.99 wean ween - 0.57 -
knob nob - 0.59 - weather whether 0.99 0.21 0.95
knot not 1.00 0.26 0.98 weir we’re - 0.08 -
know no 1.00 0.63 0.97 were whirr 0.96 0.27 0.83
knows nose 1.00 0.18 0.99 which witch 0.99 0.28 0.94
lac lack - 0.10 - whig wig 0.99 0.16 0.96
lain lane 0.99 0.30 0.96 whine wine 1.00 0.13 0.97
laps lapse - 0.32 - whirl whorl 0.97 0.37 0.78
larva lava - 0.45 - whirled world 1.00 0.26 0.96
law lore 0.96 0.33 0.89 whit wit 0.99 0.50 0.96
lea lee 0.98 0.53 0.89 white wight 0.99 0.32 0.95
lead led 0.94 0.59 0.97 who’s whose 0.99 0.26 0.99
lessen lesson 1.00 0.26 0.92 woe whoa 0.97 0.23 0.90
levee levy 0.98 0.45 0.82 wood would 0.98 0.26 0.96
liar lyre 0.99 0.31 0.93 yoke yolk 0.99 0.32 0.88
licence license 0.99 0.72 0.94 you’ll yule 0.95 0.48 0.91
lie lye 0.97 0.47 0.71