Homophone Reveals the Truth: A Reality Check for Speech2Vec

Guangyu Chen
Renmin University of China
hcs@ruc.edu.cn

Abstract

Generating spoken word embeddings that possess semantic information is a fascinating topic. Compared with text-based embeddings, they cover both phonetic and semantic characteristics, which can provide richer information and are potentially helpful for improving ASR and speech translation systems. In this paper, we review and examine the authenticity of a seminal work in this field: Speech2Vec. First, a homophone-based inspection method is proposed to check the speech embeddings released by the author of Speech2Vec. There is no indication that these embeddings are generated by the Speech2Vec model. Moreover, through further analysis of the vocabulary composition, we suspect that a text-based model fabricates these embeddings. Finally, we reproduce the Speech2Vec model, referring to the official code and optimal settings in the original paper. Experiments showed that this model failed to learn effective semantic embeddings. In word similarity benchmarks, it gets a correlation score of 0.08 in MEN and 0.15 in WS-353-SIM tests, which is over 0.5 lower than those described in the original paper. Our data and code are available¹¹1https://github.com/my-yy/s2v_rc.

1 Introduction

Word embeddings are fixed-length representations of words, which carry syntactic and semantic information and become the building block in natural language processing tasks, such as named entity recognition [24] and question answering [23]. Speech, as another carrier of information, also reflects semantic meaning, which implies the possibility of directly using it to learn semantic word representations. To this end, some attempts have been made to generate semantic speech embeddings [7, 18, 5, 6]. Compared with learning text-based embeddings, this is a much more difficult task. Because there are intrinsic interferences between phonetics and semantics [5]. For example, the words ‘ate’ and ‘eight’ have almost the same pronunciation but are semantically different. If the model can not distinguish these sound-alike inputs, it is impossible to obtain effective semantic representations.

Refer to caption — Figure 1: The Speech2Vec model (skip-gram style).

In this paper, we focus on Speech2Vec [6, 7], one of the field’s earliest and most influential studies. It can be viewed as a speech version of Word2Vec [17], which leverages the skip-gram and CBOW strategy to learn semantic speech embeddings. Despite being an imitator of Word2Vec, it seems to have magically resolved the mentioned phonetic-semantic interference and has been reported to exceed the Word2Vec on 13 word similarity benchmarks [10]. This phenomenon aroused our interest in exploring the validity of this study. Specifically, we analyze the authenticity of Speech2Vec from two perspectives of the official-released speech embeddings and our reproduction results. The conclusions can be summarized as follows:

1) The official speech embeddings fail to reflect speech characteristics, implying that they are not generated by the Speech2Vec model.

2) Analysis of the vocabulary composition reveals that these released embeddings may come from a text-based model.

3) Our reproduction results show that the Speech2Vec model failed to generate valid semantic speech embeddings. Even after 500 epochs of training, its performance is still negligible in most benchmark metrics.

2 Recap Speech2Vec

Speech2Vec can be regarded as a speech version of Word2Vec, which relies on the skip-gram or CBOW strategy to learn semantic speech embeddings. Herein, we focused on the skip-gram style of Speech2Vec, which is reported to have better performance. Fig. 1 shows its structure.

When training, the input speech sentence is segmented into spoken word sequences in advance. Then the encoder network converts the current spoken word into an embedding vector. And this vector is used for decoding context words. In the corpus, the audio segments corresponding to the same text word appear multiple times (with different speakers, environment, emotions, or talking speed). To obtain a deterministic word representation, the average of all embeddings corresponding to the same text word is taken as the final embedding.

The official code of Speech2Vec is not publicly available, but the pre-trained speech embeddings are released on GitHub ²²2https://github.com/iamyuanchung/speech2vec-pretrained-vectors. These embeddings are described from a skip-gram style Speech2Vec model trained on 500 hours of LibriSpeech [19] dataset:

… In this repository, we release the speech embeddings of different dimensionalities learned by Speech2Vec using skip-grams as the training methodology. The model is trained on a corpus consisting of about 500 hours of speech from LibriSpeech (the clean-360 + clean-100 subsets) …

We further provide a fork to this repository ³³3https://github.com/my-yy/speech2vec-pretrained-vectors.

3 Reality Check

This section aims to check the authenticity of the released speech embeddings. We analyze them from two perspectives: phonetic characteristics and vocabulary composition.

3.1 Homophone Inspection

Homophones refer to words that have the same pronunciation but reflect different meanings (e.g., ‘ate, eight’,‘ dear, deer’, and ‘whine, wine’). Since these spoken words are consistent in pronunciation, in a speech recognition system, the speech context and language model [3] are usually required to distinguish them. For the skip-gram style Speech2Vec, its input is a single spoken word that does not contain context information. Therefore, this model should not be able to distinguish these homophones effectively. And the embedding similarity of the homophone pair should have a higher value. We used 307 homophone pairs⁴⁴4 Sourced from: http://www.singularis.ltd.uk/bifroest/misc/homophones-list.html and calculated their similarities. The results are shown in Tab. 1.

Embedding	Dim	Pair Similarity
Embedding	Dim	Homophone	Random
Official	50	0.33±0.15	0.39±0.17
	100	0.28±0.14	0.35±0.15
	200	0.23±0.12	0.31±0.14
	300	0.21±0.11	0.29±0.14
Ours	50	0.95±0.05	0.77±0.10

Table 1: Cosine similarities of word pairs (mean ± std). Random: Similarities between 307 randomly selected pairs. Official: Embeddings released by the author of Speech2Vec. Ours: Embeddings generated by our reproduced model (detailly described in Sec. 4. )

As we can see, for the official embeddings, their homophone similarities are counterintuitively lower than the random results. It suggested that the official model may not be affected by phonetic similarity and can successfully distinguish these highly similar inputs. However, our reproduced model did not ‘show the magic’. Although it had a 0.77 similarity for random pairs (which implies the embedding space is ill-distributed), it still obtained a higher similarity of 0.95 for the homophone pairs.

Homophone Pair		Similarity
Homophone Pair		Raw	Official	Ours
ate	eight	1.00	0.22	1.00
hail	hale	0.99	0.34	0.97
know	no	1.00	0.63	0.97
made	maid	1.00	0.25	0.98
meat	meet	1.00	0.13	0.99
peace	piece	1.00	0.15	0.99
sea	see	1.00	0.31	0.98
steal	steel	1.00	0.19	0.99
tale	tail	1.00	0.23	1.00
whine	wine	1.00	0.13	0.97
MEAN		0.99	0.32	0.95

Table 2: Examples of homophone similarities. The Official stands for the official 50-dimensional embeddings. The appendix gives the full 307 results.

Word	Official				Ours
Word	K Nearest Neighbor			HR	K Nearest Neighbor			HR
ate	eating	drank	eat	25,099	eight	agent	aid	1
hail	thunder	blast	thunders	15,253	hell	handle	held	9
know	tell	think	why	2,305	narrow	natural	no	3
made	making	gave	make	35,582	me	mid	necessity	15
meat	roasting	roasted	venison	33,108	meet	mate	mute	1
peace	tranquility	liberty	deliverer	33,609	plates	place	patience	7
sea	ocean	shore	waters	18,610	city	safety	succeed	100
steal	flay	stealing	kill	29,506	still	steel	special	2
tale	story	adventure	tales	28,617	tail	title	tackle	1
whine	snarling	growls	growl	34,575	watering	washing	wide	4
MEAN				25,626				14

Table 3: The k nearest neighbor results (k=3). HR: Homophone ranking.

Tab. 2 further gives some pair examples. In this table, we introduce a ‘Raw’ metric to measure the similarity of audio input. This metric is the cosine similarity of extracted features from a pre-trained wav2vec [21] model.

It is clear to see that most homophone pairs have a Raw score close to 1.0, which means there is a high degree of similarity between these homophone words. And the wav2vec model failed to distinguish them. In addition, our reproduced model has similar behavior, which gets an average similarity of 0.95. However, the official embeddings get relatively lower similarities. Taking the ‘ate, eight’ pair as an example, the similarities of Raw and ours are both 1.00, while the official gives a 0.22 similarity score.

The k-nearest neighbor results (Tab. 3) further reveals the anomaly of the official embedding. Although the official’s nearest neighbors reflect similar semantics, there are considerable differences in pronunciation. For example, the nearest neighbors of ‘hail’ are ‘thunder, blast, thunders’, while its homophone counterpart (‘hale’) is ranked as high as 15,253. In contrast, the homophone rank in our reproduction is 9.

The above anomaly behavior of the official embeddings can be explained by a bold hypothesis: the official embeddings come from a text-based model. As in a text-based model (like Word2Vec), words of different spellings are assigned with different coding values. In the learning process, the model does not rely on pronunciation to identify words; therefore, it will not be disturbed by them. Moreover, because homophone pairs have different meanings, their similarities may be smaller than the random pairs. This explains why the official results have smaller similarities than the random in Tab. 1.

3.2 Vocabulary Analysis

3.2.1 Checking the Released Vocabulary

Corpus Division		Hours		Speakers
Clean	train-clean-100	100.6	475	251	1,252
	train-clean-360	363.6		921
	dev-clean	5.4		40
	test-clean	5.4		40
Other	train-other-500	496.7	507	1,166	1,232
	dev-other	5.3		33
	test-other	5.1		33

Table 4: The statistics of the LibrSpeech corpus.

The Seepch2Vec paper claimed that the training is based on 500 hours of LibriSpeech [19], which contains 1,252 speakers. We refer to this dataset as Libri-Clean. Its statistics are shown in Tab. 4. We counted distinct words in Libri-Clean’s transcripts and compared them with the official. The results are shown in Tab. 5.

	Official	Libri-Clean		Libri-All
	Official	-	count $\geq$ 5	count $\geq$ 5
Vocab	37,622	66,721	27,454	37,622
Missing	-	1,685	10,168	0

Table 5: The vocabulary composition.

We found that 1.6k official words are not in the Libri-Clean. If we only consider frequent words of occurrence count $\geq$ 5, this number rises to 10k. However, if we use the entire LibriSpeech corpus and filter out infrequent words, the vocabulary composition is exactly the same as the official. This proves that the released embeddings are derived from all LibriSpeech corpus, not the claimed ‘500 hours’ data.

In addition, we believe that these embeddings are generated by a text-based model, which directly uses text transcripts for training. Because the LibriSpeech dataset needs to be pre-processed for training, each speech sentence should be aligned with the transcript to locate word boundaries. However, this alignment process is not ideal, which may cause data loss.

In our analysis, we used the alignment results based on the widely used Montreal Forced Aligner [16] ⁵⁵5https://github.com/CorentinJ/librispeech-alignments. It aligned the entire corpus of 292,367 speech sentences and encountered 127 failures (0.04% fail rate). After removing these unaligned speeches, the corpus changes, which causes additional 17 missing words in the vocabulary. Therefore, if the authors of Speech2Vec do not realize an ideal alignment, their vocab size will not equal 37,622 exactly. Otherwise, their embeddings come from a text-based model.

Benchmark	Not Found Pairs
Benchmark	Paper Data	Libri-Clean
MEN [4]	122	231
MTurk-287 [20]	13	29
MTurk-771 [12]	22	43
Rare-Word [15]	783	1027
SimLex-999 [13]	0	11
SimVerb-3500 [11]	126	118
Verb-143 [2]	0	5
WS-353 [26]	21	34
WS-353-REL [1]	12	24
WS-353-SIM [1]	7	16
YP-130 [26]	0	9

Table 6: Not found pairs in word similarity benchmarks.

3.2.2 Checking The Paper Data

Although the vocabulary size is not mentioned in the paper of Speech2Vec, we can still find some clues from the word similarity benchmark results. In the original paper, these benchmarks are used to evaluate the semantic information carried by embeddings. Each benchmark contains a series of predefined word pairs with the corresponding oracle correlation. In an early version of Speech2vec [6], the author reported the ‘not found’ pairs in these benchmarks (Tab. 6).

If we use all the 66,721 words in Libri-Clean to construct the vocabulary, there are still more not found pairs in these tests (except for SimVerb-3500). For example, the original paper reported 122 not found pairs in the MEN test, but we found 231. This illustrates that the authors of Speech2Vec used a larger corpus other than the claimed 500h Libri-Clean.

4 Our Reproduction

4.1 Reproduction Details

4.1.1 Code Implementation

Our reproduction is based on the incomplete code provided by the author of Speech2Vec. This code covers the model implementation and training logic while lacking the data pre-processing. We complement the missing parts and train the Speech2Vec, referring to the optimal setting in the original paper. Specifically, the embedding dimension is set as 50. The window size is taken as 3. The encoder is a single-layer bidirectional LSTM. And the decoder is a single-layer unidirectional LSTM. During the decoding process, the attention mechanism [14] is used to reference the encoder’s outputs (Fig. 5 in the appendix gives a detailed overview of its structure).

4.1.2 Training Details

In accordance with Speech2Vec’s paper, the Libri-Clean dataset is used for training. Each speech sentence is segmented into spoken words in advance. Then we filter out the low-frequency words that appear less than four times and finally get a vocabulary of size 30,788. For each spoken word, we convert it into 13-dim MFCC frames.

During the mini-batch construction, we set batch size as 4096 and trained the model for 500 epochs with SGD optimizers of 0.001 learning rate. This training process took 8.4 days on an AMD 3900XT + RTX3090 device. The output models of every epoch are saved, and the 500th epoch’s model is used for evaluation.

4.2 Results & Analysis

Same with the official, we tested our reproduced model on 13 similarity benchmarks [10]. The evaluation score is Spearman’s rank correlation coefficient $\rho$ . It has a range of [-1,1], and a larger value represents a better result. The results are shown in Fig. 2.

Obviously, the reproduced model gives trivial results on these benchmarks. Compared with the random initialization, the 500-epoch model only has minor improvements or even decreases in indicators of SimLex-999 and SimVerb-3500. Moreover, compared with the claimed paper data, there is a substantial performance drop, even over 0.5 in some metrics (MC-30, MEN, RG-65, WS-353-SIM).

We further depicted the training process in Fig. 3. With the loss decreasing, most indicators showed a rising trend. But the increasing speed is relatively slow and is finally limited to a small range (e.g., 0 $\sim$ 0.022 for MTurk-287). More training epochs may provide better results on these metrics, but it should be much larger than the claimed ‘500 epochs’. Moreover, two indicators seem to have passed their peak period, with performance turning from rising to falling (WS-353-REL and MEN). However, their peak performance is still relatively lower than the original paper. There are two indicators whose performance is consistently declining, which means that the required semantics have not been learned successfully (SimLex-999 and SimVerb-3500).

As mentioned in Sec. 2, each text word corresponds to multiple speech segments. In Fig. 4, we visualize the embedding distribution of specific words. Specifically, the multi-dimensional scaling (MDS) [25] is used for visualization. Compared with the popular T-SNE, the MDS preserves distance and global structure.

As we can see, the homophone words are mixed, while words with different pronunciations tend to be separated. This result confirms that Speech2Vec’s encoder can not successfully distinguish homophones. Moreover, we noticed some mixed points for the ‘need-thus’ pair. This shows that the model did not have a good discernibility even for not sound-alike pairs, which may be caused by the interference between phonetics and semantics.

5 Related Works

5.1 Works on Learning Semantic Speech Embeddings

Five years since the Speech2Vec was proposed, there is rarely work in the field of learning semantic speech embeddings. The Speech2Vec has received 158 citations so far, but no article has directly compared with it, and no subsequent work has proved its validity.

5.1.1 Chen et al.’s Work

Chen et al. [5] pointed out that when learning semantic speech embeddings, the phonetics and semantics inevitably disturb each other. Therefore, they designed a two-stage learning scheme to reduce the learning difficulty. In the first stage, an encoder-decoder structure generates phonetic embeddings in which the speaker characteristics are disentangled. In the second stage, these embeddings possess semantic meaning through a context prediction task. Experiments have verified their embeddings’ phonetic and semantic characteristics. However, this work is not compared with Speech2Vec on the similarity benchmarks.

5.1.2 CAWE

CAWE [18] is an encoder-decoder-based speech recognition model, which relies on an attention mechanism to automatically segment speech sentences into spoken words and generate speech embeddings. This model is tested on 16 sentence evaluation tasks and shows competitive results to Word2Vec. Compared with skip-gram style Speech2Vec, the CAWE can leverage the context information. Thus their effectiveness can not laterally prove Speech2Vec’s validity. Moreover, we have not found any successful third-party replications of CAWE.

5.1.3 Audio2Vec

Audio2Vec [22] is a CNN-based model which uses skip-gram, CBOW, and temporal-gap strategies to generate fixed-length audio representations. Although it has a similar training style to Speech2Vec, the goal of Audio2Vec is to produce general-purpose audio representations. In the training process, the speech sentence is segmented into fixed-length slices without reference to word boundaries. Thus this model was not tested on the word similarity benchmarks and not compared with Speech2Vec.

5.2 Works Directly Based on Speech2Vec Embedding

Ideally, if the experimental data is questionable, the conclusion’s validity cannot be guaranteed. We note that there are some works [8, 9, 27] directly based on the embeddings of Speech2Vec. Among them, two works come from the authors of Speech2vec [8, 9], which align the speech space and text space to realize cross-language speech-to-text translation.

Yi et al. [27] reported that they successfully improved punctuation prediction performance by involving the released Speech2Vec embeddings in their input. But they did not explore how these speech embeddings work. We think their results are authentic but unintentionally provide a wrong interpretation. As we have analyzed, these Speech2Vec embeddings cannot reflect phonetic information. Thus their improvements come from involving more input features. And it has nothing to do with whether these features come from a text-based or speech-based model. Therefore, we believe that the above works cannot provide evidence for the authenticity of Speech2Vec.

5.3 Replication Attempts by Other Researchers

We investigated the replication attempts by other researchers, including Jang’s⁶⁶6https://github.com/yjang43/Speech2Vec and Wang’s⁷⁷7https://github.com/ZhanpengWang96/pytorch-speech2vec. Like us, they fail to achieve the results claimed in Speech2Vec’s paper. Jang reported that the model gets trivial results. As for Wang’s, the reproduced model is also vulnerable, which only gets $\rho=0.08$ in WS-353-REL and $\rho=-0.15$ in YP-130.

6 Disscussion

“If I have seen further, it is by standing on the shoulders of Giants. ——Isaac Newton,1675.”

Scientific progress builds on the innovations of predecessors. If an influential innovation is falsified, it is not ‘the shoulder of a giant’. Instead, it becomes a stumbling block and brings huge obstacles to developing the field. We think science is not a kind of religion, especially when it comes to computer science. The authenticity of innovation should depend on reproducibility rather than the reputation of the author, institution, or paper grade.

In 2018, a master’s student was assigned to begin his academic research in an emerging field. Unluckily, he took faith in a bogus study and get off the wrong road. He was passionate about it, striving for years to achieve an impossible goal, and ended up with nothing. Four years after setting out, he reflected on the failures of his life and wrote this article. To prevent others from repeating his past, this is the value of this study.

References

[1] Eneko Agirre, Enrique Alfonseca, Keith Hall, Jana Kravalova, Marius Pasca, and Aitor Soroa. A study on similarity and relatedness using distributional and wordnet-based approaches. 2009.
[2] Simon Baker, Roi Reichart, and Anna Korhonen. An unsupervised model for instance level subcategorization acquisition. In EMNLP, pages 278–289, 2014.
[3] Yoshua Bengio, Réjean Ducharme, and Pascal Vincent. A neural probabilistic language model. Advances in neural information processing systems, 13, 2000.
[4] Elia Bruni, Gemma Boleda, Marco Baroni, and Nam-Khanh Tran. Distributional semantics in technicolor. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 136–145, 2012.
[5] Yi-Chen Chen, Sung-Feng Huang, Chia-Hao Shen, Hung-Yi Lee, and Lin-Shan Lee. Phonetic-and-semantic embedding of spoken words with applications in spoken content retrieval. In 2018 IEEE Spoken Language Technology Workshop (SLT), pages 941–948. IEEE, 2018.
[6] Yu-An Chung and James Glass. Learning word embeddings from speech. arXiv preprint arXiv:1711.01515, 2017.
[7] Yu-An Chung and James Glass. Speech2vec: A sequence-to-sequence framework for learning word embeddings from speech. In INTERSPEECH, 2018.
[8] Yu-An Chung, Wei-Hung Weng, Schrasing Tong, and James Glass. Unsupervised cross-modal alignment of speech and text embedding spaces. Advances in neural information processing systems, 31, 2018.
[9] Yu-An Chung, Wei-Hung Weng, Schrasing Tong, and James Glass. Towards unsupervised speech-to-text translation. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7170–7174. IEEE, 2019.
[10] Manaal Faruqui and Chris Dyer. Community evaluation and exchange of word vectors at wordvectors. org. In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 19–24, 2014.
[11] Daniela Gerz, Ivan Vulić, Felix Hill, Roi Reichart, and Anna Korhonen. Simverb-3500: A large-scale evaluation set of verb similarity. arXiv preprint arXiv:1608.00869, 2016.
[12] Guy Halawi, Gideon Dror, Evgeniy Gabrilovich, and Yehuda Koren. Large-scale learning of word relatedness with constraints. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1406–1414, 2012.
[13] Felix Hill, Roi Reichart, and Anna Korhonen. Simlex-999: Evaluating semantic models with (genuine) similarity estimation. Computational Linguistics, 41(4):665–695, 2015.
[14] Minh-Thang Luong, Hieu Pham, and Christopher D Manning. Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025, 2015.
[15] Minh-Thang Luong, Richard Socher, and Christopher D Manning. Better word representations with recursive neural networks for morphology. In Proceedings of the seventeenth conference on computational natural language learning, pages 104–113, 2013.
[16] Michael McAuliffe, Michaela Socolof, Sarah Mihuc, Michael Wagner, and Morgan Sonderegger. Montreal forced aligner: Trainable text-speech alignment using kaldi. In Interspeech, volume 2017, pages 498–502, 2017.
[17] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems, 26, 2013.
[18] Shruti Palaskar, Vikas Raunak, and Florian Metze. Learned in speech recognition: Contextual acoustic word embeddings. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6530–6534. IEEE, 2019.
[19] Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 5206–5210. IEEE, 2015.
[20] Kira Radinsky, Eugene Agichtein, Evgeniy Gabrilovich, and Shaul Markovitch. A word at a time: computing word relatedness using temporal semantic analysis. In Proceedings of the 20th international conference on World wide web, pages 337–346, 2011.
[21] Steffen Schneider, Alexei Baevski, Ronan Collobert, and Michael Auli. wav2vec: Unsupervised pre-training for speech recognition. arXiv preprint arXiv:1904.05862, 2019.
[22] Marco Tagliasacchi, Beat Gfeller, Félix de Chaumont Quitry, and Dominik Roblek. Self-supervised audio representation learning for mobile devices. arXiv preprint arXiv:1905.11796, 2019.
[23] Stefanie Tellex, Boris Katz, Jimmy Lin, Aaron Fernandes, and Gregory Marton. Quantitative evaluation of passage retrieval algorithms for question answering. In Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, pages 41–47, 2003.
[24] Joseph Turian, Lev Ratinov, and Yoshua Bengio. Word representations: a simple and general method for semi-supervised learning. In Proceedings of the 48th annual meeting of the association for computational linguistics, pages 384–394, 2010.
[25] Florian Wickelmaier. An introduction to mds. Sound Quality Research Unit, Aalborg University, Denmark, 46(5):1–26, 2003.
[26] Dongqiang Yang and David MW Powers. Verb similarity on the taxonomy of WordNet. Masaryk University, 2006.
[27] Jiangyan Yi and Jianhua Tao. Self-attention based model for punctuation prediction using word and speech embeddings. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7270–7274. IEEE, 2019.

Appendix A Encoding-Decoding Details

Fig. 5 depicts the encoding-decoding process. For clarity, this illustration takes a single spoken word as input, while the model is trained in batch mode. Since the input words have various time lengths, the length uniformity is implemented in the official code. The official code lacks the data processing part, and we don’t know the exact time length. According to statistics, our dataset has an average MFCC frame of 10, and there are 94.6% of words less than 21 frames. Thus we padded (or truncated) the input word to fixed 20 MFCC frames.

In the encoding stage, the official code used the pack_padded_sequence of PyTorch to eliminate the padding effects. However, in the decoding stage, the reconstruction loss corresponding to the padded frames is still considered. We fixed this issue in our reproduction. More details can be found in our released code.

Appendix B Comparisons on Similarity Benchmarks

We note that the optimizer in the official code is the Adam optimizer (not the described SGD in the official paper). To exclude the optimizer’s impact, we tried the Adam setting. However, we did not observe essential differences in the results, as shown in Tab. 7.

Benchmark	Paper Data	SGD		Adam
Benchmark	Paper Data	Rand Init	500-Epoch	Rand Init	500-Epoch
MC-30	0.85	0.18	0.24↓0.60	0.37	-0.28↓1.13
MEN	0.62	0.08	0.08↓0.54	0.06	0.04↓0.58
MTurk-287	0.47	0.00	0.02↓0.45	0.04	0.08↓0.39
MTurk-771	0.52	-0.00	0.03↓0.49	-0.00	0.04↓0.48
RG-65	0.79	0.11	0.18↓0.61	0.25	0.05↓0.74
Rare-Word	0.32	0.10	0.12↓0.21	0.10	0.04↓0.28
SimLex-999	0.29	0.02	-0.05↓0.34	-0.06	-0.04↓0.33
SimVerb-3500	0.16	0.00	-0.02↓0.17	-0.03	-0.03↓0.18
Verb-143	0.32	0.24	0.29↓0.02	0.34	0.24↓0.07
WS-353	0.51	0.11	0.14↓0.36	0.18	0.12↓0.39
WS-353-REL	0.35	0.10	0.11↓0.24	0.12	0.11↓0.24
WS-353-SIM	0.66	0.10	0.15↓0.51	0.22	0.16↓0.51
YP-130	0.32	0.01	0.07↓0.25	0.11	-0.01↓0.33

Table 7: Results on semantic benchmarks. The ↓ represents the gap with the Paper Data. Note that the optimizer type does not affect the ‘Rand Init’ results. We used different seeds for these two runs; that’s why their Rand Init results are different.

Appendix C The Homophone List

The full 307 homophone pairs are listed in LABEL:tab:homo_full_list. Since our reproduction is based on Libri-Clean, which has a smaller vocab than the released embedding. Thus there are 33 missing pairs in our reproduction (marked by ‘-’).

Table 8: The homophone list.

Homophone Pairs		Raw	Official	Ours	Homophone Pairs		Raw	Official	Ours
ad	add	0.99	0.09	0.92	lieu	loo	0.98	0.43	0.91
ail	ale	0.98	0.49	0.75	links	lynx	1.00	0.24	0.96
air	heir	1.00	0.07	0.97	lo	low	0.99	0.22	0.98
all	awl	-	0.40	-	loan	lone	1.00	0.09	0.94
allowed	aloud	1.00	0.21	0.96	loot	lute	0.99	0.17	0.92
alms	arms	0.95	0.26	0.95	made	maid	1.00	0.25	0.98
altar	alter	0.99	0.13	0.95	mail	male	1.00	0.23	0.98
arc	ark	0.98	0.19	0.91	main	mane	0.99	0.17	0.97
aren’t	aunt	0.96	0.44	0.95	maize	maze	0.99	0.25	0.93
ate	eight	1.00	0.22	1.00	manna	manner	0.97	0.33	0.89
auger	augur	-	0.51	-	mantel	mantle	1.00	0.43	0.95
awe	oar	0.92	0.26	0.78	mare	mayor	0.99	0.38	0.96
axel	axle	0.99	0.41	0.86	marshal	martial	1.00	0.60	0.98
aye	eye	0.99	0.29	0.89	marten	martin	-	0.36	-
bail	bale	0.99	0.36	0.98	mask	masque	-	0.51	-
bait	bate	0.98	0.41	0.95	mean	mien	0.98	0.24	0.90
baize	bays	0.99	0.50	0.91	meat	meet	1.00	0.13	0.99
bald	bawled	0.99	0.21	0.96	medal	meddle	0.99	0.11	0.90
bard	barred	0.98	0.01	0.96	metal	mettle	1.00	0.31	0.92
bare	bear	1.00	0.51	1.00	meter	metre	0.99	0.59	0.95
bark	barque	0.96	0.33	0.86	might	mite	0.99	0.38	0.96
baron	barren	1.00	0.07	0.97	mind	mined	-	0.21	-
base	bass	0.99	0.28	0.98	miner	minor	1.00	0.17	0.95
bazaar	bizarre	0.98	0.12	0.92	missed	mist	0.99	0.20	0.98
be	bee	0.99	0.35	0.98	moat	mote	0.99	0.46	0.93
beach	beech	1.00	0.47	0.97	mode	mowed	0.99	0.21	0.94
bean	been	0.92	0.29	0.96	moor	more	0.99	0.41	0.94
beat	beet	-	0.41	-	morning	mourning	1.00	0.46	0.99
beer	bier	0.99	0.24	0.96	muscle	mussel	0.99	0.48	0.95
bel	bell	0.99	0.29	0.98	naval	navel	-	0.34	-
berry	bury	0.99	0.16	0.97	nay	neigh	-	0.34	-
berth	birth	1.00	0.08	1.00	none	nun	0.99	0.35	0.84
bitten	bittern	-	0.63	-	ode	owed	0.99	0.30	0.95
blew	blue	1.00	0.48	1.00	oh	owe	0.97	0.55	0.82
boar	bore	0.99	0.39	0.96	one	won	1.00	0.39	1.00
board	bored	1.00	0.13	0.99	pail	pale	1.00	0.27	0.99
boarder	border	0.99	0.34	0.92	pain	pane	1.00	0.38	0.98
bold	bowled	0.99	0.50	0.92	pair	pare	0.99	0.35	0.95
born	borne	1.00	0.29	0.99	palate	palette	0.98	0.50	0.88
bough	bow	0.99	0.48	0.97	pause	paws	1.00	0.34	0.99
boy	buoy	0.96	0.35	0.88	pea	pee	0.93	0.54	0.87
braid	brayed	-	0.60	-	peace	piece	1.00	0.15	0.99
brake	break	1.00	0.38	0.99	peak	peek	-	0.30	-
bread	bred	1.00	0.24	0.99	peal	peel	0.99	0.21	0.96
bridal	bridle	1.00	0.36	0.97	peer	pier	0.99	0.33	0.96
broach	brooch	0.98	0.52	0.89	plain	plane	1.00	0.40	0.99
but	butt	0.97	0.37	0.92	pleas	please	-	0.42	-
buy	by	0.99	0.25	0.98	plum	plumb	0.99	0.57	0.96
calendar	calender	0.98	0.39	0.95	pole	poll	0.98	0.19	0.96
canvas	canvass	0.99	0.41	0.98	practice	practise	1.00	0.74	0.99
cast	caste	0.99	0.33	0.98	praise	prays	0.99	0.51	0.99
caught	court	0.95	0.11	0.91	principal	principle	1.00	0.48	1.00
ceiling	sealing	1.00	0.63	0.95	profit	prophet	1.00	0.30	0.99
cell	sell	0.99	0.06	0.95	quarts	quartz	0.99	0.49	0.95
censer	censor	0.98	0.35	0.78	rain	reign	1.00	0.18	0.99
cent	scent	0.99	0.00	0.98	raise	rays	1.00	0.10	0.99
check	cheque	0.99	0.61	0.98	rap	wrap	0.99	0.40	0.96
chord	cord	1.00	0.42	0.98	raw	roar	0.96	0.18	0.88
cite	sight	0.99	0.24	0.97	read	reed	0.97	0.28	0.97
clew	clue	0.99	0.83	0.96	real	reel	1.00	0.27	0.97
climb	clime	0.99	0.29	0.93	reek	wreak	0.98	0.33	0.92
coarse	course	1.00	0.20	0.99	rest	wrest	0.99	0.54	0.95
coign	coin	-	0.37	-	right	rite	1.00	0.19	0.96
colonel	kernel	0.98	0.02	0.94	ring	wring	0.99	0.48	0.97
complacent	complaisant	-	0.54	-	road	rode	1.00	0.66	0.96
complement	compliment	1.00	0.27	0.96	roe	row	0.95	0.23	0.80
coo	coup	0.98	0.35	0.87	role	roll	1.00	0.18	0.99
cops	copse	0.98	0.39	0.96	rood	rude	-	0.32	-
council	counsel	1.00	0.52	0.99	root	route	0.98	0.06	0.98
creak	creek	0.99	0.37	0.98	rose	rows	1.00	0.37	0.99
crews	cruise	1.00	0.67	0.95	rote	wrote	0.98	0.46	0.91
currant	current	0.99	0.11	0.97	rough	ruff	0.99	0.51	0.94
dam	damn	0.99	0.21	0.95	rung	wrung	1.00	0.33	0.97
days	daze	0.99	0.47	0.94	rye	wry	0.99	0.35	0.92
dear	deer	0.99	0.08	0.98	sale	sail	1.00	0.32	0.99
descent	dissent	1.00	0.42	0.98	sane	seine	-	0.09	-
desert	dessert	0.99	0.14	0.97	sauce	source	0.97	0.22	0.97
dew	due	1.00	0.25	0.98	saw	soar	0.96	0.33	0.92
die	dye	0.99	0.26	0.89	scene	seen	1.00	0.41	1.00
done	dun	0.99	0.21	0.97	sea	see	1.00	0.31	0.98
draft	draught	0.99	0.30	0.96	seam	seem	0.98	0.30	0.93
dual	duel	0.98	0.48	0.88	sear	seer	0.98	0.57	0.86
earn	urn	0.99	0.04	0.89	seas	sees	0.99	0.18	0.96
ewe	yew	0.99	0.45	0.90	sew	so	0.99	0.55	0.98
faint	feint	0.99	0.55	0.95	shear	sheer	0.99	0.25	0.94
fair	fare	1.00	0.42	0.99	side	sighed	1.00	0.31	0.99
farther	father	0.98	0.29	0.99	sign	sine	-	0.39	-
feat	feet	1.00	0.37	0.99	slay	sleigh	0.99	0.30	0.97
few	phew	-	0.38	-	sole	soul	1.00	0.43	0.98
find	fined	0.99	0.24	0.97	some	sum	0.99	0.34	0.99
fir	fur	1.00	0.32	0.99	son	sun	1.00	0.26	1.00
flaw	floor	0.97	0.30	0.87	sort	sought	0.96	0.21	0.96
flea	flee	0.99	0.36	0.98	staid	stayed	1.00	0.55	0.99
floe	flow	0.99	0.42	0.89	stair	stare	1.00	0.27	0.99
flour	flower	1.00	0.30	0.99	stake	steak	1.00	0.19	0.98
for	fore	0.95	0.26	0.99	stalk	stork	0.96	0.55	0.92
fort	fought	0.95	0.42	0.93	stationary	stationery	0.99	0.27	0.85
forth	fourth	1.00	0.25	1.00	steal	steel	1.00	0.19	0.99
foul	fowl	1.00	0.36	0.97	stile	style	0.99	0.21	0.89
franc	frank	0.98	0.31	0.95	storey	story	0.99	0.37	0.95
freeze	frieze	-	0.38	-	straight	strait	1.00	0.39	0.97
furs	furze	-	0.56	-	sweet	suite	1.00	0.03	0.99
gait	gate	1.00	0.31	0.99	tacks	tax	0.98	0.25	0.96
gamble	gambol	-	0.41	-	tale	tail	1.00	0.23	1.00
gild	guild	-	0.41	-	taught	taut	0.99	0.12	0.93
gilt	guilt	1.00	0.21	0.99	team	teem	-	0.28	-
gnaw	nor	0.95	0.38	0.84	tear	tier	0.98	0.24	0.92
grate	great	1.00	0.15	0.97	teas	tease	0.99	0.35	0.97
greys	graze	0.98	0.65	0.90	tern	turn	-	0.32	-
grisly	grizzly	0.99	0.64	0.88	there	their	0.99	0.34	0.98
groan	grown	0.99	0.25	0.95	threw	through	0.99	0.39	0.99
guessed	guest	1.00	0.33	0.99	throes	throws	0.99	0.31	0.95
hail	hale	0.99	0.34	0.97	throne	thrown	1.00	0.30	0.96
hair	hare	1.00	0.32	0.98	thyme	time	0.99	0.25	0.97
hall	haul	1.00	0.13	0.93	tide	tied	1.00	0.10	0.99
hart	heart	0.99	0.23	0.94	tire	tyre	0.99	0.21	0.90
haw	hoar	0.91	0.12	0.71	to	too	0.97	0.63	0.97
hay	hey	0.98	0.37	0.92	toad	toed	0.98	0.55	0.96
he’d	heed	0.99	0.39	0.98	told	tolled	0.98	0.33	0.96
heal	heel	1.00	0.37	0.95	ton	tun	-	0.55	-
hear	here	1.00	0.58	0.98	vain	vane	0.99	0.19	0.96
heard	herd	0.99	0.32	0.97	vale	veil	1.00	0.65	0.98
hew	hue	0.99	0.34	0.94	vial	vile	0.99	0.33	0.94
hi	high	0.98	0.18	0.98	wain	wane	0.98	0.73	0.86
higher	hire	1.00	0.27	0.98	waist	waste	1.00	0.13	0.99
him	hymn	0.99	0.28	0.98	wait	weight	1.00	0.21	0.99
ho	hoe	0.98	0.40	0.96	waive	wave	0.99	0.22	0.95
hoard	horde	0.99	0.47	0.89	war	wore	0.99	0.23	0.96
hoarse	horse	1.00	0.21	0.99	ware	wear	0.99	0.32	0.97
hour	our	0.98	0.28	0.94	warn	worn	1.00	0.02	0.98
idle	idol	1.00	0.46	0.97	watt	what	-	0.52	-
in	inn	0.98	0.38	0.95	wax	whacks	-	0.48	-
it’s	its	0.99	0.20	0.95	way	weigh	0.99	0.36	0.96
key	quay	0.97	0.33	0.95	we	wee	0.97	0.21	0.96
knave	nave	0.99	0.06	0.90	we’d	weed	0.96	0.26	0.94
knead	need	0.99	0.51	0.98	weak	week	1.00	0.10	1.00
knew	new	1.00	0.31	0.99	weal	we’ll	0.94	0.15	0.90
knight	night	1.00	0.26	0.99	wean	ween	-	0.57	-
knob	nob	-	0.59	-	weather	whether	0.99	0.21	0.95
knot	not	1.00	0.26	0.98	weir	we’re	-	0.08	-
know	no	1.00	0.63	0.97	were	whirr	0.96	0.27	0.83
knows	nose	1.00	0.18	0.99	which	witch	0.99	0.28	0.94
lac	lack	-	0.10	-	whig	wig	0.99	0.16	0.96
lain	lane	0.99	0.30	0.96	whine	wine	1.00	0.13	0.97
laps	lapse	-	0.32	-	whirl	whorl	0.97	0.37	0.78
larva	lava	-	0.45	-	whirled	world	1.00	0.26	0.96
law	lore	0.96	0.33	0.89	whit	wit	0.99	0.50	0.96
lea	lee	0.98	0.53	0.89	white	wight	0.99	0.32	0.95
lead	led	0.94	0.59	0.97	who’s	whose	0.99	0.26	0.99
lessen	lesson	1.00	0.26	0.92	woe	whoa	0.97	0.23	0.90
levee	levy	0.98	0.45	0.82	wood	would	0.98	0.26	0.96
liar	lyre	0.99	0.31	0.93	yoke	yolk	0.99	0.32	0.88
licence	license	0.99	0.72	0.94	you’ll	yule	0.95	0.48	0.91
lie	lye	0.97	0.47	0.71