Robust Semantic Communications with Masked VQ-VAE Enabled Codebook

Qiyu Hu, Student Member, IEEE, Guangyi Zhang, Student Member, IEEE, Zhijin Qin, Senior Member, IEEE,
Yunlong Cai, Senior Member, IEEE, Guanding Yu, Senior Member, IEEE, and Geoffrey Ye Li, Fellow, IEEE
Q. Hu, G. Zhang, Y. Cai, and G. Yu are with the College of Information Science and Electronic Engineering, Zhejiang University, Hangzhou 310027, China (e-mail: qiyhu@zju.edu.cn; zhangguangyi@zju.edu.cn; ylcai@zju.edu.cn; yuguanding@zju.edu.cn). Z. Qin is with the Department of Electronic Engineering, Tsinghua University, Beijing 100084, China (e-mail: qinzhijin@tsinghua.edu.cn). Geoffrey Ye Li is with the Department of Electrical and Electronic Engineering, Imperial College London, London SW7 2AZ, UK (e-mail: geoffrey.li@imperial.ac.uk). This work was supported in part by the National Natural Science Foundation of China (NSFC Nos 62293484 and 61925105) and in part by Tsinghua University-China Mobile Communications Group Company Ltd. Joint Institute. This work was supported in part by the Fundamental Research Funds for the Central Universities 226-2022-00195, the Defense Industrial Technology Development Program under Grant JCKY2020210B021. Zhejiang Provincial Key Laboratory of Information Processing, Communication and Networking (IPCAN), Hangzhou 310027, China.
Abstract

Although semantic communications have exhibited satisfactory performance on a large number of tasks, the impact of semantic noise and the robustness of the systems have not been well investigated. Semantic noise refers to the misleading between the intended semantic symbols and received ones, thus causes the failure of tasks. In this paper, we first propose a framework for the robust end-to-end semantic communication systems to combat the semantic noise. In particular, we analyze sample-dependent and sample-independent semantic noise. To combat the semantic noise, the adversarial training with weight perturbation is developed to incorporate the samples with semantic noise in the training dataset. Then, we propose to mask a portion of the input, where the semantic noise appears frequently, and design the masked vector quantized-variational autoencoder (VQ-VAE) with the noise-related masking strategy. We use a discrete codebook shared by the transmitter and the receiver for encoded feature representation. To further improve the system robustness, we develop a feature importance module (FIM) to suppress the noise-related and task-unrelated features. Thus, the transmitter simply needs to transmit the indices of these important task-related features in the codebook. Simulation results show that the proposed method can be applied in many downstream tasks and significantly improve the robustness against semantic noise with remarkable reduction on the transmission overhead.

Index Terms:
Adversarial training, feature importance module (FIM), masked vector quantized-variational autoencoder (VQ-VAE), robust semantic communications, semantic noise.

I Introduction

Traditional communication systems focus on efficient symbol transmission and accurate symbol recovery [1], where symbol-error rate (SER) and bit-error rate (BER) are usually used as the performance metrics. Various new applications generate unprecedented amounts of data for serving different types of tasks while the conventional communication system is facing the bottleneck to support such massive amount of data [2]. Moreover, it causes critical challenges for conventional communications to support massive connectivity over limited spectrum resources but with low latency [3].

I-A Prior Work

With the development of deep neural networks (DNNs) [3] and end-to-end learning [4, 5, 6], semantic communications, which extract and transmit the task-related meanings of data [7, 8, 9, 10], have emerged as a key technology and received great attention. For example, in the image transmission for object detection task, the location and shape of an object are task-related while the background is unrelated to the task and unnecessary to transmit. Moreover, semantic communication is robust to unfriendly channel environments, i.e., low signal-to-noise ratio (SNR), which fits well the applications requiring high reliability [11]. These advantages motivate us to develop communication systems by considering the semantic meaning beyond digital bits to enhance transmission accuracy and efficiency.

The existing works on semantic communications can be divided into two categories: data reconstruction [10, 12, 11, 13, 14, 15, 16] and task execution [17, 18, 19, 20, 21]. For the data reconstruction, the semantic information behind the data is extracted and only the related data is reconstructed based on the received semantic information. The joint semantic-channel coding (JSCC) scheme in [11] extracts the semantic information from text. In [13], reinforcement learning is exploited to recover the text. The semantic communication system in [14] with channel feedback can improve the quality of image reconstruction. The attention-based semantic communication system in [15] focuses on speech signals.

For the task-specific applications, the task-related semantic information encoded at the transmitter is directly applied for task execution at the receiver. In particular, the image classification-oriented semantic communications in [17] can improve the recognition accuracy. The task of image re-identification for a person in [18] can enhance the retrieval accuracy. The semantic communication system in [19] considers edge inference with classification task. The semantic communication system in [20] is designed for the multi-modal data task: the visual question answering.

Though the aforementioned deep learning (DL)-based semantic communication systems have exhibited very impressive performance in certain tasks, the impact of noise and the system robustness still need to be further investigated. There have been studies on analyzing the generation and characteristics of different kinds of image and text noise and a number of denoising algorithms have been proposed [22]. However, there exists a particular type of noise in semantic communications that has not been well studied [8]. More or less similar to the noise in the conventional systems, semantic noise causes misunderstanding of semantic information and decoding errors, which results in misleading between the intended semantic meaning and the reconstructed one at the receiver. The semantic noise can be generated in different stages, including semantic encoding, data transmission, and decoding [9]. In the semantic encoding stage, the semantic noise corresponds to the mismatch between the original signal and the encoded signal after semantic encoding, which is related to the representational capability of the encoder. In the data transmission stage, the signal distortion caused by channel fading and some well-designed signals sent by malicious attackers both introduce semantic noise. In the decoding stage, the misinterpretation, incorrect representation, and confusion of meanings introduce semantic noise to the receiver. For example, in the process of semantic symbol recognition, the same semantic symbol is employed to represent different sets of data with different meanings when the reconstructed symbols are ambiguous.

Semantic noises for different categories of sources, e.g., text and image, are generally different [23]. The semantic noise in text refers to semantic ambiguity, where slight changes to words in the sentence, e.g., synonym replacement or randomly reverse alphabetical order, may make the DL model misunderstand the semantic meaning of the sentence [24]. The semantic noise in the image can be modeled based on the adversarial samples [25], which is different from that in the text. Due to the discrete nature of text, it is impossible to add perturbation to the text without being noticed by humans. However, some subtle modification can be added to the images that is barely noticeable to humans. Adversarial samples mislead DL models and cause significant performance degradation, but they look identical to the original images for humans.

In this paper, we focus on the semantic noise in the image domain and that in the text domain can be handled in a similar way. The methods for generating adversarial samples in the image domain can be classified into two categories: (i) the sample-dependent method, which fools a DNN on a single image [26, 27, 28, 29]; (ii) the sample-independent universal method, which fools a DNN on any image with a high probability [30, 31]. In particular, the methods on the first category include the fast gradient sign method (FGSM) [26], projected gradient descent (PGD) [27], Jacobian-based saliency map attack (JSMA) [28], and deepfool algorithm [29]. The existing DL-enabled systems are vulnerable and particularly unstable to these adversarial samples, where small and imperceptible perturbations of the data samples are sufficient to fool them and would result in incorrect results [25]. To improve the robustness of DL models against adversarial samples, there have been some methods, such as input denoising [32], defensive distillation [33], gradient regularization [34], weight perturbation [35], and adversarial training [36].

I-B Motivation and Contributions

It is foreseen that the robustness of semantic communications in terms of security and reliability is important in future applications, such as self-driving vehicles and medical diagnosis. Although there have been a number of methods for generating adversarial perturbations in the field of image processing, the semantic noise model in wireless communications has not been well investigated. Moreover, the performance of these methods against adversarial perturbations is unsatisfactory and even deteriorates in the clean samples without noise. More importantly, the impacts of wireless channels and transmission overhead in communications are usually ignored. Therefore, in this paper, we model the semantic noise in the communication field and design a robust semantic communication system that can effectively combat semantic noise with lower transmission overhead. We propose a DL-enabled end-to-end robust semantic communication system to combat the semantic noise. Both the transmitter and receiver are represented by DNNs. Since it is a data-driven method without pre-assumed channel models as a prerequisite, it can potentially provide a solution with high generalization capability to various communication scenarios.

We firstly model the semantic noise in practical wireless communication environments. In particular, we employ the iterative FGSM method to generate the sample-dependent semantic noise at the transmitter instantly, which adds different semantic noise at each image. Since it is difficult to acquire each channel and the transmitted signal, we further propose an iterative method to generate the sample-independent semantic noise at the receiver. It adds the same semantic noise to different transmitted images and fools most images without requiring the channel state information (CSI) and the transmitted images. To combat the semantic noise, we propose an adversarial training method with weight perturbation, which incorporates the samples with semantic noise in the training dataset to solve a complicated min-max optimization problem. Then, the masked vector quantized-variational autoencoder (VQ-VAE) with vision Transformer (ViT) blocks [37, 38] is designed as the architecture of the robust semantic communication system. A novel strategy is proposed to mask a portion of the original image, where the semantic noise appears with a high probability. Moreover, a discrete codebook shared by the transmitter and the receiver is designed for encoded feature representation. It focuses on the task-related feature representation and neglects imperceptible noise-related details, which reduces the impact of semantic noise.

To further improve the system robustness, we design a feature importance module (FIM) that dynamically learns and incorporates the feature importance to the masked VQ-VAE. It inherently suppresses the task-unrelated and noise-related features. Thus, the transmitter only needs to send the indices of the important task-related features in the codebook. Furthermore, the SNR is incorporated into the FIM, which ensures that the proposed system can successfully operate in a wide range of SNR levels. Moreover, existing works in semantic communications focus on mapping the source data directly into channel symbols for transmission. It is called full-resolution constellation because the constellation points can appear anywhere in the constellation. However, this is difficult to realize for practical systems due to the finite precision and might be impractical in the current digital communication systems [10, 11, 12]. Our proposed masked VQ-VAE model with FIM can design a discrete codebook-based system, which is more practical and can be easily realized in the current digital communication systems since the indices of the features can be directly mapped into symbols via employing the existing constellation. Simulation results show that our proposed method can be applied in many downstream tasks and can significantly improve the robustness of semantic communication systems against semantic noise with much reduced transmission overhead. The main contributions of this paper are summarized as follows:

  • Based on adversarial perturbations in computer vision [30], taking into account the modulation, channel, and demodulation in communication systems, we model the sample-dependent and sample-independent semantic noise added at the transmitter and receiver, respectively.

  • Based on the basic adversarial training method [27] and optimization theory, we propose an adversarial training method with weight perturbation to combat the semantic noise, with the consideration of the effects caused by channel impairments in wireless communication systems.

  • We develop a masked VQ-VAE model with a masking strategy as the architecture of the robust semantic communication system. A discrete codebook shared by the transmitter and receiver is designed for encoded feature representation, which fits the current digital communication system well.

  • We provide performance analysis and propose a novel loss function to improve system robustness based on semantic similarity.

  • To further improve the system robustness, we design the FIM to suppress the noise-related and task-unrelated features.

I-C Organization and Notations

The rest of paper is structured as follows. Section II models the semantic noise and proposes a general framework of the semantic communication system to combat the semantic noise. Section III designs the masked VQ-VAE with a masking strategy and a discrete codebook for encoded feature representation. Section IV improves the semantic communication systems for stronger robustness by designing the FIM with dynamic SNR. The simulation results are presented in Section V. Finally, the paper is concluded in Section VI.

Notations: Scalars, vectors, and matrices are respectively denoted by lower case, boldface lower case, and boldface upper case letters. Notation 𝐈 represents an identity matrix and 𝟎 denotes an all-zero matrix. For a matrix 𝐀, 𝐀T, 𝐀, 𝐀H, 𝐀1, 𝐀, and 𝐀 are its transpose, conjugate, conjugate transpose, inversion, pseudo-inversion, and Frobenius norm, respectively. For a vector 𝐚, 𝐚 is its Euclidean norm. Finally, m×n(m×n) are the space of m×n complex (real) matrices.

II Framework of Robust Semantic Communications

In this section, we model the semantic noise and propose the framework of robust end-to-end semantic communication systems with adversarial training to combat the semantic noise.

II-A Semantic Communication Systems

Refer to caption
Figure 1: The framework of the robust semantic communication system with semantic noise.

As shown in Fig. 1, the transmitter maps the source, 𝐬, into a symbol stream, 𝐱, and then passes it through the physical channel with transmission impairments. The received symbol stream, 𝐲, is decoded at the receiver to obtain an estimation of the source, 𝐬^. Both the transmitter and receiver are represented by DNNs, which are jointly designed. In particular, the DNNs at the transmitter consist of the semantic encoder and the channel encoder while the DNNs at the receiver consist of the channel decoder and the semantic decoder. The semantic encoder learns to extract the semantic information from the transmitted data and transforms it into an encoded feature vector while the semantic decoder learns to recover the transmitted data from the received signals. Moreover, the channel encoder and channel decoder aim at eliminating the signal distortion caused by wireless channels.

Assuming that the input is an image, we consider a system with Nt transmit antennas and Nr receive antennas. The encoded symbol stream can be represented by

𝐱=f2(f1(𝐬;𝜽1);𝜽2), (1)

where 𝐱Nt×1, 𝜽1 and 𝜽2 denote the trainable parameters of the semantic encoder, f1(), and the channel encoder, f2(), respectively. Subsequently, the received signal, 𝐲Nr×1, is given by

𝐲=𝐇𝐱+𝐧, (2)

where 𝐇Nr×Nt denotes the channel matrix and 𝐧Nr×1𝒞𝒩(𝟎,σ2𝐈) is the additive white Gaussian noise (AWGN). Correspondingly, the decoded signal is given as

𝐬^=g1(g2(𝐲;𝚯2);𝚯1), (3)

where 𝚯1 and 𝚯2 denote the trainable parameters of the semantic decoder, g1(), and the channel decoder, g2(), respectively. For clarity, we denote 𝜽 as the trainable parameters and f𝜽() as the DNNs in the considered semantic communication systems. Thus, we have 𝐬^=f𝜽(𝐬). The goal of this system is to minimize the semantic error while reducing the number of symbols to be transmitted.

II-B Generation of Semantic Noise

We consider the semantic noise generated at the transmitter and the receiver.

II-B1 Semantic Noise at Transmitter

This kind of semantic noise is generated in the encoding stage. Consider the scenario where a malicious attacker downloads the image dataset, adds semantic noise to each image, and then uploads the modified dataset. The semantic noise has serious impact on the encoding process and will mislead the DL models to generate wrong results for tasks. However, since the semantic noise is so subtle that the legitimate users barely notice it, they will use these contaminated images as usual. Semantic noise also exists in the nature. For example, images obtained by taking pictures of the adversarial samples could also cause misclassification [39].

The goal of the semantic communication system is to minimize the loss function for serving a specific task, e.g., the mean square error for image reconstruction and the cross entropy for classification task, etc. In contrast, the semantic noise aims to maximize the loss function. Let 𝒮={𝐬1,𝐬2,,𝐬I} be a set of images sampled from the training dataset. Then, the generation of semantic noise for the i-th image, 𝐬i, can be modeled as the solution of the following optimization problem,

P1: maxΔ𝐬i (f𝜽(𝐬i+Δ𝐬i),𝐳i) (4a)
s.t. Δ𝐬ipϵ, (4b)

where 𝐬i and f𝜽(𝐬i+Δ𝐬i) denote the i-th input image and the output of the DNN, respectively, Δ𝐬i is the semantic noise generated for the i-th image, and () denotes the loss function of DNN for a specific task. In addition, 𝐳i denotes the target associated with 𝐬i, e.g., the true label for classification task and the original image for image reconstruction task, etc. Note that p is the p-norm and constraint (4b) limits the power of semantic noise to avoid being observed by humans. Unless otherwise stated, we select p=, i.e., infinite norm, in this paper.

To solve this problem, we employ the FGSM in [26], which linearizes the loss function as

(f𝜽(𝐬i+Δ𝐬i),𝐳i)(f𝜽(𝐬i),𝐳i)+(Δ𝐬i)T𝐬i(f𝜽(𝐬i),𝐳i). (5)

It is minimized by setting Δ𝐬i=α𝐬i(f𝜽(𝐬i),𝐳i), where α is a scaling factor to constrain the power of semantic noise to ϵ in (4b). Then, we obtain the semantic noise with power ϵ as

Δ𝐬i=ϵsign(𝐬i(f𝜽(𝐬i),𝐳i)), (6)

where sign(x)=1 for x0 and sign(x)=1 for x<0. Then, the contaminated sample with semantic noise becomes 𝐬i=𝐬i+Δ𝐬i. This semantic noise is only generated with one-step iteration of the gradient descent method. To increase the impact of Δ𝐬i on the system, we propose to employ the iterative process

𝐬i(k+1)=Πϵ(𝐬i(k)+αsign(𝐬i(f𝜽(𝐬i(k)),𝐳i))), (7)

where k denotes the iteration index and Π is the projection operator. α is selected to satisfy Kα>ϵ to ensure the full advantage of the noise power ϵ, where K denotes the number of iterations.

II-B2 Semantic Noise at the Receiver

Input : The generated channel samples {𝐇a(1),𝐇a(2),,𝐇a(N)} and a set of collected received signals {𝐱~(1),𝐱~(2),,𝐱~(N)} with their true labels {𝐳(1),𝐳(2),,𝐳(N)}. The number of iterations K, the power constraint ϵ, the scaling factor α, the weighting coefficient ρ, and the model of decoder.
Output : The sample-independent semantic noise Δ𝐱.
Initialize : 𝜹1=𝟎, Δ𝐱(n)=𝟎, and 𝜹norm=𝟎.
1 for n1 to N do
2 % Generate the sample-dependent semantic noise for the n-th sample.
3 for k1 to K do
4 𝜹norm=𝐇a(n)𝐱~k(n)(gd𝜽(𝐱~k(n)),𝐳(n))𝐇a(n)𝐱~k(n)(gd𝜽(𝐱~k(n)),𝐳(n))2;
5 𝐱~k+1(n)=𝐱~k(n)+α𝐇a(n)𝜹norm;
6 𝜹k+1=𝜹k+α𝜹norm;
7
8 end for
9 Weighted average: Δ𝐱(n)=Δ𝐱(n1)+ρ𝜹K𝜹K2;
10 Normalization: Δ𝐱(n)=ϵΔ𝐱(n)Δ𝐱(n)2;
11
12 end for
Output the sample-independent semantic noise Δ𝐱=Δ𝐱(N).
Algorithm 1 Generation of the sample-independent semantic noise at the receiver

It corresponds to the semantic noise generated in the transmission and decoding stages, which leads to the failure of decoding and the misunderstanding caused by the interpretation of the receivers. It may come from the non-ideal characteristics of hardware, the signal distortion caused by channel fading, or malicious attackers [9, 31]. Consider a legitimate transmitter that transmits the encoded signal, 𝐱, to the receiver and a malicious attacker that sends the semantic noise, Δ𝐱, for attacking. Then, the received signal at the receiver is given by

𝐲=𝐇t𝐱+𝐧+𝐇aΔ𝐱𝐱~+𝐇aΔ𝐱, (8)

where 𝐱~𝐇t𝐱+𝐧 denotes the received signal without semantic noise, 𝐇t is the channel between the legitimate transmitter and receiver, 𝐇a denotes the channel between the attacker and receiver, and 𝐧 is the AWGN. Note that the semantic noise model in (8) at the receiver is a general model, where 𝐇a can be removed when studying the effects of non-ideal characteristics of hardware and the signal distortion caused by channel fading.

To generate the sample-dependent semantic noise as in Section II-B1, we need to assume that the attacker knows: (i) the exact channel between the attacker and the receiver, 𝐇a; and (ii) the received signal at the receiver, 𝐱~, in advance, which are not always practical in real wireless communication systems. Therefore, we aim to find a sample-independent semantic noise, Δ𝐱, that fools most images in the dataset, 𝒮, by assuming that the attacker only knows the channel statistics rather than the exact channel. In particular, we first generate N channel realizations {𝐇a(1),𝐇a(2),,𝐇a(N)} based on the channel statistics and collect a set of received signals {𝐱~(1),𝐱~(2),,𝐱~(N)}. Then, we generate the sample-independent semantic noise, Δ𝐱(n), by using 𝐱~(n) and 𝐇a(n) for n=1,2,,N, instead of using the real channel. Specifically, we select the scaling factor, α, satisfying Kα>ϵ to ensure that we can take full advantage of the noise power, ϵ. To maximize the received power of semantic noise and effectively fool the decoder, the attacker has to fully utilize channel 𝐇a. Thus, if semantic noise Δ𝐱 is multiplied by the conjugate of the channel, 𝐇a, then the received power of semantic noise after going through the channel is maximized. The generated semantic noise vectors for all the samples, {Δ𝐱(n)}, are weighted averaged and normalized. The details are presented in Algorithm 1, where 𝜽d denotes the parameters of the decoder g𝜽d() and g𝜽d(𝐱~(n)) is the output of the decoder for the n-th sample. In the following, we design the robust semantic communication systems based on the sample-dependent semantic noise model proposed in Section II-B1 and the sample-independent semantic noise can be handled in a similar way.

Remark 1.

There exists semantic noise in the original images naturally. Here we take the image classification task as an example, where the “misunderstanding” caused by semantic noise refers to the “misclassification”, e.g., the transmitter sends an image with a dog but the receiver classifies it as a cat. The misclassification rate would never become zero, due to the inevitable semantic noise that exists in the original images naturally. Moreover, the semantic noise of different datasets is generally different. In particular, the semantic noise power of some simple datasets is low, e.g., MNIST, which is easy to achieve high classification accuracy, while many complicated datasets are with high-power semantic noise, e.g., ImageNet. Besides, the capabilities of deep learning models in handling semantic noise are different. The powerful model with complicated structures can effectively eliminate the semantic noise to achieve better classification accuracy, e.g., ResNet-101. Our proposed semantic noise model strengthens such kind of misunderstanding and requires higher robustness for system.

II-C Adversarial Training

II-C1 Basic Adversarial Training

The key idea of adversarial training against semantic noise is to add the samples corrupted by the semantic noise into the training dataset [26]. In particular, the trainable parameters, 𝜽, and semantic noise, Δ𝐬i, are updated iteratively to improve the model robustness. It can be formulated as solving the following min-max optimization problem,

P2: min𝜽 1Ii=1ImaxΔ𝐬i(f𝜽(𝐬i+Δ𝐬i),𝐳i) (9a)
s.t. Δ𝐬ipϵ, (9b)

where I denotes the number of training samples.

To solve (P2), the following two steps are executed iteratively: (i) compute 𝐬i based on (7) or Algorithm 1 and obtain the semantic noise by Δ𝐬i=𝐬i𝐬i. Note that 𝜽 is fixed in this step and we add the samples, 𝐬i, into the training dataset; (ii) update 𝜽 by the SGD based on the training samples, 𝐬i, to minimize the loss function.

II-C2 Adversarial Training with Weight Perturbation

To further improve the robustness against the semantic noise, we add the weight perturbation, 𝝂, on the trainable parameters and reformulate the problem as

P3: min𝜽max𝝂 1Ii=1ImaxΔ𝐬i(f𝜽+𝝂(𝐬i+Δ𝐬i),𝐳i) (10a)
s.t. Δ𝐬ipϵ,𝝂pγ𝜽p. (10b)

Intuitively, the semantic noise, Δ𝐬i, and weight perturbation, 𝝂, lead to the increase of loss function () for the i-th sample and all the samples, respectively. Thus, the two “max” operations make solving the inner maximization problem effectively, which results in a better solution of the whole min-max problem [35]. We solve (P3) by Algorithm 2, where the update of weight perturbation 𝝂 is

𝝂Πγ(𝝂+η𝝂1Ii=1I(f𝜽+𝝂(𝐬i+Δ𝐬i),𝐳i)𝝂1Ii=1I(f𝜽+𝝂(𝐬i+Δ𝐬i),𝐳i)𝜽), (11)

which can be derived in a similar way to (7).

Input : The training dataset that consists of the input images, 𝐬i, with their true labels, 𝐳i.
Output : The parameters of the trained model, 𝜽.
1 for m1 to M do
2 Compute 𝐬i based on (7) or Algorithm 1 and obtain the semantic noise by Δ𝐬i=𝐬i𝐬i. Note that 𝜽 and 𝝂 are fixed in this step and the computed samples, 𝐬i, are added into the training dataset.
3 Update 𝝂 to maximize () by one step forward and backward propagation in (11) with fixed 𝜽 and Δ𝐬i.
4 Update 𝜽 to minimize () by the SGD with fixed 𝝂 based on the training samples 𝐬i.
5
6 end for
Algorithm 2 Proposed adversarial training algorithm for solving P3

III Masked VQ-VAE Enabled Discrete Codebook

In this section, we design the robust semantic communication systems with masked VQ-VAE. A novel masking strategy and a discrete codebook are designed to combat semantic noise with reduced transmission overhead. We provide some performance analysis and propose a novel loss function to improve system robustness based on semantic similarity. The proposed codebook is different from that of channel feedback [1] in two aspects: (i) we propose a novel masked VQ-VAE to train the codebook together with the encoder and decoder at the transceiver while that of channel feedback is designed by conventional algorithms, e.g., dictionary learning; (ii) we design the codebook for source compression to combat the semantic noise in semantic communications while that of channel feedback is for channel compression in conventional communication systems.

III-A Masked VQ-VAE

There exists information redundancy in various kinds of sources. The image has heavy spatial redundancy and a missing patch in the image can be recovered from its neighboring patches with the understanding of parts, objects, and scenes. Thus, the strategy of randomly masking partial patches is an efficient approach to create a challenging task that requires the model to build a comprehensive understanding of image statistics and semantic information, which also reduces the information redundancy. Moreover, since the semantic noise is added in the patches of the image, the masking operation can eliminate the effects of semantic noise to some extent.

III-A1 The Architecture of Masked VQ-VAE

Refer to caption
(a)
Refer to caption
(b)
Figure 2: (a) The architecture of the masked VQ-VAE with discrete codebook; (b) The discrete codebook against semantic noise.

As in Fig. 2(a), we employ the masked VQ-VAE with the ViT structure, where we randomly mask patches from the input images and aim to reconstruct the missing patches. The masked VQ-VAE belongs to the autoencoder, but can reconstruct the original image from partial observations. Unlike conventional autoencoders, we adopt an asymmetric encoder-decoder architecture.

In particular, the encoder only needs to process a small portion of the unmasked patches and maps them to the encoded features for transmission, which significantly reduces the training time and memory consumption. It removes the masked patches and embeds the unmasked patches with their positional information in the original image, and then processes them via a series of ViT blocks [37]. In contrast, the input to the decoder is the full set of tokens consisting of (i) the encoded features of the unmasked patches and (ii) the mask tokens, as shown in Fig. 2(a). Each mask token is a shared and learned vector that indicates the presence of a missing patch to be predicted. We add positional embeddings to all tokens in this full set. Without this, mask tokens would have no information about their location in the image. Moreover, the decoder is only used during pre-training to perform the image reconstruction task while the encoder is employed to extract the features of the input images. Thus, the decoder architecture, which is independent of the encoder design, can be flexibly designed. Compared with the conventional autoencoder in communication systems, the masked VQ-VAE has the following advantages:

  • With such an asymmetrical design, the encoder only processes the unmasked patches and the lightweight decoder reconstructs the image from the encoded features and the mask tokens. In this way, the computational complexity and training time can be significantly reduced.

  • The pre-trained masked VQ-VAE can be employed for different downstream tasks, e.g., classification, simply by changing the structure of the lightweight decoder and fine-tuning the masked VQ-VAE within a short time.

  • Transmitting the encoded features of the unmasked patches and the mask tokens to the decoder at the receiver leads to a large reduction in transmission overhead.

  • The masking operation can combat the semantic noise since part of the noise is masked.

III-A2 Masking Strategy Against Semantic Noise

We first divide an image into a number of non-overlapping patches. Then, we sample a subset of patches, mask and remove the remaining patches. A high masking ratio, i.e., the ratio of removed patches, largely eliminates redundancy. The “random sampling” strategy in [38] randomly samples patches following a uniform distribution, i.e., the masking probability of each patch is the same. However, the semantic noise does not appear randomly. It appears in the patches related to the objectives more frequently. Hence, to reduce the impact of semantic noise, we increase the masking probability of the patches effected severely by the semantic noise based on its statistics.

III-B Discrete Codebook for Encoded Feature Representation

We aim to design a discrete codebook for the encoded feature space and represent encoded features by the basis vectors in the codebook. We consider important task-related features and neglect task-unrelated features with noise and imperceptible details. Specifically, we set these basis vectors as trainable parameters and train them together with the parameters of both encoder and decoder. The encoder network outputs continuous encoded features and then maps them into the discrete indices of basis vectors in the trained codebook. This design comes with the following advantages:

  • It is simple to train the codebook with a small variance, which makes the semantic communication system more stable.

  • The discrete feature representation can combat semantic noise.

  • The transmitter simply needs to transmit the indices of the basis vectors, which significantly reduces the transmission overhead.

III-B1 Codebook Design

As in Fig. 2(a), we denote the codebook of the encoded features as [𝐞1,𝐞2,,𝐞J]J×D, which consists of J basis vectors {𝐞jD,j1,2,,J} and D is the dimension of each basis vector, 𝐞j. The model takes an input 𝐬 and it passes through an encoder to produce the encoded feature vector, 𝐳e(𝐬). Then, it is mapped to a basis vector, 𝐳b(𝐬), by the nearest neighbor look-up

𝐳b(𝐬)=argmin𝐞j𝐳e(𝐬)𝐞j2,𝐞j, (12)

where the indices of features are omitted for clarity. Then, 𝐳b(𝐬) is input to the decoder. We can treat this forward computation as a layer of DNN with a particular non-linear function that maps the encoded feature vector, 𝐳e(𝐬), to a basis vector, 𝐳b(𝐬). The basis vectors, {𝐞j,j}, in the codebook, , are trained together with the parameters of encoder and decoder. However, operation (12) is non-differentiable. Thus, in back propagation, we approximate the gradient by straight-through estimator [40] and copy gradients from decoder input, 𝐳b(𝐬), to encoder output, 𝐳e(𝐬). Therefore, the nearest basis vector, 𝐳b(𝐬), is passed to the decoder in forward propagation, and during the back propagation, the gradient, 𝐳b(𝐬)c, is passed unaltered to the encoder. Note that the gradients contain useful information for training the encoder to minimize the loss function and can push the encoder’s output, 𝐳e(𝐬), to be discretized efficiently to achieve better performance.

III-B2 Differentiable Loss Function

Our designed loss function consists of three components representing different parts of parameters:

c(𝐬,𝐳;𝜽,𝐞j) =𝐬^𝐳22+ng[𝐳e(𝐬)]𝐞j22 (13)
+β𝐳e(𝐬)ng[𝐞j]22,

where 𝐬, 𝐬^, and 𝐳 denote the input, output, and true label of the network, respectively, 𝜽 denotes the trainable parameters of the original DNN, and β is the hyper-parameter. Symbol ng[𝐳e(𝐬)] represents that there is no gradient passed to 𝐳e(𝐬) and its gradient is zero, which effectively constrains 𝐳e(𝐬) to be a non-updated constant. The first term is the reconstruction loss that trains the parameters of encoder and decoder. Due to the straight-through gradient estimation of mapping from 𝐳e(𝐬) to 𝐳b(𝐬), the basis vectors, {𝐞j,j}, receive no gradients from the reconstruction loss 𝐬^𝐳22. Therefore, in order to train the basis vectors, we employ the l2 error to move the basis vectors towards the encoded features, 𝐳e(𝐬), as shown in the second term of (13). Since the volume of the encoded feature space is dimensionless, the codebook can grow arbitrarily and cause the training process to diverge if the basis vectors, {𝐞j,j}, are not trained as fast as the encoder parameters. To address this issue, we add the third term in (13). In summary, the decoder is optimized by the first loss term only, the encoder is optimized by the first and the last loss terms, and the basis vectors are optimized by the middle loss term.

III-C Robustness of Codebook

III-C1 Semantic Similarity

The semantic similarity of two basis vectors, 𝐞1 and 𝐞2, in the codebook can be defined as the vector multiplication, 𝐞1T𝐞2, cosine distance, 𝐞1T𝐞2𝐞1𝐞2, or l2-norm, 𝐞1𝐞2, etc. The two basis vectors contain similar semantic information when their semantic similarity is high. We choose the cosine distance and compute all the semantic similarity between two basis vectors in the codebook. Denote the normalized codebook that consists of all the normalized basis vectors as

𝐄[𝐞1𝐞1,𝐞2𝐞2,,𝐞J𝐞J]. (14)

Hence, the (i,j)-th element of matrix 𝐄T𝐄 denotes the semantic similarity of basis vectors 𝐞i and 𝐞j.

III-C2 Improved Robust Codebook

The loss function for decreasing the semantic similarity, i.e., increasing the distance, among the basis vectors can be written as

s=𝐄T𝐄2. (15)

Based on semantic similarity, we add term 𝐄T𝐄2 into loss function (13) and try to make the basis vectors in the codebook, , mutually orthogonal, i.e., the semantic similarity between two basis vectors is small and their distance is large.

III-C3 Analysis of Codebook Robustness

The discrete representation is efficient to reduce the impact of the semantic noise. As shown in Fig. 2(b), the semantic noise causes the extracted feature to move towards some certain directions. Only if it does not leave far away from the basis vector corresponding to the original extracted feature that is not affected by the semantic noise, the impact of the semantic noise can be eliminated by this discrete representation. Thus, increasing the distance between two basis vectors can improve the robustness of the codebook against the semantic noise. Based on the semantic similarity, we propose a novel loss function to increase the distance between two basis vectors and try to make them mutually orthogonal. The orthogonal basis vectors have two advantages: (i) the distance between two orthogonal basis vectors is large; (ii) the required number of basis vectors to represent the encoded feature space is the least when basis vectors are mutually orthogonal.

III-D Efficient Transmission with Codebook

III-D1 Codebook-Based Constellation Diagram

Existing works in semantic communications focus on mapping the source data directly into channel symbols and assume the full-resolution constellation [10, 11, 12], that is, the constellation points can appear anywhere in the constellation. However, the full-resolution constellation is extremely complex for practical systems. Thus, we need to limit the number of constellation points. To address this issue, we propose a discrete codebook-based system, which is more practical for digital communication systems, since the indices of the features can be mapped into existing finite constellation and it can better fit the digital communication scheme. In general, discretization would cause the loss of data transmission accuracy. Fortunately, VQ-VAE is proved to be an efficient vector quantization scheme that achieves satisfactory performance [40]. Compared to the conventional uniform quantization, VQ-VAE achieves better quantization performance since the codebook is jointly trainied with the system.

III-D2 Semantic Communications with Efficient Transmission

We assume that the transmitter and receiver share the codebook, , consisting of the basis vectors, {𝐞j,j}, which are fixed after the training stage. Thus, for each encoded feature output by the encoder, the transmitter simply needs to send the index of the corresponding basis vector, which significantly reduces the transmission overhead. As shown in Fig. 2(a), in the transmission stage, the indices of the encoded features are firstly mapped into the binary bits. Then, these binary bits are mapped into symbols and transmitted via wireless channel, 𝐇. The receiver maps the received symbols into the indices and finds the corresponding basis vectors in the codebook, which are the input into the decoder for further processing. The channel and noise can be treated as a layer of the autoencoder and jointly trained with the parameters of the autoencoder.

IV Feature Importance Module with Training Method

In this section, we make the semantic communication system more robust and efficient by designing the FIM. Furthermore, the SNR is incorporated into the FIM, which ensures that the proposed system can successfully operate with different SNR levels. Moreover, we propose a novel loss function and training method to train the FIM.

Refer to caption
Figure 3: The architecture of the MAE with FIM.

IV-A Noise-Related Feature Suppression

The FIM dynamically learns and incorporates the feature importance into the training phase to train a DNN model that inherently suppresses those noise-related and task-unrelated features.

IV-A1 Noise-Related Features

Different features depict an image from different aspects and there exist strong connections between features and robustness to semantic noise, where such robustness varies with different features. Different from the existing works assuming features are of equal importance, we focus on the relationship among features and assign them different importance. Intuitively, different features contribute differently to the results of tasks and have different levels of robustness to semantic noise. We expect that the semantic communication systems can learn the importance of different features and have better understanding of semantic information behind the input image. Then, the transmitter can send the important task-related and noise-unrelated features, which significantly improves the robustness of system and reduce the transmission overhead.

IV-A2 Feature Activation

We observe two characteristics of semantic noise from the feature activation perspective: (i) the magnitudes of the activated features from the samples with semantic noise are higher than that of natural samples; (ii) some noise-related features are activated more uniformly and frequently by samples with semantic noise. We find that the adversarial training has addressed the first issue of high magnitudes of activated features. In other words, some task-unrelated and low contributing features that are not activated by clean samples without semantic noise, yet are activated by samples with semantic noise. This to some extent explains why adversarial training works but its performance is unsatisfactory. It motivates us to design an FIM that trains a model to assign different importance to the features and inherently suppress these task-unrelated and noise-related high magnitude features from being activated by semantic noise.

IV-B FIM with Dynamic SNR

IV-B1 Architecture of FIM

We denote the raw feature map, 𝐅l[𝐟1l,𝐟2l,,𝐟Nl]M×N, as the output of the l-th layer, where M and N represent the dimension and the number of features, respectively. As shown in Fig. 3, we first apply the global average pooling (GAP) operation on the raw feature map, 𝐅l, to obtain the feature activation map, 𝐟^l[μ,f^1l,f^2l,,f^Nl]N+1, where μ denotes the SNR. For the n-th feature, we have

f^nl=1Mm=1M𝐟nl(m). (16)

Note that the GAP extracts the global feature information by averaging elements of the feature vector [12]. Moreover, different SNR levels result in different importance of features. Thus, to ensure that the proposed semantic communication system can operate in a wide range of SNR levels, μ is designed as a part of the input of the FIM, 𝐟^l.

The feature activation map, 𝐟^l, is then passed into an auxiliary DNN with fully-connected (FC) layers and ReLU function. The output of the semantic communication system is 𝐳C, e.g., class label for classification problems and reconstructed images for reconstruction problems. Then, the trainable parameters of this auxiliary DNN can be written as 𝐖l[𝐰μl,𝐰1l,𝐰2l,,𝐰Nl]C×(N+1), which identifies the importance of each feature corresponding to output 𝐳. Parameters 𝐖l will be applied to reweight the raw feature map 𝐅l. We denote the weight component as 𝐰^l[w^1l,w^2l,,w^Nl]N, where we have

w^nl=1Cc=1C𝐰nl(c), (17)

for the n-th feature. It is employed as the weight of importance for the n-th feature in the l-th layer associated with output 𝐳. We apply a “softmax” layer to scale these weights into range [0,1]. Then, it reweights the features, {𝐟nl,n}, in the raw feature map, 𝐅l, as 𝐟~nl=w^nl𝐟nl. The adjusted feature map, 𝐅~l[𝐟~1l,𝐟~2l,,𝐟~Nl]M×N, will be passed into the next layer via forward propagation. In this way, the feature relationship is captured and different weights are generated for different features to increase or suppress their connection strength to the next layer.

IV-B2 FIM with Label Information

Moreover, to make full use of the label information, we slightly modify our FIM for the image classification task and incorporate the label information to improve the performance. Particularly, at the training phase, the ground-truth label is utilized as the index to determine the channel importance. While in the inference phase, we simply take the weight component that is associated to the predicted class as the feature importance since the ground-truth label is not available. Thus, the feature importance is then applied to reweight the original activation map as 𝐟~nl=𝐰nl(y)𝐟l, where y denotes the true label at the training phase and the predicted class at the inference phase, 𝐅~l[𝐟~1l,𝐟~2l,,𝐟~Nl]M×N. Then, the adjusted 𝐅~l is passed into the next layer via forward propagation. Therefore, in this case, the feature importance is learned by considering the label information.

IV-C Model Training

IV-C1 Loss Function of FIM

We insert the proposed FIM into certain layers of the original DNN, which can be considered as an auxiliary network. It can be trained together with the parameters of the encoder and decoder by using the adversarial training method. We design the loss functions to simultaneously train the original DNN and the FIM, where by taking one inserted FIM after the l-th layer as an example, the loss function is designed as

FIMl(𝐩l(𝐟^l),𝐳;𝜽,𝐖l)CE(𝐩l(𝐟^l),𝐳;𝜽,𝐖l), (18)

where 𝐩l(𝐟^l)ReLU(𝐖l𝐟^l) is the output of the FIM, 𝜽 denotes the trainable parameter of the original DNN, 𝐖l is the trainable parameter of the FIM, and CE(𝐩l(𝐟^l),𝐳;𝜽,𝐖l) is the cross entropy loss of 𝐩l(𝐟^l) and label 𝐳. The loss function of FIM is designed as (18) since the FIM will be trained better, i.e., the noise-related features are suppressed and the system achieves better performance, when 𝐩l(𝐟^l) approaches label 𝐳.

It can be easily extended to multiple FIMs. The overall loss function for adversarial training with the proposed FIM can be written as

f(𝐬,𝐳;𝜽,𝐖l)=CE(𝐬^,𝐳;𝜽)+γLl=1LFIMl(𝐩l(𝐟^l),𝐳;𝜽,𝐖l), (19)

where 𝐬 denotes the sample with semantic noise used for adversarial training, 𝐬^ is the output of the decoder, CE(𝐬^,𝐳;𝜽) denotes the cross entropy loss of 𝐬^ and 𝐳, L denotes the number of DNN layers, and γ is a tunable parameter for controlling the strength of FIM.

IV-C2 Training Method of the Whole Model

We develop a method to jointly train the aforementioned modules in the proposed robust semantic communication system, where the detailed training procedures are summarized in Algorithm 3.

Input : The training dataset that consists of the input images, 𝐬i, with their true labels, 𝐳i, the generated channel samples, 𝐇i, and the number of training epoch, N.
Output : The discrete codebook and the trained robust model with FIM, encoder, and decoder.
1 for n1 to N do
2 Mask a portion of the input images based on the proposed masking strategy.
3 Calculate c based on the loss function (13) for training the discrete codebook , together with the encoder and decoder.
4 Calculate f based on the loss function (19) for training the FIM.
5 Calculate s based on the loss function (15) for fine-tune the encoder and decoder.
6 Execute Algorithm 2 to do adversarial training via employing the sum of c, f, and s.
7
8 end for
Algorithm 3 Training procedures of the robust semantic communication system

V Simulation Results

In this section, we verify the effectiveness of the proposed semantic communication system by numerical results.

TABLE I: The architecture of the proposed model.
Layer Name Dimension Activation
Transmitter 8×Transformer Encoder 768 (12 heads) Linear
Dense 256 Sigmoid
Codebook 128 None
FIM 196 ReLU
Channel Channel Nr None
Receiver FIM 196 ReLU
Codebook 128 None
Dense 768 Sigmoid
4×Transformer Encoder 768 (12 heads) Linear
Dense 10 ReLU

V-A Simulation Setup

We consider the scenario where the transmitter and the receiver are equipped with Nt=4 transmit and Nr=2 receive antennas, respectively. We compare the proposed masked VQ-VAE with the conventional source coding and channel coding approaches under MIMO channels [1]. We adopt CIFAR-10, consisting of 60,000 images of 10 classes as the dataset for image classification, Cars196, with 16,185 images of 196 classes as the dataset for image retrieval, and ImageNet, consisting of 14,197,122 images as the dataset for image reconstruction. The average size of joint photographic experts group (JPEG) images in the dataset is 5,108 bytes and the number of patches of each image is 14×14. The masking ratio of masked VQ-VAE is 0.5 and the codebook size is 256, which requires 8 bits for transmitting each index. We adopt the 16-QAM for modulation with low-density parity-check code (LDPC) of rate 1/2. The number of iterations to generate semantic noise is set as K=5 and its power is ϵ=0.016. The architecture of the proposed model is presented in Table I. In particular, the “Codebook” layer represents that the basis vectors in the codebook are set as trainable parameters and the “Dimension” of “Codebook” layer denotes the number of basis vectors. Moreover, the “Dimension” of other layers represent the output dimension of this layer. For different downstream tasks, e.g., classification, we simply need to change the dimension of the last layer and fine-tune the pre-trained masked VQ-VAE. We compare the performance of the following methods:

  • Masked VQ-VAE+FIM+AT: The proposed masked VQ-VAE with FIM and adversarial training.

  • Masked VQ-VAE+AT: The proposed masked VQ-VAE with adversarial training.

  • Masked VQ-VAE: The proposed masked VQ-VAE.

  • JSCC+AT: The modified JSCC scheme in [10] for different tasks with the ViT architecture and adversarial training.

  • JSCC: The modified JSCC scheme in [10] for different tasks with the ViT architecture.

  • JPEG+LDPC+AT: The conventional scheme that adopts JPEG for the image source coding, LDPC for the channel coding, and the ViT as classifier with the adversarial training.

  • JPEG+LDPC: The conventional scheme with JPEG and LDPC.

Note that we have proposed two kinds of semantic noise models: (i) sample-dependent semantic noise added at the transmitter; (ii) sample-independent semantic noise added at the receiver, where (i) has a more serious impact on semantic communication systems. Hence, unless otherwise stated, we adopt semantic noise model (i). We consider the following tasks and generate the corresponding semantic noise.

  • For image classification, the semantic noise is generated to misclassify the minimally perturbed data that looks visually similar to clean samples.

  • As for image retrieval, the subtle semantic noise leads to incorrect retrieved results.

  • For image reconstruction, the semantic noise results in the failure of reconstruction, e.g., some key objects and information in the reconstructed images are missed or blurred.

V-B Analysis of Transmission Overhead

TABLE II: The number of transmitted symbols for one image in different tasks.
Schemes JPEG+LDPC Masked VQ-VAE (Patch =16) Masked VQ-VAE (Patch =8)
Image classification 53,760 196 784
Image retrieval 249,900 490 1,960
Image reconstruction 143,350 490 1,960

Table II presents the transmission overhead of the conventional JPEG+LDPC and our proposed masked VQ-VAE. Note that “Patch 16” denotes the proposed masked VQ-VAE scheme with image compression ratio 16 in the image pre-processing stage, i.e., a 16×16×3 patch is compressed into a scalar. Hence, a larger value represents a higher compression ratio and lower transmission overhead. The number of transmitted symbols of the JPEG+LDPC can be computed as: LHCnPbRcCrBs, where L and H denote the length and the width of the image, respectively, Cn is the number of channels, Pb denotes the required number of bits for each pixel, Rc denotes the code rate, Cr is the compression ratio, and Bs denotes the required bits for a symbol that depends on the modulation mode. We take the image classification as an example: 224×224×3×8×211.2×4=53,760 symbols/image. Moreover, the number of transmitted symbols for the proposed masked VQ-VAE can be computed as: LHJcMrPa2Bs, where Jc denotes the required number of bits for transmitting an index of basis vector in the codebook, Mr is the masking ratio, Pa is the patch size. We take the masked VQ-VAE (Patch =16) for image classification as an example: 224×224×8×0.5162×4=196 symbols/image. Therefore, the proposed masked VQ-VAE simply requires 196/53760=0.36% transmitted symbols of the conventional JPEG+LDPC.

V-C Accuracy of Image Classification

Refer to caption
(a) The ablation experiments.
Refer to caption
(b) The classification accuracy without semantic noise.
Figure 4: The classification accuracy of the proposed schemes versus SNR.

Fig. 4 shows the classification accuracy versus SNR. We train the model with the SNR range from 3 dB to 12 dB and test it in the SNR range from 9 dB to 18 dB. From Fig. 4(a), the classification accuracy increases with SNR for all schemes. The proposed Masked VQ-VAE+FIM+AT significantly outperforms the JSCC+AT, JSCC, JPEG+LDPC+AT, and JPEG+LDPC, and achieves the best performance. Moreover, the proposed Masked VQ-VAE+FIM+AT outperforms the Masked VQ-VAE+AT and Masked VQ-VAE. It demonstrates the efficiency of each module in our design, including the masked VQ-VAE, FIM, and adversarial training. In addition, the proposed scheme and JSCC significantly outperform the conventional JPEG+LDPC in low SNR scenario because the BER is high in low SNR scenario while the proposed scheme is robust by transmitting the indices of extracted task-related features in the trained codebook.

Fig. 4(b) shows the classification accuracy of the schemes without semantic noise. Note that “clean” denotes the scheme without semantic noise. The upper bound is achieved by the JSCC without the effects of channel noise and semantic noise while the lower bound is achieved by the JSCC with semantic noise. From the figure, the proposed Masked VQ-VAE+FIM (clean) approaches the upper bound and significantly outperforms JSCC (clean) and JPEG+LDPC (clean). Moreover, the proposed Masked VQ-VAE+FIM+AT approaches the performance of the model without semantic noise, which demonstrates that the proposed model can effectively improve the system robustness by reducing the impacts of semantic noise.

Refer to caption
(a) Without semantic noise.
Refer to caption
(b) With semantic noise.
Figure 5: The classification accuracy of different codebook size J versus SNR.

Fig. 5(a) and Fig. 5(b) present the classification accuracy versus SNR for different codebook sizes without semantic noise and with semantic noise, respectively. From the figure, the classification accuracy increases with SNR. A larger codebook size has stronger robustness against semantic noise and stronger representational capability for the whole dataset, which results in a higher classification accuracy. Moreover, the codebook size J16 is enough for accurate classification.

Refer to caption
Figure 6: The classification accuracy of different schemes versus the power of semantic noise ϵ.

Fig. 6 illustrates the classification accuracy versus the power of semantic noise ϵ. The upper bound and lower bound are achieved by JSCC without semantic noise and with maximum power of semantic noise, respectively. From the figure, the classification accuracy achieved by all schemes decreases with ϵ. The proposed Masked VQ-VAE+FIM+AT significantly outperforms the benchmarks and achieves the best performance, especially when ϵ is large, which shows the superiority of the proposed model against semantic noise.

Refer to caption
Figure 7: The classification accuracy of the proposed scheme versus reserving ratio for different ϵ.

Fig. 7 presents the classification accuracy of the proposed Masked VQ-VAE+FIM+AT versus reserving ratio for different ϵ. Note that reserving ratio =1 masking ratio, which denotes the ratio of the reserved parts of an image to that of the whole image. From the figure, the classification accuracy achieved by the proposed Masked VQ-VAE+FIM+AT decreases with ϵ. In addition, when ϵ is small, e.g., ϵ=0.008, the classification accuracy of the proposed scheme increases with the reserving ratio since a higher reserving ratio keeps more semantic information of the image. When ϵ is large, e.g., ϵ=0.016, the classification accuracy of the proposed scheme firstly increases and then decreases with the increase of reserving ratio. It is because that a larger reserving ratio retains more semantic noise although it keeps more semantic information. Thus, reserving ratio 0.5 achieves a good trade-off between the semantic information and semantic noise, which has the highest classification accuracy.

Refer to caption
Figure 8: The classification accuracy versus signal-to-semantic noise ratio at the receiver.

Fig. 8 shows the classification accuracy of different schemes versus signal-to-semantic noise ratio, where the noise denotes the sample-independent semantic noise added at the receiver. From the figure, the classification accuracy achieved by all schemes increases with signal-to-semantic noise ratio. The proposed Masked VQ-VAE+FIM+AT significantly outperforms the benchmarks and achieves the best performance. It shows that the proposed model is more robust to the sample-independent semantic noise than the JSCC. Compared with the sample-dependent semantic noise, the sample-independent semantic noise has less impacts on classification accuracy and the effects of adversarial training to combat it is not obvious.

V-D White-Box Semantic Noise and Black-Box Semantic Noise

We have claimed that the attacker needs to know all the model information to generate the semantic noise. In fact, it is almost the worst case for the proposed semantic communication system, in which the generated semantic noise can significantly influence the system performance. We call it the white-box semantic noise. However, semantic noise could also be generated when the attackers only know part of the model information, e.g., model parameters and architectures. They can also attack the system by adopting the semantic noise generated from other similar models, which is named as the black-box semantic noise.

TABLE III: The classification accuracy of the white-box and black-box semantic noise.
Attacked model Proposed Transformer ResNet
Type of semantic noise P-P T-P R-P T-T R-T R-R T-R
Classification accuracy 68.7% 82.3% 85.3% 9.5% 75.3% 8.2% 67.2%

In Table III, we first generate the semantic noise for the proposed model, the Transformer-based model, and the ResNet-based model. Then, we employ these different kinds of semantic noise to evaluate the robustness of these models. Particularly, the “T-P” in the line of “type of semantic noise” refers to employing the semantic noise generated from the Transformer-based model to attack the proposed model, which is the black-box semantic noise. In comparison, the “P-P” means that the proposed model is attacked with semantic noise generated from the proposed model, which is the white-box semantic noise. We can see that the white-box semantic noise has more severe effects on the performance of the proposed model than the black-box noise. Moreover, the black-box semantic noise generated from the Transformer-based model is more effective than that of the ResNet-based model since the proposed model is designed based on Transformer. It demonstrates that the black-box semantic noise generated from a more similar model tends to be more effective. It is mainly because that the models with similar architecture generally have similar parameters and layer structures. Furthermore, the proposed model is more robust than the Transformer-based and ResNet-based models against both the white-box and the black-box semantic noise. In addition, the target of this paper is to design a robust semantic communication system against semantic noise. From Table III, the proposed model will be able to defense the black-box semantic noise.

V-E Feature Activation

Refer to caption
(a) Without FIM.
Refer to caption
(b) With FIM.
Figure 9: The activation frequency of features.

Fig. 9(a) and Fig. 9(b) present the activation frequency of features without FIM and with FIM, respectively. A feature is determined as activated if its activation value is larger than a threshold. From Fig. 9(a), the samples with semantic noise activate the features more uniformly and they tend to frequently activate the features that are rarely activated by clean samples, i.e., the features 125-200. These low frequency features are noise-related and correspond to the redundant activation that is less important for the task. Furthermore, the semantic noise inhibits the activation of the features that are frequently activated by clean samples, i.e., the features 1-75. Fig. 9(b) illustrates that the FIM increases the activation frequency of some important features, i.e., the features 75-100, and avoids those redundant and less important features from being activated by samples with semantic noise, i.e., the features 125-200. This result demonstrates the effectiveness of our proposed FIM, which can suppress the activation of less important and noise-related features.

V-F Semantic Similarity of Basis Vectors

Refer to caption
(a) Without semantic similarity term in loss function.
Refer to caption
(b) With semantic similarity term in loss function.
Figure 10: Semantic similarity of basis vectors in the codebook.

Fig. 10(a) and Fig. 10(b) show the semantic similarity of basis vectors in the codebook without and with the semantic similarity term in the loss function, respectively. A larger value with brighter color represents the higher semantic similarity between the two vectors. It can be observed that semantic similarity between two basis vectors in Fig. 10(a) is larger than that of Fig. 10(b). It demonstrates that the semantic similarity term of loss function proposed in Section III-C2, 𝐄T𝐄2, can make the basis vectors in the codebook nearly mutually orthogonal, which enhances the robustness and representational capability of the codebook.

V-G Performance of Image Retrieval and Reconstruction

Refer to caption
(a) The recall@1 performance of image retrieval.
Refer to caption
(b) The quality of image reconstruction.
Figure 11: The performance of image retrieval and reconstruction.

Fig. 11(a) shows the recall@1 performance of image retrieval task versus SNR, where recall@1 represents the ratio of successful image retrieval at the first query. The upper bound is achieved by JPEG+LDPC without semantic noise in high SNR scenario, nearly lossless transmission. From the figure, Patch 8 outperforms Patch 16 that has much higher compression ratio and requires lower transmission overhead. Moreover, Patch 8 significantly outperforms the conventional JPEG+LDPC, especially in low SNR scenario. It is because that the image cannot be correctly decoded by employing JPEG+LDPC in low SNR scenario and the proposed scheme is more robust against semantic noise.

Fig. 11(b) shows the quality of image reconstruction under the power of semantic noise ϵ=0.012. The left, middle, and right column represent the original image, the image reconstructed with patch size 16, and the image reconstructed with patch size 8, respectively. From the figure, a lower compression ratio leads to a better construction quality and patch size 8 achieves a satisfactory construction quality against semantic noise.

VI Conclusion

In this paper, we have analyzed semantic noise and proposed the methods to generate the sample-dependent and sample-independent semantic noise. Then, the framework of robust semantic communication system has been proposed to combat the semantic noise, where the adversarial training with weight perturbation has been developed. The masked VQ-VAE with noise-related masking strategy has been proposed as the architecture of the system and a discrete codebook for encoded feature representation has been designed. To improve the system robustness, the FIM that suppresses the noise-related and task-unrelated features has been proposed. Simulation results show that our proposed method can be applied in many downstream tasks and significantly improve the robustness of semantic communication systems against semantic noise with much reduced transmission overhead.

References

  • [1] D. Tse and P. Viswanath, Fundamentals Wireless Communication., Cambridge University Press, 2005.
  • [2] M. Mohammadi, A. Al-Fuqaha, S. Sorour, and M. Guizani, “Deep learning for IoT big data and streaming analytics: A survey,” IEEE Commun. Surveys Tuts., vol. 20, no. 4, pp. 2923–2960, Jun. 2018.
  • [3] K. B. Letaief, W. Chen, Y. Shi, J. Zhang, and Y.-J. A. Zhang, “The roadmap to 6G: AI empowered wireless networks,” IEEE Commun. Mag., vol. 57, no. 8, pp. 84–90, Aug. 2019.
  • [4] H. Ye, L. Liang, G. Y. Li, and B.-H. Juang, “Deep learning-based end-to-end wireless communication systems with conditional GANs as unknown channels,” IEEE Trans. Wireless Commun., vol. 19, no. 5, pp. 3133–3143, May 2020.
  • [5] Q. Hu, Y. Cai, K. Kang, G. Yu, J. Hoydis, and Y. C. Eldar, “Two-timescale end-to-end learning for channel acquisition and hybrid precoding,” IEEE J. Select. Areas Commun., vol. 40, no. 1, pp. 163–181, Jan. 2022.
  • [6] T. O’Shea and J. Hoydis, “An introduction to deep learning for the physical layer,” IEEE Trans. Cogn. Commun. Netw., vol. 3, no. 4, pp. 563–575, Dec. 2017.
  • [7] M. Kountouris and N. Pappas, “Semantics-empowered communication for networked intelligent systems,” IEEE Commun. Mag., vol. 59, no. 6, pp. 96–102, Jun. 2021.
  • [8] M. Kalfa, M. Gok, A. Atalik, B. Tegin, T. M. Duman, and O. Arikan, “Towards goal-oriented semantic signal processing: Applications and future challenges,” Digit. Signal Process., pp. 103–134, Dec. 2021.
  • [9] G. Shi, Y. Xiao, Y. Li, and X. Xie, “From semantic communication to semantic-aware networking: Model, architecture, and open problems,” IEEE Commun. Mag., vol. 59, no. 8, pp. 44–50, Aug. 2021.
  • [10] E. Bourtsoulatze, D. Burth Kurka, and D. Gunduz, “Deep joint source-channel coding for wireless image transmission,” IEEE Trans. Cognit. Comm. Netw., vol. 5, no. 3, pp. 567–579, Sep. 2019.
  • [11] H. Xie, Z. Qin, G. Y. Li, and B.-H. Juang, “Deep learning enabled semantic communication systems,” IEEE Trans. Signal Process., vol. 69, pp. 2663–2675, Apr. 2021.
  • [12] J. Xu, B. Ai, W. Chen, A. Yang, P. Sun, and M. Rodrigues, “Wireless image transmission using deep source channel coding with attention modules,” IEEE Trans. Circuits Syst. Video Technol., vol. 32, no. 4, pp. 2315–2328, Apr. 2022.
  • [13] K. Lu, R. Li, X. Chen, Z. Zhao, and H. Zhang, “Reinforcement learning-powered semantic communication via semantic similarity,” arXiv preprint arXiv:2108.12121, 2021.
  • [14] D. B. Kurka and D. Gunduz, “DeepJSCC-f: Deep joint source-channel coding of images with feedback,” IEEE J. Select. Areas Inf. Theory, vol. 1, no. 1, pp. 178–193, May 2020.
  • [15] Z. Weng and Z. Qin, “Semantic communication systems for speech transmission,” IEEE J. Select. Areas Commun., vol. 39, no. 8, pp. 2434–2444, Aug. 2021.
  • [16] P. Jiang, C.-K. Wen, S. Jin, and G. Y. Li, “Deep source-channel coding for sentence semantic transmission with HARQ,” IEEE Trans. Commun., vol. 70, no. 8, pp. 5225–5240, Aug. 2022.
  • [17] C.-H. Lee, J.-W. Lin, P.-H. Chen, and Y.-C. Chang, “Deep learning-constructed joint transmission-recognition for Internet of Things,” IEEE Access, vol. 7, pp. 76 547–76 561, Jun. 2019.
  • [18] M. Jankowski, D. Gunduz, and K. Mikolajczyk, “Wireless image retrieval at the edge,” IEEE J. Select. Areas Commun., vol. 39, no. 1, pp. 89–100, Jan. 2021.
  • [19] J. Shao, Y. Mao, and J. Zhang, “Learning task-oriented communication for edge inference: An information bottleneck approach,” IEEE J. Select. Areas Commun., vol. 40, no. 1, pp. 197–211, Jan. 2022.
  • [20] H. Xie, Z. Qin, X. Tao, and K. B. Letaief, “Task-oriented multi-user semantic communications,” IEEE J. Select. Areas Commun., vol. 40, no. 9, pp. 2584–2597, Sep. 2022.
  • [21] G. Zhang, Q. Hu, Z. Qin, Y. Cai, G. Yu, X. Tao, and G. Y. Li, “A unified multi-task semantic communication system for multimodal data,” arXiv preprint arXiv:2209.07689, 2022.
  • [22] C. Boncelet, “Image noise models,” The essential guide to image processing, pp. 143–167, Academic Press, 2009.
  • [23] Z. Qin, X. Tao, J. Lu, and G. Y. Li, “Semantic communications: Principles and challenges,” arXiv preprint arXiv:2201.01389, 2021.
  • [24] W. Wang, R. Wang, L. Wang, Z. Wang, and A. Ye, “Towards a robust deep neural network against adversarial texts: A survey,” IEEE Trans. Knowl. Data Eng., vol. 35, no. 3, pp. 3159–3179, Mar. 2023.
  • [25] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus, “Intriguing properties of neural networks,” in Proc. Int’l. Conf. Learn. Represent. (ICLR), Apr. 2014, pp. 1–10.
  • [26] I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing adversarial examples,” in Proc. Int’l. Conf. Learn. Represent. (ICLR), May 2015, pp. 1–11.
  • [27] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu, “Towards deep learning models resistant to adversarial attacks,” in Proc. Int’l. Conf. Learn. Represent. (ICLR), May May 2018, pp. 1–10.
  • [28] N. Papernot, P. McDaniel, S. Jha, M. Fredrikson, Z. B. Celik, and A. Swami, “The limitations of deep learning in adversarial settings,” in IEEE Eur. Symp. Secur. Privacy (EuroS&P), Mar. 2016, pp. 372–387.
  • [29] S.-M. Moosavi-Dezfooli, A. Fawzi, and P. Frossard, “DeepFool: a simple and accurate method to fool deep neural networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), Jun. 2016, pp. 2574–2582.
  • [30] S.-M. Moosavi-Dezfooli, A. Fawzi, O. Fawzi, and P. Frossard, “Universal adversarial perturbations,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), Jul. 2017, pp. 1765–1773.
  • [31] B. Kim, Y. E. Sagduyu, K. Davaslioglu, T. Erpek, and S. Ulukus, “Channel-aware adversarial attacks against deep learning-based wireless signal classifiers,” IEEE Tran. Wireless Commun., vol. 21, no. 6, pp. 3868–3880, Jun. 2022.
  • [32] F. Liao, M. Liang, Y. Dong, T. Pang, X. Hu, and J. Zhu, “Defense against adversarial attacks using high-level representation guided denoiser,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), Jun. 2018, pp. 1778–1787.
  • [33] N. Papernot, P. McDaniel, X. Wu, S. Jha, and A. Swami, “Distillation as a defense to adversarial perturbations against deep neural networks,” in IEEE Symp. Secur. Privacy (SP), May 2016, pp. 582–597.
  • [34] S. Gu and L. Rigazio, “Towards deep neural network architectures robust to adversarial examples,” arXiv preprint arXiv:1412.5068, 2014.
  • [35] D. Wu, S.-T. Xia, and Y. Wang, “Adversarial weight perturbation helps robust generalization,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), Dec. 2020, pp. 1–20.
  • [36] Y. Bai, Y. Zeng, Y. Jiang, S.-T. Xia, X. Ma, and Y. Wang, “Improving adversarial robustness via channel-wise activation suppressing,” arXiv preprint arXiv:2103.08307, 2021.
  • [37] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), Dec. 2017, pp. 5998–6008.
  • [38] K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick, “Masked autoencoders are scalable vision learners,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), Jun. 2022, pp. 15 979–15 988.
  • [39] A. Kurakin, I. Goodfellow, and S. Bengio, “Adversarial examples in the physical world,” in Proc. Int’l. Conf. Learn. Represent. (ICLR), May 2017, pp. 1–13.
  • [40] A. van den Oord, O. Vinyals, and K. Kavukcuoglu, “Neural discrete representation learning,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), Dec. 2017, pp. 6309–6318.