Wireless Deep Video Semantic Transmission

Sixian Wang, , Jincheng Dai, , Zijian Liang, Kai Niu, , Zhongwei Si, , Chao Dong, , Xiaoqi Qin, , and Ping Zhang This work was supported in part by the National Natural Science Foundation of China under Grant 92067202, Grant 62001049, Grant 62071058, Grant 61971062, in part by the Beijing Natural Science Foundation under Grant 4222012, and in part by the Major Key Project of PCL under Grant PCL2021A15. (Corresponding author: Jincheng Dai)S. Wang, J. Dai, Z. Liang, K. Niu, Z. Si, C. Dong are with the Key Laboratory of Universal Wireless Communications, Ministry of Education, Beijing University of Posts and Telecommunications, Beijing 100876, China (corresponding author e-mail: daijincheng@bupt.edu.cn).K. Niu is also with Peng Cheng Laboratory, Shenzhen, China.X. Qin and P. Zhang are with the State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, Beijing 100876, China.
Abstract

In this paper, we design a new class of high-efficiency deep joint source-channel coding methods to achieve end-to-end video transmission over wireless channels. The proposed methods exploit nonlinear transform and conditional coding architecture to adaptively extract semantic features across video frames, and transmit semantic feature domain representations over wireless channels via deep joint source-channel coding. Our framework is collected under the name deep video semantic transmission (DVST). In particular, benefiting from the strong temporal prior provided by the feature domain context, the learned nonlinear transform function becomes temporally adaptive, resulting in a richer and more accurate entropy model guiding the transmission of current frame. Accordingly, a novel rate adaptive transmission mechanism is developed to customize deep joint source-channel coding for video sources. It learns to allocate the limited channel bandwidth within and among video frames to maximize the overall transmission performance. The whole DVST design is formulated as an optimization problem whose goal is to minimize the end-to-end transmission rate-distortion performance under perceptual quality metrics or machine vision task performance metrics. Across standard video source test sequences and various communication scenarios, experiments show that our DVST can generally surpass traditional wireless video coded transmission schemes. The proposed DVST framework can well support future semantic communications due to its video content-aware and machine vision task integration abilities.

Index Terms:
Semantic communications, video transmission, nonlinear transform, joint source-channel coding, rate-distortion.

I Introduction

The task of video transmission in today’s wireless networks is largely separated into two steps: source coding and channel coding [1]. Source coding compresses the source video as sequences of bits, and channel coding represents sequences of bits as transmitted signals against impacts of imperfect wireless channels such as noise, fading, and interferences. This separation-based approach has been employed for a large variety of applications, as the binary representations of various source data can be seamlessly transmitted over arbitrary wireless channels by changing the underlying channel code. This paradigm has benefited a lot due to the independent optimization of each component.

However, the limits of the separation-based design begin to emerge with more demands on low-latency wireless video delivery applications such as virtual reality (VR). On the one hand, current wireless video transmission systems suffer from time-varying channel conditions, in which case the mismatch between communication rate and channel capacity leads to obvious cliff-effect, i.e., the performance breaks down when the channel capacity goes below communication rate. On the other hand, the widely-used entropy coding, which converts the source representation into sequences of bits, is quite sensitive to the variational estimate of the marginal distribution of the source latent representation. Small perturbations on this marginal can lead to catastrophic error propagation in entropy decoding [2]. In practice, the small perturbation can often be caused by floating point round-off error [3]. Unfortunately, this round-off operation depends heavily on hardware or software platforms, and in various data compression applications, the transceiver may well employ different platforms. This non-determinism issue in transmitter vs. receiver may finally lead to severe performance degradation.

To address this problem, it is very time to bridge source coding and channel coding together to boosting the end-to-end communication system capabilities. By this means, the channel transmission process can be aware of the source semantic features [4, 5, 6, 7, 8, 9, 10]. The paradigm aiming at the integrated design of source and channel processing was named joint source-channel coding (JSCC) [11], a classical topic in the information theory and coding theory. However, conventional JSCC schemes [11, 12, 13, 14] are based on explicit probabilistic models and handcrafted designs, whose optimization complexity is intractable for complex sources. In addition, they ignored the semantic aspects of source messages. As one modern version, recent deep learning methods for realizing JSCC have stimulated significant interest in both artificial intelligence (AI) and wireless communication communities [15, 16, 17, 18, 19]. By using artificial neural networks (ANNs), source data can be directly encoded as continuous-valued symbols to be transmitted over wireless channels. Deep JSCC can overcome the catastrophic degradation problem by using analog transmission without entropy coding. Current deep JSCC methods have shown end-to-end image transmission performance surpassing classical separation-based JPEG/JPEG2000/BPG source compression combined with ideal channel capacity-achieving code family, especially for sources of small dimensions, e.g., small CIFAR10 image data set [20].

However, one can observe that, in general, as the source dimension increases, e.g., large-scale images, the performance of deep JSCC degrades rapidly, which is even inferior to the classical separation-based coding schemes as demonstrated in [9]. Moreover, existing deep JSCC schemes cannot provide comparable coding gain as that of classical separated coding schemes, i.e., the slope of the performance curve slows down quickly with the increase of coding rate or the channel signal-to-noise ratio (SNR). This poor coding gain stems from the naive design of codec networks. Current deep JSCC works simply employ one highly-integrated ANN as the encoder function to achieve dimension reduction with respect to the raw source data. By adding the wireless channel as one non-trainable layer, the learned codec ANNs can also combat the impacts of imperfect wireless channels. Nevertheless, this light auto-encoder structured deep JSCC cannot provide sufficient model expression capability for large-scale source data, resulting in a prematurely saturated coding gain of deep JSCC. Compared to the image source, the video source further induces the time dimension. Above saturation phenomenon on the coding gain is more likely to appear on video sources that need higher dimensional representation. Thus, a naive application of deep JSCC for wireless video transmission cannot provide satisfactory performance.

Inspired by the emerging data compression methods adopted in computer vision (CV) communities, a high-dimensional source will first be converted as latent representations defined by variational latent-variable models. This procedure is referred to as nonlinear transform [21, 22, 23, 24, 25]. The richness of latent representations preserves almost all the source semantic features that can be used for either recovering source data or directly driving the downstream intelligent tasks. By training appropriately, many such nonlinear transform models successfully represent the source data quite compactly and may be called compression in a sense. For practical data compression tasks, the latent representation needs to be further compressed as binary sequences through entropy coding. However, because the employed ANNs in nonlinear transform are typically based on floating point math, and the transmission is over time-varying wireless channels, a direct combination of nonlinear transform with traditional source coding (entropy coding such as arithmetic coding [2]) and channel coding (such as low-density parity-check (LDPC) coding [26]) will be also vulnerable to catastrophic failures left by entropy decoding.

In this paper, to attain high-efficiency and robust end-to-end video transmission, we leverage the advantages of nonlinear transform and deep JSCC together to formulate a new powerful framework, named deep video semantic transmission (DVST). It is specifically targeted at video transmission over imperfect wireless channels and preventing catastrophic failures caused by sensitive entropy coding. By integrating the emerging conditional coding paradigm [27] with nonlinear transform and deep JSCC, the proposed DVST framework works on the principle: considering the strong temporal correlations among video frames, DVST encodes the current frame in an efficient manner to generate channel-input symbols through contextual nonlinear transform and contextual deep JSCC. The contextual semantic information is used as part of the input of both nonlinear transform and deep JSCC codec. Benefiting from the temporal prior provided by the semantic feature domain context, the learned nonlinear transform function becomes temporally adaptive, resulting in a richer and more accurate entropy model to indicate how to allocate channel bandwidth resources to transmit the current frame. Moreover, we leverage the context to carry rich information as prior to deep JSCC codec, which helps to reconstruct the semantic feature map for higher video quality or downstream task performance. The whole DVST design is formulated as an optimization problem whose goal is to minimize the end-to-end transmission rate-distortion (RD) performance under perceptual quality metrics or machine vision task performance metrics.

Specifically, the contributions of this paper can be summarized as follows.

  1. (1)

    DVST Framework: We propose a new end-to-end learnable framework for wireless video transmission, i.e., DVST, which integrates the advantages of nonlinear transform and deep JSCC. To the best of authors’ knowledge, this is the first work that establishes a temporally adaptive entropy model to customize deep JSCC for video. The proposed DVST framework exploits both nonlinear transform and conditional coding architecture for video semantic feature extraction, which contributes to higher efficiency and more robust wireless video transmission than traditional video coded transmission schemes.

  2. (2)

    Context-Driven Semantic Feature Modeling: We exploit a simple yet efficient method using temporal context to enhance the entropy model in nonlinear transform as well as the deep JSCC codec. We design ANN architectures to realize each module of DVST, in which the definition, usage, and learning manner of contextual semantic features as conditions are all clearly given.

  3. (3)

    Rate-Adaptive Semantic Feature Transmission: In light of the temporally adaptive entropy model on semantic features, we develop a method to improve the coding gain of video deep JSCC. In particular, we introduce a variable-length transmission mechanism for each embedding vector in the latent representation. The resulting DVST model learns to allocate the limited channel bandwidth within and among video frames to maximize the overall performance.

  4. (4)

    Performance Validation: We verify the performance of our DVST system across standard video source sequences. We show that for wireless video transmission, our DVST can achieve much better coding gain and RD performance on various established metrics such as PSNR and MS-SSIM. Equivalently, achieving identical end-to-end wireless transmission performance, the proposed DVST method can save up to 50% channel bandwidth cost, compared to classical H.264/H.265 combined with LDPC and digital modulation schemes. For task-oriented machine-type semantic communications, experimental results verify the effectiveness of DVST, which can better support machine vision tasks, meanwhile holds higher perceptual fidelity for human vision.

The remainder of this paper is organized as follows. In the next section II, we first review the system model of wireless video transmission, and propose the DVST framework. Then, in section III, we propose ANN architectures for realizing DVST, as well as key methods to guide the optimization of the DVST model. Section IV provides a direct comparison of a number of methods to quantify the performance gain of the proposed method. Finally, section V concludes this paper.

Refer to caption
Figure 1: The framework of DVST. The visualization of the latents with the highest entropy.

Notational Conventions: Throughout this paper, lowercase letters (e.g., x) denote scalars, bold lowercase letters (e.g., 𝐱) denote vectors. In some cases, xi denotes the elements of 𝐱, which may also represent a subvector of 𝐱 as described in the context. Bold uppercase letters (e.g., 𝐗) denote matrices, and 𝐈m denotes an m-dimensional identity matrix. ln() denotes the natural logarithm, and log() denotes the logarithm to base 2. px denotes a probability density function (pdf) with respect to the continuous-valued random variable x, and Px¯ denotes a probability mass function (pmf) with respect to the discrete-valued random variable x¯. In addition, 𝔼[] denotes the statistical expectation operation, and denotes the real number set. Finally, 𝒩(x|μ,σ2)(2πσ2)1/2exp((xμ)2/(2σ2)) denotes a Gaussian function, (x|μ,σ)(2σ)1exp(|xμ|/σ) denotes a Laplace function, and 𝒰(au,a+u) stands for a uniform distribution centered on a with the range from au to a+u.

II The Proposed Method

In this section, we first present the system model of wireless video transmission. Then, we describe the whole framework of DVST. After that, we introduce the contextual entropy model for rate-adaptively transmit the latent representations, followed by the learning methods of the context. Finally, we derive the optimization goal of DVST system.

II-A System Model

Consider a wireless video transmission problem. Given a video sequence 𝒳={𝐱1,𝐱2,,𝐱T}, where the frame at time step t is modeled as a vector of pixel intensities 𝐱tm. The transmitter encodes the video frame sequence 𝒳 as a sequence of variable-length continuous-valued channel input symbols 𝒮={𝐬1,𝐬2,,𝐬T}, where 𝐬tkt denote the kt-dimensional channel input vector at time step t. We usually have kt<m, and R=1Tt=1Tktm is defined as the channel bandwidth ratio (CBR) [28] denoting the average coding rate of 𝒳. Then, the sequence {𝐬t} is successively sent over the wireless channel. This channel introduces random corruptions denoted as a transfer function W(;𝝂), where 𝝂 denotes the channel parameters. The received sequence is 𝐬^t=W(𝐬t;𝝂) with the transition probability p𝐬^t|𝐬t(𝐬^t|𝐬t). In this paper, we mainly consider the widely used additive white Gaussian noise (AWGN) channel such that the transfer function is 𝐬^t=W(𝐬t;σ2)=𝐬t+𝐧t where each component of the noise vector 𝐧t is independently sampled from a time-invariant multidimensional Gaussian distribution, i.e., 𝐧t𝒩(0,σ2𝐈kt), where σ2 is the average noise power. Other channel models can also be similarly incorporated by changing the channel transition function. The receiver comprises a series of inverse operation which aims to recover 𝐱^t from the corrupted signal 𝐬^t or executes the downstream intelligent task.

We consider video transmission over the noisy wireless channel in a low-latency manner, i.e., the video sequence is transmitted to and reconstructed in the receiver frame-by-frame. We encapsulate N consecutive frames as one group-of-pictures (GOP). A typical video coding algorithm first divides 𝒳 into a stack of GOPs. Each GOP begins with an intra-coded picture (I-frame or keyframe) as a reference, followed by N1 predictive coded frames (P-frames), which contain the motion compensated difference information for bitrate saving. In this paper, we exploit the classical GOP structure for end-to-end transmission. Since the transmission of I-frame is equivalent to that of image, which has been well studied in [17, 28, 18, 19], we concentrate on the transmission of P-frame.

II-B The Framework of DVST

We propose the DVST as a new learnable model for end-to-end wireless video transmission, which integrates the advantages of nonlinear transform and deep JSCC. Our DVST framework is illustrated in Fig. 1. To encode the current frame 𝐱t efficiently, the transmitter adopts contextual analysis transform and contextual deep JSCC encoder as two critical modules. The analysis transform converts the source frame in pixel domain to the latent representation in semantic feature domain. Guided by the variational entropy modeling on the latent representation, a rate control module is added to achieve variable-length encoding in deep JSCC. For video source, there exists temporal correlation. Thus, the above two modules also employ the semantic feature domain context and the deep JSCC codeword domain context as the temporal prior. That makes nonlinear transform and deep JSCC modules temporally adaptive, resulting in a higher efficient video transmission framework.

Refer to caption
Figure 2: Illustration of a DVST codec architecture and the entropy model of the primary link.

As illustrated in Fig. 1 and Fig. 2, ga and gs functions in nonlinear transform of the current frame are conditioned on the contextual semantic features 𝐜ˇt and 𝐜^t, respectively. fe and fd functions in deep JSCC are conditioned on the contextual codewords 𝐝ˇt and 𝐝^t, respectively. The primary link of DVST system is formulated as

𝐱tga(|𝐜ˇt)𝐲tfe(|𝐝ˇt)𝐬tW(|𝝂)𝐬^tfd(|𝐝^t)𝐲^tgs(|𝐜^t)𝐱^t. (1)

In our DVST design, we use ANN to realize each function in (1) except for the channel transfer function W. In terms of the context information, transmitter (Tx) contexts 𝐜ˇt and 𝐝ˇt are obtained from the reference frame 𝐱ˇt1 and the reference feature map 𝐲ˇt1, respectively. These two references are generated at the transmitter by simulating the DVST process without passing over the wireless channel, i.e.,

𝐬t1fd(|𝐝ˇt1)𝐲ˇt1gs(|𝐜ˇt1)𝐱ˇt1, (2)

where the codeword 𝐬t1 is obtained from (1) by substituting the time step t1. Receiver (Rx) contexts 𝐜^t and 𝐝^t are obtained from the reference synthesized frame 𝐱^t1 and the reference decoded feature map 𝐲^t1, respectively. These two references are directly obtained by taking out records from the receiver buffer at the time step t1. Details about how to use context as conditions to formulate ANN-based functions will be introduced in the next section.

Specifically, in the transmitter, for the current frame 𝐱t at time step t, ga() extracts the source semantic features as a lower-dimensional latent representation 𝐲t, fe() operates on this latent space. Consider the inter frame correlation in video sources, the analysis transform ga() is formulated as

𝐲t=ga(𝐱t|𝐜ˇt) with 𝐜ˇt=φa(𝐱ˇt1). (3)

φa() denotes the function to generate context 𝐜ˇt for the analysis transform, ga() is therefore referred to as the contextual analysis transform. After that, the latent representation 𝐲t is fed into the contextual deep JSCC encoder fe() to generate the channel-input sequence 𝐬t as

𝐬t=fe(𝐲t|𝐝ˇt) with 𝐝ˇt=γe(𝐲ˇt1). (4)

γe() denotes the function of generating context for deep JSCC encoder. To provide rich and more correlated information for encoding 𝐱t, the context 𝐜ˇt is in the semantic feature domain with higher dimensions, and the context 𝐝ˇt is in the deep JSCC codeword space.

Then, the analog codeword sequence 𝐬 is directly sent over the wireless communication channel. As aforementioned, we consider the AWGN channel such that the received sequence is 𝐬^t=𝐬t+𝐧t with 𝐧t𝒩(0,σ2𝐈kt). The receiver comprises a contextual deep JSCC decoder fd() to reconstruct the corrupted signal 𝐬^t as 𝐲^t, i.e.,

𝐲^t=fd(𝐬^t|𝐝^t) with 𝐝^t=γd(𝐲^t1). (5)

γd() denotes the function of generating context for deep JSCC decoder. The contextual synthesis transform function gs() is then performed on 𝐲^t to recover the current frame, i.e.,

𝐱^t=gs(𝐲^t|𝐜^t) with 𝐜^t=φs(𝐱^t1). (6)

φs() denotes the function to generate context 𝐜^t for the synthesis transform.

For the contextual analysis transform ga, we use a network to automatically learn the correlation between 𝐱t and 𝐜ˇt, which does not remove the redundancy through handcrafted subtraction operation like traditional residual video coding [29]. Herein, the context 𝐜ˇt comes from the reference frame 𝐱ˇt1. In this way, the contextual analysis transform becomes adaptive which generates the latent representation by selectively extracting semantic features from 𝐱t and 𝐱ˇt1 [27]. Due to the motion in video, for old contents in 𝐱t that can find a good reference from 𝐱ˇt1, ga still forces to generate its patch embeddings from the residue. For new contents in 𝐱t that cannot find a good reference from 𝐱ˇt1, ga tends to generate its patch embeddings from 𝐱t itself. The contextual nonlinear transform inherently learns to adaptively utilize the condition for semantic extraction. In addition, the context 𝐜ˇt is not only used for generating the latent representation, but also utilized to construct the entropy model, which will be introduced in the subsequent subsection.

For the contextual deep JSCC encoder fe, we use a network to automatically learn the correlation between 𝐲t and 𝐝ˇt. Note that the context 𝐝ˇt comes from the reconstructed reference feature map 𝐲ˇt1, thus, the contextual deep JSCC encoder also becomes adaptive to generate the channel-input codewords. If patch embeddings in 𝐲t can find a good reference from 𝐲ˇt1, fe inclines to transmit these embeddings with smaller channel bandwidth. In contrast, for patch embeddings in 𝐱t that cannot find a good reference from 𝐲ˇt1, fe tends to allocate more channel bandwidth to transmit these embeddings. In this way, the contextual deep JSCC codec learns to adaptively utilize the condition for high-efficiency transmission. Moreover, the context 𝐝ˇt is not only used for generating the channel-input codeword, but also utilized to learn a rate-allocation function that controls the scaling rule from entropy value to channel bandwidth cost. Details will be introduced in the subsequent subsection.

II-C Entropy Model for Rate-Adaptive Transmission

Refer to caption
Figure 3: The framework of context learning. Dashed lines denote MV data flows simulated at the transmitter, which are used to generate the reference MV without passing over the wireless channel. Solid lines denote data flows at both transmitter and receiver, which will pass over the realistic wireless channel.

In order to improve the coding gain of DVST, a variable-length transmission mechanism should be developed for each embedding yt,i of the semantic feature map 𝐲t. To this end, we estimate the entropy distribution on 𝐲t, and the channel bandwidth cost kt,i for transmitting yt,i can be accordingly determined. Therefore, our target is to design an entropy model which can accurately estimate the probability distribution of the latent representation 𝐲t.

Our entropy model is illustrated in Fig. 2. Following the work of [27], the latent representation 𝐲t is variationally modeled as the Laplace distribution, where each embedding yt,i is of varying distribution parameters. In this paper, the hyperprior entropy model learns both the hierarchical prior [23] and the spatial prior [24]. In addition, our entropy model fuses the temporal prior provided by the context 𝐜ˇt. Specifically, the entropy of each embedding yt,i is computed as

rt,i=logPy¯t,i|𝐜ˇt,𝐳¯t,𝐲¯t,<i(y¯t,i|𝐜ˇt,𝐳¯t,𝐲¯t,<i), (7)

where y¯t,i=yt,i denotes the quantized version of yt,i by using the uniform scalar quantization function (rounding to integers). 𝐲¯t,<i denotes a tensor consisting of quantized embeddings y¯t,l with l<i. The quantized hyperprior 𝐳¯t=𝐳t=hhpe(𝐲t) is obtained by stacking the hyperprior encoder network hhpe() on 𝐲t.

In order to use the gradient descent methods to optimize the entropy model, Ballé et al. have proposed a relaxed method for addressing the zero gradient problem caused by quantization [21]. A proxy “uniformly-noised” representation 𝐲~t is adopted to replace the quantized representation 𝐲¯t=𝐲t during model training, i.e., 𝐲~t=𝐲t+𝐨 with oi𝒰(12,12). Each y~t,i is variationally modeled as a Laplace distribution with learned parameters μ~t,i and σ~t,i such that

p𝐲~t|𝐜ˇt,𝐳~t(𝐲~t|𝐜ˇt,𝐳~t)=ipy~t,i|𝐜ˇt,𝐳~t,𝐲~t,<i(y~t,i|𝐜ˇt,𝐳~t,𝐲~t,<i) (8)
=i((μ~t,i,σ~t,i)𝒰(12,12))(y~t,i)
with (μ~t,i,σ~t,i)=hpf(hhpd(𝐳~t),har(𝐲~t,<i),htpe(𝐜ˇt)),

where “” is the convolutional operation, i{1,2,,I} denotes the path embedding index, and the proxy hyperprior 𝐳~t is obtained by performing the hyperprior encoder network hhpe() on 𝐲t and adding the uniformly sampled random offset 𝐨, i.e., 𝐳~t=hhpe(𝐲t)+𝐨. Since we do not have prior beliefs about the hyperprior 𝐳~t, it can be modeled as non-parametric fully factorized density [23], i.e.,

p𝐳~t(𝐳~t)=j(pzt,j|𝝍(j)(zt,j|𝝍(j))𝒰(12,12))(z~t,j), (9)

where 𝝍(j) encapsulates all the parameters of pzt,j|𝝍(j). During model testing, the entropy model P𝐳¯t is established by taking discrete values from the learned entropy model p𝐳~t by substituting 𝐳~t as 𝐳¯t. hhpd() denotes the hyperprior decoder network to provide the hierarchical prior. har() represents the auto regressive network to provide the spatial prior. htpe() denotes the temporal prior encoder network to provide additional side information. hpf() denotes the prior fusion network operating on the above three types of prior information. During model testing, the entropy model Py¯t,i|𝐜ˇt,𝐳¯t,𝐲¯t,<i in (7) will be established by taking discrete values from the learned entropy model py~t,i|𝐜ˇt,𝐳~t,𝐲~t,<i by substituting 𝐲~t and 𝐳~t as 𝐲¯t and 𝐳¯t, respectively.

As stated in NTC [25], the probabilistic model of 𝐲t,i can be conditioned on some other vector 𝐳~t like [23] or its preceding dimensions 𝐲t,<i as that in [24]. The former corresponds to forward adaptation (FA) of the density model, and the latter is backward adaptation (BA). In this paper, due to the auto-regressive computation in (8), our entropy model is established under the BA mode. One can also use the FA mode, where hpf() relies only on the hierarchical prior and the context, i.e., (μ~t,i,σ~t,i)=hpf(hhpd(𝐳~t),htpe(𝐜ˇt)). Herein, the FA mode in DVST however cannot harvest the gain of better codec parallelism while it incurs performance degradation. The reason is that the auto-regressive computations of BA in our DVST are only used in entropy modeling to estimate the probability. The following deep JSCC codec runs in parallel for each embedding yt,i. In comparison, traditional codec relying on arithmetic coding also adopts regressive computations in the BA mode which leads to higher latency. Therefore, we advertise the BA mode in our DVST framework.

With the learned entropy model Py¯t,i|𝐜ˇt,𝐳¯t,𝐲¯t,<i, the allocated channel bandwidth cost kt,i to transmit the embedding yt,i is formulated as

kt,i=ηtrt,i=ηtlogPy¯t,i|𝐜ˇt,𝐳¯t,𝐲¯t,<i(y¯t,i|𝐜ˇt,𝐳¯t,𝐲¯t,<i), (10)

where the scaling factor ηt denotes the proportion from the entropy of embedding yt,i to the number of channel symbols. In particular, the physical meaning of ηt can be interpreted as ηt=ηt/C, where C is the channel capacity (bits per channel symbol), and ηt denotes an efficiency factor representing the capability of deep JSCC codec. Accordingly, ηt=1 stands for the ideal JSCC codec that is of the same performance as entropy-achieving source coding combined with capacity-achieving channel coding.

Following the aforementioned entropy model, the channel bandwidth cost of the primary link for transmitting semantic features 𝐲t is derived as

ktpl =ikt,i (11)
=iηt,ilogPy¯t,i|𝐜ˇt,𝐳¯t,𝐲¯t,<i(y¯t,i|𝐜ˇt,𝐳¯t,𝐲¯t,<i).

II-D Motion Transmission and Context Learning

As for context learning functions φa(), φs(), γe(), and γd(), inspired by [27], we also adopt the idea of motion estimation and motion compensation (MEMC) to formulate specific forms of these functions. Different from conventional MEMC applied in source pixel domain, we perform MEMC in semantic feature domain to generate contexts 𝐜ˇt, 𝐜^t and in deep JSCC codeword domain to generate contexts 𝐝ˇt, 𝐝^t. This paradigm utilizes rich information density in feature/codeword domain to fascinate high-efficiency video transmission over wireless channels with limited bandwidth.

As shown in Fig. 3, the motion vector (MV) 𝐦t is generated by using the flow estimation network [30] performed between the current frame 𝐱t and the reconstructed reference frame 𝐱ˇt1. This MV is then transmitted over the wireless channel, which is referred to as the motion link. The whole process copies from the primary link without using context, i.e.,

𝐦tgamv()𝐲tmvfemv()𝐬tmvW(|𝝂)𝐬^tmv (12)
fdmv()𝐲^tmvgsmv()𝐦^t.

In our DVST design, we use ANN to realize each function in (12) except for the channel transfer function W. The reference MV 𝐦ˇt used at the transmitter is generated by simulating the motion link without passing over the wireless channel, i.e.,

𝐬tmvfdmv()𝐲ˇtmvgsmv()𝐦ˇt. (13)

where the codeword 𝐬tmv is obtained from (12).

In analogue to the primary link, by using the learned entropy model Py¯t,jmv|𝐳¯tmv,𝐲¯t,<jmv on 𝐲tmv, the allocated channel bandwidth cost kt,jmv to transmit the embedding yt,jmv is formulated as

kt,jmv=ηt,jmvrt,jmv=ηt,jmvlogPy¯t,jmv|𝐳¯tmv,𝐲¯t,<jmv(y¯t,jmv|𝐳¯tmv,𝐲¯t,<jmv). (14)

The channel bandwidth cost of the motion link for transmitting MV semantic features 𝐲tmv is derived as

ktml=jηt,jmvlogPy¯t,jmv|𝐳¯tmv,𝐲¯t,<jmv(y¯t,jmv|𝐳¯tmv,𝐲¯t,<jmv). (15)

The hyperprior 𝐳tmv is obtained as 𝐳tmv=hhpemv(𝐲tmv), and the entropy model P𝐳¯tmv of the quantized hyperprior 𝐳¯tmv can be obtained similar to (9).

Refer to caption
Figure 4: Network architectures of primary link. Conv k×C is a convolution with C channels and k×k filters, and the followed 2/2 indicates upscaling or downscaling with a stride of 2. ChannelNorm refers to a normalization layer from [31]. GDN denotes the Generalised Divisive Normalization proposed in [32], and IGDN is the inverse GDN. FC denotes fully-connected layer.

As illustrated in Fig. 3, at the transmitting end, the feature domain context generation function φa is formulated as

𝐜ˇt=φa(𝐱ˇt1)=φref(warp(φfe(𝐱ˇt1),𝐦ˇt)), (16)

where the reference frame 𝐱ˇt1 is obtained as (2). We use the feature extractor network φfe() to convert the reference frame 𝐱ˇt1 to its feature domain representation. The reference MV guides where to extract feature domain context by using the warping function warp() [33]. The refinement network φref() is used to restore the spatial discontinuity problem caused by warping operation.

As for the codeword context 𝐝ˇt, its generation function γe is formulated as

𝐝ˇt=γe(𝐲ˇt1)=γref(warp(γpc(𝐲ˇt1),𝐦ˇt)). (17)

The precoding network γpc() operates on the reference semantic feature map 𝐲ˇt1 which preprocesses the data before warping it. γref() denotes the refinement network.

At the receiving end, the context learning procedures are similar to (16) and (17), where the realistic received MV 𝐦^t is used. In addition, 𝐱ˇt1 and 𝐲ˇt1 are substituted by 𝐱^t1 and 𝐲^t1, respectively.

II-E Optimization Goal

The optimization goal of our DVST system is to use the least channel bandwidth cost to get the best video reconstruction quality or downstream task performance. Given the reference frame, the loss function at the current time step t is formulated as a rate-distortion (RD) form, i.e.,

Lt=λkt+Dt=λ(ktpl+ktml)+Dt, (18)

where λ controls the trade-off between the total channel bandwidth cost kt and the distortion Dt.

Due to the conditional coding architecture, the performance of a previous frame will affect its subsequent frame. Therefore, during the training phase of DVST, we take into account the correlations of frames within a GOP. The DVST model can thus learn to allocate channel bandwidth resources within one frame and among various frames. Thus, the overall training loss function is formulated as

L=1Nt=1N(λkt+Dt)=1Nt=1N(λ(ktpl+ktml)+Dt). (19)

The training procedure details will be introduced in the subsequent subsection.

For machine-type semantic communications, our DVST can directly drive the downstream machine vision tasks while preserving the advantages of signal level reconstruction. Different from the functional transmission mode adopted in [19], this paper aims to transmit videos friendly to both human vision and machine analytics [34]. Therefore, we incorporate the low-level signal distortion and the loss of high-level tasks, thus, the distortion term is reformulated as Dt=Dt,rec+βDt,task, where Dt,rec denotes the reconstruction loss and Dt,task denotes the loss of downstream task.

III Architectures and Implementations

In this section, we present details of the adopted network architectures to implement our DVST. Then, we introduce the progressive training strategy to enable a stable model learning.

III-A Network Architectures

We illustrate ANN implementation details of the primary link in Fig. 4, including the contextual nonlinear transform modules and the contextual deep JSCC modules. For brevity, we do not repeat the structure of the motion link since it has almost the same architecture as the primary link except that the MV is of two channels, and the contextual operations are removed.

L¨tmv =λR¨tmv+D¨tmv=λ(jlogPy~t,jmv|𝐳~tmv,𝐲~t,<jmv(y~t,jmv|𝐳~tmv,𝐲~t,<jmv)logP𝐳~tmv(𝐳~tmv))+D(warp(𝐱t1,𝐦¨t),𝐱t) (20)
with 𝐲~tmv=𝐲tmv+𝐨,𝐳~tmv=𝐳tmv+𝐨,𝐲tmv=gamv(𝐦t),𝐳tmv=hhpemv(𝐲tmv),𝐦¨t=gsmv(𝐲~tmv).
L¨t =λR¨t+D¨t=λ(jlogPy~t,j|𝐳~t,𝐲~t,<j(y~t,j|𝐳~t,𝐲~t,<j)logP𝐳~t(𝐳~t))+D(𝐱¨t,𝐱t) (21)
with 𝐲~t=𝐲t+𝐨,𝐳~t=𝐳t+𝐨,𝐲t=ga(𝐱t|𝐜ˇt),𝐳t=hhpe(𝐲t),𝐱¨t=gs(𝐲~t|𝐜^t),

III-A1 Contextual Nonlinear Transform

Refer to caption
Refer to caption
Figure 5: Detailed structure of (a) semantic domain feature context generation networks, (b) deep JSCC codeword domain context generation networks.

The proposed DVST drives the current frame 𝐱t in terms of the context extracted from MV rather than relying on a handcrafted subtraction operation. Fig. 5(a) illustrates the context generation network φa in semantic feature domain, which consists of feature extractor, warp, and refinement operation. After that, the analysis transform ga concatenates the Tx context 𝐜ˇt with the current frame 𝐱t to learn a compact latent representation 𝐲t in semantic feature domain. To estimate the spatial varying mean values and standard deviations of 𝐲t, we follow the hyperprior entropy model in [27], which fuses hierarchical prior, spatial prior, and temporal prior. The contextual synthesis transform gs has a symmetric architecture with ga except that it uses the Rx context 𝐜^t generated from φs which is of the same structure with φa.

III-A2 Contextual Deep JSCC

Contextual deep JSCC module exploits the codeword contexts extracted from 𝐲ˇt1 and 𝐲^t1 to collaboratively transmit the current latent representation 𝐲t. It is of variable rates in accordance to the estimated entropy. In particular, the structure of codeword context generator γe is shown in Fig. 5(b). Before the wrapping operation [33], we align the MV with the precoded feature by using an average pooling operation with stride 16. By using the learned entropy model, we pre-allocate the channel bandwidth cost of each spatial position kt,i as (11). Accordingly, the encoder fe fuses the 𝐲t with the context 𝐝ˇt and partitions the fused feature map into patch embedding sequence (yt,1,yt,2,,yt,I), where each embedding is an M-dimensional feature vector. After that, the practical channel bandwidth cost k¯t,i for transmitting yt,i is determined as k¯t,i=Q(kt,i). Herein, Q denotes a scalar quantization whose range includes 2q (q=1,2,) integers, and the quantization value set 𝒱={v1,v2,,v2q} is related to the scaling factor η and the Lagrange multiplier λ in the RD loss function. In this way, we inform the receiver which rate is allocated to each embedding yi by transmitting predetermined q bits as extra side information.

Instead of naively training 2q deep JSCC networks, we exploit the dynamic neural network structure to realize variable rate transmission. As shown in Fig. 4, fe consists of a powerful shared backbone to extract the contextual dependencies among yt,i, and light FC layers to encode yt,i into the given dimension k¯t,i. In particular, we employ a group of 2q FC layers with different output dimensions {v1,v2,,v2q}, and each FC layer is invoked on demand. As a result, during the model forward pass, some FC layers may not be used while others may be used more than once. Additionally, to enhance the capacity of deep JSCC and extract the global and long-term correlations, we employ Swin Transformer blocks as the network backbone [35]. Inspired by the application of positional embeddings in vision Transformer, we develop a group of rate tokens ={rv1,rv2,,rv2kq} to indicate the CBR information. The rate tokens can be viewed as learnable parameters within Transformer. As shown in Fig. 4, each embedding yt,i will be added with its corresponding rate token rk¯t,i before fed into Transformer blocks. Hence, the output patch embeddings can get a better trade-off between fidelity and robustness. As a result, the following FC layer can efficiently rescale the dimensions of channel-input symbols st,i.

III-B Progressive Training Strategy

As aforementioned, the goal of DVST is to minimize a compromise between channel bandwidth cost (including primary link and motion link) and end-to-end distortion (reconstruction error or downstream task accuracy). Starting from a pretrained optical flow estimation network, the training procedure consists of the following steps:

  1. (1)

    Pretrain the nonlinear transform components of the motion link, including the motion estimation network, gamv, gsmv, and the entropy model. In this step, we use the lossless previous frame as reference. The pretraining loss function L¨tmv of this step is formulated as (20), where 𝐨 denotes uniformly sampled random quantization offsets.

  2. (2)

    Taking the wireless transmission error of the motion link into account, based on the previous step, deep JSCC codec femv, fdmv, and the rate adaptation module are jointly trained with nonlinear transform components to execute an MV transmission task. We also add the pretraining distortion terms D¨tmv to make the training process stable. The training loss function of this step is formulated as

    Ltml=D¨tmv+D(warp(𝐱t1,𝐦^t),𝐱t)+λktml, (22)

    where the reconstructed MV 𝐦^t is obtained as (12) by passing over wireless channel, and the channel bandwidth cost of the motion link is obtained from (15).

  3. (3)

    Pretrain the nonlinear transform components of the primary link and context generation networks φa and φs meanwhile freezing the parameters of motion link. Similar to the pretraining process of motion link, we still employ the ideal lossless previous frame as reference, i.e., we manually set 𝐱ˇt1=𝐱^t1=𝐱t1. In practice, following [27], we temporally remove the bitcost terms and readd them after several training epoches. This strategy helps model to generate useful contexts and exploit them, which actually accelerates the convergence. The training loss of this step is formulated as (21), where the generation of feature domain context 𝐜ˇt and 𝐜^t can refer to (16).

  4. (4)

    Based on the previous step, train the whole framework except the freezing motion link. The training loss of this step is formulated as

    Ltpl=D¨t+D(𝐱^t,𝐱t)+λktpl, (23)

    where the reconstructed frame 𝐱^t is derived from (1), and the channel bandwidth cost of the primary link is obtained as (11). In this step, model learns to transmit current frame efficiently with the help of context 𝐝t in JSCC codeword space. Also, for stable training, the end-to-end distortion D¨t is added.

  5. (5)

    Unfreeze the motion link and train the whole DVST model according to subsection II-E. The final training loss for a GOP is formulated as

    L=1Nt=1N(λ(ktpl+ktmv)+D(𝐱^t,𝐱t)+D(𝐱¨t,𝐱t)), (24)

    where the pretraining distortion D(𝐱¨t,𝐱t) serves as a regularization term to improve the training stability. In this step, the model indeed learns a bandwidth cost trade-off between primary and motion links. Moreover, due to the integrated training within a whole GOP, the rate allocation between frames has also been optimized, which will be shown in the ablation study of the subsequent section.

IV Experimental Results

IV-A Experimental Setup

IV-A1 Datasets

Our DVST model is trained with the Vimeo-90k dataset [36], which consists of 89800 video clips with a large variety of scenes and actions. During the model training, the chunks are randomly cropped to 256×256 pixels. We use N=7 unrolled frames as a GOP in the last step of the training procedure and disallow the gradients passing from the I-frame reconstruction to the P-frame. We evaluate the performance of DVST using the HEVC test dataset [37] and the UVG dataset. As widely used standards to measure video-related algorithms’ performance, they contain sequences of various content, frame rate, and resolution. In particular, the HEVC dataset includes Class A (2560×1600), Class B (1920×1080), Class C (832×480), Class D (416×240), Class E (1280×720). And the UVG dataset consists of 7 videos with the resolution of 1920×1080. During model testing, we set the GOP size as N=4, which is identical to the end-to-end wireless video transmission scheme in [38]. As for I-frame coding, we adopt our previous work of image semantic transmission using nonlinear transform source-channel coding [9].

IV-A2 Implementation Details

In all experiments, the channel dimension C in Fig. 4 is set to 96 for the primary link and 128 for the motion link. In addition, the channel dimension C in Fig. 5 is 96. As mentioned before, we employ the Swin Transformer [35] as the backbone of the contextual deep JSCC codec, which greatly reduces the computation complexity of vision Transformers by conducting multi-head self-attention (MHSA) within local windows or shifted windows. In this paper, the number of Swin Transformer blocks is set to Ne=Nd=4, and we use 8 heads and 16×16 window size in MHSA. In addition, the quantized channel bandwidth cost value set of primary link is chosen as 𝒱pl={0,2,4,6,8,10,15,20,26,32,40,48,56,64,80,96}, and the motion link 𝒱ml={0,1,2,4,8,16,32,48}. Hence, extra side information of total q=7 bits will be transmitted to inform the receiver of the CBR for each embedding. Since we adopt a large patch size of 32×32 pixels, the side information cost is relatively trivial compared to video content. The composition of total CBR will be discussed in the ablation study.

For the reconstruction task, we optimize DVST in terms of the mean squared error (MSE) for the peak-signal-to-noise ratio (PSNR) metric, or multiscale structural similarity [39] (MS-SSIM) for perceptual quality. Multiple DVST models are trained with λ{1/4,1/8,1/16,1/32,1/64} for MS-SSIM and λ{256,128,64,32,16} for PSNR, thus achieving different RD tradeoffs. A smaller value of λ leads to a larger CBR. We denote these models as “DVST (PSNR)” and “DVST (MS-SSIM)”, respectively. For each model, we use the Adam optimizer [40] with a learning rate of 104. We use a mini-batch size of 8, and it takes about one week to train the whole DVST model on single RTX 3090 GPU.

IV-A3 Comparison Schemes

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 6: PSNR performance versus the average channel bandwidth ratio (CBR) over the AWGN channel at SNR=10dB.

Following [38], we compare our DVST with classical video coded transmission schemes in current mainstream wireless communication systems. In particular, we employ the standard video codecs (H.264 [41] and H.265 [42]) for source coding combined with practical LDPC codes [26] or ideal capacity-achieving channel code family for channel coding. For brevity, we use “+” to concatenate the source coding and channel coding schemes, e.g., H.265 combined with capacity-achieving channel code is denoted as “H.265 + Capacity”. As we shall note, the ideal “H.264 + Capacity” or “H.265 + Capacity” scheme can be viewed as a performance upper bound on traditional separation-based source and channel coding schemes. The above simulations are implemented on the top of Sionna [43], an open-source library for the link-level simulation of digital communication systems. In addition, we refer to the configurations of H.264 and H.265 in [44], which adopt the typical ffmpeg settings for low-latency and veryfast mode. In practical implementation, to be aligned with previous works [17], we also convert two consecutive real symbols in 𝐬 as one complex channel-input symbol and add complex Gaussian noise.

IV-B Reconstruction Task Results

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 7: MS-SSIM performance versus the average channel bandwidth ratio (CBR) over the AWGN channel at SNR=10dB. All MS-SSIM values are converted to dB (10log10(1m), where m is the MS-SSIM value in the range between 0 and 1).
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 8: PSNR and MS-SSIM performance versus the change of channel SNR over the AWGN channel. The CBR constraint for HEVC Class A sequence is set to R=0.015 and R=0.025 for Class D sequence.

Fig. 6 shows the RD results under the PSNR metric among various test sequences over the AWGN channel with channel SNR=10dB. For H.264 + LDPC and H.265 + LDPC, after traversing given combinations of LDPC coded modulation schemes, we exploit a 2/3 rate (4096,6144) LDPC code with 16-ary quadrature amplitude modulation (16QAM) to ensure reliable transmission and the highest efficiency [28]. We use the Gaussian capacity formula [1] to calculate the maximum transmission rate per channel symbol for the ideal H.264 + Capacity and H.265 + Capacity schemes. From Fig. 6, we can find that on most test sequences, the proposed DVST (PSNR) scheme can outperform the practical H.264 + LDPC scheme by a large margin for all CBRs, and the performance gap increases with the CBR which indicates a better coding gain achieved by our DVST method. Furthermore, the proposed DVST shows competitive performance to the H.265 + LDPC scheme and even performs close to H.265 + Capacity in some test sequences.

As for the coding gain that shows as the RD curve slope, by using the adaptive rate allocation and contextual transmission mechanism, our DVST model shows comparable coding gain as that of H.265/H.264 series in most cases. The coding gain generally increases with the resolution of the video sequence, which demonstrates the potential of DVST on transmitting higher resolution videos over wireless channels. However, we also note that DVST performs slightly worse than H.265 + LDPC on HEVC Class C and E. A possible reason is that many video sequences, e.g., BQMall, in these two Classes consist of complex foreground or various textures, which results in difficulty to the context generation in both semantic feature space and deep JSCC codeword space. As a comparison, our DVST performs better than H.265 + LDPC on HEVC Class A and D, which are of relatively flat foreground and simple textures.

Fig. 7 shows the RD performance in terms of MS-SSIM perceptual metric over the AWGN channel at channel SNR=10dB. Since MS-SSIM yields values between 0 (worst) and 1 (best), and most values are higher than 0.9, we converted the MS-SSIM values in dB to improve the legibility. For semantic communications, this perceptual metric aligns better with human feeling. Results indicate that the proposed DVST method can outperform classical schemes by a large margin, and it achieves a greater improvement on high-resolution images and high CBR regions. Compared to the PSNR results in Fig. 6, we can find that traditional video coded transmission series are inferior to the learning-based DVST because traditional video compression is designed to be optimized for squared error with hand-selected constraints.

Fig. 8 provides the PSNR and MS-SSIM results versus the change of channel SNR, where the CBR constraint for HEVC Class A sequence is R=0.015, and R=0.025 for Class D sequence. Since DVST learns an adaptive bandwidth allocation strategy depending on the video content and channel condition, it is difficult to strictly constrain the CBR to the predetermined value. In practice, based on the 10dB DVST models (λ=64 for PSNR, and λ=1/32 for MS-SSIM), we finetune ηt to meet the CBR constraint in different SNRs. For comparison schemes, we evaluate the performance using all possible combinations of (4096,8192), (4096,6144), and (2048,6144) LDPC codes with 4QAM, 16QAM, and 64QAM modulations. The solid blue line presents the envelope of the best performing configurations of H.265 + LDPC at each SNR. In Fig. 8, we use the solid red line to illustrate the performance of the DVST model trained with channel SNR at 2dB, 2dB, 6dB, 10dB, and 14dB, where the testing SNR equals to the training SNR. We also provide the performance of mismatched training and testing as the red dashed lines, where two models are trained under channel SNR 6dB and 10dB, respectively, but tested for various SNRs. We can find that the proposed DVST brings considerable performance gain. Comparing the three red lines, we observe that our DVST model also shows reasonable performance improvement with the increase of SNRtest when SNRtest>SNRtrain, and avoids catastrophic degradation when SNRtest<SNRtrain. In contrast, traditional separation-based video coded transmission schemes show significant cliff effect which are plotted as the PSNR-SNR curves of H.265 + LDPC in the blue dashdotted lines.

As for the slope of each curve in Fig. 8, our DVST shows better performance and comparable coding gain with that of H.264/H.265 series, especially on the high-resolution videos of HEVC Class A since it contains more high-frequency contents. Furthermore, we compare our DVST with the emerging neural video compression scheme DCVC [27] combined with LDPC codes for wireless transmission. For fair comparison, we use the same I-frame coding and GOP size and only compare the P-frame performance. Compared with DCVC + LDPC, we also achieve meaningful gain, which indicates our DVST can benefit from the good match between the learned deep JSCC and the nonlinear transform. Moreover, DVST does not rely on explicit entropy coding for compression and channel codes for error-correcting, which avoids the cliff effect and reduces the computational complexity, but DCVC + LDPC will also involve the cliff effect due to the use of entropy coding and left errors in LDPC decoding.

Refer to caption
Figure 9: PSNR performance versus channel SNR over the Rayleigh fading channel. The CBR constraint for HEVC Class D sequence is set to R=0.025.
Refer to caption
Refer to caption
Figure 10: CBR and PSNR/MS-SSIM comparison between DVST and H.265 + LDPC. We use the last 12 frames of Kimono from HEVC Class B dataset as an example.
Refer to caption
Figure 11: CBR savings for each video in all test sequences. We use the widely used Bjontegaard Delta (BD) rate reduction [45] algorithm to estimate the bandwidth savings. The rate saving value represents the percentage of channel bandwidth cost relative to H.264 + LDPC at the same PSNR (smaller is better), and this experiment is conducted at the AWGN channel with SNR=10dB. As the anchor, H.264 + LDPC scheme is always 100% as plotted in the dashed horizontal line. In most cases (21/25), the proposed DVST can save more bandwidth compared to both H.264 + LDPC and H.265 + LDPC.
Original H.264 + LDPC H.265 + LDPC DVST
Refer to caption
Refer to caption
(a) RGOP / PSNR (dB)
Refer to caption
(b) 0.024 (0%) / 31.35
Refer to caption
(c) 0.022 (–8.3%) / 31.93
Refer to caption
(d) 0.015 (–37.5%) / 32.12
Refer to caption
Refer to caption
(e) RGOP / MS-SSIM (dB)
Refer to caption
(f) 0.018 (0%) / 12.79
Refer to caption
(g) 0.017 (–5.5%) / 13.02
Refer to caption
(h) 0.016 (–11.1%) / 15.91
Figure 12: Examples of visual comparison. The first column shows the original frame. The second column shows the cropped patch in original frame. The third to the fifth column show the reconstructed frames by using different transmission schemes over the AWGN channel at SNR=10dB. Note that, the fifth column presents the cropped patch generated by DVST (PSNR) and DVST (MS-SSIM), respectively. RGOP denotes the average CBR of the current GOP. The blue number indicates the percentage of bandwidth cost saving compared to the baseline “H.264 + LDPC” scheme.
SNRtest=2dB SNRtest=6dB SNRtest=10dB SNRtest=14dB
Refer to caption
Refer to caption
(a) DVST (SNRtrain=2dB)
Refer to caption
(b) DVST (SNRtrain=6dB)
Refer to caption
(c) DVST (SNRtrain=10dB)
Refer to caption
(d) DVST (SNRtrain=14dB)
Refer to caption
(e) Original
Refer to caption
(f) H.265+1/2 LDPC+4QAM
Refer to caption
(g) H.265+2/3 LDPC+4QAM
Refer to caption
(h) H.265+2/3 LDPC+16QAM
Refer to caption
(i) H.265+1/2 LDPC+64QAM
Figure 13: Examples of visual comparison. The first column shows the original frame and its cropped patch. The second to the fifth column shows the frames reconstructed by DVST (PSNR) or H.265 + LDPC over the AWGN channel at various channel SNRs, respectively. For the DVST (PSNR), we consider the condition when SNRtest=SNRtrain. For the H.265 + LDPC scheme, the rate of LDPC and the order of QAM modulation is marked below the reconstructed frame. The average CBR of GOP is limited to 0.012.

Next, we show the PSNR performance under the Rayleigh fading channel with CBR constraint R=0.025 in Fig. 9. In this case, we assume the Rayleigh fading channel gain vector 𝐡t𝒞𝒩(𝟎,𝐈kt), and it is known at the receiver with ideal channel estimation. Thus, the receiver first conducts channel equalization, such that the received signal can be equivalently written as 𝐬^t=𝐬t+𝐧t/𝐡t, and then feed 𝐬^t into DVST decoder. In practice, our DVST models of the Rayleigh fading channel are finetuned from baseline models learned under the AWGN channel with the same SNR. Apparently, classical separation schemes (H.264/H.265 + LDPC + QAM) are still inferior to our DVST, especially in the low SNR region.

Fig. 10 shows a detailed comparison between DVST and H.265 + LDPC over the performance of a group of consecutive frames. It can be observed that the reconstruction quality of both schemes degrades with the increase of P-frame number within one GOP. In comparison, our DVST can spend fewer channel bandwidth costs while achieving much better reconstruction quality.

Furthermore, as shown in Fig. 11, we compute the BD rate reduction [45] relative to H.264 + LDPC for each video under AWGN channel at SNR = 10dB. Compared with H.264 + LDPC, the channel bandwidth cost of DVST is only 40% to 80% under the same reconstruction quality in terms of PSNR, which means a 60% to 20% the channel bandwidth can be saved. Compared with the results of H.265 + LDPC, DVST can still save more bandwidth in most videos (21/25).

Fig. 12 and Fig. 13 provide illustrative examples to demonstrate the performance of DVST intuitively. Specifically, as shown in Fig. 12, we visualize the specific reconstructed frames of Fig. 6 and Fig. 7 in the two rows, respectively. From the two groups of examples, we can observe that our DVST model generates high fidelity reconstructions with lower CBR costs. Fig. 13 presents the reconstruction results versus the change of channel SNR under a limited CBR budget. It can be seen that the results of H.265 + LDPC + QAM scheme have artifacts and block effects in the low SNR region, while DVST generates a clear text.

Refer to caption
Figure 14: Ablation study results in terms of MS-SSIM over the AWGN channel at SNR = 10dB. The hyperparameter λ and the CBR percentage of motion link are also marked on the DVST curve.
Refer to caption
(a) (a)
Refer to caption
(b) (b)
Figure 15: Reconstruction and segmentation performance versus the average CBR over the AWGN channel at SNR=10dB.
Ground Truth H.264 + LDPC H.265 + LDPC DVST
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 16: Examples of semantic segmentation results. The first two columns present the original frame and its label. The third to the fifth column shows the segmentation results from reconstructions of H.264 + LDPC or H.265 + LDPC or DVST (PSNR), respectively. The experiment is carried out over the AWGN channel with SNR=10dB and the average CBR of GOP is constrained to 0.009.

As for the ablation study, we verify the gains brought by proposed algorithms in Fig. 14, including the contextual model enhancement, rate-adaptive transmission, and GOP integrated training strategy. The bandwidth cost trade-offs between primary and motion links are also provided. When the λ becomes smaller, the DVST model tends to allocate more channel bandwidth to reduce the distortion in the primary link while the percentage of channel bandwidth in the motion link decreases. This result means it is more efficient to focus on the primary link in the high CBR region since this link can preserve more details. We verify the effectiveness of the GOP training strategy, which considers the influence of temporal correlations among adjacent frames. Comparing the results of DVST with that of DVST (without (w/o) GOP training), about 0.2dB reconstruction quality improvement can be seen. To verify the advantages of contextual video transmission, we remove the context generation network of DVST and adopt the traditional residual coding structure [44] as an alternative, which exploits MV to warp the reference frame and transmit the residual between the wrapped frame and the input frame. This contextual model enhancement method in DVST improves the whole performance compared to traditional residual coding structure. Finally, we invalidate the rate adaption module in fe, femv, fd, and fdmv by using a constant channel bandwidth cost for each patch embedding and deleting the additional rate tokens. Since the network can no longer learn a bandwidth cost trade-off between primary and motion links, the CBR proportion of the motion link is predetermined to 1/3. Given the channel bandwidth cost for each patch embedding, the total CBR is then fixed, so the loss function of DVST (w/o rate allocation) only has the distortion term. As shown in Fig. 14, our DVST overpasses the DVST (w/o rate allocation) by a large margin, especially in the high CBR region, verifying the coding gain brought by the proposed rate-adaptive transmission mechanism. Furthermore, we provide the ablation study about the motion link. The DVST (w/o motion link) extracts the context semantics solely from previous reconstructions. In this case, the whole system architecture can be vastly simplified. However, without the guidance of optical flow, it will be difficult for the context generation module to extract valuable information to exploit the spatio-temporal dependencies, resulting in significant performance degradation.

To compare the computational complexity of different video transmission systems, we measured the average encoding time of DVST on a Linux server with an Intel Xeon Gold 6226R CPU and a RTX 3090 GPU. Following the complexity analysis from [44, 27], we transmit five videos in the 1080P HEVC Class B dataset and measure the encoding speed. As a result, our DVST model spends 280ms to encode a single P-frame as channel-input symbols, which is one times faster than DCVC + LDPC, mainly due to the savings in arithmetic coding time. It is worth mentioning that the coding speed can be further improved by employing the latest deep model acceleration techniques, which is beyond the scope of this paper. As a comparison, the scheme of H.265 + LDPC runs at the speed from 1.5fps (frames per second) to 25fps with different coding settings (the trade-off between the coding efficiency and encoding speed), and the encoding speed of H.264 + LDPC is 8fps to 150fps. Note that, both H.264 and H.265 are implemented using commercial softwares with highly parallel framework and advanced assembly optimization techniques, while their official reference software runs hundreds of times slower [46].

IV-C Downstream Machine Vision Task Results

To potentially support future machine-type communications [34], we further optimize DVST for driving the downstream machine vision tasks while preserving the advantages of signal level reconstruction. Specifically, an analytics model is concatenated after the transmission framework to complete high-level semantics-related tasks based on the reconstructed frames. The transmission framework can be either a JSCC scheme like DVST or classical separated system like H.265 + LDPC. Herein, we take semantic segmentation as an example visual analytic task, employ HyperSeg [47] as a powerful analytics model to generate segmentation prediction, and evaluate the capacity of our DVST on the popular CamVid benchmark [48, 49]. CamVid is a road scene understanding dataset, which offers four video clips of driving scenes, and part of frames are densely semantic annotated. Following the training protocol of [47], we use 468 annotated images as the training set and the other 233 ones as the test set. Each video and the labeled images are resized to 256×192. To achieve a better rate-distortion-accuracy trade-off, we finetune the DVST (PSNR) model over the CamVid training set. The whole training loss extends (19) to achieve joint optimization of both video transmission and analysis. It is formulated as L=1Nt=1N(λkt+Dt+βDt,seg), where Dt,seg denotes the boot-strapped cross entropy loss [50] specialized for semantic segmentation. β is the weighting parameter, we set β=64 to balance the importance among reconstruction and segmentation. During the evaluation on the test set, all frames are transmitted to the receiver to calculate signal level distortion in terms of PSNR, while only the labeled frame participates in the calculation of segmentation performance (the labeled frame evenly distributes in the second or last frame of GOP). We report the class mean intersection over union (mIoU) results, a standard evaluation metric for semantic image segmentation [47]. A higher mIoU score indicates a better match between prediction and ground truth, with a maximum value of 1.

The quantitative results are shown in Fig. 15. We observe that our DVST achieves better performance for both reconstruction and segmentation with various CBRs. It outperforms the two separated coding schemes and the gap increases with CBR. Hence, our DVST method can better support machine vision tasks and hold higher fidelity for human vision at the same time. In addition, we present a group of segmentation example in Fig. 16. The reconstruction of DVST preserves more semantic information for machine recognition, which leads to more accurate segmentation results.

V Conclusion

This paper has proposed a new class of high-efficiency deep JSCC methods to achieve end-to-end video transmission over wireless channels. It was collected under the name “DVST”. This DVST framework has exploited nonlinear transform and conditional coding architecture to adaptively extract semantic features across video frames, and transmit semantic features via a group of learned variable-length deep JSCC codecs and wireless channel. Benefiting from the strong temporal prior provided by the semantic feature domain context and the deep JSCC codeword domain context, the DVST framework works highly efficient and effective. The whole video transmission system design has been formulated as an optimization problem whose goal is to minimize the end-to-end transmission rate-distortion performance under established perceptual quality metrics or downstream task metrics, which well matches with the goal of end-to-end semantic communications. Extensive numerical results have shown that the proposed DVST method can generally surpass traditional wireless video coded transmission schemes. In a nutshell, this paper has proposed a promising method to attain a customized design of learning-based source-channel coding for video transmission in future semantic communications.

References

  • [1] C. E. Shannon, “A mathematical theory of communication, 1948,” Bell System Technical Journal, vol. 27, no. 3, pp. 3–55, 1948.
  • [2] J. Rissanen and G. Langdon, “Universal modeling and coding,” IEEE Transactions on Information Theory, vol. 27, no. 1, pp. 12–23, 1981.
  • [3] J. Ballé, N. Johnston, and D. Minnen, “Integer networks for data compression with latent-variable models,” in Proceedings of the International Conference on Learning Representations, 2018.
  • [4] P. Zhang, W. Xu, H. Gao, K. Niu, X. Xu, X. Qin, C. Yuan, Z. Qin, H. Zhao, J. Wei, et al., “Toward wisdom-evolutionary and primitive-concise 6G: A new paradigm of semantic communication networks,” Engineering, vol. 8, pp. 60–73, 2022.
  • [5] H. Xie, Z. Qin, G. Y. Li, and B.-H. Juang, “Deep learning enabled semantic communication systems,” IEEE Transactions on Signal Processing, vol. 69, pp. 2663–2675, 2021.
  • [6] Z. Qin, X. Tao, J. Lu, and G. Y. Li, “Semantic communications: Principles and challenges,” arXiv preprint arXiv:2201.01389, 2021.
  • [7] H. Seo, J. Park, M. Bennis, and M. Debbah, “Semantics-native communication with contextual reasoning,” arXiv preprint arXiv:2108.05681, 2021.
  • [8] J. Dai, P. Zhang, K. Niu, S. Wang, Z. Si, and X. Qin, “Communication beyond transmitting bits: Semantics-guided source and channel coding,” IEEE Wireless Communications, pp. 1–8, early access, 2022.
  • [9] J. Dai, S. Wang, K. Tan, Z. Si, X. Qin, K. Niu, and P. Zhang, “Nonlinear transform source-channel coding for semantic communications,” IEEE Journal on Selected Areas in Communications, vol. 40, no. 8, pp. 2300–2316, 2022.
  • [10] K. Niu, J. Dai, S. Yao, S. Wang, Z. Si, X. Qin, and P. Zhang, “A paradigm shift towards semantic communications,” IEEE Communications Magazine, pp. 1–7, early access, 2022.
  • [11] M. Fresia, F. Perez-Cruz, H. V. Poor, and S. Verdu, “Joint source and channel coding,” IEEE Signal Processing Magazine, vol. 27, no. 6, pp. 104–113, 2010.
  • [12] A. Guyader, E. Fabre, C. Guillemot, and M. Robert, “Joint source-channel turbo decoding of entropy-coded sources,” IEEE Journal on Selected Areas in Communications, vol. 19, no. 9, pp. 1680–1696, 2001.
  • [13] N. Ramzan, S. Wan, and E. Izquierdo, “Joint source-channel coding for wavelet-based scalable video transmission using an adaptive turbo code,” EURASIP Journal on Image and Video Processing, vol. 2007, pp. 1–12, 2007.
  • [14] C. Chen, L. Wang, and F. CM Lau, “Joint optimization of protograph LDPC code pair for joint source and channel coding,” IEEE Transactions on Communications, vol. 66, no. 8, pp. 3255–3267, 2018.
  • [15] N. Farsad, M. Rao, and A. Goldsmith, “Deep learning for joint source-channel coding of text,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2018, pp. 2326–2330.
  • [16] K. Choi, K. Tatwawadi, A. Grover, T. Weissman, and S. Ermon, “Neural joint source-channel coding,” in Proceedings of the International Conference on Machine Learning. PMLR, 2019, pp. 1182–1192.
  • [17] E. Bourtsoulatze, D. B. Kurka, and D. Gündüz, “Deep joint source-channel coding for wireless image transmission,” IEEE Transactions on Cognitive Communications and Networking, vol. 5, no. 3, pp. 567–579, 2019.
  • [18] D. B. Kurka and D. Gündüz, “Bandwidth-agile image transmission with deep joint source-channel coding,” IEEE Transactions on Wireless Communications, vol. 20, no. 12, pp. 8081–8095, 2021.
  • [19] M. Jankowski, D. Gündüz, and K. Mikolajczyk, “Wireless image retrieval at the edge,” IEEE Journal on Selected Areas in Communications, vol. 39, no. 1, pp. 89–100, 2020.
  • [20] A. Krizhevsky, G. Hinton, et al., “Learning multiple layers of features from tiny images,” 2009.
  • [21] J. Ballé, V. Laparra, and E. P. Simoncelli, “End-to-end optimized image compression,” in Proceedings of the International Conference on Learning Representations, 2017.
  • [22] J. Ballé, “Efficient nonlinear transforms for lossy image compression,” in 2018 Picture Coding Symposium. IEEE, 2018, pp. 248–252.
  • [23] J. Ballé, D. Minnen, S. Singh, S. J. Hwang, and N. Johnston, “Variational image compression with a scale hyperprior,” in Proceedings of the International Conference on Learning Representations, 2018.
  • [24] D. Minnen, J. Ballé, and G. D Toderici, “Joint autoregressive and hierarchical priors for learned image compression,” Advances in Neural Information Processing Systems, vol. 31, 2018.
  • [25] J. Ballé, P. A Chou, D. Minnen, S. Singh, N. Johnston, E. Agustsson, S. J. Hwang, and G. Toderici, “Nonlinear transform coding,” IEEE Journal of Selected Topics in Signal Processing, vol. 15, no. 2, pp. 339–353, 2020.
  • [26] T. Richardson and S. Kudekar, “Design of low-density parity check codes for 5G new radio,” IEEE Communications Magazine, vol. 56, no. 3, pp. 28–34, 2018.
  • [27] J. Li, B. Li, and Y. Lu, “Deep contextual video compression,” Advances in Neural Information Processing Systems, vol. 34, 2021.
  • [28] D. B. Kurka and D. Gündüz, “DeepJSCC-f: Deep joint source-channel coding of images with feedback,” IEEE Journal on Selected Areas in Information Theory, vol. 1, no. 1, pp. 178–193, 2020.
  • [29] G. Lu, W. Ouyang, D. Xu, X. Zhang, C. Cai, and Z. Gao, “DVC: An end-to-end deep video compression framework,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 11006–11015.
  • [30] A. Ranjan and M. J Black, “Optical flow estimation using a spatial pyramid network,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 4161–4170.
  • [31] F. Mentzer, G. D Toderici, M. Tschannen, and E. Agustsson, “High-fidelity generative image compression,” Advances in Neural Information Processing Systems, vol. 33, pp. 11913–11924, 2020.
  • [32] J. Ballé, V. Laparra, and E. P Simoncelli, “Density modeling of images using a generalized normalization transformation,” in Proceedings of the International Conference on Learning Representations, 2016.
  • [33] M. Jaderberg, K. Simonyan, A. Zisserman, and K. kavukcuoglu, “Spatial transformer networks,” in Advances in Neural Information Processing Systems, C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, Eds. 2015, vol. 28, Curran Associates, Inc.
  • [34] L. Duan, J. Liu, W. Yang, T. Huang, and W. Gao, “Video coding for machines: A paradigm of collaborative compression and intelligent analytics,” IEEE Transactions on Image Processing, vol. 29, pp. 8680–8695, 2020.
  • [35] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 10012–10022.
  • [36] T. Xue, B. Chen, J. Wu, D. Wei, and W. T Freeman, “Video enhancement with task-oriented flow,” International Journal of Computer Vision, vol. 127, no. 8, pp. 1106–1125, 2019.
  • [37] F. Bossen et al., “Common test conditions and software reference configurations,” JCTVC-L1100, vol. 12, no. 7, 2013.
  • [38] T.-Y. Tung and D. Gündüz, “Deepwive: Deep-learning-aided wireless video transmission,” IEEE Journal on Selected Areas in Communications, vol. 40, no. 9, pp. 2570–2583, 2022.
  • [39] Z. Wang, E. P Simoncelli, and A. C Bovik, “Multiscale structural similarity for image quality assessment,” in Proceedings of the The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers. IEEE, 2003, vol. 2, pp. 1398–1402.
  • [40] D. P Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
  • [41] T. Wiegand, G. J Sullivan, G. Bjontegaard, and A. Luthra, “Overview of the H. 264/AVC video coding standard,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 13, no. 7, pp. 560–576, 2003.
  • [42] G. J Sullivan, J.-R. Ohm, W.-J. Han, and T. Wiegand, “Overview of the high efficiency video coding (HEVC) standard,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 22, no. 12, pp. 1649–1668, 2012.
  • [43] J. Hoydis, S. Cammerer, F. Ait Aoudia, A. Vem, N. Binder, G. Marcus, and A. Keller, “Sionna: An open-source library for next-generation physical layer research,” arXiv preprint, Mar. 2022.
  • [44] G. Lu, X. Zhang, W. Ouyang, L. Chen, Z. Gao, and D. Xu, “An end-to-end learning framework for video compression,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 10, pp. 3292–3308, 2021.
  • [45] G. Bjontegaard, “Calculation of average PSNR differences between rd-curves,” ITU-T VCEG-M33, April, 2001, 2001.
  • [46] G. Lu, X. Zhang, W. Ouyang, L. Chen, Z. Gao, and D. Xu, “An end-to-end learning framework for video compression,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 10, pp. 3292–3308, 2020.
  • [47] Y. Nirkin, L. Wolf, and T. Hassner, “Hyperseg: Patch-wise hypernetwork for real-time semantic segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 4061–4070.
  • [48] G. J Brostow, J. Shotton, J. Fauqueur, and R. Cipolla, “Segmentation and recognition using structure from motion point clouds,” in European Conference on Computer Vision. Springer, 2008, pp. 44–57.
  • [49] G. J Brostow, J. Fauqueur, and R. Cipolla, “Semantic object classes in video: A high-definition ground truth database,” Pattern Recognition Letters, vol. 30, no. 2, pp. 88–97, 2009.
  • [50] S. E Reed, H. Lee, D. Anguelov, C. Szegedy, D. Erhan, and A. Rabinovich, “Training deep neural networks on noisy labels with bootstrapping,” in Proceedings of the International Conference on Learning Representations (Workshop), 2015.