Efficient algorithms for quantum information bottleneck

Masahito Hayashi hayashi@sustech.edu.cn Shenzhen Institute for Quantum Science and Engineering, Southern University of Science and Technology, Shenzhen,518055, China International Quantum Academy (SIQA), Futian District, Shenzhen 518048, China Guangdong Provincial Key Laboratory of Quantum Science and Engineering, Southern University of Science and Technology, Shenzhen, 518055, China Graduate School of Mathematics, Nagoya University, Nagoya, 464-8602, Japan    Yuxiang Yang yuxiang@cs.hku.hk QICI Quantum Information and Computation Initiative, Department of Computer Science, The University of Hong Kong, Pokfulam Road, Hong Kong
Abstract

The ability to extract relevant information is critical to learning. An ingenious approach as such is the information bottleneck, an optimisation problem whose solution corresponds to a faithful and memory-efficient representation of relevant information from a large system. The advent of the age of quantum computing calls for efficient methods that work on information regarding quantum systems. Here we address this by proposing a new and general algorithm for the quantum generalisation of information bottleneck. Our algorithm excels in the speed and the definiteness of convergence compared with prior results. It also works for a much broader range of problems, including the quantum extension of deterministic information bottleneck, an important variant of the original information bottleneck problem. Notably, we discover that a quantum system can achieve strictly better performance than a classical system of the same size regarding quantum information bottleneck, providing new vision on justifying the advantage of quantum machine learning.

1 Introduction

Learning is a task of eminent importance to the contemporary world. As such, it has always been of top priority to quest powerful tools for learning information. Information bottleneck [32] stands as an excellent example, with many useful applications including deep learning [33, 28, 8], video processing [16], clustering [29] and polar coding [30]. Concretely, information bottleneck is a method to extract a piece of information T with respect to the system Y from the system X, and is formulated as the minimization problem of the difference I(T:X)βI(T:Y) with a positive parameter β, where I(T:X) is the mutual information between T and X. In particular, we are interested in the case when X is classical. By design, information bottleneck achieves an irreversible compression, by extracting essential information about Y and simultaneously removing unessential information contained in X.

Refer to caption
Figure 1: Visualization of quantum information bottleneck. In a prototypical setting of quantum information bottleneck, the task is to compress a classical system into a smaller system T, which can be either classical or quantum, by extracting its useful information about a quantum system Y and removing the useless information. It is expected that more relevant information Y about Y, instead of the entire X, can be recovered from T.

As we are stepping into the age of quantum information, the demand is growing for a method that efficiently learns information on a quantum system. For this purpose, let us consider the setup of quantum information bottleneck (QIB), demonstrated in Fig. 1. Similar as its classical counterpart, the aim of QIB is to compress X into a smaller system T while preserving the correlation with Y when some of these systems are quantum systems. Prior to this work, QIB has been discussed in several recent works [9, 24, 6, 14, 2] and has been applied to quantum information theory [6, 14] and quantum machine learning [2]. On the other hand, the fundamental properties of QIB such as convergence have not been analysed, which hinders its application in more practical tasks. Quantum information bottleneck is first proposed as a quantum extension of information bottleneck method in [9]. It also derived a necessary condition for the solution of the minimization problem (see [9, Appendix A]) by using Lagrange multiplier method in the same way as [1, 4]. Using the obtained condition, it also proposed an iterative algorithm to find a solution to satisfy the necessary condition [9, Appendix C]. Then, the reference [24] considered QIB in the quantum communication scenario. 111The reference [24, Appendix A] derived a necessary condition for the solution of the minimization problem by using Lagrange multiplier method in the same way as [1, 4]. Using the obtained condition, it also proposed an iterative algorithm to find a solution to satisfy the necessary condition [24, The end of Appendix C]. However, no study discusses the behaviour of the iterative algorithm, i.e., it is not known whether the algorithm monotonically reduces the objective function [32, 31, 9, 24]. It was also claimed in [9, Appendix B] that there is no advantage of using a quantum T if X,Y are both classical.

In this work, we conduct a systematic study on quantum information bottleneck, focusing on the case when the system X is classical. Compared to existing works [9, 24, 6, 14, 2], our work makes significant contributions in several directions:

First, we provide throughout analyses on two critical properties – efficiency and convergence – of QIB. Motivated by a recent generalization [22] of the Arimoto-Blahut algorithm [1, 4], we introduce a new quantum information bottleneck algorithm with an acceleration parameter γ that can make the value of QIB converges much faster than before when chosen properly. We prove rigorous criteria for our algorithm to converge and to achieve a minimum. In particular, we prove that the choice of β plays an important role in convergence.

Second, in contrast to the claim in Refs. [9, 24], we provide concrete examples where using a quantum instead of classical T could reduce the minimal value of QIB. Notably, our result justifies a genuine quantum advantage in quantum machine learning [34, 27, 3], where the employment of quantum circuits has been prevalent [26, 11, 5, 17, 20, 25] but the quantum advantage was rarely justified.

Last but not least, we generalise QIB by considering a general target function (1α)H(T)+αI(T:X)βI(T:Y) with parameters α,β0, which reduces to the standard QIB when α=1. By doing so, the generalised QIB contains QDIB, i.e., the quantum version of deterministic information bottleneck [31], by setting α=0. We show that our analyses and our algorithm hold for this generalised setting and, in particular, to QDIB. Then, we clarify that QDIB can be used to find a good approximate sufficient statistics T for X for Y, which requires a smaller entropy H(T) and larger mutual information I(T:Y). We justify our finding via a numerical example, where QDIB extracts a good approximate sufficient statistics over information about a quantum ensemble.

In summary, our work addresses several critical issues of QIB, including convergence, efficiency, choice of parameters, and the quantum advantage. We also extend QIB to a generalised setting and introduce the notion of QDIB. Our results consist of both rigorous analytical analyses and numerical experiments that justifies the importance of QIB and QDIB in fundamental tasks of learning.

The remaining part of this paper is organized as follows. Section 2 introduces our algorithm for quantum information bottleneck, and discusses its convergence and dependence of the parameter β. Section 3 discusses our algorithm when our memory system T is classical. Section 4 presents examples that realizes a smaller value of the target function by quantum memory T than by classical memory T. Section 5 discusses an application of our QIB algorithm in data classification. Section 6 proposes our algorithm for quantum deterministic information bottleneck, and studies its properties. Section 7 applies it to the extraction of approximate sufficient statistics, and numerically verifies its efficiency in an example. Section 8 makes discussion and conclusion.

2 The quantum information bottleneck (QIB) problem

2.1 Problem definition

Consider a classical-quantum joint system composed of X and Y with the joint state

ρXY:=xPX(x)|xx|ρY|x, (1)

where X is a classical system and Y is a quantum system. Our quantum information bottleneck (QIB) problem aims at constructing an information processor, modelled by a c-q channel σT|X from X to T (which prepares a quantum state σT|x when the classical register is x), that extracts efficient information from X with respect to the quantum system Y. After the action of the information processor, the joint state becomes:

ρXYT:=xPX(x)|xx|ρY|xσT|x. (2)

To this aim, the QIB problem concerns constructing a classical-quantum channel σT|X:XT that minimizes the information bottleneck function, consisting of entropic quantities defined with respect to the joint state ρXYT:

fα(σT|X) :=H(T)αH(T|X)βI(T:Y)
=(1α)H(T)+αI(T:X)βI(T:Y), (3)

where H(T) denotes the entropy of T 222For convenience, the notation H(A) stands for the Shannon entropy when the system A is classical and for the von Neumann entropy when A is quantum., H(T|X) denotes the conditional entropy of T on X, and I(T:Y) stands for the mutual information between T and Y.

That is, our aim is the calculation of the following value:

α,β:=minσT|Xfα(σT|X). (4)

In the information bottleneck (2.1), α and β are positive real variables modelling the objective of the task. In the original proposal of information bottleneck [32] α=1. Another common choice of α is α=0, and the task is called a deterministic QIB (whose classical counterpart was discussed in Ref. [31]). The parameter β controls the tradeoff between faithfulness and compression. For instance, in a deterministic information bottleneck, a larger β would make I(T:Y) more prominent in the objective function, forcing the information processor to preserve more information about Y, whereas a smaller β would signify the role of I(T:X), prompting the information processor to do more compression in X.

Although this section addresses the case with quantum systems Y and T, the case with a classical system Y and a quantum T can be contained as a special case by considering the diagonal densities ρY|x. On the other hand, the case with a classical system T is a different problem from the case with a quantum system T because we need to discuss a different minimization problem, which has a different range for the minimizing variable. Fortunately, our algorithm for a quantum system T, presented in the next subsection, can be applied to the case with a classical system T. Section 3 discusses the case of T being classical. We remark that the case where both T and Y are classical has been widely studied in classical information theory and machine learning; see, e.g., Refs. [32, 33, 31, 28].

2.2 QIB algorithm for α=1

The paper [9] discussed this problem when X,Y,T are quantum systems and α=1, extending the classical information bottleneck [32] to the quantum regime. It derived a necessary condition for σX|T to achieve the minimum (4). The necessary condition with quantum systems T,Y and a classical system X is written as

logσT|x= (1β)logσT[σT|X]
βTrYρY|x(logρYlogσYT[σT|X])Cx, (5)

where Cx is a normalizing constant and

ρY:= xPX(x)ρY|x (6)
σT[σT|X]:= xPX(x)σT|x (7)
σYT[σT|X]:= xPX(x)σT|xρY|x. (8)

Since this condition is self-consistent, using this condition, the paper [9] proposed the following iterative algorithm with the following update rule:

σT|x(n+1):= 1eCxexp((1β)logσT[σT|X(n)]
βTrYρY|x(logρYlogσYT[σT|X(n)])). (9)

2.3 The acceleration parameter γ

Next, we propose an extension of the iterative algorithm in [9]. First, we introduce a new parameter γ>0 and rewrite the condition (5) as:

logσT|x=(11γ)logσT|x+1γlogσT|x
= (11γ)logσT|x+1γ(1β)logσT[σT|X]
1γβTrYρY|x(logρYlogσYT[σT|X])1γCx
= logσT|x1γ1[σT|X](x)1γCx, (10)

where

1[σT|X](x)
:= logσT[σT|X]+logσT|x
+βTrY(ρY|x(log(σT[σT|X]ρY)logσYT[σT|X])). (11)

Using (10), we can derive another iterative algorithm as

σT|x(n+1):=1e1γCxexp(logσT|x(n)1γ1[σT|X(n)](x)). (12)

In this way, we can easily generalize the iterative algorithm (9) by [9]. However, it is not trivial to find the suitable value for 1γ, which, as we show later, is critical to the efficiency of our iterative algorithm. Although many papers [32, 31, 9, 24] discussed the iterative algorithm given by (9) including the classical case, no preceding study showed the convergence of the iterative algorithm by (9). In addition, the discussion above focuses on the case of α=1 and does not include the case of deterministic information bottleneck (α=0). Therefore, to make an efficient algorithm, we need to discuss the choice of the parameter γ for generic α.

2.4 QIB algorithm with general α and convergence

To analyze the convergence of the algorithm (12), we introduce a two-input variable function based on the idea in Ref. [22, Section III-B], whereas the method in Ref. [22, Section III-B] was obtained as a generalization of the Arimoto-Blahut algorithm [1, 4]. The idea is that, instead of directly solving the minimization of fα(σT|X), which is often too difficult, we find a continuous function J(σT|X,σT|X) with two variables σT|X,σT|X. Then we can update these two input variables σT|X,σT|X alternately to decrease J(σT|X,σT|X). Finally, if the function satisfies

fα(σT|X)=J(σT|X,σT|X), (13)

the minimum of J(σT|X,σT|X) will be close to the minimum of the IB function.

The above type of functions can be constructed if we find an operator α[σT|X](x) to satisfy

fα(σT|X)= xPX(x)TrTσT|xα[σT|X](x), (14)

In this paper, we employ the following function:

α[σT|X](x)
:= logσT[σT|X]+αlogσT|x
+βTrY(ρY|x(log(σT[σT|X]ρY)logσYT[σT|X])). (15)

Then, the condition (14) is satisfied.

Using this function, we can define J0(σT|X,σT|X):=TrTxσT|xPX(x)α[σT|X](x), which satisfies the condition (13). However, it is difficult to optimize two input variables alternately in the function J0(σT|X,σT|X). Instead, for γ>0, we introduce the following function

Jγ,α(σT|X,σT|X) (16)
:= γD(σT|XσT|X)+xPX(x)TrTσT|xα[σT|X](x), (17)

where D(σT|XσT|X):=xPX(x)D(σT|xσT|x) and D(σT|xσT|x) denotes the relative entropy.

Next, we need to specify the rules of the alternatively updating σT|X,σT|X. Crucially, we need to ensure that Jγ,α(σT|X,σT|X) is non-increasing under the updating rules. To this purpose, we first introduce the following condition:

(A1)

σT|X and σT|X satisfy the relation

γxPX(x)D(σT|xσT|x)
xPX(x)TrTσT|x(α[σT|X](x)α[σT|X](x)). (18)

In fact, the condition (A1) is rewritten as γγ(σT|X,σT|X) by defining γ(σT|X,σT|X) as

γ(σT|X,σT|X)
:= xPX(x)TrTσT|x(α[σT|X](x)α[σT|X](x))xPX(x)D(σT|xσT|x). (19)

This quantity is evaluated as

γ(σT|X,σT|X)α (20)

because the relation

D(ρYT[σT|X]ρYT[σT|X])D(σT[σT|X]σT[σT|X])
= D(σT[σT|X]ρYσT[σT|X]ρY) (21)

implies the relation

xPX(x)TrTσT|x(α[σT|X](x)α[σT|X](x))
= D(σT[σT|X]σT[σT|X])
+αxPX(x)D(σT|xσT|x)
+βD(σT[σT|X]ρYσT[σT|X]ρY)
βD(ρYT[σT|X]ρYT[σT|X])
D(σT[σT|X]σT[σT|X])
+αxPX(x)D(σT|xσT|x)
αxPX(x)D(σT|xσT|x). (22)

To state our updating rules, we define

σ^γ,α,T[σT|X](x):= exp(logσT|x1γα[σT|X](x)) (23)
η^γ,α|x[σT|X]:= Trσ^γ,α,T[σT|X](x) (24)
σ^γ,α,T|x[σT|X]:= 1η^γ,α[σT|X](x)σ^γ,α,T[σT|X](x). (25)

In particular, when γ=α, the operator σ^γ,α,T[σT|X](x) is simplified as

σ^α,T[σT|X](x)
= exp(1βαlogσT[σT|X]
βαTrYρY|x(logρYlogσYT[σT|X])). (26)
Theorem 1

Under the condition (A1), we have

Jγ,α(σT|X,σT|X) Jγ,α(σT|X,σT|X) (27)
Jγ,α(σT|X,σT|X) Jγ,α(σ^γ,α,T|X[σT|X],σT|X). (28)

Proof of Theorem 1:   The condition (A1) yields

Jγ,α(σT|X,σT|X)
= tTrσT|xPX(x)α[σT|X](x)
xTrσT|xPX(x)α[σT|X](x,t)
+γxPX(x)D(σT|xσT|x)
= Jγ,α(σT|X,σT|X). (29)

Hence, we obtain (27).

Also, we have

Jγ,α(σT|X,σT|X)
=(a) γxPX(x)TrσT|x(logσT|xlogσT|x
+1γα[σT|X](x))
=(b) γxPX(x)TrσT|x(logσT|xσ^γ,α,T[σT|X](x))
=(c) γxPX(x)(TrσT|x(logσT|xlogσ^γ,α,T|x[σT|X])
logη^γ,α[σT|X](x))
= γxPX(x)(D(σT|xσ^γ,α,T|x[σT|X]))
γxPX(x)logη^γ,α|x[σT|X], (30)

where (a), (b), and (c) follow from (17), (23), and (25), respectively. Finally, from Eq. (30) we can see that the minimum of Jγ,α(σT|X,σT|X) is achieved when σT|X=σ^γ,α,T|x[σT|X], since the first term of (30) is non-negative (with equality achieved when σT|X=σ^γ,α,T|x[σT|X]) and the second term is independent of σT|X. Hence, we obtain (28).  

Corollary 2

Assume that γsupσT|X,σT|Xγ(σT|X,σT|X). When σT|X is a local minimizer, we have

σ^γ,α,T|x[σT|X]=σT|X, (31)

which is equivalent to (5) when α=1.

When γγ(σ^γ,α,T|X[σT|X],σT|X), the following chain of inequalities hold: fα(σT|X)=Jγ,α(σT|X,σT|X)Jγ,α(σ^γ,α,T|X[σT|X],σT|X)Jγ,α(σ^γ,α,T|X[σT|X],σ^γ,α,T|X[σT|X])=fα(σ^γ,α,T|X[σT|X]). Hence, the monotonicity of the information bottleneck under the updating rules is also guaranteed, as long as γ is sufficiently large. Finally, we propose the following algorithm with a fixed γ and general α:

Algorithm 1 QIB algorithm
1:  Input: A joint state ρXY [as in Eq. (1)].
2:  Randomly choose an initial c-q channel σT|X(1);
3:  Create a counter n as the number of iterations; initialize n to 1.
4:  repeat
5:     Choose σT|X(n+1) as σ^γ,α,T|X[σT|X(n)] [cf. Eqs. (23) and (25)]; set n as n+1.
6:  until convergence.
7:  Output: A c-q channel σT|X(n+1)

As mentioned, when γ satisfies the condition (A1) in all iteration steps, i.e., when γ is sufficiently large, Theorem 1 guarantees the monotonicity of the information bottleneck function:

fα(σT|X(n+1))Jγ,α(σT|X(n+1),σT|X(n))fα(σT|X(n)). (32)

Since fα consists of bounded entropic quantities (assuming the system to be finite), it is a bounded quantity. Therefore, the sequence {fα(σT|X(n))} in our Algorithm converges. In addition, we can show that the sequence of c-q channels {σT|X(n)} converges as well:

Theorem 3

When γsupσT|X,σT|Xγ(σT|X,σT|X), the sequence {σT|X(n)} converges.

In particular, since αsupσT|X,σT|Xγ(σT|X,σT|X), the sequence {σT|X(n)} converges with γ=α.

Proof: Since {fα(σT|X(n))} is monotonically decreasing for n, we have

limnfα(σT|X(n))fα(σT|X(n+1))=0. (33)

Using (30), we have

fα(σT|X(n))=Jγ,α(σT|X(n),σT|X(n))
= γxPX(x)D(σT|x(n)σT|x(n+1))+Jγ,α(σT|X(n+1),σT|X(n))
γxPX(x)D(σT|x(n)σT|x(n+1))+fα(σT|X(n+1)). (34)

Thus, we have

γxPX(x)D(σT|x(n)σT|x(n+1))fα(σT|X(n))fα(σT|X(n+1)). (35)

Since due to (33) and (35), the sequence {σT|X(n)} is a Cauchy sequence, it converges.   

We remark that it is free to choose the convergence criterion in Algorithm 1.

In Algorithm 1, γ is fixed to be a large enough value. Intuitively (see the next paragraph for more detailed discussion), γ (or, more precisely, 1/γ) is an acceleration parameter that makes the algorithm converge faster if chosen to be a smaller value.

To begin with, we show the role of γ in convergence of the algorithm. Denote by σT|X the convergence point of {σT|X(n)}. The performance of our algorithm can be characterized by the decreasing speed of the average divergence between σT|X and σT|X(n), which is evaluated as

xPX(x)D(σT|xσT|x(n))xPX(x)D(σT|xσT|x(n+1))
= xPX(x)TrσT|x(logσT|xlogσT|x(n))
xPX(x)TrσT|x(logσT|xlogσT|x(n+1))
= xPX(x)TrσT|x(logσT|x(n+1)logσT|x(n))
=(a) xPX(x)TrσT|x(1γα[σT|X(n)](x)logη^γ,α[σT|X(n)](x))
=(b) 1γJγ,α(σT|X(n+1),σT|X(n))1γxPX(x)TrσT|xα[σT|X(n)](x)
=(c) 1γ((Jγ,α(σT|X(n+1),σT|X(n))fα(σT|X))
+xPX(x)TrσT|x(α[σT|X](x)α[σT|X(n)](x))), (36)

where (a), (b), and (c) follow from the combination of (23) and (25), (30), and (27), respectively.

The above discussion manifests that if 1γ((Jγ,α(σT|X(n+1),σT|X(n))fα(σT|X))+xPX(x)TrσT|x(α[σT|X](x)α[σT|X(n)](x)))>0, making γ smaller makes the average divergence between σT|X and σT|X(n) decrease faster. On the other hand, making γ too small leads to a risk of violating the condition (18) (and, consequently, breaking the monotonicity of Jγ,α).

Remark 1

The reference [22, Section III] considered a general setting. If σT|X is a single density matrix, our method can be considered as a special case of their setting. However, since σT|X is classical-quantum channel in our case, our analysis is not a special case of their setting.

Remark 2

The references [9, Appendix A] [24, Appendix A] considered the case when the systems X,Y,T are quantum systems and α=1. They derived a necessary condition for the solution of the minimization problem by using Lagrange multiplier method in the same way as [1, 4]. Using the obtained condition, they [9, Appendix C] [24, Appendix C] also proposed an iterative algorithm to find a solution to satisfy the necessary condition. It seems that their necessary condition is the same as (31) with γ=α=1. However, they did not discuss the convergence to a local minimizer in their algorithm.

2.5 Numerics on the effects of different γ

To see the effect of different γ, let us take a look at a concrete example: Consider a single-qubit quantum system Y and a classical register X with size 28. Then, we assume that PX is the uniform distribution over 𝒳={0,,281}, and the density ρY|x is given as ρY|x=ρ(θx,λx), where

ρ(θ,λ) :=exp(iθσx)(1λ00λ)exp(iθσx), (37)

where σx=(0110) is the Pauli-X matrix. The parameters θx and λx are randomly chosen.

Then, the ensemble we consider admits the following joint density matrix:

ρ^XY =xPX(x)|π(x)π(x)|ρ(θx,λx) (38)

with ρ(θx,λx) given by Eq. (37).

Now, we apply our QIB algorithm (i.e., Algorithm 1) to the ensemble (38). We consider a classical T whose size is the square root of |𝒳| (i.e., |𝒯|=24). We set α=1, and β=10. Our focus will be the effects of different choices of the acceleration parameter γ. As shown in Fig. 2, the choice of γ is crucial for the performance, more specifically, the efficiency and the convergence, of the QIB algorithm.

Two interesting phenomena are manifested by our numerics: For one thing, choosing a smaller γ will accelerate the course of convergence. As shown in Fig. 2, by choosing a suitably smaller value of γ (e.g., 0.8 or 0.5), our QIB algorithm achieves convergence faster than the existing QIB algorithm [9, 24], which corresponds to Algorithm 1 with γ=1. For the other, choosing a too small γ will ruin the convergence property of the QIB algorithm. For instance, when γ is chosen to be 0.4, fα jumps up after a few iterations and ends up in a much larger value than its initial value.

Refer to caption
Figure 2: Performance of Algorithm 1 for different γ. We apply Algorithm 1 (|𝒯|=16, α=1, and β=10) to the joint state (38). The information bottleneck fα is plotted as a function of the number of iterations for different values of γ. The green curve with γ=0.55 converges most quickly. It significantly improves the convergence speed in comparison with the black line with γ=1. The blue curve with γ=0.45 goes down even faster in the beginning but gets overtaken after a few iterations. Finally, it goes up around n=7. It shows that γ=0.45 does not satisfy the condition (A1) for n7.

In conclusion, the numerics has justified our theoretical analysis (see Section 2.4) on the importance of choosing a suitable γ. We emphasize that our contribution in this direction is twofold:

  1. 1.

    We proposed a method of accelerating the QIB algorithm, making it converge within fewer rounds of iteration, by introducing a new parameter γ and setting it to be smaller than one.

  2. 2.

    We showed that the QIB algorithm cannot achieve the desired minimal value of fα if γ is too small.

2.6 Choice of β

The output of our QIB algorithms depend not only on ρXY [cf. (1)] but also on the choice of α and β. Intuitively, a larger β improves the faithfulness (as it makes I(Y:T) more significant in fα), while a smaller β leads to more compression (as it makes I(X:T) more significant in fα). Somehow surprisingly, the choice of β is not completely free: In the following, we show that the QIB algorithm will yield a trivial σT|X if β is too small.

To consider the relation between the choice of β and the resultant information on T, we introduce the following condition for a subset 𝒮𝒮XT, where 𝒮XT is the the set of all c-q channels from X to T, i.e., the set {σT|X=(σT|x)x𝒳}:

(A2)

For any two distinct elements σT|X,σT|X𝒮, xPX(x)TrTσT|x(α[σT|X](x)α[σT|X](x))>0

The condition (A2) is unitarily invariant, i.e., the pair (σT|X,σT|X) satisfies the condition (A2), if and only if the pair (UσT|XU,UσT|XU) satisfies the condition (A2) for any unitary U on T. Hence, we choose 𝒮 as a unitarily invariant subset.

Theorem 4

Assume that a unitarily invariant subset 𝒮 satisfies (A2). Let σT|XM:=argminσT|Xfα(σT|X) be the solution to the QIB problem. When σT|XM belongs to 𝒮, σT|xM is the maximally mixed state on T for any x.

If σT|xM is the maximally mixed state for every x, T is uncorrelated with Y and does not contain any meaningful information. In other words, when the assumption for Theorem 4 holds, the solution of the QIB problem is not useful. Hence, we need to choose the parameters α,β such that condition (A2) does not hold.

Now we discuss how to avoid the condition (A2). The LHS of (A2) is evaluated as

xPX(x)TrTσT|x(α[σT|X](x)α[σT|X](x))
= TrTYxPX(x)(σT|xρY|x)((logσT[σT|X]logσT[σT|X])+α(logσT|xlogσT|x)
+β((log(σT[σT|X]ρY)log(σT[σT|X]ρY))(logσYT[σT|X]logσYT[σT|X])))
= TrTYxPX(x)(σT|xρY|x)((logσT[σT|X]logσT[σT|X])+α(logPX(x)σT|xlogPX(x)σT|x)
+β((log(σT[σT|X]ρY)log(σT[σT|X]ρY))(logσYT[σT|X]logσYT[σT|X])))
= D(σT[σT|X]σT[σT|X])+αD(σXT[σT|X]σXT[σT|X])
β(D(σYT[σT|X]σYT[σT|X])D(σT[σT|X]σT[σT|X])), (39)

where σXT[σT|X]:=xPX(x)σT|x[σT|X]|xx|. Since D(σYT[σT|X]σYT[σT|X])D(σT[σT|X]σT[σT|X]), the coefficient of β is a negative value. Hence, a smaller β has a possibility to satisfy the condition (A2). That is, to obtain a useful solution, we need to choose β to be a sufficiently large value.

Proof of Theorem 4:   Let U be an arbitrary unitary on 𝒯. We define σT|XM by σT|xM=UσT|xMU. Substituting σT|x(n) with σT|xM in (36), we have

0= xPX(x)D(σT|xMσT|xM)
xPX(x)D(σT|xMσ^γ,α,T|x[σT|XM])
= 1γ(fα(σT|XM)fα(σT|XM))
+1γxPX(x)TrσT|xM(α[σT|XM](x)α[σT|XM](x)) (40)
= 1γxPX(x)TrσT|xM(α[σT|XM](x)α[σT|XM](x)). (41)

Thus, the condition (A2) implies σT|XM=σT|XM. σT|xM is the completely mixed state on T for any x.  

3 Classical system T

Next, we consider the case when T is constrained to be a classical system. We stress that this is a different minimization from the previously discussed one with a quantum system T, whose minimum may not be attainable with a classical T. Instead, our objective function now is

α,βc:=minσT|X:diagonalfα(σT|X). (42)

Therefore, we need to re-examine the validity of our previous analyses.

Let us start with the form of QIB algorithm. Fortunately, our algorithm with a quantum system T can be applied to this case, simply with the adaptation that the states σT|x are limited to diagonal density matrices with respect to the basis {|t} of T. Under this condition, the states σ^γ,α,T|x[σT|X] are also diagonal density matrices. Therefore, when we set the initial state as diagonal density matrices, Algorithm 1 works for this case.

The above discussion leads to an interesting observation as follows. The convergent σT|X with initial diagonal σT|X satisfies the condition (10) and it is also diagonal. That is, if the minimum with classical T is strictly larger than the minimum with quantum T, the minimum with classical T is an example for the following statement: A solution of the condition (10) does not necessarily give the minimum of fα with quantum T. This fact shows the possible risk that a solution to (10) might be a saddle point or a local minimum rather than the global minimum for fα with quantum T.

When the states σT|x are limited to diagonal density matrices with respect to the basis {|t} of T, σTY[σT|X] is commutative with σT[σT|X] so that we can define σY|T[σT|X]:=σTY[σT|X]σT[σT|X]1. Then, σ^γ,α,T[σT|X](x) is simplified as follows.

logσ^γ,α,T[σT|X](x)
= (1αγ)logσT|x+1γlogσT[σT|X]
βγTrY(ρY|x(logρYlogσY|T[σT|X])). (43)

The notion of unitary invariance is reduced to invariance under permutations on T, and the condition (A2) is invariant under permutations on T. Then, Theorem 4 can be rewritten as follows.

Theorem 5

Assume that a subset 𝒮 satisfies (A2) and is invariant under any permutation on T. Let σT|X be the minimizer of minσT|X:diagonalfα(σT|X). When σT|X belongs to 𝒮, σT|x is the uniform distribution over T for any x.

Theorem 5 can be shown in the same way as Theorem 4.

In this case, we can make a more precise discussion for the condition (A2). For this purpose, we consider the maximum ratio

κ:=maxQX,QXD(xQX(x)ρY|xxQX(x)ρY|x)D(QXQX). (44)

The inequality κ1 follows from the information processing inequality for the map QXxQX(x)ρY|x. In this condition, σT[σT|X] is written as tQT[σT|X](t)|tt| by using a distribution QT[σT|X]. Then, the LHS of (A2) is simplified as

xPX(x)TrTσT|x(α[σT|X](x)α[σT|X](x))
= (β1)D(σT[σT|X]σT[σT|X])
+αD(σXT[σT|X]σXT[σT|X])
βD(σYT[σT|X]σYT[σT|X])
= (α1)D(σT[σT|X]σT[σT|X])
+tQT[σT|X](t)(αD(σX|T=t[σT|X]σX|T=t[σT|X])
βD(σY|T=t[σT|X]σY|T=t[σT|X]))
(α1)D(σT[σT|X]σT[σT|X])
+(αβκ)tQT[σT|X](t)
D(σX|T=t[σT|X]σX|T=t[σT|X]). (45)

When the condition α1,ακ>β holds, the LHS of (A2) is positive for σT|XσT|X. Hence, to extract useful σT|X, we need to choose β to satisfy the condition β>ακ with α=1. In fact, even when β>ακ, there is a possibility that a permutation-invariant subset 𝒮 satisfies (A2). Due to Theorem 5, when a permutation-invariant subset 𝒮 satisfies (A2), a useful solution does not belong to the subset 𝒮. Hence, to obtain a useful solution, we need to choose β sufficiently large beyond the above condition β>ακ with α=1.

Remark 3

We consider the case with classical Y and γ=α. The operator σ^α,T[σT|X](x) is simplified as follows.

σ^α,T[σT|X](x)
= exp(1αlogσT[σT|X]
βαTrY(ρY|x(logρYlogσY|T[σT|X]))). (46)

In this case, the reference [31, (14) Section 3] proposed the following update rule:

τ^T|x[σT|X]:=1Trτ^T[σT|X](x)τ^T[σT|X](x), (47)

where the operator τ^T[σT|X](x) is defined as

τ^T[σT|X](x)
:= exp(1αlogσT[σT|X]
βαTrY(ρY|x(logρY|xlogσY|T[σT|X]))). (48)

Since

logτ^T[σT|X](x)logσ^T[σT|X](x)=βαD(ρY|xρY), (49)

we have

τ^T|x[σT|X]
= 1TreβαD(ρY|xρY)σ^T[σT|X](x)eβαD(ρY|xρY)σ^T[σT|X](x)
= σ^T|x[σT|X]. (50)

That is, the update rule (47) by [31, (14) Section 3] is the same as ours of this special case. In particular, the update rule (47) with α=1 coincides with the update rule by the reference [32].

Remark 4

When the system Y is classical and α=1, the reference [9, Appendix B] claimed that there is no difference between the optimal value with quantum T and the optimal value with classical T. Since their algorithm works with T of a fixed size, it can be considered that they claimed the above statement when the size of T is fixed. However, their proof (see [9, Appendix B II]) contains a gap: The statement under Eq. (B23) that “the Lagrangian is invariant under a measurement of the memory M in a chosen basis |m” is not backed by a rigorous mathematical proof. It is thus unclear whether this statement and, consequently, the claim that there is no quantum advantage are correct. On the other hand, as we show next, the optimal value with quantum T can be strictly smaller than the optimal value with classical T. That is, the claim in [9, Appendix B] contradicts with our result of the next section.

4 Quantum advantage for T

To see the advantage of quantum system T over classical system T, we discuss several examples with the strict inequality

α,β<α,βc. (51)

We provide an analytical example in this section and a numerical example with application in quantum machine learning in Section 5.2 when the size of the system T is fixed. Generally, to achieve the optimal performance, we need to choose the system T as a sufficiently large dimensional system. However, in this section, to provide analytical examples, we fix the size of the system T to a certain value.

Assume that 𝒴 is a classical system of size d. The size of 𝒳 is k times of the size d of 𝒴. We assume that 𝒳 is given as 𝒳1×𝒳2 with 𝒳1=𝒴 and |𝒳2|=k. The distribution of X is assumed to be uniform. We focus on the quantum system T with the dimension n<d.

Lemma 6

When β1 and βα, we have

α,β=(1β)logn (52)

Proof: First, we show a bound on the QIB for generic (quantum) T. For any σT|x, we have H(T)I(T:X)I(T:Y). Hence, the relation βα0 implies (βα)I(T:Y)(βα)H(T). Hence, we have

fα(σT|x)=(1α)H(T)+αI(T:X)βI(T:Y)
(1α)H(T)(βα)I(T:Y)(1β)H(T). (53)

Since H(T)logn and 1β0, we obtain

α,β(1β)logn. (54)

The above bound is tight. Indeed, we choose σT|x1,x2 as the pure state t=1n1ne2πx1ni|t. Then, we have H(T)=logn. Also, H(T)=I(T:X)=I(T:Y). Therefore, fα(σT|x)=(1β)logn.   

Next, we focus on the case when T is a classical system of dimension n<d.

Lemma 7

Assume that d=mn+l with 0l<n. When β1α, we have

α,βc=(1β)(l(m+1)dlogdm+1+(nl)mdlogdm) (55)

Proof: Any channel σT|x can be written as a probabilistic mixture of deterministic channels σT|xj. That is, we have

σT|x=jpjσT|xj. (56)

Since Y is independent of X2 and the random variable J describing the choice of j, we have

I(T:Y|JX2)= I(T:Y|JX2)+I(Y:JX2)
= I(TJX2:Y)I(T:Y). (57)

Also, we have

H(T)H(T|JX2). (58)

Then, we have

fα(σT|x)(a)(1α)H(T)(βα)I(T:Y)
(b) (1α)H(T|JX2)(βα)I(T:Y|JX2), (59)

where (a) follows from (53), and (b) follows from (57) and (58). The minimization of (1α)H(T|JX2)(βα)I(T:Y|JX2) equals the minimization of the same function under the condition that σT|X is a deterministic channel and σT|x1x2 depends only on x1.

Under this condition, we have I(T:X)=I(T:X1)=I(T:Y), which implies the equality in (a) at (59). Therefore, for the minimization, we can impose this condition, i.e., the variable T is determined only by X1=Y, which implies I(T:Y)=H(T). In this case, we have fα(σT|x)=(1β)H(T). In the classical case, the maximum entropy H(T) among deterministic channels is achieved when the distribution (PT(t))t=1n as close as possible to the uniform distribution, i.e., PT=(m+1d,,m+1dl,md,,mdnl). Hence, the maximum entropy H(T) is l(m+1)dlogdm+1+(nl)mdlogdm. Therefore, we obtain the desired statement.   

When the conditions of Lemma 7 hold, d cannot be divided by n. In this case, since l(m+1)dlogdm+1+(nl)mdlogdm is strictly smaller than logn, when the state ρXY is close to the state x1d|x,xx,x|, the strict inequality (51) holds. There is clearly an advantage of using a quantum T.

5 Quantum feature maps with QIB

5.1 Information bottleneck in supervised learning

Supervised learning is a cornerstone of machine learning. Given a dataset {(x,y)} sampled from an unknown probability distribution PXY, a general supervised learning task is to find a classifier such that, for any testing data (x,y) sampled from the same distribution PXC, it predicts the label y with as high accuracy as possible given x.

Remarkably, recent studies [33, 28, 8] on the information bottleneck theory showed evidences that the training phase of deep learning can be divided into two stages. In the first stage, a representation T of X that faithfully encodes its correlation with Y is found, featured by increasing I(T:Y). In the second stage, the size of T is compressed, featured by decreasing I(T:X). This result suggests that finding an efficient and compressed representation of X facilitates data classification.

Refer to caption
Figure 3: Data classification with quantum feature maps. The flowchart illustrates the training phase and the testing phase of data classification using the technique of quantum feature maps. The part where our QIB algorithm is applied is highlighted.

5.2 Quantum feature maps

Following the above intuition, we propose a classical-quantum hybrid algorithm of data classification, by combining the QIB algorithm with the kernel method. The idea is illustrated in the flowchart in Fig. 3. Given a training dataset 𝒮train, the algorithm first identifies an efficient representation T of X by minimising the information bottleneck fα:=H(T)αH(T|X)βI(T:Y). Then a classifier is constructed that yields a prediction Y^ based on the state in T corresponding to the value of X. For simplicity, we consider for now the case when Y{1,1} is binary. In the first step, we set the representation T to be a quantum state ρ(x) that depends on the data x, and we obtain ρ(x) via Algorithm 1. In the second step, we use a linear classifier

cQIB(ρ(x~))=sgn(Tr[Aρ(x~)]+b) (60)

where A is a Hermitian operator and b. We further consider A that can be expressed as a linear combination A=x:(x,y)𝒮trainaxρ(x), and the classifier has the reduced form

cQIB(ρ(x~))=sgn(x:(x,c)𝒮trainaxK(x,x~)+b), (61)

where K(x,x~) is the kernel function, in our case given by the Hilbert-Schmidt (HS) inner product of quantum states and can be evaluated by performing the SWAP test on a quantum computer:

K(x,y)=Tr{ρ(x)ρ(y)}. (62)

The algorithm is summarised as follows:

Algorithm 2 QIB for data classification
  input: A training data set 𝒮train={(x,y)}; configuration (α,β,γ).
  input: A classifier cQIB:XY^.
  1) Generate an empirical distribution P^(x,y) from 𝒮train.
  2) Run Algorithm 1 with P^(x,y) as input and certain (adjustable) parameters α,β,γ.
  3) Compute the kernel K in Eq. (61) using the output of Step 2).
  4) Train the classifier (61) with 𝒮train and output the trained classifier.

We remark that the quantum kernel method, where a mapping xρ(x) is constructed for better classification, has been a hot topic recently (see, e.g., [26, 11, 5, 17, 20, 25]). The key distinction between existing works and our present method is the following: In existing works, the parameter x is passed to a parametrised (a.k.a. variational) quantum circuit that prepares the state ρ(x). One needs to train the circuit parameters on a quantum computer to obtain a good mapping xρ(x), which is called a feature map. In the near term, this method might be subject to the physical limitations of quantum devices. In contrast, in our present method ρ(x) is directly computed via a simple iterative algorithm. Therefore, there are two possible ways of realizing our present method, i.e., Algorithm 2. In the near term, we can regard Algorithm 2 as a “quantum-inspired” classical algorithm, and evaluate everything on a classical computer. When large-scale quantum computing becomes feasible, Algorithm 2 can be readily “quantised”. Indeed, the evaluation of ρ(x) in each iteration requires subroutines that compute matrix powers and logarithm and solve linear systems, which have already been developed in Refs. [10, 19, 18, 7].

5.3 Numerical experiments

Refer to caption
Refer to caption
Figure 4: Quantum vs classical feature maps. We run Algorithm 1 with α=γ=1, β=15 on the distribution P~XY based on the training data, and compare the converging values of QIB when T is classical (i.e., a probabilistic bit) and when T is quantum (i.e., a single qubit). This numerics shows the advantage of use of quantum T over classical T. The final feature maps with quantum T (plotted in red) and with the classical-T (plotted in blue) are visualised in the Bloch ball.

As a proof-of-principle experiment, we tested the performance of our QIB classifier on a dataset on 2, generated in the following way: First, we define the discrete sets 𝒳=𝒳1×𝒳2 and 𝒴, with 𝒳1=𝒴={0,1,2} and 𝒳2={0,1,,9}. To apply our classification method, we arbitrarily choose permutation π, and generate n=400 independent and identically distributed data (X~1,i,X~2,i,Yi) for i=1,,n as follows. We independently generate (X1,i,X2,i,Yi) according to the following distribution

PXY(x1,x2,y):=PY(y)QX1|Y(x1,y)QX2|X1(x2,x1), (63)

where PY is the uniform distribution over Y, QX1|Y(x1,y)=δ(x1,y), QX2|X1(x2,x1)=δ(x1,x2)+1|𝒳2|+1, and (x1,x2)=π(x1,x2). Next, we generate the random variables X~j,i:=Xj,i+Rj,i, where the random variable Rj,i is subject to the uniform distribution in the interval [0,1.2) unless i=1,Xi=2 nor i=2,Xi=9, it is subject to the uniform distribution in the interval [0,1) otherwise. Then, using the obtained data (X~1,i,X~2,i,Yi) with i=1,,n, we define its empirical distribution P~XY. We apply Algorithm 1 to the distribution P~XY as Fig. 4. In the case with the distribution P~XY, Algorithm 1 with quantum T can realize a smaller fα than Algorithm 1 with classical T, which shows the advantage of quantum T over classical T.

In the classification experiment, 50% of the data are used as the training set and the rest are used as the testing set. The kernel is constructed with Algorithm 2 with α=1,β=15,γ=1, a single-qubit register T, and 10 iterations. We consider both when T is a generic qubit system and when T is restricted to a binary classical system, and we compare their performance. As can be seen from Fig. 4, the case of quantum T has lower IB value than the case of classical T. The final feature map σT|X for the quantum T case suffers from certain degree of dispersion due to the random noise r1,r2, but the quantum features still form 3 clusters. In contrast, the final σT|X in the classical T case maps different values of X into two clusters.

The effect of the above distinction is made apparent in the classification performance. In Fig. 5, the performance of the classifiers constructed from the kernels are illustrated via their decision regions. It can be seen that, since the classical-T feature map groups X into two clusters, its resultant classifier gives a binary prediction on any input data, giving up the least possible label. In stark contrast, the quantum-T feature map utilizes the full Bloch ball to generate 3 clusters, leading to a much higher accuracy of prediction. The advantage of a genuinely quantum feature map is thus manifested by this numerical example.

For reference, in Fig. 5, we also plot the performance of two standard methods of classical feature maps. The referential methods (linear kernel and polynomial kernel) achieve accuracies (defined by the ratio of correct predictions in the testing set) 0.64 and 0.62, which is slightly higher than the classical-T information bottleneck kernel (0.565) but much lower than the QIB kernel (0.92). This further justifies the superior performance of our QIB method in classification.

Refer to caption
Figure 5: Decision regions of the QIB classifier and reference classifiers. The decision regions of the QIB classifier, the classical-T IB classifier, and two reference classifiers are plotted together with the test data. The different dot colors correspond to data with different labels, and the color of each region corresponds to the prediction made by the classifier for data in that region.

6 Quantum deterministic information bottleneck (QDIB)

Considering the limit α+0, the paper [31] proposed deterministic IB, which minimize f0. Now, we consider this minimization with quantum systems T,Y and classical system X. First, we define

σ^0,T|x[σT|X]
:= 1TrσT|xPT|x[σT|X]PT|x[σT|X]σT|xPT|x[σT|X], (64)

where PT|x[σT|X] is the projection to the maximum eigenvalue of the operator (1β)logσT[σT|X]+βTrYρY|x(logσYT[σT|X]logρY).

Given an initial point σT|X(1), we propose the following update rule

σT|X(n+1):=σ^0,T|X[σT|X(n)]. (65)

As shown below, each step of this algorithm improves the value of the target function f0.

The operator σ^0,T|x[σT|X] is characterized as

σ^0,T|x[σT|X]=limα0σ^α,α,T|x[σT|X]. (66)

Since Theorem 1 and (20) guarantee

fα(σ^α,α,T|X[σT|X])
= Jα,α(σ^α,α,T|X[σT|X],σ^α,α,T|X[σT|X])
Jα,α(σ^α,α,T|X[σT|X],σT|X)
Jα,α(σT|X,σT|X)=fα(σT|X), (67)

the limit α0 in (67) implies

fα0(σ^0,T|X[σT|X])fα0(σT|X]), (68)

which shows that each step of this algorithm improves the value of the target function fDIB:=fα0.

Algorithm 3 Quantum deterministic information bottleneck (QDIB) algorithm
1:  Input: A joint state ρXY [cf. (1)].
2:  Create a counter n as the number of iterations, initialized to 1.
3:  repeat
4:     Choose σT|X(n+1) as
σT|x(n+1)=PT|x[σT|X(n)]σT|x(n)PT|x[σT|X(n)]Tr(σT|x(n)PT|x[σT|X]) (69)
where PT|x[σT|X(n)] is the projection on the space spanned by the eigenvectors of α=0[σT|X(n)](x) [cf. (2.4)] corresponding to the minimum eigenvalue.
5:     Set n as n+1.
6:  until convergence.
7:  Output: A c-q channel σT|X(n+1)

7 Approximate sufficient statistics from DIB

7.1 Task formulation

Next, we discuss how DIB can be used for the extraction of useful information under a classical-quantum (c-q) joint system composed of X and Y with the joint state ρXY:=xPX(x)|xx|ρY|x, where X is a classical system and Y is a quantum system. For example, assume that our interest is in the quantum phenomena in the quantum system Y. This quantum system Y is correlated to the classical system X. However, there is a possibility that the classical system X contains redundant information. In this case, it is useful to extract essential information from X to describe the behavior of the quantum phenomena in the quantum system Y. To discuss the essential information, we introduce the concept of ϵ-(approximate) sufficient statistics of the classical system X with respect to the quantum system Y while the papers [36, 12] discussed this concept when system Y is a classical system.

A function f from X to T is called a sufficient statistics of X for the quantum system Y when there exists a conditional distribution PX|T such that

ρXY=tPX|T(x|t)|xx|xf1(t)PX(x)ρY|x. (70)

The above condition is equivalent to the condition

I(X:Y)=I(T:Y) (71)

while in general we have the inequality I(X:Y)I(T:Y).

However, when we use sufficient statistics, we cannot remove a small correlation generated by a noise. As an example, suppose that the classical system X is composed of two classical systems X1 and X2. Assume that we have a c-q state ρX1X2Y=x1x2PX1,X2(x1,x2)|x1,x2x1,x2|ρY|x1 with two classical systems X1 and X2.

We assume that we have already known the distribution PX1X2 but we do not know ρY|x. Also, we assume that we generate this state several times and apply the state estimation to the generated state. As a result, we obtain our estimate

ρ^X1X2Y=x1x2PX1X2(x1,x2)|x1,x2x1,x2|ρ^Y|x1,x2. (72)

Since our estimate always has small error, ρ^Y|x1,x2 is not exactly the same as ρY|x1, but it is close to ρY|x1. In this case, this difference should be considered as a noise. That is, the dependence of X2 is not essential. It is better to consider that the correlation is given as ρ^Y|x1:=x2PX2|X1(x2|x1)ρ^Y|x1,x2 so that our estimate of ρX1X2Y is given as x1x2PX1,X2(x1,x2)|x1,x2x1,x2|ρ^Y|x1.

For ϵ>0, a function f:XT is called an ϵ-sufficient statistics when the inequality

I(X:Y)ϵI(T:Y) (73)

holds. Hence, a sufficient statistics with T of small size and an ϵ-sufficient statistics can be considered as compressed data of X with respect to Y.

In the above example, X1X2 is a sufficient statistics for Y. When δ is sufficiently small for ϵ, I(X1:Y) is close to I(X1X2:Y), i.e., X1 is an ϵ-sufficient statistics. Hence, we can remove non-essential information X2. In fact, if 𝒳=𝒳1×𝒳2 is disturbed by a random permutation π, it will be non-trivial to extract essential information. To cover such a non-trivial case, we need a systematic approach to find such a function with a small-size T. For this aim, we can use the information bottleneck algorithm.

To extract approximate sufficient statistics T, we focus on two requirements. The mutual information I(T:Y) should be larger, and the entropy H(T) should be smaller. To handle these requirements, we simply minimize H(T)βI(T:Y) by using deterministic information bottleneck algorithm with |𝒯|=|𝒳|. Since the algorithm minimizes H(T)βI(T:Y), and the conditional distribution PT|X in the solution is deterministic, the support of PT in the solution is expected to be smaller than the original set 𝒯.

7.2 Numerics

Refer to caption
Figure 6: Bloch representation of the estimated ensemble {ρ(θx1,x2,λx1,x2)}. As can be seen in the figure, the qubit states, especially those with higher purity, form several clusters in the Bloch ball. In each cluster, the states have the same value of x1 and different values of x2. This shows that the correlation between X1 and Y is higher than the correlation between X2 and Y.

To demonstrate the above idea, let us take a look at a concrete example, which is a modification of the example in Section 2.5. Consider a single-qubit quantum system Y and a classical register X that encodes information about Y. The register X is further split into two sub-registers X1 and X2 that take values in the sets 𝒳1={0,1,,4} and 𝒳2={0,1,,19}. Then, we assume that PX is the uniform distribution over 𝒳1×𝒳2, and the density ρY|x1 is given as ρ(θx1,λx1) with (37). The parameters θ and λ depend on x1 as

θx1 :=πx1|𝒳1|λx1:=x14|𝒳1|. (74)

Obviously, the quantum system depends only on X1 and X2 contains no information about the quantum system. An experimentalist who has access to the ensemble, however, does not know this. To extract information about the quantum system, for each pair of (x1,x2), the experimentalist estimates its density matrix by repetitively (for ν< times) making a suitable measurement on ρ(θx1,λx1). According to quantum state estimation theory [15, 13], the estimate has an inaccuracy proportional to 1/ν. Taking this into account, we model the estimated density matrix as ρ(θx1,x2,λx1,x2) when the actual density matrix is ρ(θx1,λx1), where

θx1,x2 :=πx1|𝒳1|(1+rν(x1,x2)) (75)
λx1,x2 :=x14|𝒳1|(1+rν(x1,x2)) (76)

and rν(x1,x2),rν(x1,x2)=O(1/ν) characterise the estimation errors. The estimated ensemble then admits the density matrix given in (72) with ρ^Y|x1,x2=ρ(θx1,x2,λx1,x2), which is given by Eqs. (37), (75), and (76). Notice that now the register X2 is correlated with Y in the estimated joint state ρ^XY, even if the estimation-induced noise follows a distribution that does not depend on the value of X2.

Now, the task is to compress the register X, by constructing a map from X to a smaller classical register T. Here we take T to be the same size as X. One intuitive approach is to discard the X2 register because X1 contains much more information about the qubit state than X2. Nevertheless, such a simple map does not exist in more general cases. For instance, if the values of (x1,x2) in Eq. (72) are permuted, discarding X2 will not result in faithful compression. To see this, we further apply a arbitrary chosen unknown reshuffling π:𝒳𝒳 to the classical register 𝒳=𝒳1×𝒳2 in Eq. (72). The ensemble then admits the following joint density matrix:

ρ^XY= x1,x2(PX(x1,x2)|π(x1,x2)π(x1,x2)|
ρ(θx1,x2,λx1,x2)) (77)

with ρ(θx1,x2,λx1,x2) given by Eqs. (75) and (76). The goal is to extract an approximate sufficient statistics by constructing a map Q:𝒳𝒯.

Refer to caption
Refer to caption
Figure 7: Performance of QDIB algorithm in constructing approximate sufficient statistics. We apply our quantum deterministic information bottleneck (QDIB) algorithm on the state (77) (see also Fig. 6). For the joint state, we choose |𝒳1|=5 and |𝒳2|=20, and PX to be the uniform distribution over 𝒳=𝒳1×𝒳2. The noise rν(x1,x2) and rν(x1,x2) are drawn randomly and uniformly from the interval (1/ν,1/ν) with ν=20 for any x1𝒳1,x2𝒳2. In the QDIB algorithm (Algorithm 3), we choose β=20 and |𝒯|=|𝒳|=100. In the figure above, the information bottleneck fDIB:=fα0 is plotted as a function of the number of iterations. As can be seen from the plot, the QDIB value of our algorithm becomes lower than that of the fictional protocol of “discarding X2 after the inverse permutation π1” after only 3 iterations. In the figure below, the faithfulness I(T:Y) is plotted as a function of the number of iterations, and I(X1:Y) after the inverse permutation π1 (corresponding to the performance of the fictional protocol of “discarding X2 after the inverse permutation π1”) as well as I(X:Y) (corresponding to the upper bound of I(T:Y)) are plotted for reference. Both plots justify that our QDIB algorithm performs well in the task of constructing approximate sufficient statistics.

Our QDIB algorithm works as a more systematic and more efficient method to extract essential information and discard non-essential information, even in the presence of an arbitrary permutation. In the QDIB algorithm (Algorithm 3), we choose β=20 and |𝒯|=|𝒳|=|𝒳1||𝒳2|. First, we consider the case when the ensemble admits the form (72), and the performance is summarised in Fig. 7. As one can see from the numerics, fDIB:=fα0 of applying our QDIB algorithm to ρ^XY drops lower than that of the “discarding X2 after the inverse permutation π1” approach within 5 iterations, and converges to a much lower value, suggesting a better compression performance. This is further justified in the second plot, where the faithfulness I(T:Y) and the residual information I(T:X) are plotted. We can see that since our QDIB algorithm preserves almost as much information about Y as the original variable X, it compresses a considerably larger portion of information about the original register X.

8 Discussion and conclusion

We have proposed a generalized algorithm for QIB with an acceleration parameter γ and an additional parameter α, and have derived a necessary condition for the monotonic decrease of the objective function fα=H(T)αH(T|X)βI(T:Y) with quantum systems Y,T and classical system X when we extract information T with respect to Y from X. We have also showed its convergence under the same condition and that a wisely-chosen parameter γ can accelerate the convergence. Our numerical calculation has further justified the above analysis as follows. In our numerical experiment, making γ smaller accelerates the convergence, but if γ is made smaller than a threshold the algorithm will fail to converge. In addition, we have provided examples that quantum system T have an advantage over classical system T even when Y and X are classical.

Next, taking the limit α+0, we have proposed an iterative algorithm for QDIB that minimizes the objective function fDIB=H(T)βI(T:Y). We have shown that this iterative algorithm always decreases the objective function monotonically. QDIB can be applied to find an approximate sufficient statistics because it realizes a smaller entropy H(T) and a larger mutual information I(T:Y). Then, we have numerically demonstrated that our QDIB algorithm works well as an approximate sufficient statistics.

An important application we show in this work is that our QIB algorithm yields a new approach of constructing quantum feature maps for classification. In our numerical example, quantum system T realizes a smaller value of the objective function than classical system T. This numerical analysis shows the advantage of using quantum memory T for the classification. Despite significant recent progress [34, 27, 3, 26, 11, 5, 17, 20, 25], the advantage of quantum machine learning over its classical counterpart has not been discussed much. Our work provides a new angle of attacking this issue, shedding light on a new proposal to rigorously justify and quantify quantum supremacy in the world of learning.

An open question left for future study is how to extend our result to the case where X is also a quantum system, which covers, for instance,the scenario of compressing a quantum system while keeping its correlation with a classical label [21, 23, 35, 36, 37, 38]. Remarkably, in such a scenario, it has been shown that, if T is classical, some correlation will be lost regardless of its size [37]. Therefore, we anticipate that the advantage of a quantum T might persist or grow even stronger for QIB with a quantum X.

Finally, we remark that currently there is no efficient method to compute the restriction on γ in Theorem 3. Resolving this important issue in a future work will accelerate the convergence of our information bottleneck algorithm.

Acknowledgement

MH is supported in part by the National Natural Science Foundation of China (Grant No. 62171212) and Guangdong Provincial Key Laboratory (Grant No. 2019B121203002). YY is supported by Guangdong Basic and Applied Basic Research Foundation (Project No. 2022A1515010340), and by the Hong Kong Research Grant Council (RGC) through the Early Career Scheme (ECS) grant 27310822.

References

  • Arimoto [1972] S. Arimoto. An algorithm for computing the capacity of arbitrary discrete memoryless channels. IEEE Transactions on Information Theory, 18(1):14–20, 1972. doi: 10.1109/TIT.1972.1054753.
  • Banchi et al. [2021] Leonardo Banchi, Jason Pereira, and Stefano Pirandola. Generalization in quantum machine learning: A quantum information standpoint. PRX Quantum, 2:040321, Nov 2021. doi: 10.1103/PRXQuantum.2.040321.
  • Biamonte et al. [2017] Jacob Biamonte, Peter Wittek, Nicola Pancotti, Patrick Rebentrost, Nathan Wiebe, and Seth Lloyd. Quantum machine learning. Nature, 549(7671):195–202, 2017. doi: 10.1038/nature23474.
  • Blahut [1972] R. Blahut. Computation of channel capacity and rate-distortion functions. IEEE Transactions on Information Theory, 18(4):460–473, 1972. doi: 10.1109/TIT.1972.1054855.
  • Blank et al. [2020] Carsten Blank, Daniel K Park, June-Koo Kevin Rhee, and Francesco Petruccione. Quantum classifier with tailored quantum kernel. npj Quantum Information, 6(1):1–7, 2020. doi: 10.1038/s41534-020-0272-6.
  • Datta et al. [2019] Nilanjana Datta, Christoph Hirche, and Andreas Winter. Convexity and operational interpretation of the quantum information bottleneck function. In 2019 IEEE International Symposium on Information Theory (ISIT), pages 1157–1161, 2019. doi: 10.1109/ISIT.2019.8849518.
  • Gilyén et al. [2019] András Gilyén, Yuan Su, Guang Hao Low, and Nathan Wiebe. Quantum singular value transformation and beyond: exponential improvements for quantum matrix arithmetics. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, pages 193–204, 2019. doi: 10.1145/3313276.3316366.
  • Goldfeld and Polyanskiy [2020] Ziv Goldfeld and Yury Polyanskiy. The information bottleneck problem and its applications in machine learning. IEEE Journal on Selected Areas in Information Theory, 1(1):19–38, 2020. doi: 10.1109/JSAIT.2020.2991561.
  • Grimsmo and Still [2016] Arne L. Grimsmo and Susanne Still. Quantum predictive filtering. Phys. Rev. A, 94:012338, Jul 2016. doi: 10.1103/PhysRevA.94.012338.
  • Harrow et al. [2009] Aram W Harrow, Avinatan Hassidim, and Seth Lloyd. Quantum algorithm for linear systems of equations. Physical review letters, 103(15):150502, 2009. doi: 10.1103/PhysRevLett.103.150502.
  • Havlíček et al. [2019] Vojtěch Havlíček, Antonio D Córcoles, Kristan Temme, Aram W Harrow, Abhinav Kandala, Jerry M Chow, and Jay M Gambetta. Supervised learning with quantum-enhanced feature spaces. Nature, 567(7747):209–212, 2019. doi: 10.1038/s41586-019-0980-2.
  • Hayashi and Tan [2018] Masahito Hayashi and Vincent Y. F. Tan. Minimum rates of approximate sufficient statistics. IEEE Transactions on Information Theory, 64(2):875–888, 2018. doi: 10.1109/TIT.2017.2775612.
  • Helstrom [1969] Carl W Helstrom. Quantum detection and estimation theory. Journal of Statistical Physics, 1(2):231–252, 1969. doi: 10.1007/BF01007479.
  • Hirche and Winter [2020] Christoph Hirche and Andreas Winter. An alphabet-size bound for the information bottleneck function. In 2020 IEEE International Symposium on Information Theory (ISIT), pages 2383–2388, 2020. doi: 10.1109/ISIT44484.2020.9174416.
  • Holevo [2011] Alexander S Holevo. Probabilistic and statistical aspects of quantum theory, volume 1. Springer Science & Business Media, 2011. doi: 10.1007/978-88-7642-378-9.
  • Hsu et al. [2006] Winston H. Hsu, Lyndon S. Kennedy, and Shih-Fu Chang. Video search reranking via information bottleneck principle. MM ’06, pages 35–44, New York, NY, USA, 2006. Association for Computing Machinery. ISBN 1595934472. doi: 10.1145/1180639.1180654.
  • Lloyd et al. [2020] Seth Lloyd, Maria Schuld, Aroosa Ijaz, Josh Izaac, and Nathan Killoran. Quantum embeddings for machine learning. arXiv preprint arXiv:2001.03622, 2020. doi: 10.48550/arXiv.2001.03622.
  • Low and Chuang [2017] Guang Hao Low and Isaac L Chuang. Hamiltonian simulation by uniform spectral amplification. arXiv preprint arXiv:1707.05391, 2017. doi: 10.48550/arXiv.1707.05391.
  • Low and Chuang [2019] Guang Hao Low and Isaac L Chuang. Hamiltonian simulation by qubitization. Quantum, 3:163, 2019. doi: 10.22331/q-2019-07-12-163.
  • Pérez-Salinas et al. [2020] Adrián Pérez-Salinas, Alba Cervera-Lierta, Elies Gil-Fuster, and José I Latorre. Data re-uploading for a universal quantum classifier. Quantum, 4:226, 2020. doi: 10.22331/q-2020-02-06-226.
  • Plesch and Bužek [2010] Martin Plesch and Vladimír Bužek. Efficient compression of quantum information. Physical Review A, 81(3):032317, 2010. doi: 10.1103/PhysRevA.81.032317.
  • Ramakrishnan et al. [2021] Navneeth Ramakrishnan, Raban Iten, Volkher B. Scholz, and Mario Berta. Computing quantum channel capacities. IEEE Transactions on Information Theory, 67(2):946–960, 2021. doi: 10.1109/TIT.2020.3034471.
  • Rozema et al. [2014] Lee A Rozema, Dylan H Mahler, Alex Hayat, Peter S Turner, and Aephraim M Steinberg. Quantum data compression of a qubit ensemble. Physical Review Letters, 113(16):160504, 2014. doi: 10.1103/PhysRevLett.113.160504.
  • Salek et al. [2019] Sina Salek, Daniela Cadamuro, Philipp Kammerlander, and Karoline Wiesner. Quantum rate-distortion coding of relevant information. IEEE Transactions on Information Theory, 65(4):2603–2613, 2019. doi: 10.1109/TIT.2018.2878412.
  • Schuld [2021] Maria Schuld. Supervised quantum machine learning models are kernel methods. arXiv preprint arXiv:2101.11020, 2021. doi: 10.48550/arXiv.2101.11020.
  • Schuld and Killoran [2019] Maria Schuld and Nathan Killoran. Quantum machine learning in feature Hilbert spaces. Physical Review Letters, 122(4):040504, 2019. doi: 10.1103/PhysRevLett.122.040504.
  • Schuld et al. [2015] Maria Schuld, Ilya Sinayskiy, and Francesco Petruccione. An introduction to quantum machine learning. Contemporary Physics, 56(2):172–185, 2015. doi: 10.1080/00107514.2014.964942.
  • Shwartz-Ziv and Tishby [2017] Ravid Shwartz-Ziv and Naftali Tishby. Opening the black box of deep neural networks via information. arXiv preprint arXiv:1703.00810, 2017. doi: 10.48550/arXiv.1703.00810.
  • Slonim and Tishby [2000] Noam Slonim and Naftali Tishby. Document clustering using word clusters via the information bottleneck method. SIGIR ’00, pages 208–215, New York, NY, USA, 2000. Association for Computing Machinery. ISBN 1581132263. doi: 10.1145/345508.345578.
  • Stark et al. [2018] Maximilian Stark, Aizaz Shah, and Gerhard Bauch. Polar code construction using the information bottleneck method. In 2018 IEEE Wireless Communications and Networking Conference Workshops (WCNCW), pages 7–12, 2018. doi: 10.1109/WCNCW.2018.8368978.
  • Strouse and Schwab [2017] DJ Strouse and David J. Schwab. The Deterministic Information Bottleneck. Neural Computation, 29(6):1611–1630, 06 2017. ISSN 0899-7667. doi: 10.1162/NECO_a_00961.
  • Tishby et al. [1999] N. Tishby, F. C. Pereira, and W. Bialek. The information bottleneck method. In The 37th annual Allerton Conference on Communication, Control, and Computing, pages 368–377. Univ. Illinois Press, 1999. doi: 10.48550/arXiv.physics/0004057.
  • Tishby and Zaslavsky [2015] Naftali Tishby and Noga Zaslavsky. Deep learning and the information bottleneck principle. In 2015 IEEE information theory workshop (ITW), pages 1–5. IEEE, 2015. doi: 10.1109/ITW.2015.7133169.
  • Wittek [2014] Peter Wittek. Quantum machine learning: what quantum computing means to data mining. Academic Press, 2014. doi: 10.1016/C2013-0-19170-2.
  • Yang et al. [2016a] Yuxiang Yang, Giulio Chiribella, and Daniel Ebler. Efficient quantum compression for ensembles of identically prepared mixed states. Physical Review Letters, 116(8):080501, 2016a. doi: 10.1103/PhysRevLett.116.080501.
  • Yang et al. [2016b] Yuxiang Yang, Giulio Chiribella, and Masahito Hayashi. Optimal compression for identically prepared qubit states. Phys. Rev. Lett., 117:090502, Aug 2016b. doi: 10.1103/PhysRevLett.117.090502.
  • Yang et al. [2018a] Yuxiang Yang, Ge Bai, Giulio Chiribella, and Masahito Hayashi. Compression for quantum population coding. IEEE Transactions on Information Theory, 64(7):4766–4783, 2018a. doi: 10.1109/TIT.2017.2788407.
  • Yang et al. [2018b] Yuxiang Yang, Giulio Chiribella, and Masahito Hayashi. Quantum stopwatch: how to store time in a quantum memory. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences, 474(2213):20170773, 2018b. doi: 10.1098/rspa.2017.0773.