Processing math: 100%
+ - 0:00:00
Notes for current slide

illustrates how the introduction of noise in a channel can be compensated with the introduction of redundancy, by sending multiple copies of every bit.

Notes for next slide

Talk 6

Noisy-Channel Coding: I (Chapter 8 & 9)

2020-03-20 - Talk 6 - Saurabh B

Image Credit: Wolfram

1 / 37

illustrates how the introduction of noise in a channel can be compensated with the introduction of redundancy, by sending multiple copies of every bit.

Last few talks . . .

  • We looked at various data compression techniques
2 / 37

Last few talks . . .

  • We looked at various data compression techniques

  • Mostly considered random input vectors coming from simple probability distribution

2 / 37

Last few talks . . .

  • We looked at various data compression techniques

  • Mostly considered random input vectors coming from simple probability distribution

  • Mostly each component was independent of others

2 / 37

Motivation

  • Betters compression for real world data which often has interesting co-relation
3 / 37

we need to know how to work with models that include dependences

Motivation

  • Betters compression for real world data which often has interesting co-relation

  • Noisy channel with input x and output y can be looked at as joint ensembles where x and y are dependent

3 / 37

we need to know how to work with models that include dependences

if they were independent, it would be impossible to communicate over the channel

Motivation

  • Betters compression for real world data which often has interesting co-relation

  • Noisy channel with input x and output y can be looked at as joint ensembles where x and y are dependent

  • Information content in context of noisy channels

3 / 37

we need to know how to work with models that include dependences

if they were independent, it would be impossible to communicate over the channel

4 / 37
5 / 37
  • Steve is very shy and withdrawn, invariably helpful but with very little interest in people or in the world of reality.
  • A meek and tidy soul, he has a need for order and structure, and a passion for detail.”
  • Is Steve more likely to be a librarian or a farmer?

Solution

6 / 37

Dependent Random Variables

  • Entropy

H(X)=xAxP(x)log21P(x)

7 / 37

Dependent Random Variables

  • Entropy

H(X)=xAxP(x)log21P(x)

  • Joint entropy

H(X,Y)=xyAxAyP(x,y)log21P(x,y)

7 / 37

Dependent Random Variables

  • Entropy

H(X)=xAxP(x)log21P(x)

  • Joint entropy

H(X,Y)=xyAxAyP(x,y)log21P(x,y)

  • For independent random variables

H(X,Y)=H(X)+H(Y)P(x,y)=P(x)P(y)

7 / 37

Dependent Random Variables

  • Conditional entropy of X given y=bk

H(X|y=bk)=xAxP(x|y=bk)log21P(x|y=bk)

8 / 37

Dependent Random Variables

  • Conditional entropy of X given y=bk

H(X|y=bk)=xAxP(x|y=bk)log21P(x|y=bk)

  • Conditional entropy of X given Y

H(X|Y)=yAyP(y)[xP(x|y)log21P(x|y)]

8 / 37

Dependent Random Variables

  • Conditional entropy of X given Y H(X|Y)=xyAxAyP(x|y)log21P(x|y)
9 / 37

Dependent Random Variables

  • Chain rule of conditional entropy

H(X,Y)=H(X)+H(Y|X)=H(Y)+H(X|Y)

10 / 37

Dependent Random Variables

  • Chain rule of conditional entropy

H(X,Y)=H(X)+H(Y|X)=H(Y)+H(X|Y)

Proof-

H(X,Y)=xyAxAyP(x,y)log21P(x,y)

10 / 37

Dependent Random Variables

  • Chain rule of conditional entropy

H(X,Y)=H(X)+H(Y|X)=H(Y)+H(X|Y)

Proof-

H(X,Y)=xyAxAyP(x,y)log21P(x,y)

H(X,Y)=xyAxAyP(x,y)log21P(x|y)P(y)

10 / 37

Proof continued . . .

H(X,Y)=xyAxAyP(x,y)log2P(x|y)P(y)

11 / 37

Proof continued . . .

H(X,Y)=xyAxAyP(x,y)log2P(x|y)P(y)

H(X,Y)=xyP(x,y)log2P(x|y)xyP(x,y)log2P(y)

11 / 37

Proof continued . . .

H(X,Y)=xyAxAyP(x,y)log2P(x|y)P(y)

H(X,Y)=xyP(x,y)log2P(x|y)xyP(x,y)log2P(y)

H(X,Y)=xyP(x,y)log2P(x|y)yxP(x,y)log2P(y)

11 / 37

Proof continued . . .

H(X,Y)=xyAxAyP(x,y)log2P(x|y)P(y)

H(X,Y)=xyP(x,y)log2P(x|y)xyP(x,y)log2P(y)

H(X,Y)=xyP(x,y)log2P(x|y)yxP(x,y)log2P(y)

H(X,Y)=xyP(x,y)log2P(x|y)yP(y)log2P(y)

11 / 37

Proof continued . . .

H(X,Y)=xyAxAyP(x,y)log2P(x|y)P(y)

H(X,Y)=xyP(x,y)log2P(x|y)xyP(x,y)log2P(y)

H(X,Y)=xyP(x,y)log2P(x|y)yxP(x,y)log2P(y)

H(X,Y)=xyP(x,y)log2P(x|y)yP(y)log2P(y)

H(X,Y)=H(X|Y)+H(Y) QED

11 / 37

Proof continued . . .

H(X,Y)=xyAxAyP(x,y)log2P(x|y)P(y)

H(X,Y)=xyP(x,y)log2P(x|y)xyP(x,y)log2P(y)

H(X,Y)=xyP(x,y)log2P(x|y)yxP(x,y)log2P(y)

H(X,Y)=xyP(x,y)log2P(x|y)yP(y)log2P(y)

H(X,Y)=H(X|Y)+H(Y) QED

  • Note that in general: H(X|Y)H(Y|X)
11 / 37

Mutual information between X and Y

  • Mutual information measures information X contains about Y or vice versa
12 / 37

Decrease of the uncertainty about an outcome of a random variable given an outcome of another random variable

Mutual information between X and Y

  • Mutual information measures information X contains about Y or vice versa

  • Relative entropy or KL divergence between P(x) and Q(x)

DKL(PQ)=xP(x)logP(x)Q(x)

12 / 37

Decrease of the uncertainty about an outcome of a random variable given an outcome of another random variable

Mutual information between X and Y

  • Mutual information measures information X contains about Y or vice versa

  • Relative entropy or KL divergence between P(x) and Q(x)

DKL(PQ)=xP(x)logP(x)Q(x)

  • So mutual information I(X;Y), cane be understood as relative entropy between the joint distribution and the product of marginal distributions
12 / 37

Decrease of the uncertainty about an outcome of a random variable given an outcome of another random variable

Mutual information between X and Y

  • Mutual information measures information X contains about Y or vice versa

  • Relative entropy or KL divergence between P(x) and Q(x)

DKL(PQ)=xP(x)logP(x)Q(x)

  • So mutual information I(X;Y), cane be understood as relative entropy between the joint distribution and the product of marginal distributions

I(X;Y)=DKL(P(x,y)P(x)P(y))

12 / 37

Decrease of the uncertainty about an outcome of a random variable given an outcome of another random variable

Mutual information between X and Y

I(X;Y)=xyAxAyP(x,y)log2P(x,y)P(x)P(y)

13 / 37

Mutual information between X and Y

I(X;Y)=xyAxAyP(x,y)log2P(x,y)P(x)P(y)

I(X;Y)=xyAxAyP(x,y)log2P(x|y)P(y)P(x)P(y)

13 / 37

Mutual information between X and Y

I(X;Y)=xyAxAyP(x,y)log2P(x,y)P(x)P(y)

I(X;Y)=xyAxAyP(x,y)log2P(x|y)P(y)P(x)P(y)

I(X;Y)=xyAxAyP(x,y)log2P(x|y)P(x)

13 / 37

Mutual information between X and Y

I(X;Y)=xyAxAyP(x,y)log2P(x,y)P(x)P(y)

I(X;Y)=xyAxAyP(x,y)log2P(x|y)P(y)P(x)P(y)

I(X;Y)=xyAxAyP(x,y)log2P(x|y)P(x)

I(X;Y)=xyAxAyP(x,y)log2P(x|y)P(x)

13 / 37

Proof continued . . .

I(X;Y)=P(x,y)log2P(x|y)P(x,y)log2P(x)

14 / 37

Proof continued . . .

I(X;Y)=P(x,y)log2P(x|y)P(x,y)log2P(x)

I(X;Y)=P(x,y)log2P(x|y)P(x)log2P(x)

14 / 37

Proof continued . . .

I(X;Y)=P(x,y)log2P(x|y)P(x,y)log2P(x)

I(X;Y)=P(x,y)log2P(x|y)P(x)log2P(x)

I(X;Y)=P(x)log2P(x)+P(x,y)log2P(x|y)

14 / 37

Proof continued . . .

I(X;Y)=P(x,y)log2P(x|y)P(x,y)log2P(x)

I(X;Y)=P(x,y)log2P(x|y)P(x)log2P(x)

I(X;Y)=P(x)log2P(x)+P(x,y)log2P(x|y)

I(X;Y)=P(x)log21P(x)P(x,y)log21P(x|y)

14 / 37

Proof continued . . .

I(X;Y)=P(x,y)log2P(x|y)P(x,y)log2P(x)

I(X;Y)=P(x,y)log2P(x|y)P(x)log2P(x)

I(X;Y)=P(x)log2P(x)+P(x,y)log2P(x|y)

I(X;Y)=P(x)log21P(x)P(x,y)log21P(x|y)

I(X;Y)=H(X)H(X|Y)

QED

14 / 37

Proof continued . . .

I(X;Y)=P(x,y)log2P(x|y)P(x,y)log2P(x)

I(X;Y)=P(x,y)log2P(x|y)P(x)log2P(x)

I(X;Y)=P(x)log2P(x)+P(x,y)log2P(x|y)

I(X;Y)=P(x)log21P(x)P(x,y)log21P(x|y)

I(X;Y)=H(X)H(X|Y)

QED

I(X;Y)=I(Y;X)

14 / 37

information is symmetric

Information measure

15 / 37

Noisy channels

  • A discrete memoryless channel Q is characterized Ax and Ay and P(y|x)
16 / 37

Noisy channels

  • A discrete memoryless channel Q is characterized Ax and Ay and P(y|x)

  • Q defines conditional prob. P(y|x), if we choose an input dist. P(X) we then have a joint dist.

P(x,y)=P(x|y)p(y)

16 / 37

Noisy channels

  • A discrete memoryless channel Q is characterized Ax and Ay and P(y|x)

  • Q defines conditional prob. P(y|x), if we choose an input dist. P(X) we then have a joint dist.

P(x,y)=P(x|y)p(y)

  • we can bayes theorem to infer input given the output

P(x|y)=P(y|x)P(x)P(y)

16 / 37

1. BSC

  • Ax=0,1
  • Ay=0,1

Image Credit: Wikimedia

17 / 37

Example.

  • Consider BSC with f=0.15,p0=0.9,p1=0.1, compute P(x=1|y=0)
18 / 37

Example.

  • Consider BSC with f=0.15,p0=0.9,p1=0.1, compute P(x=1|y=0)

P(x=1|y=0)=P(y=0|x=1)P(x=1)P(y=0)

18 / 37

Example.

  • Consider BSC with f=0.15,p0=0.9,p1=0.1, compute P(x=1|y=0)

P(x=1|y=0)=P(y=0|x=1)P(x=1)P(y=0)

P(x=1|y=0)=0.15×0.10.9×0.85+0.1×0.15

18 / 37

Example.

  • Consider BSC with f=0.15,p0=0.9,p1=0.1, compute P(x=1|y=0)

P(x=1|y=0)=P(y=0|x=1)P(x=1)P(y=0)

P(x=1|y=0)=0.15×0.10.9×0.85+0.1×0.15

P(x=1|y=0)=0.015×0.10.78=0.019

18 / 37

Example.

  • Consider BSC with f=0.15,p0=0.9,p1=0.1, compute P(x=1|y=0)

P(x=1|y=0)=P(y=0|x=1)P(x=1)P(y=0)

P(x=1|y=0)=0.15×0.10.9×0.85+0.1×0.15

P(x=1|y=0)=0.015×0.10.78=0.019

  • From above we know that P(y=0)=0.78,P(y=1)=0.22, let's look at mutual information
18 / 37

Example.

  • Consider BSC with f=0.15,p0=0.9,p1=0.1, compute P(x=1|y=0)

P(x=1|y=0)=P(y=0|x=1)P(x=1)P(y=0)

P(x=1|y=0)=0.15×0.10.9×0.85+0.1×0.15

P(x=1|y=0)=0.015×0.10.78=0.019

  • From above we know that P(y=0)=0.78,P(y=1)=0.22, let's look at mutual information

I(X;Y)=H(Y)H(Y|X)

18 / 37

Example.

  • Consider BSC with f=0.15,p0=0.9,p1=0.1, compute P(x=1|y=0)

P(x=1|y=0)=P(y=0|x=1)P(x=1)P(y=0)

P(x=1|y=0)=0.15×0.10.9×0.85+0.1×0.15

P(x=1|y=0)=0.015×0.10.78=0.019

  • From above we know that P(y=0)=0.78,P(y=1)=0.22, let's look at mutual information

I(X;Y)=H(Y)H(Y|X)

I(X;Y)=H2(0.22)H2(0.15)=0.760.61=0.15bits.

H2(p)=plog1p+(1p)log1(1p)

18 / 37

Capacity of a channel Q

C(Q)=maxPxI(X;Y)

  • The distribution Px that achieves the maximum is called the optimal input distribution, denoted by Px
19 / 37

There may be multiple optimal input distributions achieving the same value of I(X;Y)

Capacity of a channel Q

C(Q)=maxPxI(X;Y)

  • The distribution Px that achieves the maximum is called the optimal input distribution, denoted by Px

  • Eventually we will learn that the C(Q) does indeed measure the maximum amount of error-free information that can be transmitted over the channel per unit time.

19 / 37

There may be multiple optimal input distributions achieving the same value of I(X;Y)

Capacity of a channel Q

C(Q)=maxPxI(X;Y)

  • The distribution Px that achieves the maximum is called the optimal input distribution, denoted by Px

  • Eventually we will learn that the C(Q) does indeed measure the maximum amount of error-free information that can be transmitted over the channel per unit time.

  • If we look at BSC channel again with f=0.15,p0=0.9,p1=0.1, mutual information was:

I(X;Y)=H2(0.22)H2(0.15)=0.760.61=0.15bits.

19 / 37

There may be multiple optimal input distributions achieving the same value of I(X;Y)

Capacity of a channel Q

C(Q)=maxPxI(X;Y)

  • The distribution Px that achieves the maximum is called the optimal input distribution, denoted by Px

  • Eventually we will learn that the C(Q) does indeed measure the maximum amount of error-free information that can be transmitted over the channel per unit time.

  • If we look at BSC channel again with f=0.15,p0=0.9,p1=0.1, mutual information was:

I(X;Y)=H2(0.22)H2(0.15)=0.760.61=0.15bits.

  • Optimal input distribution is p0=0.5,p1=0.5, keeping the same noise level we get

C(QBSC)=H2(0.5)H2(0.15)=10.61=0.39bits.

19 / 37

There may be multiple optimal input distributions achieving the same value of I(X;Y)

2. BEC

  • Ax=0,1
  • Ay=0,?,1

Image Credit: Wikimedia

20 / 37

Example.

  • Let's look at capacity for BEC with general pe=f
21 / 37

Example.

  • Let's look at capacity for BEC with general pe=f

I(X;Y)=H(X)H(X|Y)

21 / 37

Note that we are writing in terms of Y and not X, I(X;Y)=H(Y)H(Y|X)

Example.

  • Let's look at capacity for BEC with general pe=f

I(X;Y)=H(X)H(X|Y)

  • We can again use symmetry argument at use 0.5, 0.5 distribution, so H(X)=H2(0.5)

  • As for H(X|Y), we are uncertain about x given y only when y is ?

  • Above as probability f/2+f/2, so H(X|Y)=fH2(0.5)

  • So C=I(X;Y)=H2(0.5)fH2(0.5)=1f

21 / 37

Note that we are writing in terms of Y and not X, I(X;Y)=H(Y)H(Y|X)

  • The binary erasure channel fails a fraction f of the time.
  • Its capacity is precisely 1 − f, which is the fraction of the time that the channel is reliable
  • This can also be done without assuming symmetry first
22 / 37
  • This can also be done without assuming symmetry first

  • Consider prob distribution p0,p1

22 / 37
  • This can also be done without assuming symmetry first

  • Consider prob distribution p0,p1

  • With prob of y=? as p0f+p1f=f, H(X|Y)=fH2(p1)

22 / 37
  • This can also be done without assuming symmetry first

  • Consider prob distribution p0,p1

  • With prob of y=? as p0f+p1f=f, H(X|Y)=fH2(p1)

  • I(X;Y)=H2(p1)fH2(p1)=(1f)H2(p1)

22 / 37
  • This can also be done without assuming symmetry first

  • Consider prob distribution p0,p1

  • With prob of y=? as p0f+p1f=f, H(X|Y)=fH2(p1)

  • I(X;Y)=H2(p1)fH2(p1)=(1f)H2(p1)

  • I(X;Y) achieves maximum when p1=0.5

22 / 37

3. Z Channel

  • Ax=0,1
  • Ay=0,1

00pe11

23 / 37

Example.

  • Consider Z channel with pe=f=0.15,p0,p1
24 / 37

Example.

  • Consider Z channel with pe=f=0.15,p0,p1

P(y=1)=p1(1f)

24 / 37

Example.

  • Consider Z channel with pe=f=0.15,p0,p1

P(y=1)=p1(1f)

I(X;Y)=H(Y)H(Y|X)

24 / 37

Example.

  • Consider Z channel with pe=f=0.15,p0,p1

P(y=1)=p1(1f)

I(X;Y)=H(Y)H(Y|X)

I(X;Y)=H2(p1(1f))(p0H2(0)+p1H2(f))

24 / 37

Example.

  • Consider Z channel with pe=f=0.15,p0,p1

P(y=1)=p1(1f)

I(X;Y)=H(Y)H(Y|X)

I(X;Y)=H2(p1(1f))(p0H2(0)+p1H2(f))

I(X;Y)=H2(p1(1f))p1H2(f))

24 / 37

Example.

  • Consider Z channel with pe=f=0.15,p0,p1

P(y=1)=p1(1f)

I(X;Y)=H(Y)H(Y|X)

I(X;Y)=H2(p1(1f))(p0H2(0)+p1H2(f))

I(X;Y)=H2(p1(1f))p1H2(f))

p1=1/(1f)1+2(H2(f)/(1f))

24 / 37

Example.

  • Consider Z channel with pe=f=0.15,p0,p1

P(y=1)=p1(1f)

I(X;Y)=H(Y)H(Y|X)

I(X;Y)=H2(p1(1f))(p0H2(0)+p1H2(f))

I(X;Y)=H2(p1(1f))p1H2(f))

p1=1/(1f)1+2(H2(f)/(1f))

  • For f=0.15 we get p1=0.445 to maximize mutual information
24 / 37

Example.

  • Consider Z channel with pe=f=0.15,p0,p1

P(y=1)=p1(1f)

I(X;Y)=H(Y)H(Y|X)

I(X;Y)=H2(p1(1f))(p0H2(0)+p1H2(f))

I(X;Y)=H2(p1(1f))p1H2(f))

p1=1/(1f)1+2(H2(f)/(1f))

  • For f=0.15 we get p1=0.445 to maximize mutual information

  • With this prob dist C(Qz)=0.685

24 / 37

this means can communicate slightly more information by using input symbol 0 more frequently than 1

4. Noisy typewriter

  • Ax=Ay=A,B,....,Z,
  • Circular arrangements for the letters

25 / 37

Example

  • From the video lecture we all (should) know that capacity for this channel is

C=maxPxI(X;Y)

C=H(Y)H(Y|X)=log(27)log(3)=log(9)bits

26 / 37

Example

  • From the video lecture we all (should) know that capacity for this channel is

C=maxPxI(X;Y)

C=H(Y)H(Y|X)=log(27)log(3)=log(9)bits

  • To make sure we have actually understood it, let's have a quick and easy quiz
26 / 37

Example

  • From the video lecture we all (should) know that capacity for this channel is

C=maxPxI(X;Y)

C=H(Y)H(Y|X)=log(27)log(3)=log(9)bits

  • To make sure we have actually understood it, let's have a quick and easy quiz

  • What happens if we only have 26 letters i.e A, B, .... Z and excluded the last -?

  • What prob dist can we use? what's the capacity of this new channel ?

26 / 37

Answer

27 / 37

Answer

C=H(Y)H(Y|X)=log(26)log(2)=log(13)bits

27 / 37

Intuition for noisy channel coding theorem

  • For a channel Q, there is a non-negative number C (channel capacity), with the following property
28 / 37

Intuition for noisy channel coding theorem

  • For a channel Q, there is a non-negative number C (channel capacity), with the following property

  • For any ϵ>0 and R<C, for large enough N, there exists a block code of length N and rateR and a decoding algorithm, such that

28 / 37

Intuition for noisy channel coding theorem

  • For a channel Q, there is a non-negative number C (channel capacity), with the following property

  • For any ϵ>0 and R<C, for large enough N, there exists a block code of length N and rateR and a decoding algorithm, such that

  • the maximal probability of block error is ϵ

28 / 37

Example

  • For the noisy typewriter, we can define prob distribution such that only third letter is used
29 / 37

Example

  • For the noisy typewriter, we can define prob distribution such that only third letter is used

  • These letters form a non-confusable subset of the input alphabet, block code of length N = 1

29 / 37

Example

  • For the noisy typewriter, we can define prob distribution such that only third letter is used

  • These letters form a non-confusable subset of the input alphabet, block code of length N = 1

  • the error-free information rate of this system is log2(9)bits

29 / 37

Conti: Intuition for noisy channel coding theorem

  • If we use any channel lot of times (say N) in sequence then that collection behaves much like noisy typewriter channel
30 / 37

Conti: Intuition for noisy channel coding theorem

  • If we use any channel lot of times (say N) in sequence then that collection behaves much like noisy typewriter channel

  • What applies to noisy typewriter channel applies to noisy channel in general

30 / 37

Conti: Intuition for noisy channel coding theorem

  • If we use any channel lot of times (say N) in sequence then that collection behaves much like noisy typewriter channel

  • What applies to noisy typewriter channel applies to noisy channel in general

  • Consider extended BSC channel, each block code of length N and total input space (X) of size 2N

30 / 37

Conti: Intuition for noisy channel coding theorem

  • If we use any channel lot of times (say N) in sequence then that collection behaves much like noisy typewriter channel

  • What applies to noisy typewriter channel applies to noisy channel in general

  • Consider extended BSC channel, each block code of length N and total input space (X) of size 2N

  • For BSC output alphabets will be same as input, input can map any number of outputs Y

30 / 37

31 / 37
  • But for large N, there will be output typical set with collection size 2NH(Y)
32 / 37
  • But for large N, there will be output typical set with collection size 2NH(Y)

  • Size of each typical set is 2NH(Y|X)

32 / 37
  • But for large N, there will be output typical set with collection size 2NH(Y)

  • Size of each typical set is 2NH(Y|X)

32 / 37
  • We can choose prob dist Px so that we get non confusable typical set for input channel
33 / 37
  • We can choose prob dist Px so that we get non confusable typical set for input channel

  • almost certainly non confusable input sets = 2NH(Y)2NH(Y|X)

33 / 37
  • We can choose prob dist Px so that we get non confusable typical set for input channel

  • almost certainly non confusable input sets = 2NH(Y)2NH(Y|X)

2N(H(Y)H(Y|X))=2NI(X;Y)2NC

33 / 37
  • We can choose prob dist Px so that we get non confusable typical set for input channel

  • almost certainly non confusable input sets = 2NH(Y)2NH(Y|X)

2N(H(Y)H(Y|X))=2NI(X;Y)2NC

  • We can communicate C bits per cycle error free! (using one of the non confusable inputs)
33 / 37

Interesting Problem for discussion: Exercise 9.19

34 / 37

Thank You!

Thank You!

Thank You!

Talk notes are available on github talk-06

35 / 37

Credit & References

  1. Monty Hall Simulator: http://chrisarasin.com/monty-hall-problem/
  2. Bayes theorem animation: https://github.com/nskobelevs/nskobelevs.github.io
36 / 37
37 / 37

Last few talks . . .

  • We looked at various data compression techniques
2 / 37
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow