ML2020 Talk 6 - Noisy Channel Coding (Chapter 8 & 9)

# Talk 6

## Noisy-Channel Coding: I (Chapter 8 & 9)

![:scale 50%](intro.gif)

2020-03-20 - Talk 6 - Saurabh B

.footnote[.left[[Image Credit: Wolfram](https://demonstrations.wolfram.com/ShannonsNoisyChannelCodingTheorem)]]
???
illustrates how the introduction of noise in a channel can be compensated with the introduction of redundancy, by sending multiple copies of every bit.

---

### Last few talks . . .

* We looked at various data compression techniques

*  Mostly considered random input vectors coming from simple probability distribution

* Mostly each component was independent of others

---

### Motivation

* Betters compression for real world data which often has interesting co-relation

???
we need to know how to work with models that include dependences
--

* Noisy channel with input $x$ and output $y$ can be looked at as joint ensembles where x and y are dependent

???
if they were independent, it would be impossible to communicate over the channel
--

* Information content in context of noisy channels

---

???
http://chrisarasin.com/monty-hall-problem/
---

### Solution

![:scale 80%](monty_hall_sol.png)

---

### Dependent Random Variables

* Entropy

$$ H(X) = \sum_{x \in A_x} P(x) \log_2 \frac{1}{P(x)} $$

* Joint entropy

$$ H(X,Y) = \sum_{xy \in A_xA_y} P(x,y) \log_2 \frac{1}{P(x,y)} $$

* For independent random variables

$$ H(X,Y) = H(X) + H(Y) \iff P(x,y) = P(x)P(y) $$

---

### Dependent Random Variables

* Conditional entropy of $X$ given $y=b_k$

$$ H(X \| y = b\_k) = \sum_{x \in A_x} P(x|y=b_k) \log_2 \frac{1}{P(x | y=b_k)} $$

* Conditional entropy of $X$ given $Y$

$$ H(X \| Y) = \sum\_{y \in A_y} P(y) \left[ \sum_x P(x|y)\log_2 \frac{1}{P(x|y)}  \right] $$

---

### Dependent Random Variables

* Conditional entropy of $X$ given $Y$
$$ H(X | Y) = \sum_{xy \in A_xA_y} P(x | y) \log_2 \frac{1}{P(x | y)} $$

---

### Dependent Random Variables

* Chain rule of conditional entropy

$$ H(X,Y) = H(X) + H(Y|X) = H(Y) + H(X|Y) $$

Proof-

$$ H(X,Y) = \sum_{xy \in A_xA_y} P(x,y) \log_2 \frac{1}{P(x,y)} $$

$$ H(X,Y) = \sum_{xy \in A_xA_y} P(x,y) \log_2 \frac{1}{P(x|y)P(y)} $$

---

#### Proof continued . . .

$$ H(X,Y) = - \sum_{xy \in A_xA_y} P(x,y) \log_2 {P(x|y)P(y)} $$

$$ H(X,Y) = - \sum_x \sum_y P(x,y) \log_2 {P(x|y)} - \sum_x \sum_y P(x,y) \log_2 {P(y)} $$

$$ H(X,Y) = - \sum_x \sum_y P(x,y) \log_2 {P(x|y)} - \sum_y \sum_x P(x,y) \log_2 {P(y)} $$

$$ H(X,Y) = - \sum_x \sum_y P(x,y) \log_2 {P(x|y)} - \sum_y P(y) \log_2 {P(y)} $$

$$ H(X,Y) = H(X|Y) + H(Y) $$
.right[QED]
--

* Note that in general: $ H(X|Y) \neq H(Y|X) $

---

### Mutual information between $X$ and $Y$

* Mutual information measures information $X$ contains about $Y$ or vice versa

???
Decrease of the uncertainty about an outcome of a random variable given an outcome of another random variable
--

* Relative entropy or KL divergence between P(x) and Q(x)

$$ D_{KL}(P \parallel Q) = \sum_x P(x) \log \frac{P(x)}{Q(x)}$$

* So mutual information $I(X;Y)$, cane be understood as relative entropy between the joint distribution and the product of marginal distributions

$$ I(X;Y) = D_{KL}(P(x,y) \parallel P(x)P(y)) $$

---

### Mutual information between $X$ and $Y$

$$ I(X;Y) = \sum_{xy \in A_xA_y} P(x,y) \log_2 \frac{P(x,y)}{P(x)P(y)} $$

$$ I(X;Y) = \sum_{xy \in A_xA_y} P(x,y) \log_2 \frac{P(x|y)P(y)}{P(x)P(y)} $$

$$ I(X;Y) = \sum_{xy \in A_xA_y} P(x,y) \log_2 \frac{P(x|y)}{P(x)} $$

---

####Proof continued . . .

$$ I(X;Y) = \sum P(x,y) \log_2 {P(x|y)} - \sum P(x,y) \log_2 {P(x)} $$

$$ I(X;Y) = \sum P(x,y) \log_2 {P(x|y)} - \sum P(x) \log_2 {P(x)} $$

$$ I(X;Y) = - \sum P(x) \log_2 {P(x)} + \sum P(x,y) \log_2 {P(x|y)} $$

$$ I(X;Y) = \sum P(x) \log_2 \frac{1}{P(x)} - \sum P(x,y) \log_2 \frac{1}{P(x|y)} $$

$$ I(X;Y) = H(X) - H(X|Y) $$

$$ I(X;Y) = I(Y;X) $$

???
information is symmetric 
---

### Information measure

![:scale 80%](information_measure.jpg)

---

### Noisy channels

* A discrete memoryless channel $Q$ is characterized $A_x$ and $A_y$ and $P(y|x)$

* $Q$ defines conditional prob. $P(y|x)$, if we choose an input dist. $P(X)$ we then have a joint dist.

$$P(x, y) = P(x|y)p(y)$$

* we can bayes theorem to infer input given the output

$$P(x | y) = \frac{P(y | x) P(x)}{ P(y) }$$

---

#### 1. BSC

* $A_x = {0, 1}$
* $A_y = {0, 1}$

---

#### Example.

* Consider BSC with $f= 0.15, p_0 = 0.9, p_1 = 0.1$, compute $P(x=1|y=0)$

$$P(x=1|y=0) = \frac{P(y=0|x=1)P(x=1)}{P(y=0)}$$

$$P(x=1|y=0) = \frac{0.15 \times 0.1}{0.9 \times 0.85 + 0.1 \times 0.15 }$$

$$P(x=1|y=0) = \frac{0.015 \times 0.1}{0.78} = 0.019$$

* From above we know that $P(y=0) = 0.78, P(y=1) = 0.22$, let's look at mutual information

$$ I(X;Y) = H(Y) - H(Y|X) $$

$$ I(X;Y) = H_2(0.22) - H_2(0.15) = 0.76 − 0.61 = 0.15 bits. $$

$$ H_2(p) =  p \log \frac{1}{p} + (1-p) \log \frac{1}{(1-p)} $$

---

### Capacity of a  channel $Q$

$$C(Q) = \max_{P_x} I(X;Y) $$

- The distribution $P_x$ that achieves the maximum is called the optimal input distribution, denoted by $P^*_x$

???
There may be multiple optimal input distributions achieving the same value of $I(X;Y)$
--

- Eventually we will learn that the $C(Q)$ does indeed measure the maximum amount of error-free information that can be transmitted over the channel per unit time.

- If we look at BSC channel again with $f= 0.15, p_0 = 0.9, p_1 = 0.1$, mutual information was:

$$ I(X;Y) = H_2(0.22) - H_2(0.15) = 0.76 − 0.61 = 0.15 bits. $$

- Optimal input distribution is $p_0 = 0.5, p_1 = 0.5$, keeping the same noise level we get

$$ C(Q_{BSC}) = H_2(0.5) - H_2(0.15) = 1 − 0.61 = 0.39 bits. $$

---

#### 2. BEC

* $A_x = { 0, 1 }$
* $A_y = {0, ?, 1}$

---

#### Example.

* Let's look at capacity for BEC with general $p_e=f$

$$ I(X;Y) = H(X) - H(X|Y) $$

???
Note that we are writing in terms of Y and not X, $$ I(X;Y) = H(Y) - H(Y|X) $$
--

* We can again use symmetry argument at use 0.5, 0.5 distribution, so $H(X) = H_2(0.5)$

* As for $H(X|Y)$, we are uncertain about x given y only when y is `?`

* Above as probability $f / 2 + f / 2$, so $H(X|Y) = f H_2(0.5)$

* So $C = I(X;Y) = H_2(0.5) - f H_2(0.5) = 1 - f$

???
* The binary erasure channel fails a fraction f of the time. 
* Its capacity is precisely 1 − f, which is the fraction of the time that the channel is reliable
---

* This can also be done without assuming symmetry first

* Consider prob distribution ${p_0, p_1}$

* With prob of $y = ?$ as $p_0 f + p_1 f = f$, $H(X|Y) = f H_2(p_1)$

* $I(X;Y) = H_2(p_1) - f H_2(p_1) = (1 - f)H_2(p_1)$

* $I(X;Y)$ achieves maximum when $p_1 = 0.5$

---

#### 3. Z Channel

* $A_x = {0, 1}$
* $A_y = {0, 1}$

$$ 
\begin{array}{ccccccccc}
0 & \xrightarrow{} & 0 \\\
& \nearrow{p_e} \\\
1 & \xrightarrow{} & 1 \\\
\end{array}
$$

---

#### Example.

* Consider Z channel with $p_e=f= 0.15, p_0, p_1$

$$P(y=1)=p_1(1-f)$$

$$ I(X;Y) = H(Y) - H(Y|X) $$

$$ I(X;Y) = H_2(p_1(1-f)) - (p_0 H_2(0) + p_1 H_2(f)) $$

$$ I(X;Y) = H_2(p_1(1-f)) -  p_1 H_2(f)) $$

$$ p^*_1 = \frac{1 / (1-f)}{1+2^{(H_2(f) / (1-f))}}$$

* For $f=0.15$ we get $p^*_1 = 0.445$ to maximize mutual information

* With this prob dist $C(Q_z) = 0.685$
???
this means can communicate slightly more information by using input symbol 0 more frequently than 1
---

#### 4. Noisy typewriter

* $A_x = A_y = {A, B, ...., Z, -}$
* Circular arrangements for the letters

---

#### Example

* From the video lecture we all (should) know that capacity for this channel is

$$ C = \max_{P_x} I(X;Y) $$

$$ C =  H(Y) - H(Y|X) = log(27) - log(3) = log(9) bits$$

* To make sure we have actually understood it, let's have a quick and easy quiz

* What happens if we only have 26 letters i.e A, B, .... Z and excluded the last -?

* What prob dist can we use? what's the capacity of this new channel ?

---

#### Answer

$$ C =  H(Y) - H(Y|X) = log(26) - log(2) = log(13) bits$$
---

### Intuition for noisy channel coding theorem

* For a channel $Q$, there is a non-negative number $C$ (channel capacity), with the following property

* For any $\epsilon  > 0$ and $R < C$, for large enough $N$, there exists a block code of length $N$ and $rate \geq R$ and a decoding algorithm, such that

* the maximal probability of block error is $ \leq \epsilon $

---

#### Example

* For the noisy typewriter, we can define prob distribution such that only third letter is used

* These letters form a non-confusable subset of the input alphabet, block code of length N = 1

* the error-free information rate of this system is $log_2(9)bits$

---

### Conti: Intuition for noisy channel coding theorem

* If we use any channel lot of times (say N) in sequence then that collection behaves much like noisy typewriter channel

* What applies to noisy typewriter channel applies to noisy channel in general

* Consider extended BSC channel, each block code of length $N$ and total input space ($X$) of size $2^N$

* For BSC output alphabets will be same as input, input can map any number of outputs $Y$

---

---
class: middle

* But for large N, there will be output typical set with collection size $2^{NH(Y)}$

* Size of each typical set is $2^{NH(Y|X)}$

---

* We can choose prob dist $P_x$ so that we get non confusable typical set for input channel

* almost certainly non confusable input sets = $\frac{2^{NH(Y)}}{2^{NH(Y|X)}}$

$$2^{N(H(Y)-H(Y|X))} = 2^{NI(X;Y)} \leq 2^{NC} $$

* We can communicate C bits per cycle error free! (using one of the non confusable inputs)
 
---

### Interesting Problem for discussion: Exercise 9.19

---

# Thank You!

.footnote[.left[Talk notes are available on [github talk-06](https://github.com/bakkenbaeck/ml-seminar-information-theory/blob/master/talk06-noisy-channel-coding-1.html)]]

---

# Credit & References

1. Monty Hall Simulator: http://chrisarasin.com/monty-hall-problem/
2. Bayes theorem animation: https://github.com/nskobelevs/nskobelevs.github.io
---