class: theme layout: true --- class: center, middle # Talk 6 ## Noisy-Channel Coding: I (Chapter 8 & 9) ![:scale 50%](intro.gif) 2020-03-20 - Talk 6 - Saurabh B .footnote[.left[[Image Credit: Wolfram](https://demonstrations.wolfram.com/ShannonsNoisyChannelCodingTheorem)]] ??? illustrates how the introduction of noise in a channel can be compensated with the introduction of redundancy, by sending multiple copies of every bit. --- class: middle ### Last few talks . . . * We looked at various data compression techniques -- * Mostly considered random input vectors coming from simple probability distribution -- * Mostly each component was independent of others --- class: middle ### Motivation * Betters compression for real world data which often has interesting co-relation ??? we need to know how to work with models that include dependences -- * Noisy channel with input $x$ and output $y$ can be looked at as joint ensembles where x and y are dependent ??? if they were independent, it would be impossible to communicate over the channel -- * Information content in context of noisy channels --- class: center
??? http://chrisarasin.com/monty-hall-problem/ --- class: center
??? * Steve is very shy and withdrawn, invariably helpful but with very little interest in people or in the world of reality. * A meek and tidy soul, he has a need for order and structure, and a passion for detail.” * Is Steve more likely to be a librarian or a farmer? --- class: middle, center ### Solution ![:scale 80%](monty_hall_sol.png) --- class: middle ### Dependent Random Variables * Entropy $$ H(X) = \sum_{x \in A_x} P(x) \log_2 \frac{1}{P(x)} $$ -- * Joint entropy $$ H(X,Y) = \sum_{xy \in A_xA_y} P(x,y) \log_2 \frac{1}{P(x,y)} $$ -- * For independent random variables $$ H(X,Y) = H(X) + H(Y) \iff P(x,y) = P(x)P(y) $$ --- class: middle ### Dependent Random Variables * Conditional entropy of $X$ given $y=b_k$ $$ H(X \| y = b\_k) = \sum_{x \in A_x} P(x|y=b_k) \log_2 \frac{1}{P(x | y=b_k)} $$ -- * Conditional entropy of $X$ given $Y$ $$ H(X \| Y) = \sum\_{y \in A_y} P(y) \left[ \sum_x P(x|y)\log_2 \frac{1}{P(x|y)} \right] $$ --- class: middle ### Dependent Random Variables * Conditional entropy of $X$ given $Y$ $$ H(X | Y) = \sum_{xy \in A_xA_y} P(x | y) \log_2 \frac{1}{P(x | y)} $$ --- class: middle ### Dependent Random Variables * Chain rule of conditional entropy $$ H(X,Y) = H(X) + H(Y|X) = H(Y) + H(X|Y) $$ -- Proof- $$ H(X,Y) = \sum_{xy \in A_xA_y} P(x,y) \log_2 \frac{1}{P(x,y)} $$ -- $$ H(X,Y) = \sum_{xy \in A_xA_y} P(x,y) \log_2 \frac{1}{P(x|y)P(y)} $$ --- #### Proof continued . . . $$ H(X,Y) = - \sum_{xy \in A_xA_y} P(x,y) \log_2 {P(x|y)P(y)} $$ -- $$ H(X,Y) = - \sum_x \sum_y P(x,y) \log_2 {P(x|y)} - \sum_x \sum_y P(x,y) \log_2 {P(y)} $$ -- $$ H(X,Y) = - \sum_x \sum_y P(x,y) \log_2 {P(x|y)} - \sum_y \sum_x P(x,y) \log_2 {P(y)} $$ -- $$ H(X,Y) = - \sum_x \sum_y P(x,y) \log_2 {P(x|y)} - \sum_y P(y) \log_2 {P(y)} $$ -- $$ H(X,Y) = H(X|Y) + H(Y) $$ .right[QED] -- * Note that in general: $ H(X|Y) \neq H(Y|X) $ --- class: middle ### Mutual information between $X$ and $Y$ * Mutual information measures information $X$ contains about $Y$ or vice versa ??? Decrease of the uncertainty about an outcome of a random variable given an outcome of another random variable -- * Relative entropy or KL divergence between P(x) and Q(x) $$ D_{KL}(P \parallel Q) = \sum_x P(x) \log \frac{P(x)}{Q(x)}$$ -- * So mutual information $I(X;Y)$, cane be understood as relative entropy between the joint distribution and the product of marginal distributions -- $$ I(X;Y) = D_{KL}(P(x,y) \parallel P(x)P(y)) $$ --- class: middle ### Mutual information between $X$ and $Y$ $$ I(X;Y) = \sum_{xy \in A_xA_y} P(x,y) \log_2 \frac{P(x,y)}{P(x)P(y)} $$ -- $$ I(X;Y) = \sum_{xy \in A_xA_y} P(x,y) \log_2 \frac{P(x|y)P(y)}{P(x)P(y)} $$ -- $$ I(X;Y) = \sum_{xy \in A_xA_y} P(x,y) \log_2 \frac{P(x|y)}{P(x)} $$ -- $$ I(X;Y) = \sum_{xy \in A_xA_y} P(x,y) \log_2 \frac{P(x|y)}{P(x)} $$ --- ####Proof continued . . . $$ I(X;Y) = \sum P(x,y) \log_2 {P(x|y)} - \sum P(x,y) \log_2 {P(x)} $$ -- $$ I(X;Y) = \sum P(x,y) \log_2 {P(x|y)} - \sum P(x) \log_2 {P(x)} $$ -- $$ I(X;Y) = - \sum P(x) \log_2 {P(x)} + \sum P(x,y) \log_2 {P(x|y)} $$ -- $$ I(X;Y) = \sum P(x) \log_2 \frac{1}{P(x)} - \sum P(x,y) \log_2 \frac{1}{P(x|y)} $$ -- $$ I(X;Y) = H(X) - H(X|Y) $$ .right[QED] -- $$ I(X;Y) = I(Y;X) $$ ??? information is symmetric --- class: middle, center ### Information measure ![:scale 80%](information_measure.jpg) --- class: middle ### Noisy channels * A discrete memoryless channel $Q$ is characterized $A_x$ and $A_y$ and $P(y|x)$ -- * $Q$ defines conditional prob. $P(y|x)$, if we choose an input dist. $P(X)$ we then have a joint dist. $$P(x, y) = P(x|y)p(y)$$ -- * we can bayes theorem to infer input given the output $$P(x | y) = \frac{P(y | x) P(x)}{ P(y) }$$ --- class: middle #### 1. BSC * $A_x = {0, 1}$ * $A_y = {0, 1}$ .center[![:scale 30%](bsc.png)] .footnote[.left[[Image Credit: Wikimedia](Credit: https://commons.wikimedia.org/)]] --- class: middle #### Example. * Consider BSC with $f= 0.15, p_0 = 0.9, p_1 = 0.1$, compute $P(x=1|y=0)$ -- $$P(x=1|y=0) = \frac{P(y=0|x=1)P(x=1)}{P(y=0)}$$ -- $$P(x=1|y=0) = \frac{0.15 \times 0.1}{0.9 \times 0.85 + 0.1 \times 0.15 }$$ -- $$P(x=1|y=0) = \frac{0.015 \times 0.1}{0.78} = 0.019$$ -- * From above we know that $P(y=0) = 0.78, P(y=1) = 0.22$, let's look at mutual information -- $$ I(X;Y) = H(Y) - H(Y|X) $$ -- $$ I(X;Y) = H_2(0.22) - H_2(0.15) = 0.76 − 0.61 = 0.15 bits. $$ $$ H_2(p) = p \log \frac{1}{p} + (1-p) \log \frac{1}{(1-p)} $$ --- class: middle ### Capacity of a channel $Q$ $$C(Q) = \max_{P_x} I(X;Y) $$ - The distribution $P_x$ that achieves the maximum is called the optimal input distribution, denoted by $P^*_x$ ??? There may be multiple optimal input distributions achieving the same value of $I(X;Y)$ -- - Eventually we will learn that the $C(Q)$ does indeed measure the maximum amount of error-free information that can be transmitted over the channel per unit time. -- - If we look at BSC channel again with $f= 0.15, p_0 = 0.9, p_1 = 0.1$, mutual information was: $$ I(X;Y) = H_2(0.22) - H_2(0.15) = 0.76 − 0.61 = 0.15 bits. $$ -- - Optimal input distribution is $p_0 = 0.5, p_1 = 0.5$, keeping the same noise level we get $$ C(Q_{BSC}) = H_2(0.5) - H_2(0.15) = 1 − 0.61 = 0.39 bits. $$ --- class: middle #### 2. BEC * $A_x = { 0, 1 }$ * $A_y = {0, ?, 1}$ .center[![:scale 30%](bec.png)] .footnote[.left[[Image Credit: Wikimedia](Credit: https://commons.wikimedia.org/)]] --- class: middle #### Example. * Let's look at capacity for BEC with general $p_e=f$ -- $$ I(X;Y) = H(X) - H(X|Y) $$ ??? Note that we are writing in terms of Y and not X, $$ I(X;Y) = H(Y) - H(Y|X) $$ -- * We can again use symmetry argument at use 0.5, 0.5 distribution, so $H(X) = H_2(0.5)$ * As for $H(X|Y)$, we are uncertain about x given y only when y is `?` * Above as probability $f / 2 + f / 2$, so $H(X|Y) = f H_2(0.5)$ * So $C = I(X;Y) = H_2(0.5) - f H_2(0.5) = 1 - f$ ??? * The binary erasure channel fails a fraction f of the time. * Its capacity is precisely 1 − f, which is the fraction of the time that the channel is reliable --- class: middle * This can also be done without assuming symmetry first -- * Consider prob distribution ${p_0, p_1}$ -- * With prob of $y = ?$ as $p_0 f + p_1 f = f$, $H(X|Y) = f H_2(p_1)$ -- * $I(X;Y) = H_2(p_1) - f H_2(p_1) = (1 - f)H_2(p_1)$ -- * $I(X;Y)$ achieves maximum when $p_1 = 0.5$ --- class: middle #### 3. Z Channel * $A_x = {0, 1}$ * $A_y = {0, 1}$ $$ \begin{array}{ccccccccc} 0 & \xrightarrow{} & 0 \\\ & \nearrow{p_e} \\\ 1 & \xrightarrow{} & 1 \\\ \end{array} $$ --- class: middle #### Example. * Consider Z channel with $p_e=f= 0.15, p_0, p_1$ -- $$P(y=1)=p_1(1-f)$$ -- $$ I(X;Y) = H(Y) - H(Y|X) $$ -- $$ I(X;Y) = H_2(p_1(1-f)) - (p_0 H_2(0) + p_1 H_2(f)) $$ -- $$ I(X;Y) = H_2(p_1(1-f)) - p_1 H_2(f)) $$ -- $$ p^*_1 = \frac{1 / (1-f)}{1+2^{(H_2(f) / (1-f))}}$$ -- * For $f=0.15$ we get $p^*_1 = 0.445$ to maximize mutual information -- * With this prob dist $C(Q_z) = 0.685$ ??? this means can communicate slightly more information by using input symbol 0 more frequently than 1 --- class: middle #### 4. Noisy typewriter * $A_x = A_y = {A, B, ...., Z, -}$ * Circular arrangements for the letters .center[![:scale 80%](noisy_typewriter.jpg)] --- #### Example * From the video lecture we all (should) know that capacity for this channel is $$ C = \max_{P_x} I(X;Y) $$ $$ C = H(Y) - H(Y|X) = log(27) - log(3) = log(9) bits$$ -- * To make sure we have actually understood it, let's have a quick and easy quiz -- * What happens if we only have 26 letters i.e A, B, .... Z and excluded the last -? * What prob dist can we use? what's the capacity of this new channel ? --- #### Answer .center[![:scale 30%](noisy_typewriter_2.jpg)] -- $$ C = H(Y) - H(Y|X) = log(26) - log(2) = log(13) bits$$ --- class: middle ### Intuition for noisy channel coding theorem * For a channel $Q$, there is a non-negative number $C$ (channel capacity), with the following property -- * For any $\epsilon > 0$ and $R < C$, for large enough $N$, there exists a block code of length $N$ and $rate \geq R$ and a decoding algorithm, such that -- * the maximal probability of block error is $ \leq \epsilon $ --- class: middle #### Example * For the noisy typewriter, we can define prob distribution such that only third letter is used -- * These letters form a non-confusable subset of the input alphabet, block code of length N = 1 -- * the error-free information rate of this system is $log_2(9)bits$ --- class: middle ### Conti: Intuition for noisy channel coding theorem * If we use any channel lot of times (say N) in sequence then that collection behaves much like noisy typewriter channel -- * What applies to noisy typewriter channel applies to noisy channel in general -- * Consider extended BSC channel, each block code of length $N$ and total input space ($X$) of size $2^N$ -- * For BSC output alphabets will be same as input, input can map any number of outputs $Y$ --- class: middle .center[![:scale 80%](extended_bsc.jpg)] --- class: middle * But for large N, there will be output typical set with collection size $2^{NH(Y)}$ -- * Size of each typical set is $2^{NH(Y|X)}$ -- .center[![:scale 80%](typical_set.jpg)] --- class: middle * We can choose prob dist $P_x$ so that we get non confusable typical set for input channel -- * almost certainly non confusable input sets = $\frac{2^{NH(Y)}}{2^{NH(Y|X)}}$ -- $$2^{N(H(Y)-H(Y|X))} = 2^{NI(X;Y)} \leq 2^{NC} $$ -- * We can communicate C bits per cycle error free! (using one of the non confusable inputs) --- class: middle ### Interesting Problem for discussion: Exercise 9.19 --- # Thank You! # Thank You! # Thank You! .footnote[.left[Talk notes are available on [github talk-06](https://github.com/bakkenbaeck/ml-seminar-information-theory/blob/master/talk06-noisy-channel-coding-1.html)]] --- # Credit & References 1. Monty Hall Simulator: http://chrisarasin.com/monty-hall-problem/ 2. Bayes theorem animation: https://github.com/nskobelevs/nskobelevs.github.io ---