This document is (c) David J.C. MacKay, 2001

It originates from http://www.inference.phy.cam.ac.uk/mackay/itprnn/book.html

It contains the text of David MacKay's book,
** Information theory, inference, and learning algorithms. **
(latex source)

Copying and distribution of this file are NOT PERMITTED.

The file is provided for convenience of anyone wishing to make a web-based search of the text of the book.

% This document is (c) David J.C. MacKay, 2001 % % It originates from http://www.inference.phy.cam.ac.uk/mackay/itprnn/ % http://www.inference.phy.cam.ac.uk/mackay/itprnn/book.html % % It contains the text of David MacKay's book, % Information theory, inference, and learning algorithms. % (latex source) % % Copying and distribution of this file are NOT PERMITTED. % % The file is provided for convenience of anyone wishing to % make a web-based search of the text of the book. % was book2e.tex is now book.tex (and still latex2e) % make book.ind make book.gv \documentclass[11pt]{book}% % last minute additions \usepackage{DJCMamssymb}% needed for blacktriangleright Mon 10/11/03 (put in symbols instead) \usepackage{ragged2e}% provides \justifying % end last minute additions \usepackage{floatflt} %\usepackage{hangingsecnum}% makes sec numbers sit in the left margin (tried cutting out on Thu 6/11/03) \usepackage{hangingsecnum2}% makes sec numbers sit in the left margin (modified Thu 6/11/03) %\usepackage{mparhack} \usepackage{mparhackright-209}% makes all margin pars go in right margin \usepackage{marginfig}% Defines many macros for making various styles of figure with captions %\usepackage{symbols}% Provides a few math symbols (replaced with DJCMamssymb) %\usepackage{twoside} \usepackage{myalgorith}% defines the Algorithm environment as a float % Also forces fig,tab, and alg all to use a single counter \usepackage{aside}% defines the {aside} environment \usepackage{chapsummary}% helps me compile index-like objects (NOT USED) \usepackage{chapternotes}% lots of assorted stuff \usepackage{lsalike}% defines citation commands \usepackage{booktabs}% makes nice quality tables \usepackage{prechapter}% defines a chapter-like object \usepackage{mycaption}% defines ``\indented''and \@makecaption; and the notindented style used in figure captions % additions post-Sat 5/10/02 \usepackage{latexsym}% needed in order to make use of the \Box command \usepackage{tocloft}% implements my look of table of contents \usepackage{tocloftcomp2}% implements my look of table of contents (was tocloftcomp until Thu 6/11/03) \usepackage{mychapter}% defines chapter command, including the look of the new chapter page % also defines the look of the section and subsection commands \usepackage{mycenter}% modifies center to reduce vertical space waste - useful for figures, etc. \usepackage{mypart}% modifies part to not cleardoublepage (no longer Sat 5/4/03) \usepackage{myheadings}% redefines the pagestyle ``headings'' % \usepackage{headingmods}% redefines the pagestyle ``headings'' (similar to myheadings) % \usepackage{myindents}% defines parindent and leftmargin \usepackage{graphics}% enables rotating of boxes % \usepackage{boldmathgk}% provides bold alpha etc. (doesn't work) % \usepackage{fixmath}% provides bold alpha etc. Also (I think) provides numerous sloping greeks that I don't like \usepackage{fixmathDJCM}% provides bold alpha etc. Has Gamma definition cut out. and Omega % suggested by DAG: %\usepackage{amsmath} %\usepackage{mathptmx} \usepackage{DAGmathspacing}% provides smallfrac \usepackage{boxedminipage} \usepackage{fancybox}% Provides ability to put verbatim text inside boxes \usepackage{bbold}% CTAN blackboard.ps was helpful for choosing this PROVIDES ``holey 1'' as \textbb{1} \usepackage{epsf}% to allow use of metapost figures %\usepackage{hyperref} % incompatible with something % \usepackage{multicol}% why does CTAN refer to multicols? %\usepackage{myindex2}% overrides book definition of index \usepackage{myindex}% overrides book definition of index \usepackage{makeidx} \usepackage{mybibliog} \usepackage{mygaps}% defines \eq and \puncgap and \colonspace and \puncspace \usepackage{mytoc}% suppresses the CONTENTS headings \makeindex % \newcommand{\thedraft}{7.2}% 6.6 was 2nd printing. 6.8 was when I fixed errs Tue 24/2/04 % 6.9 = Mon 28/6/04 % 6.10 = Mon 2/8/04 % 6.11 Sun 22/8/04 % 7.0 final for 3rd printing % 7.2 is 4th printing \renewcommand{\textfraction}{0.10} \pagestyle{headings} \begin{document} \bibliographystyle{lsalikedjcmsc}%.bst %\newcommand{\bf}{\textbf} %\newcommand{\sf}{\textsf} %%\newcommand{\em}{\textem} %\newcommand{\rm}{\textrm} %\newcommand{\tt}{\texttt} %\newcommand{\sl}{\textsl} %\newcommand{\sc}{\textsc} % % chapter.tex % % this contains a few common definitions for all chapters % of the itprnn book % for _l1.tex: \hyphenation{left-multi-pli-ca-tion} \hyphenation{multi-pli-ca-tion} % \newcommand{\partnoun}{Part} \newcommand{\partone}{\partnoun\ I} \newcommand{\datapart}{I} \newcommand{\noisypart}{II} \newcommand{\finfopart}{III} \newcommand{\probpart}{IV} \newcommand{\netpart}{V} \newcommand{\sgcpart}{VI} \newcommand{\hybrid}{Hamiltonian} \newcommand{\Hybrid}{Hamiltonian} % % If sending book to readers - \newcommand{\begincuttable}{} \newcommand{\ENDcuttable}{} % If sending to editor - %\newcommand{\begincuttable}{\marginpar{\raisebox{-0.5in}[0in][0in]{$\downarrow$}CUTTABLE?}} %\newcommand{\ENDcuttable}{\marginpar{\raisebox{0.5in}[0in][0in]{$\uparrow$}CUTTABLE?}} % \newcommand{\adhoc}{ad hoc} \newcommand{\busstop}{bus-stop} \newcommand{\mynewpage}{\newpage}% switch this off later Sun 3/2/02 % see also tex/inputs/itchapter.sty % chapternotes.sty is where there is an index \newcommand{\fN}{f\!N} \newcommand{\exercisetitlestyle}{\sf} % % used in sumproduct.tex and gallager.tex \newcommand{\Mn}{{\cal M}(n)} \newcommand{\Nm}{{\cal N}(m)} %\newcommand{\N}{{\cal N}} % % the delta function that is 1 if true (defined in notation.tex) \newcommand{\truth}{\mbox{\textbb{1}}} % requires: % \usepackage{bbold}% CTAN blackboard.ps was helpful for choosing this % % used in gene.tex \newcommand{\deltaf}{\delta\! f} \newcommand{\tI}{\tilde{I}} \newcommand{\Kp}{K_{\rm{p}}} \newcommand{\Ks}{K_{\rm{s}}} % % end % lang4.tex - distributions.tex \newcommand{\lI}{I} % % clust.tex \newcommand{\rnk}{r^{(n)}_k} \newcommand{\hkn}{\hat{k}^{(n)}} % good sizes: % -0.45: 1.25 % -0.25: 0.65 % -0.4 0.8 \newcommand{\softfig}[1]{\hspace{-0.4in}\psfig{figure=octave/kmeansoft/ps1/#1.ps,width=0.8in,angle=-90}} \newcommand{\softtfa}[3]{\begin{tabular}{c}{$t=#2$}\\ \hspace*{-0.4in}\mbox{\psfig{figure=octave/kmeansoft/#3/#1.ps,width=1.2in,angle=-90}\hspace*{-0.2in}}\\ \end{tabular}} \newcommand{\softtfabig}[3]{\begin{tabular}{c}{$t=#2$}\\ \hspace*{-0.6in}\mbox{\psfig{figure=octave/kmeansoft/#3/#1.ps,width=1.5in,angle=-90}\hspace*{-0.2in}}\\ \end{tabular}} \newcommand{\softtfabigb}[3]{\begin{tabular}{c}{$t=#2$}\\ \hspace*{-0.45in}\mbox{\psfig{figure=octave/kmeansoft/#3/#1.ps,width=1.625in,angle=-90}\hspace*{-0.2in}}\\ \end{tabular}} \newcommand{\softtf}[2]{\softtfa{#1}{#2}{ps1}} \newcommand{\softtfbig}[2]{\softtfabig{#1}{#2}{ps1}} \newcommand{\softtfbigb}[2]{\softtfabigb{#1}{#2}{ps1}} \newcommand{\softtfb}[2]{\softtfa{#1}{#2}{ps3}} \newcommand{\softtfbbig}[2]{\softtfabigb{#1}{#2}{ps3}} \newcommand{\softfc}[1]{\begin{tabular}{c}% \hspace*{-0.2in}\mbox{\psfig{figure=octave/kmeansoft/ps5/#1.ps,width=1.32in,angle=-90}\hspace*{-0.2in}}\\ \end{tabular}} % end % % used in _p1 and _l2 \newcommand{\hpheight}{26mm} \newcommand{\wow}{\marginpar{{\Huge{$*$}}}} %\newcommand{\wow}{\marginpar{\raisebox{-12pt}{\psfig{figure=figs/wow.eps,width=1in}}}} % % used in _l1.tex::::::: \renewcommand{\q}{{f}} \newcommand{\obr}[3]{\overbrace{{#1}\,{#2}\,{#3}}} \newcommand{\ubr}[3]{\underbrace{{#1}\,{#2}\,{#3}}} \newcommand{\nbr}[3]{{{#1}\,{#2}\,{#3}}} % % for \mid and gaps puncgap etc see mygaps.sty \newcommand{\EM}{EM} \newcommand{\ENDsolution}{\hfill \ensuremath{\epfsymbol}\par} \newcommand{\ENDproof}{\hfill \ensuremath{\epfsymbol}\par} \newcommand{\Hint}{{\sf{Hint}}} \newcommand{\viceversa}{{\itshape{vice versa}}} \newcommand{\analyze}{analyze} \newcommand{\analyse}{analyze} \newcommand{\fitpath}{/home/mackay/octave/fit/ps}% used in fit.tex (gaussian fitting, octave) % CUP style: \renewcommand{\cf}{cf.} \renewcommand{\ie}{i.e.} \renewcommand{\eg}{e.g.} \renewcommand{\NB}{N.B.} % % symbols i e and d in maths (operators) \newcommand{\im}{{\rm i}} \newcommand{\e}{{\rm e}} % \d is already defined % % needs % \usepackage{boxedminipage} \newenvironment{conclusionboxplain}% {\begin{Sbox}\begin{minipage}{\textwidth}}% {\end{minipage}\end{Sbox}\fbox{\TheSbox}} \newenvironment{conclusionbox}% %{\begin{Sbox}\begin{minipage}{\textwidth}}% %{\end{minipage}\end{Sbox}\fbox{\TheSbox}} {% see also marginfig.sty for conflicting use of this enironment and its params - and for defn of fatfboxsep \fatfboxsep% \setlength{\mylength}{\textwidth}% \addtolength{\mylength}{-2\fboxsep}% \addtolength{\mylength}{-2\fboxrule}% \vskip8pt\noindent\begin{Sbox}\begin{minipage}{\mylength}\hspace*{-\fboxsep}\hspace*{-\fboxrule}% \hspace*{\leftmargini}\begin{minipage}{\textwidthlessindents}}% {\end{minipage}\end{minipage}\end{Sbox}\shadowbox{\TheSbox}\resetfboxsep\vskip 1pt} \newenvironment{oldconclusionbox}% {\vskip 0.1pt \noindent\rule{\textwidth}{0.1pt}\vskip -18pt\begin{quote}\vskip -8pt}% {\end{quote}\vskip -14pt \noindent\rule{\textwidth}{0.1pt}\vskip 6pt} % {\vskip 0.1pt \noindent\rule{\textwidth}{0.1pt}\vskip -12pt\begin{quote}}% % {\end{quote}\vskip -12pt \noindent\rule{\textwidth}{0.1pt}} \newcommand{\dy}{\d y} \newcommand{\plus}{+} \newcommand{\Wenglish}{Wenglish}% winglish \newcommand{\wenglish}{\Wenglish}% winglish \newcommand{\percent}{{per cent}}% in USA only: percent % %\newcommand{\nonexaminable}{$^{*}$} \newcommand{\nonexaminable}{} % % for exact sampling chapter \newcommand{\envelope}{summary state} % \def\unit#1{\,{\rm #1}} \def\cm{\unit{cm}} \def\grams{\unit{g}} % this is a 209 versus 2e problem: (huffman.latex edited instead) %\def\tenrm{\rm} %\def\tenit{\it} % % other problems: \pem \renewcommand{\textfraction}{0.1} % % for use in free text: \newcommand{\bits}{{\rm bits}} \newcommand{\bita}{{\rm bit}} % for use in equations or in '1 bit' \newcommand{\ubits}{\,{\bits}} \newcommand{\ubit}{\,{\bita}} % % % % ch 2: \newcommand{\sixtythree}{{\tt sixty-three}} \newcommand{\aep}{`asymptotic equipartition' principle} % % used in alpha: \newcommand{\sla}{\sqrt{\lambda_a}} \newcommand{\kga}{\kappa\gamma} \newcommand{\kkgg}{\kappa^2\gamma^2} \newcommand{\skg}{\sqrt{\kappa\gamma}} \newcommand{\TYP}{{\rm \scriptscriptstyle TYP}} % \newcommand{\bb}{{\bf b}} % % used in ising.tex and _s4.tex % J=+1 are in states1, J=-1 are in states %\newcommand{\risingsample}[1]{\psfig{figure=isingfigs/states1/#1.ps,width=1.82in}} \newcommand{\risingsample}[1]{\psfig{figure=isingfigs/states1/#1.ps,width=1in}}% was 1.75 \newcommand{\smallrisingsample}[1]{\psfig{figure=isingfigs/states1/#1.ps,width=0.6in}}% was 1.2 was 0.9 \newcommand{\Hisingsample}[1]{\psfig{figure=isingfigs/states1/#1.ps,width=2.6in}} \newcommand{\hisingsample}[1]{\psfig{figure=isingfigs/states/#1.ps,width=2.6in}} \newcommand{\bighisingsample}[1]{\psfig{figure=isingfigs/states/#1.ps,width=3.86in}} % % used in _noiseless.tex \newcommand{\Connectionmatrix}{Connection matrix} \newcommand{\connectionmatrix}{connection matrix} \newcommand{\connectionmatrices}{connection matrices} %\newcommand{\cwM}{M}% codeword number %\newcommand{\cwm}{m}% codeword number \newcommand{\cwM}{S}% codeword number \newcommand{\cwm}{s}% codeword number \newcommand{\sa}{\alpha}% signal amplitude in gaussian channel % \newcommand{\cmA}{A}% connection matrix symbol \newcommand{\bcmA}{{\bf \cmA}}% connection matrix symbol \newcommand{\bAcm}{{\bcmA}} \newtheorem{ctheorem}{Theorem}[chapter] \newtheorem{definc}{Definition}[chapter] \newcommand{\appendixref}[1]{Appendix \ref{#1}} \newcommand{\appref}[1]{Appendix \ref{#1}} \newcommand{\Appendixref}[1]{Appendix \ref{#1}} \newcommand{\sectionref}[1]{section \ref{#1}} \newcommand{\Sectionref}[1]{Section \ref{#1}} \newcommand{\secref}[1]{section \ref{#1}} \newcommand{\Secref}[1]{Section \ref{#1}} \newcommand{\chapterref}[1]{Chapter \ref{#1}} \newcommand{\Chapterref}[1]{Chapter \ref{#1}} \newcommand{\chref}[1]{Chapter \ref{#1}} \newcommand{\Chref}[1]{Chapter \ref{#1}} \newcommand{\chone}{\ref{ch.one}} \newcommand{\chtwo}{\ref{ch.two}} \newcommand{\chthree}{\ref{ch.three}} \newcommand{\chfour}{\ref{ch.four}} \newcommand{\chfive}{\ref{ch.five}} \newcommand{\chsix}{\ref{ch.six}} \newcommand{\chseven}{\ref{ch.ecc}} \newcommand{\cheight}{\ref{ch.bayes}} \newcommand{\chthirteen}{\ref{ch.single.neuron.class}}% single neuron \newcommand{\chfourteen}{\ref{ch.single.neuron.bayes}}% single neuron bayes? \newcommand{\chtwelve}{\ref{ch.nn.intro}}% intro to nn \newcommand{\chcover}{\ref{ch.cover}} \newcommand{\chbayes}{\ref{ch.bayes}} \newcommand{\secpulse}{\ref{sec.pulse}}% 7.2.1?} \newcommand{\secthirteenthree}{13.3?} \newcommand{\secmetrop}{\ref{sec.metrop}}% 11.3?} \newcommand{\figooo}{?1.11?} \newcommand{\eqgamma}{8.27?} \newcommand{\TSP}{travelling salesman problem} \newcommand{\Bayes}{Bayes'} \newcommand{\vfe}{variational free energy} \newcommand{\vfem}{variational free energy minimization} % could make this \ch6 = \ref{ch6} % author, title etc is in here.... % {headerinfo.tex}% uses special commands \setcounter{secnumdepth}{2}% \newcommand{\indep}{\bot}% upside down pi desired \newcommand{\dbf}{\slshape}% boldface in definitions \newcommand{\dem}{\slshape}% emphasized definitions in text \newcommand{\solutionb}[2]{\setcounter{solution_number}{#1} \solutiona{#2}} \newcommand{\lsolution}[2]{\section{Solution to exercise {#1}}{#2}} % % \newcommand{\FIGS}{/home/mackay/book/FIGS} \newcommand{\bookfigs}{/home/mackay/book/figs} \newcommand{\figsinter}{/home/mackay/handbook/figs/inter} \newcommand{\exburglar}{\exerciseref{ex.burglar}} \newcommand{\exnine}{\exerciseref{ex.invP}}%10} \newcommand{\exseven}{\exerciseonlyref{ex.weigh}}% use deprecated! % was \exseven .... \exerciseref{ex.expectn}}%9} \newcommand{\exaseven}{\exerciseref{ex.R9}}%{7} \newcommand{\exten}{\exerciseref{ex.expectng}}%{11} \newcommand{\exfourteen}{\exerciseref{ex.Hadditive}}%{15} \newcommand{\exfifteen}{\exerciseref{ex.Hcondnal}}%{16} \newcommand{\exeighteen}{\exerciseref{ex.Hmutualineq}}%{19} \newcommand{\extwenty}{\exerciseref{ex.rel.ent}}%{21} \newcommand{\extwentyone}{\exerciseref{ex.joint}}%{22}% the joint ensemble \newcommand{\extwentytwo}{\exerciseref{ex.dataprocineq}}%{23} \newcommand{\extwentythree}{\exerciseref{ex.zxymod2}}%{24} \newcommand{\extwentyfour}{\exerciseref{ex.waithead}}%{25} \newcommand{\extwentyfive}{\exerciseref{ex.sumdice}}%{26} \newcommand{\extwentysix}{\exerciseref{ex.RN}}%{27} \newcommand{\extwentyseven}{\exerciseref{ex.RNGaussian}}%{28} \newcommand{\exthirtyone}{\exerciseref{ex.logit}}%{32}% logistic \newcommand{\exthirtysix}{\exerciseref{ex.exponential}}%{37}% \newcommand{\exthirtyseven}{\exerciseref{ex.blood}}%{38}% forensic \newcommand{\exfiftythree}{\exerciseref{ex.}}%{53}% integers \newcommand{\eqsixteenfive}{16.5} \newcommand{\Kraft}{Kraft}% Kraft--McMillan \newcommand{\exrelent}{\exerciseref{ex.rel.ent}}%{20} %% \ref{ex.rel.ent} \newcommand{\eqKL}{1.24} %% \eqref{eq.KL} \newcommand{\bSigma}{{\mathbf{\Sigma}}} \newcommand{\sumproduct}{sum--product} % % for cpi material % \newcommand{\sigbias}{\sigma_{\rm bias}} \newcommand{\sigin}{\sigma_{\rm in}} \newcommand{\sigout}{\sigma_{\rm out}} \newcommand{\abias}{\alpha_{\rm bias}} \newcommand{\ain}{\alpha_{\rm in}} \newcommand{\aout}{\alpha_{\rm out}} %\newcommand{\bff}{\bf} \newcommand{\handfigs}{/home/mackay/handbook/figs} \newcommand{\mjofigs}{/home/mackay/figs/mjo} \newcommand{\FIGSlearning}{/home/mackay/book/FIGS/learning} \newcommand{\codefigs}{/home/mackay/_doc/code/ps/ps} % % mncEL stuff % \newcommand{\ebnowide}[1]{\mbox{\psfig{figure=../../code/#1.ps,width=2.8in,angle=-90}}} \newcommand{\fem}{m} \newcommand{\feM}{M} \newcommand{\fel}{n} \newcommand{\feL}{N} \renewcommand{\L}{N} \newcommand{\feLm}{{\cal N}(m)} \newcommand{\feMl}{{\cal M}(n)} \newcommand{\feK}{N} \newcommand{\fek}{n} \newcommand{\feKn}{{\cal N}(m)} \newcommand{\feNk}{{\cal M}(n)} \newcommand{\feN}{M} \newcommand{\fen}{m} \newcommand{\fer}{r} \newcommand{\GL}{GL} \newcommand{\SMN}{GL} \newcommand{\NMN}{MN} \newcommand{\MN}{MN} \renewcommand{\check}{check}% was relationship \newcommand{\checks}{checks}% was relationship \newcommand{\fs}{f_{\rm s}} \newcommand{\fn}{f_{\rm n}} \newcommand{\llncspunc}{.} \newcommand{\query}{\mbox{{\tt{?}}}} \newcommand{\lcA}{{H}} \newcommand{\rmncNall}{/home/mackay/_doc/code/rmncNall} \newcommand{\oneA}{1A} \newcommand{\twoA}{2A} \newcommand{\thrA}{2A} \newcommand{\oneB}{1B} \newcommand{\twoB}{2B} \newcommand{\thrB}{2B} \newcommand{\bndips}{/home/mackay/_doc/code/bndips} \newcommand{\codeps}{/home/mackay/_doc/code/ps} \newcommand{\equalnode}{\raisebox{-1pt}[0in][0in]{\psfig{figure=figs/gallager/equal.eps,width=8pt}\hspace{0mm}}} \newcommand{\plusnode}{\raisebox{-1pt}[0in][0in]{\psfig{figure=figs/gallager/plus.eps,width=8pt}\hspace{0mm}}} % % Mon 26/5/03 modified this to try to centre the left heading \newcommand{\fourfourtable}[9]{\begin{tabular}[b]{lcc@{\hspace{4pt}}c} \multicolumn{1}{l}{#1:} & & \multicolumn{2}{c}{#2} \\[-0.1in]% \cline{1-1} & & {#3} & {#4} \\ \cline{3-4} \raisebox{-6.5pt}[0pt][0pt]{{#5}} &\multicolumn{1}{l|}{#3} & {#6} & {#7} \\[-7pt] &\multicolumn{1}{l|}{#4} & {#8} & {#9} \\ \end{tabular}} % Mon 26/5/03 extra version with heading right aligned and space reduced between col 1 and 2 \newcommand{\fourfourtabler}[9]{\begin{tabular}[b]{r@{}cc@{\hspace{4pt}}c} \multicolumn{1}{l}{#1:} & & \multicolumn{2}{c}{#2} \\[-0.1in]% \cline{1-1} & & {#3} & {#4} \\ \cline{3-4} \raisebox{-6.5pt}[0pt][0pt]{{#5}} &\multicolumn{1}{l|}{#3} & {#6} & {#7} \\[-7pt] &\multicolumn{1}{l|}{#4} & {#8} & {#9} \\ \end{tabular}} \newcommand{\fourfourtablebeforemaythree}[9]{\begin{tabular}[b]{lcc@{\hspace{4pt}}c} \multicolumn{1}{l}{#1:} & & \multicolumn{2}{c}{#2} \\[-0.1in]% \cline{1-1} & & {#3} & {#4} \\ \cline{3-4} {#5} &\multicolumn{1}{l|}{#3} & {#6} & {#7} \\[-7pt] &\multicolumn{1}{l|}{#4} & {#8} & {#9} \\ \end{tabular}} \newcommand{\fourfourtableb}[9]{\begin{tabular}[b]{l|c@{\hspace{1pt}}c@{\hspace{3pt}}c} {#1} & {#2} & {#3} & {#4} \\ \cline{1-1}\cline{3-4} \multicolumn{2}{l}{#5} & & \\ \multicolumn{1}{l|}{#3} & & {#6} & {#7} \\[-5pt] \multicolumn{1}{l|}{#4} & & {#8} & {#9} \\ \end{tabular}} \newcommand{\fourfourtableold}[9]{\begin{tabular}[b]{l|c|c|c|} {#1} & {#2} & {#3} & {#4} \\ \cline{1-1} \multicolumn{2}{l|}{#5} & & \\ \hline \multicolumn{2}{l|}{#3} & {#6} & {#7} \\ \hline \multicolumn{2}{l|}{#4} & {#8} & {#9} \\ \hline \end{tabular}} \newcommand{\mathsstrut}{\rule[-3mm]{0pt}{8mm}} % % for ra.tex % \newcommand{\halfw}{0.35in} \newcommand{\onew}{0.9in}%{0.7in}% used in Gallager/MN figures in ra.tex% increased Wed 9/4/03 \newcommand{\onehalfw}{1.05in} \newcommand{\twow}{1.4in} \newcommand{\twohalfw}{1.75in} \newcommand{\GHfig}[1]{\psfig{figure=GHps/#1,width=\onehalfw}}% for rate 1/3 \newcommand{\GHfigone}[1]{\psfig{figure=GHps/#1,width=\onew}}% \newcommand{\GHfigthird}[1]{\psfig{figure=GHps/#1,width=\halfw}} \newcommand{\GHfigquarter}[1]{\psfig{figure=GHps/#1,width=\twohalfw}} \newcommand{\GHfigtwo}[1]{\psfig{figure=GHps/#1,width=\twow}}% for rate 1/2 \newcommand{\GHfigdouble}[1]{\psfig{figure=GHps/#1,width=\twohalfw}}% for five wide % extra wide fitting::::::::::: (for turbo) %\newcommand{\GHfigdoubleE}[1]{\psfig{figure=GHps/#1,width=2in}}% for five wide %\newcommand{\GHfigE}[1]{\psfig{figure=GHps/#1,width=1.2in}}% for rate 1/3 \newcommand{\GHfigdoubleE}[1]{\psfig{figure=GHps/#1,width=2.666666in}}% for five wide \newcommand{\GHfigE}[1]{\psfig{figure=GHps/#1,width=1.6in}}% for rate 1/3 % \newcommand{\GHdrawfig}[1]{\psfig{figure=GHps/#1,width=1.5in}}% was 1.8 \newcommand{\standardfig}[1]{\psfig{figure=rirreg/#1,width=1.8in,angle=-90}} \newcommand{\loopsfig}[1]{\psfig{figure=rirreg/loops.#1,height=1.85in,width=1.8in,angle=-90}} \newcommand{\titledfig}[2]{\begin{tabular}{c}% {#1}\\% \standardfig{#2}\\% \end{tabular}% } % % for the single neuron chapters % \newcounter{funcfignum} \setcounter{funcfignum}{1} \newcommand{\funcfig}[2]{ \put(#1,#2){\makebox(0,0)[b]{ \begin{tabular}{@{}c@{}} \psfig{figure=\FIGSlearning/f.#1.#2.ps,height=1.3in,width=1.3in,angle=-90} \\[-0.15in] $\bw = (#1,#2)$ \\ \end{tabular} } } } \newcommand{\wflatfig}[1]{ \begin{tabular}{@{}c@{}}\setlength{\unitlength}{1in}\begin{picture}(1.5,1.3)(0.30,0.40) \psfig{figure=\FIGSlearning/#1,height=2.43in,width=2.064in,angle=-90} % was 1.3,1.3 \end{picture}\\\end{tabular} } \newcommand{\wsurfig}[1]{ \begin{tabular}{@{}c@{}}\setlength{\unitlength}{1in}\begin{picture}(1.5,1.5)(0,0) \psfig{figure=\FIGSlearning/#1,height=1.8in,width=1.8in,angle=-90} % was 1.5,1.5 \end{picture}\end{tabular} } \newcommand{\datfig}[1]{ \begin{tabular}{@{}c@{}}\setlength{\unitlength}{1in}\begin{picture}(1,1)(0.30,0.1) \psfig{figure=\FIGSlearning/#1,height=1.2in,width=1.412in,angle=-90} % was 1,1 \end{picture}\end{tabular} } \newcommand{\optens}{optimal input distribution}% used in l5.tex, l6.tex, s5.tex \newcommand{\dilbertcopy}{{[Dilbert image Copyright\copyright{1997} United Feature Syndicate, Inc., used with permission.]}} \newcommand{\Rnine}{\mbox{R}_9} \newcommand{\Rthree}{\mbox{R}_3} \newcommand{\eof}{{\Box}} \newcommand{\teof}{\mbox{$\Box$}}% for use in text \newcommand{\ta}{{\tt{a}}} \newcommand{\tb}{{\tt{b}}} %\newcommand{\dits}{dits} %\newcommand{\dit}{dit} \newcommand{\disc}{disk} \newcommand{\dits}{bans} \newcommand{\dit}{ban} % % used in l5 % \newcommand{\BSC}{binary symmetric channel} \newcommand{\BEC}{binary erasure channel} \newcommand{\subsubpunc}{}% change to . if subsubsections are given in-line headings % % convolutional code definitions % \newcommand{\cta}{t^{(a)}} \newcommand{\ctb}{t^{(b)}} \newcommand{\z}{z} \newcommand{\lfsr}{linear-feedback shift-register} % % definitions for including hinton diagrams from extended directory % \newcommand{\ecfig}[1]{\psfig{figure=extended/ps/#1.ps,silent=}} % extra argument \newcommand{\ecfigb}[2]{\psfig{figure=extended/ps/#1.ps,#2,silent=}} % % used in _s1 and in _linear maybe %%%%%%%%%%% see /home/mackay/code/bucky \newcommand{\buckypsfig}[1]{\mbox{\psfig{figure=buckyps/#1,width=1.2in}}} \newcommand{\buckypsfigw}[1]{\mbox{\psfig{figure=buckyps/#1,width=1.75in}}} \newcommand{\buckypsgraph}[1]{\mbox{\psfig{figure=buckyps/#1,width=1.2in,angle=-90}}} \newcommand{\buckypsgraphb}[1]{\mbox{\psfig{figure=buckyps/#1,width=1.75in,angle=-90}}} \newcommand{\buckypsgraphB}[1]{\mbox{\psfig{figure=buckyps/#1,width=2.2in,angle=-90}}} %%%%%%%%%%%%%% %%%%%%%%%%%%%%%55 % for l1a %%%%%%%%%%%%%%%%%% % example % \bigrampicture{3.538mm}{hd_conbigram.ps} % \bigrampicture{3.538mm}{hd_conbigram.ps,width=278pt}%%%%%%% 278 is the original size % This used to work fine in latex209 then needed rejigging in 2e. % (alignment of g,j,p,q,y wrong at the bottom) (saved to graveyard.tex \newcommand{\bigrampicture}[3]%args are unitlength,picturename-and-picturesize,font-request {%%%%%%%%% \setlength{\unitlength}{#1} \begin{picture}(30,30)(0,-30)% was 28,28 0,-28 \put(0.15,-27.8){\makebox(0,0)[bl]{\psfig{figure=bigrams/#2,angle=-90}}} \put(1,-29){\makebox(0,0)[b]{{#3\tt a}}} \put(2,-29){\makebox(0,0)[b]{{#3\tt b}}} \put(3,-29){\makebox(0,0)[b]{{#3\tt c}}} \put(4,-29){\makebox(0,0)[b]{{#3\tt d}}} \put(5,-29){\makebox(0,0)[b]{{#3\tt e}}} \put(6,-29){\makebox(0,0)[b]{{#3\tt f}}} \put(7,-29){\makebox(0,0)[b]{\raisebox{0mm}[0mm][0mm]{#3\tt g}}} \put(8,-29){\makebox(0,0)[b]{{#3\tt h}}} \put(9,-29){\makebox(0,0)[b]{{#3\tt i}}} \put(10,-29){\makebox(0,0)[b]{\raisebox{0mm}[0mm][0mm]{#3\tt j}}} \put(11,-29){\makebox(0,0)[b]{{#3\tt k}}} \put(12,-29){\makebox(0,0)[b]{{#3\tt l}}} \put(13,-29){\makebox(0,0)[b]{{#3\tt m}}} \put(14,-29){\makebox(0,0)[b]{{#3\tt n}}} \put(15,-29){\makebox(0,0)[b]{{#3\tt o}}} \put(16,-29){\makebox(0,0)[b]{\raisebox{0mm}[0mm][0mm]{#3\tt p}}} \put(17,-29){\makebox(0,0)[b]{\raisebox{0mm}[0mm][0mm]{#3\tt q}}} \put(18,-29){\makebox(0,0)[b]{{#3\tt r}}} \put(19,-29){\makebox(0,0)[b]{{#3\tt s}}} \put(20,-29){\makebox(0,0)[b]{{#3\tt t}}} \put(21,-29){\makebox(0,0)[b]{{#3\tt u}}} \put(22,-29){\makebox(0,0)[b]{{#3\tt v}}} \put(23,-29){\makebox(0,0)[b]{{#3\tt w}}} \put(24,-29){\makebox(0,0)[b]{{#3\tt x}}} \put(25,-29){\makebox(0,0)[b]{\raisebox{0mm}[0mm][0mm]{#3\tt y}}} \put(26,-29){\makebox(0,0)[b]{{#3\tt z}}} \put(27,-29){\makebox(0,0)[b]{{#3--}}} % they used to be at height -29 and were aligned bottom %\put(27,-29){\makebox(0,0)[b]{{#3\verb+-+}}} % \put(29,-29){\makebox(0,0)[r]{#3$y$}} % \put(-0.2,-1){\makebox(0,0)[r]{{#3\tt a}}} \put(-0.2,-2){\makebox(0,0)[r]{{#3\tt b}}} \put(-0.2,-3){\makebox(0,0)[r]{{#3\tt c}}} \put(-0.2,-4){\makebox(0,0)[r]{{#3\tt d}}} \put(-0.2,-5){\makebox(0,0)[r]{{#3\tt e}}} \put(-0.2,-6){\makebox(0,0)[r]{{#3\tt f}}} \put(-0.2,-7){\makebox(0,0)[r]{{#3\tt g}}} \put(-0.2,-8){\makebox(0,0)[r]{{#3\tt h}}} \put(-0.2,-9){\makebox(0,0)[r]{{#3\tt i}}} \put(-0.2,-10){\makebox(0,0)[r]{{#3\tt j}}} \put(-0.2,-11){\makebox(0,0)[r]{{#3\tt k}}} \put(-0.2,-12){\makebox(0,0)[r]{{#3\tt l}}} \put(-0.2,-13){\makebox(0,0)[r]{{#3\tt m}}} \put(-0.2,-14){\makebox(0,0)[r]{{#3\tt n}}} \put(-0.2,-15){\makebox(0,0)[r]{{#3\tt o}}} \put(-0.2,-16){\makebox(0,0)[r]{{#3\tt p}}} \put(-0.2,-17){\makebox(0,0)[r]{{#3\tt q}}} \put(-0.2,-18){\makebox(0,0)[r]{{#3\tt r}}} \put(-0.2,-19){\makebox(0,0)[r]{{#3\tt s}}} \put(-0.2,-20){\makebox(0,0)[r]{{#3\tt t}}} \put(-0.2,-21){\makebox(0,0)[r]{{#3\tt u}}} \put(-0.2,-22){\makebox(0,0)[r]{{#3\tt v}}} \put(-0.2,-23){\makebox(0,0)[r]{{#3\tt w}}} \put(-0.2,-24){\makebox(0,0)[r]{{#3\tt x}}} \put(-0.2,-25){\makebox(0,0)[r]{{#3\tt y}}} \put(-0.2,-26){\makebox(0,0)[r]{{#3\tt z}}} \put(-0.2,-27){\makebox(0,0)[r]{{#3--}}} %\put(-0.2,-27){\makebox(0,0)[r]{{#3\verb+-+}}} \put(-0.2,1){\makebox(0,0)[r]{#3$x$}} \end{picture} } %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % used in ch 1: \newcommand{\pB}{p_{\rm B}} \newcommand{\pb}{p_{\rm b}} % from theorems.tex for exact.tex \newcommand{\PGB}{p^{\rm G}_{\rm B}} \newcommand{\PGb}{p^{\rm G}_{\rm b}} \newcommand{\PB}{p_{\rm B}} \newcommand{\Pb}{p_{\rm b}} % % used in occam.tex (from nn_occam.tex) \newlength{\minch} \setlength{\minch}{0.82in} \newcommand{\ostruta}{\rule[-0.07\minch]{0cm}{0.18\minch}} \newcommand{\ostrutb}{\rule[-0.17\minch]{0cm}{0.14\minch}} % % sumproduct.tex \newcommand{\gP}{P^*} \newcommand{\xmwon}{\ensuremath{\bx_m \wo n}} \newcommand{\xmwonb}{\ensuremath{\bx_{m \wo n}}} % southeast.tex %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \newcommand{\gridlet}[1]{\thinlines \multiput(#1)(0,-2){4}{\line(1,0){7.22}}% \multiput(#1)(2,0){4}{\line(0,-1){7.22}}} % \newcommand{\gridletfive}[1]{\thinlines \multiput(#1)(0,-2){5}{\line(1,0){9.22}}% \multiput(#1)(2,0){5}{\line(0,-1){9.22}}} % \newcommand{\piece}[1]{\put(#1){\circle*{0.872}}} \newcommand{\opiece}[1]{\put(#1){\circle{0.872}}} \newcommand{\movingpiece}[1]{% \thinlines \put(#1){\circle*{0.872}} \put(#1){\vector(0,-1){2}} \put(#1){\vector(1,0){2}} }%end movingpiece \newcommand{\lhnextposition}[2]{\hnextposition{#1} \put(#1){\makebox(0,0)[bl]{\raisebox{2mm}{#2}}}}% labelled horizontal arrow \newcommand{\ldnextposition}[2]{\dnextposition{#1} \put(#1){\makebox(0,0)[tl]{\raisebox{0mm}{#2}}}}% labelled horizontal arrow \newcommand{\hnextposition}[1]{\put(#1){\vector(1, 0){2}}} \newcommand{\vnextposition}[1]{\put(#1){\vector(0, -1){2}}} \newcommand{\dnextposition}[1]{\put(#1){\vector(-2,-1){4}}} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % dfountain \newcommand{\Ripple}{\ensuremath{S}} % deconvoln.tex \newcommand{\noisenu}{n} % for _s13.tex and one_neuron \newcommand{\hammingsymbol}[7]{\setlength{\unitlength}{1.4mm}% \begin{picture}(1,2)(0,0)% \ifnum #1=1 \put(0,2){\line(1,0){1}} \fi% \ifnum #2=1 \put(0,2){\line(0,-1){1}} \fi% \ifnum #3=1 \put(1,2){\line(0,-1){1}} \fi% \ifnum #4=1 \put(0,1){\line(1,0){1}} \fi% \ifnum #5=1 \put(0,1){\line(0,-1){1}} \fi% \ifnum #6=1 \put(1,1){\line(0,-1){1}} \fi% \ifnum #7=1 \put(0,0){\line(1,0){1}} \fi% \end{picture}% } \newcommand{\hammingdigit}[1]{% \ifnum #1=6 \hammingsymbol{0}{0}{0}{1}{0}{1}{1}\fi% \ifnum #1=14 \hammingsymbol{0}{0}{1}{0}{1}{1}{1}\fi% \ifnum #1=2 \hammingsymbol{0}{0}{1}{1}{1}{0}{0}\fi% \ifnum #1=1 \hammingsymbol{0}{1}{0}{0}{1}{1}{0}\fi% \ifnum #1=10 \hammingsymbol{0}{1}{0}{1}{1}{0}{1}\fi% \ifnum #1=12 \hammingsymbol{0}{1}{1}{0}{0}{0}{1}\fi% \ifnum #1=4 \hammingsymbol{0}{1}{1}{1}{0}{1}{0}\fi% \ifnum #1=11 \hammingsymbol{1}{0}{0}{0}{1}{0}{1}\fi% \ifnum #1=0 \hammingsymbol{1}{0}{0}{1}{1}{1}{0}\fi% \ifnum #1=7 \hammingsymbol{1}{0}{1}{0}{0}{1}{0}\fi% \ifnum #1=13 \hammingsymbol{1}{0}{1}{1}{0}{0}{1}\fi% \ifnum #1=5 \hammingsymbol{1}{1}{0}{0}{0}{1}{1}\fi% \ifnum #1=9 \hammingsymbol{1}{1}{0}{1}{0}{0}{0}\fi% \ifnum #1=3 \hammingsymbol{1}{1}{1}{0}{1}{0}{0}\fi% \ifnum #1=8 \hammingsymbol{1}{1}{1}{1}{1}{1}{1}\fi% } % here in binary order. %6 &\hammingsymbol{0}{0}{0}{1}{0}{1}{1} \\ %14&\hammingsymbol{0}{0}{1}{0}{1}{1}{1} \\ %2 &\hammingsymbol{0}{0}{1}{1}{1}{0}{0} \\ %1 &\hammingsymbol{0}{1}{0}{0}{1}{1}{0} \\ %10&\hammingsymbol{0}{1}{0}{1}{1}{0}{1} \\ %12&\hammingsymbol{0}{1}{1}{0}{0}{0}{1} \\ %4 &\hammingsymbol{0}{1}{1}{1}{0}{1}{0} \\ %11&\hammingsymbol{1}{0}{0}{0}{1}{0}{1} \\ %0 &\hammingsymbol{1}{0}{0}{1}{1}{1}{0} \\ %7 &\hammingsymbol{1}{0}{1}{0}{0}{1}{0} \\ %13&\hammingsymbol{1}{0}{1}{1}{0}{0}{1} \\ %5 &\hammingsymbol{1}{1}{0}{0}{0}{1}{1} \\ %9 &\hammingsymbol{1}{1}{0}{1}{0}{0}{0} \\ %3 &\hammingsymbol{1}{1}{1}{0}{1}{0}{0} \\ %8 &\hammingsymbol{1}{1}{1}{1}{1}{1}{1} \\ \newcommand{\ldpcc}{low-density parity-check code} %\newcommand{\Ldpc}{Low-density parity-check}% defined elsewhere % included by l2.tex % definitions for weighings.tex and for text % shows weighing trees, ternary % % decisions of what to weigh are shown in square boxes with 126 over 345 (l:r) % state of valid hypotheses are listed in double boxes % or maybe dashboxes? % three arrows, up means left heavy, straioght means right heavy, down is balance % \newcommand{\mysbox}[3]{\put(#1){\framebox(#2){\begin{tabular}{c}#3\end{tabular}}}} \newcommand{\mydbox}[3]{\put(#1){\framebox(#2){\begin{tabular}{c}#3\end{tabular}}}} \newcommand{\myuvector}[3]{\put(#1){\vector(#2){#3}}} \newcommand{\mydvector}[3]{\put(#1){\vector(#2){#3}}} \newcommand{\mysvector}[2]{\put(#1){\vector(1,0){#2}}} \newcommand{\mythreevector}[4]{\myuvector{#1}{#2,#3}{#4}\mydvector{#1}{#2,-#3}{#4}\mysvector{#1}{#4}} % %\newcommand{\h1}{\mbox{$1^+$}} %\newcommand{\l1}{\mbox{$1^-$}} %\newcommand{\h2}{\mbox{$2^+$}} %\newcommand{\l2}{\mbox{$2^-$}} %\newcommand{\h3}{\mbox{$3^+$}} %\newcommand{\l3}{\mbox{$3^-$}} %\newcommand{\h4}{\mbox{$4^+$}} %\newcommand{\l4}{\mbox{$4^-$}} %\newcommand{\h5}{\mbox{$5^+$}} %\newcommand{\l5}{\mbox{$5^-$}} %\newcommand{\h6}{\mbox{$6^+$}} %\newcommand{\l6}{\mbox{$6^-$}} %\newcommand{\h7}{\mbox{$7^+$}} %\newcommand{\l7}{\mbox{$7^-$}} %\newcommand{\h8}{\mbox{$8^+$}} %\newcommand{\l8}{\mbox{$8^-$}} %\newcommand{\h9}{\mbox{$9^+$}} %\newcommand{\l9}{\mbox{$9^-$}} %\newcommand{\h10}{\mbox{$10^+$}} %\newcommand{\l10}{\mbox{$10^-$}} %\newcommand{\h11}{\mbox{$11^+$}} %\newcommand{\l11}{\mbox{$11^-$}} %\newcommand{\h12}{\mbox{$12^+$}} %\newcommand{\l12}{\mbox{$12^-$}} %\setlength{\parindent}{0mm} \title{Information Theory, Inference, \& Learning Algorithms} \shortlecturetitle{} \shortauthor{David J.C. MacKay} % the book - called by book.tex % % aiming for 696 pages total % % thebook.tex % should run % make book.ind % by hand? % Mon 7/10/02 \setcounter{exercise_number}{1} % set to imminent value % \setcounter{secnumdepth}{1} % sets the level at which subsection numbering stops \setcounter{tocdepth}{0} \newcommand{\mysetcounter}[2]{}%was {\setcounter{#1}{#2}} % useful for forcing pagenumbers in drafts %\setcounter{tocdepth}{1} \renewcommand{\bs}{{\bf s}} \newcommand{\figs}{/home/mackay/handbook/figs} % while in bayes chapter % \addtocounter{page}{-1} \pagenumbering{roman} \setcounter{page}{2} % set to current value \setcounter{frompage}{2}% this is used by newcommands1.tex dvips operator that helps make \setcounter{page}{1} % set to current value \setcounter{frompage}{1}% this is used by newcommands1.tex dvips operator that helps make % individual chapters. % % PAGE ii % % \chapter*{Dedication} %\input{tex/dedicationa.tex} %\newpage % % TITLE PAGE iii % \thispagestyle{empty} \begin{narrow}{0in}{-\margindistancefudge}% \begin{raggedleft} ~\\[1.15in] {\Large \bf Information Theory, Inference, and Learning Algorithms\\[1in] } {\Large\sf David J.C. MacKay }\\ \end{raggedleft} \vfill \mbox{}\epsfxsize=160pt\epsfbox{cuplogo.eps}% increased x size to compensate for 0.9 shrinkage later and another 10% % \mbox{}\epsfxsize=128pt\epsfbox{cuplogo.eps} \vspace*{-6pt} \end{narrow} \newpage \thispagestyle{empty} \begin{center} ~\\[1.5in] {\Huge \bf Information Theory, \\[0.2in] Inference,\\[0.2in] and Learning Algorithms\\[1in] } {\Large\sf David J.C. MacKay }\\ {\tt{mackay@mrao.cam.ac.uk}}\\[0.3in] \copyright 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005\\[0.1in] \copyright Cambridge University Press 2003\\[1.3in] Version \thedraft\ (fourth printing) \today\\ \medskip \medskip \medskip \medskip \medskip Please send feedback on this book via {\tt{http://www.inference.phy.cam.ac.uk/mackay/itila/}} \medskip \medskip \medskip Version 6.0 of this book was published by C.U.P.\ in September 2003. It will remain viewable on-screen on the above website, in postscript, djvu, and pdf formats. \medskip \medskip In the second printing (version 6.6) minor typos were corrected, and the book design was slightly altered to modify the placement of section numbers. \medskip \medskip In the third printing (version 7.0) minor typos were corrected, and chapter 8 was renamed `Dependent random variables' (instead of `Correlated'). \medskip \medskip In the fourth printing (version 7.2) minor typos were corrected. \medskip \medskip \medskip {\em (C.U.P. replace this page with their own page ii.)} \end{center} %\dvipsb{frontpage} \newpage % choose one of these: % \input{cambridgefrontstuff.tex} % \newpage % {\em Page vi intentionally left blank.} % \newpage % pages v and vi pages vii and viii \mytableofcontents \dvipsb{table of contents} % alternate %\fakesection{Roadmap} %\input{roadmap.tex} % \subchaptercontents{Preface}%{How to Use This Book}% use subchapter because this % marks the chapter name in the header, unlike chapter*{} % \section*{How to use this book} %{\em [This front matter is still being written. The remainder of the book is essentially finished, % except for typographical corrections, April 18th 2003.]} % % a longer version of this is in % longabout.tex % \section*{How to use this book} % \section{How to use this book} % The first question we must address is: This book is aimed at senior undergraduates and graduate students in Engineering, Science, Mathematics, and Computing. It expects familiarity with calculus, probability theory, and linear algebra as taught in a first- or second-year undergraduate course on mathematics for scientists and engineers. Conventional courses on information theory cover not only the beautiful {\em theoretical\/} ideas of Shannon, but also {\em practical\/} solutions to \ind{communication} problems. This book goes further, bringing in Bayesian data modelling, Monte Carlo methods, variational methods, clustering algorithms, and neural networks. Why unify information theory and machine learning? % Well, Because they % Information theory and % machine learning are two sides of the same coin. % , so it makes sense to unify them. % These two fields were once unified: % It was once so: In the 1960s, a single field, cybernetics, was populated by information theorists, computer scientists, and neuroscientists, all studying common problems. Information theory and machine learning still belong together. Brains are the ultimate compression and \ind{communication} systems. And the state-of-the-art algorithms for both data compression and error-correcting codes use the same tools as machine learning. % Our brains are surely the ultimate in robust % error-correcting information storage and recall systems. \section*{How to use this book} The essential dependencies between chapters are indicated in the figure on the next page. An arrow from one chapter to another indicates that the second chapter requires some of the first. %\section*{General points} % The pinnacles of the book, the key chapters with the really exciting bits, % are first \chref{chone} (in which we meet Shannon's noisy-channel coding theorem); % \chref{ch.six} (in which we prove it); \chref{ch.hopfield} (in which % we meet a neural network that performs robust error-correcting % content-addressable memory); and Chapters \ref{ch.ldpcc} and \ref{chdfountain} % (in which we meet beautifully simple sparse-graph codes that solve % Shannon's communication problem). %% honorable mention - \chref{ch.ac}, ch.ra /////\ exact sampling - not central. % Do not feel daunted by this book. % You don't need to read all of this book. Within {\partnoun}s \datapart, \noisypart, \probpart, and \netpart\ of this book, chapters on advanced or optional topics are towards the end. % For example, \chref{ch.codesforintegers} (Codes for Integers), \chref{ch.xword} (Crosswords and Codebreaking) % and \chref{ch.sex} (Why have Sex? Information Acquisition and Evolution) % are provided for fun. All chapters of {\partnoun} \finfopart\ are optional on a first reading, except perhaps for \chref{ch.message} (Message Passing). The same system sometimes applies within a chapter: the final sections often deal with advanced topics that can be skipped on a first reading. For example in two key chapters -- \chref{chtwo} ({The Source Coding Theorem}) and \chref{ch.six} ({The Noisy-Channel Coding Theorem}) -- the first-time reader should detour at \secref{sec.chtwoproof} and \secref{sec.ch6stop} respectively. % \subsection*{Roadmaps} Pages \pageref{map1}--\pageref{map4} show a few ways to use this book. First, I give the roadmap for a course that I teach in Cambridge: % which embraces both information theory and machine learning. `Information theory, pattern recognition, and neural networks'. % The book is also intended as a textbook for traditional courses in information theory. The second roadmap shows the chapters for an introductory information theory course and the third for a course aimed at an understanding of state-of-the-art error-correcting codes. % The fourth roadmap shows how to use the text in a conventional course on machine learning. % The diagrams on the following pages will indicate % the dependences between chapters and % a few possible routes through the book. \newpage \begin{center}\hspace*{-0.2cm}\raisebox{2cm}{\epsfbox{metapost/roadmap.2}}\end{center} \newpage % \input{tex/cambroadmap.tex} % \newpage \begin{center}\raisebox{2cm}{\epsfbox{metapost/roadmap.3}}\end{center} \label{map1} \newpage \begin{center}\raisebox{2cm}{\epsfbox{metapost/roadmap.4}}\end{center} \newpage \begin{center}\raisebox{2cm}{\epsfbox{metapost/roadmap.5}}\end{center} \newpage \begin{center}\raisebox{2cm}{\epsfbox{metapost/roadmap.6}}\end{center} \label{map4} \newpage \section*{About the exercises} % I firmly believe that You can understand a subject only by creating it for yourself. % To this end, you should % I think it is essential to The exercises play an essential role in this book. % on each topic. For guidance, each % exercise has a rating (similar to that used by \citeasnoun{KnuthAll}) from 1 to 5 to indicate its difficulty. \noindent\ratfull\hspace*{\parindent}In addition, exercises that are especially recommended are marked by a marginal encouraging rat. Some exercises that require the use of a computer are marked with a {\sl C}. % will have % a rating such as A1, A5, C1 or C5. % The letter indicates how important I think the exercise is: % A = very important $\ldots$ C = not essential to the flow of the % book. The number indicates the difficulty of the problem: % 1 = easy, 5 = research project. % I'll circulate detailed recommendations on exercises % as the course progresses. Answers to many exercises are provided. Use them wisely. Where a solution is provided, this is indicated by including its page number % of the solution with alongside the difficulty rating. Solutions to many of the other exercises will be supplied to instructors using this book in their teaching; please email {\tt{solutions@cambridge.org}}. %\begin{table}[htbp] %\caption[a] \begin{realcenter} \fbox{ \begin{tabular}{ll} %\begin{minipage}{3in} {\sf Summary of codes for exercises}\\[0.2in] % \hspace{0.2in} \begin{tabular}[b]{cl} \dorat & Especially recommended \\[0.2in] {\ensuremath{\triangleright}} & Recommended \\ {\sl C} & Parts require a computer \\ {\rm [p.$\,$42]}& Solution provided on page 42 \\ \end{tabular} %\end{minipage} & \begin{tabular}[b]{cl} \pdifficulty{1} & Simple (one minute) \\ \pdifficulty{2} & Medium (quarter hour) \\ \pdifficulty{3} & Moderately hard \\ \pdifficulty{4} & Hard \\ \pdifficulty{5} & Research project \\[0.2in] \end{tabular} \\ \end{tabular} } \end{realcenter} %\end{table} \section*{Internet resources} The website \begin{realcenter} {\tt{http://www.inference.phy.cam.ac.uk/mackay/itila}} \end{realcenter} contains several resources: \ben \item {\em Software}. Teaching software that I use in lectures,\index{software} interactive software, and research software, written in {\tt{perl}}, {\tt{octave}}, {{\tt{tcl}}}, {\tt{C}}, and {\tt{gnuplot}}. Also some animations. \item {\em Corrections to the book}. Thank you in advance for emailing these! \item {\em This book}. The book is provided in {\tt{postscript}}, {\tt{pdf}}, and {\tt{djvu}} formats for on-screen viewing. The same copyright restrictions apply as to a normal book. % \item % {\em Further worked solutions to some exercises}. % If you would like to send in your own solutions for inclusion, % please do. \een % {\em (I aim to add a table of software resources here.)} \section*{About this edition} This is the fourth printing of the first edition. In the second printing, % a small number of typographical errors were corrected, % and the design of the book was altered slightly. % to allow a slightly larger font size. Page-numbering generally remained unchanged, % consistent between the two printings, except in chapters 1, 6, and 28, where % with the exception of pages 7 to 13, where % among which a few paragraphs, figures, and equations moved around. % on which text, figures, and equations have all been slightly rearranged. All equation, section, and exercise numbers were unchanged. In the third printing, chapter 8 was renamed `Dependent Random Variables', instead of `Correlated', which was sloppy. % BEWARE, _RNGaussian.tex had to be changed for the asides. %\input{tex/thirdprint.tex}% about %\input{tex/secondprint.tex}% about the second printing \section*{Acknowledgments} %\chapter*{Acknowledgments} I am most grateful to the organizations who have supported me while this book gestated: the Royal Society and Darwin College who gave me a fantastic research fellowship in the early years; the University of Cambridge; the Keck Centre at the University of California in San Francisco, where I spent a productive sabbatical; % (and failed to finish the book); and the Gatsby Charitable Foundation, whose support gave me the freedom to break out of the Escher staircase that book-writing had become. My work has depended on the generosity of free software authors.\index{software!free}\index{Knuth, Donald} I wrote the book in \LaTeXe. Three cheers for Donald Knuth and Leslie Lamport! %\nocite{latex} Our computers run the GNU/Linux operating system. I use {\tt{emacs}}, {\tt{perl}}, and {\tt{gnuplot}} every day. Thank you Richard Stallman, thank you Linus Torvalds, thank you everyone. % I thank David Tranah of Cambridge University Press for his editorial support. % ``cut, it's my job'' Many readers, too numerous to name here, have given feedback on the book, and to them all I extend my sincere acknowledgments. % I especially wish to thank all the students and colleagues at Cambridge University who have attended my lectures on information theory and machine learning over the last nine years. % Without their enthusiasm and criticism, this book would surely The members of the Inference research group have given immense support, and I thank them all for their generosity and patience over the last ten years: Mark Gibbs, Michelle Povinelli, Simon Wilson, Coryn Bailer-Jones, Matthew Davey, Katriona Macphee, James Miskin, David Ward, Edward Ratzer, Seb Wills, John Barry, John Winn, Phil Cowans, Hanna Wallach, Matthew Garrett, and especially Sanjoy Mahajan. Thank you too to Graeme Mitchison, Mike Cates, and Davin Yap. Finally I would like to express my debt to my personal heroes, the mentors from whom I have learned so much: Yaser Abu-Mostafa, Andrew Blake, John Bridle, Peter Cheeseman, Steve Gull, Geoff Hinton, John Hopfield, Steve Luttrell, Robert MacKay, Bob McEliece, Radford Neal, Roger Sewell, and John Skilling. %%%%%%%%%%%%%% %\chapter*{Dedication} %\vspace*{80pt} \vfill \begin{center} \rule{\textwidth}{1pt} \par \vskip 18pt { \huge \sl {Dedication} } \par %\end{center} \nobreak \vskip 40pt %\begin{center} This book is dedicated to the campaign against the arms trade.\\[0.3in] % % Their web page is % , as overburdened with animated images as the world is with weapons, is here: %\verb+http://www.caat.demon.co.uk/+\\[0.6in] \verb+www.caat.org.uk+\\[0.6in] \end{center} \begin{quote} \begin{raggedleft} Peace cannot be kept by force.\\ It can only be achieved % by understanding. % Peace cannot be achieved through violence, it can only be attained through understanding.\\ \hfill -- {\em Albert Einstein}\\ \end{raggedleft} \end{quote} \vspace*{2pt} \rule{\textwidth}{1pt} \par % Two things are infinite: the universe and human stupidity; and I'm not sure % about the the universe. %The important thing is not to stop questioning. Curiosity has its own reason for % existing. %Any intelligent fool can make things bigger, more complex, and more violent. It % takes a touch of genius -- and a lot of courage -- to move in the opposite % direction. % \input{extrafrontstuff.tex}% aims dedication, about the author, etc % see also tex/oldaims.tex % for some good stuff. % and tex/typicalreaders.tex % %% \input{tex/overview2001.tex} %\dvipsb{preface} \newpage %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %\setcounter{page}{0} % set to current value %Fake page % added to get draft.dps to look right %\newpage %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \pagenumbering{arabic} \prechapter{About Chapter} \setcounter{page}{1} % set to current value \label{pch.one} % % pre-chapter 1 % \fakesection{Before ch 1} In the first chapter, you will need to be familiar with the \ind{binomial distribution}. % , reviewed below. And to solve the exercises in the text -- which I urge you to do -- you will need to know {\dem\ind{Stirling's approximation}\/}\index{approximation!Stirling} for the factorial function, $%\beq x! \simeq x^{x} \, e^{-x} $, and be able to apply it to ${{N}\choose{r}} = \smallfrac{N!}{(N-r)!\,r!}$.\marginpar{\small\raggedright\reducedlead{Unfamiliar notation?\\ See \appref{app.notation}, \pref{app.notation}.}} % $x!$ These topics are reviewed below. \subsection*{The binomial distribution} \label{sec.first.binomial} \exampl{ex.binomial}{ A \ind{bent coin}\index{coin} has probability $f$ of coming up heads. The coin is tossed $N$ times. What is the probability distribution of the number of heads, $r$? What are the \ind{mean} and \ind{variance} of $r$? } \amarginfig{t}{% \begin{tabular}{r} % $P(r\given f,N)$\\ \mbox{\psfig{figure=bigrams/urn.f.g.ps,angle=-90,width=1.51in}}% %\\ %\mbox{\psfig{figure=bigrams/urn.f.l.ps,angle=-90,width=1.64in}}% \\[-0.1in] \multicolumn{1}{c}{\small$r$} \\ \end{tabular} %}{% \caption[a]{The binomial distribution $P(r \given f\eq 0.3,\,N \eq 10)$.} % , on a linear scale (top) and a logarithmic scale (bottom).} \label{fig.binomial} } % see bigrams/README \noindent %\begin{Sexample}{ex.binomial} {\sf Solution\colonspace} \label{sec.first.binomial.sol} The number of heads has a binomial distribution. \beq P(r \given f,N) = {N \choose r} f^{r} (1-f)^{N-r} . \eeq The mean, $\Exp [ r ]$, and variance, $\var[r]$, of this distribution are defined by \beq \Exp [ r ] \equiv \sum_{r=0}^{N} P(r\given f,N) \, r \label{eq.mean.def} \eeq \beqan \var[r] & \equiv & \Exp \left[ \left( r - \Exp [ r ] \right)^2 \right] \\ & = & \Exp [ r^2 ] - \left( \Exp [ r ] \right)^2 = \sum_{r=0}^{N} P(r\given f,N) r^2 - \left( \Exp [ r ] \right)^2 . \label{eq.var.sum} \eeqan % Rather than evaluating the sums over $r$ in (\ref{eq.mean.def}) and (\ref{eq.var.sum}) directly, it is easiest to obtain the mean and variance by noting that $r$ is the sum of $N$ {\em independent\/} % , identically distributed random variables, namely, the number of heads in the first toss (which is either zero or one), the number of heads in the second toss, and so forth. In general, \beq \begin{array}{rcll} \Exp [ x + y ] &=& \Exp [ x ] + \Exp [ y ] & \mbox{for any random variables $x$ and $y$}; \\ \var [ x + y ] &=& \var [ x ] + \var [ y ] & \mbox{if $x$ and $y$ are independent}. \end{array} \eeq So the mean of $r$ is the sum of the means of those random variables, and the variance of $r$ is the sum of their variances.\index{variances add} % its mean and variance are given by adding the means and variances % of those random variables, respectively. The mean number of heads in a single toss is $f\times 1 + (1-f)\times 0 = f$, and the variance of the number of heads in a single toss is \beq \left[ f\times 1^2 + (1-f)\times 0^2 \right] - f^2 = f - f^2 = f(1-f), \eeq so the mean and variance of $r$ are: \beq \Exp [ r ] = N f %\eeq\beq \hspace{0.35in} \mbox{and} \hspace{0.35in} \var[r] = N f (1-f) . \hspace{0.35in}\epfsymbol\hspace{-0.35in} \eeq %\end{Sexample} % ADD END PROOF SYMBOL HERE !!!!!!!!!!!!!!!!!!!!!!!!!!!!!! \subsection*{Approximating $x!$ and ${{N}\choose{r}}$} \amarginfig{t}{% \begin{tabular}{r} \mbox{\psfig{figure=bigrams/poisson.g.ps,angle=-90,width=1.5in}}% %\\ %\mbox{\psfig{figure=bigrams/poisson.l.ps,angle=-90,width=1.64in}}% \\[-0.1in] \multicolumn{1}{c}{\small$r$} \\ \end{tabular} %}{% \caption[a]{The Poisson distribution $P(r\,|\,\l\eq 15)$.} % , on a linear scale (top) and a logarithmic scale (bottom).} \label{fig.poisson} } % see bigrams/README \label{sec.poisson} % FAVOURITE BIT \noindent Let's derive Stirling's approximation by an unconventional route. We start from the \ind{Poisson distribution} with mean $\l$, \beq P( r \given \l ) = e^{-\l} \frac{\l^r}{r!} \:\:\:\: \:\: r\in \{ 0,1,2,\ldots\} . \label{eq.poisson} \eeq % % \noindent For large $\l$, this distribution is well approximated -- at least\index{approximation!by Gaussian} in the vicinity of $r \simeq \l$ -- by a \ind{Gaussian distribution} with mean $\l$ and variance $\l$: % So, \beq e^{-\l} \frac{\l^r}{r!} \,\simeq\, \frac{1}{\sqrt{2\pi \l}} \, e^{{ -\smallfrac{(r-\l)^2}{2\l}}} . \eeq Let's plug $r=\l$ into this formula, then rearrange it.\label{sec.stirling} \beqan e^{-\l} \frac{\l^{\l}}{\l!} &\simeq& \frac{1}{\sqrt{2\pi \l}} \\ \Rightarrow \ \ \l! &\simeq& \l^{\l} \, e^{-\l} \sqrt{2\pi \l} . \eeqan This is {Stirling's approximation} for the \ind{factorial} function. \beq x! \,\simeq\, x^{x} \, e^{-x} \sqrt{2\pi x} \:\:\:\Leftrightarrow\:\:\: \ln x! \,\simeq\, x \ln x - x + {\textstyle\frac{1}{2}} \ln {2\pi x} . \label{eq.stirling} \eeq We have derived not only the leading order behaviour, $x! \simeq x^{x} \, e^{-x}$, but also, at no cost, the next-order correction term $\sqrt{2\pi x}$. % We now apply Stirling's approximation % the approximation %$%\beq % x! \simeq x^{x} \, e^{-x} $ to\index{combination} $%\beq \ln {{N}\choose{r}} $:%\eeq \beqan \ln {{N}\choose{r}} \,\equiv\, \ln \frac{N!}{(N-r)!\,r!} % & \simeq & % N [ \ln N - 1 ] - (N-r) [ \ln (N-r) - 1 ] - r [ \ln r - 1 ] %\\ & \simeq & (N-r) \ln\frac{N}{N-r} + r \ln\frac{N}{r} . \label{eq.choose.approx} \eeqan Since all the terms in this equation are logarithms, this result can be rewritten in any base.\marginpar{\small Recall that $\displaystyle{ \log_2 x = \frac{ \log_e x }{ \log_e 2} }$.\\[0.03in] Note that $\displaystyle\frac{\partial \log_2 x }{\partial x} = \frac{1}{\log_e 2}\,\frac{1}{x}$. } %\fakesubsection*{My rule about log and ln}\index{conventions!logarithms} We will denote\index{notation!logarithms} natural logarithms ($\log_e$) by `ln', and \ind{logarithms} to base 2 ($\log_2$) by `$\log$'. If we introduce the {\dbf\ind{binary entropy function}}, \beq H_2(x) \equiv x \log \frac{1}{x} + (1\! -\! x) \log \frac{1}{(1\! -\! x)} , \eeq then we can rewrite the approximation (\ref{eq.choose.approx}) %\beq %$ \log {{N}\choose{r}} % \simeq (N-r) \log \frac{N}{N-r} + r \log \frac{N}{r} %$ %\eeq as \amarginfig{t}{\small% \begin{center} \mbox{ \hspace{-6mm} % \hspace{6.2mm} \raisebox{\hpheight}{$H_2(x)$} % to put H at left: \hspace{-7.5mm} % \hspace{-20mm} \mbox{\psfig{figure=figs/H2.ps,% width=42mm,angle=-90}}$x$ } % see also H2p.tex \end{center} \caption[a]{The binary \ind{entropy} function.} % $H_2(x)$.} \label{fig.h2x} } \beq \log {{N}\choose{r}} \, \simeq \, N H_2(r/N) , \label{eq.stirling.choose.l} \eeq or, equivalently, % \:\:\:\Leftrightarrow\:\:\: \beq {{N}\choose{r}} \, \simeq \, 2^{N H_2(r/N)} . \label{eq.stirling.choose} \eeq If we need a more accurate approximation, we can include terms of the next order from Stirling's approximation (\ref{eq.stirling}): \beq \log {{N}\choose{r}} \,\simeq\, N H_2(r/N) - {\textstyle\frac{1}{2}} \log \left[ {2\pi N \, \frac{N\!-\!r}{N} \, \frac{r}{N}} \right] . \label{eq.H2approxaccurate} \eeq % % - {\textstyle\frac{1}{2}} \ln {2\pi N} % + {\textstyle\frac{1}{2}} \ln {2\pi N-r} % + {\textstyle\frac{1}{2}} \ln {2\pi r} % % ln += {\textstyle\frac{1}{2}} \ln {2\pi (N-r)(r)/N} % log_2 += {\textstyle\frac{1}{2}} \log_2 {2\pi (N-r)(r)/N} % or % log_2 += {\textstyle\frac{1}{2}} \log_2 {2\pi N} % + {\textstyle\frac{1}{2}} \log_2 {\frac{(N-r)}{N}\frac{r}{N}} % log_2 += {\textstyle\frac{1}{2}} \log_2 {2\pi \frac{(N-r)}{N}\frac{r}{N} N} \ENDprechapter \chapter{Introduction to Information Theory} \label{ch.one} \label{chone} % % \part{Information Theory} % \chapter{Introduction to Information Theory} \label{ch1} %\section{Communication over noisy channels} % One of the principal questions addressed by information theory is % Shannon's ground-breaking paper on `The Mathematical Theory of % Communication' opens thus: \begin{quotation} \noindent The fundamental problem of \ind{communication} is that of reproducing at one point either exactly or approximately a message selected at another point. \\ \mbox{~} \hfill {\em (Claude Shannon, 1948)}\index{Shannon, Claude} \\ % \end{quotation} \noindent In the first half of this book we %are going to study how to measure information content; we % are going to % learn by how much data from a given source % can be compressed; we % are going to learn how % , practically, to % achieve data compression; to compress data; and we % are going to learn how to communicate perfectly over imperfect communication channels. We start by getting a feeling for this last problem. \section[How can we achieve perfect communication?]{How can we achieve perfect communication over an imperfect, noisy communication channel?} Some examples of noisy communication channels are: \bit \item an analogue telephone line,\marginpar{\footnotesize \setlength{\unitlength}{1mm}% \begin{picture}(45,10)(0,5) \put(0,10){\makebox(0,0)[l]{\shortstack{modem}}} \put(21,10){\makebox(0,0)[l]{\shortstack{phone\\line}}} \put(39,10){\makebox(0,0)[l]{\shortstack{modem}}} \put(15,10){\vector(1,0){3}} \put(32,10){\vector(1,0){3}} \end{picture} } over which two modems communicate digital information; \item the radio communication link from Galileo,\marginpar{\footnotesize \setlength{\unitlength}{1mm}% \begin{picture}(45,10)(0,5) \put(0,10){\makebox(0,0)[l]{\shortstack{Galileo}}} \put(21,10){\makebox(0,0)[l]{\shortstack{radio\\waves}}} \put(39,10){\makebox(0,0)[l]{\shortstack{Earth}}} \put(15,10){\vector(1,0){3}} \put(32,10){\vector(1,0){3}} \end{picture} } the Jupiter-orbiting spacecraft, to earth; \item \marginpar[c]{\footnotesize \setlength{\unitlength}{1mm}% \begin{picture}(30,20)(0,0) \put(0,10){\makebox(0,0)[l]{\shortstack{parent\\cell}}} \put(16,2){\makebox(0,0)[l]{\shortstack{daughter\\cell}}} \put(16,16){\makebox(0,0)[l]{\shortstack{daughter\\cell}}} \put(10,10){\vector(1,1){5}} \put(10,10){\vector(1,-1){5}} \end{picture} }reproducing cells, in which the daughter cells' \ind{DNA} contains information from the parent % cell or cells; \item \marginpar{\footnotesize \setlength{\unitlength}{1mm}% \begin{picture}(45,10)(0,5) \put(0,10){\makebox(0,0)[l]{\shortstack{computer\\ memory}}} \put(20,10){\makebox(0,0)[l]{\shortstack{\disc\\drive}}} \put(33,10){\makebox(0,0)[l]{\shortstack{computer\\ memory}}} \put(15,10){\vector(1,0){3}} \put(29,10){\vector(1,0){3}} \end{picture} }a \disc{} drive. \eit The last example shows that \ind{communication} doesn't have to involve information going from one {\em place\/} to another. When we write a file on a \disc{} drive, we'll % typically read it off % again in the same location -- but at a later {\em time}. These channels are noisy.\index{noise}\index{channel!noisy} A telephone line suffers from cross-talk with other lines; the hardware in the line distorts and adds noise to the transmitted signal. The deep space network that listens to Galileo's puny transmitter % fairy-bulb power receives background radiation from terrestrial and cosmic sources. DNA is subject to mutations and damage. A \ind{disk drive}, which writes a binary digit (a one or zero, also known as a {\dbf\ind{bit}}) by aligning a patch of magnetic material in one of two orientations, may later % , with some probability, fail to read out the stored binary digit: % that was stored the patch of material might spontaneously flip magnetization, or a glitch of background noise might cause the reading circuit to report the wrong value for the binary digit, or the writing head might not induce the magnetization in the first place because of interference from neighbouring bits. In all these cases, if we transmit data, \eg, a string of bits, over the channel, there is some probability that the received message will not be identical to the transmitted message. % And in all cases, We would prefer to have a communication channel for which this probability was zero -- or so close to zero that for practical purposes it is indistinguishable from zero. Let's consider % the example of a noisy \disc{} drive % having the property that transmits each bit correctly % transmitted with probability $(1\!-\!f)$ and incorrectly with probability $f$. This model % favourite communication channel\index{channel!binary symmetric} is known as the {\dbf{\ind{binary symmetric channel}}} (\figref{fig.bsc1}). \begin{figure}[htbp] \figuremargin{% \[ \begin{array}{c} \setlength{\unitlength}{0.46mm} \begin{picture}(30,20)(-5,0) \put(-4,9){{\makebox(0,0)[r]{$x$}}} \put(5,2){\vector(1,0){10}} \put(5,16){\vector(1,0){10}} \put(5,4){\vector(1,1){10}} \put(5,14){\vector(1,-1){10}} \put(4,2){\makebox(0,0)[r]{1}} \put(4,16){\makebox(0,0)[r]{0}} \put(16,2){\makebox(0,0)[l]{1}} \put(16,16){\makebox(0,0)[l]{0}} \put(24,9){{\makebox(0,0)[l]{$y$}}} \end{picture} \end{array} \:\:\: \begin{array}{ccl}%%%%% {c@{}c@{}l} %%%%% (for twocolumn style) P(y\eq 0 \given x\eq 0) &= & 1 - \q ; \\ P(y\eq 1 \given x\eq 0) &= & \q ; \end{array} \begin{array}{ccl} P(y\eq 0 \given x\eq 1) &= & \q ; \\ P(y\eq 1 \given x\eq 1) &= & 1 - \q . \end{array} \] }{% \caption[a]{The binary symmetric channel. The transmitted symbol is $x$ and the received symbol $y$. The noise level, the probability % of a bit's being that a bit is flipped, is $f$.} %% (Unfamiliar notation? See %% \appref{app.notation}, \pref{app.notation}, %% and \pref{sec.conditional.def}.)} \label{fig.bsc1} }% \end{figure} \begin{figure}[htbp] \figuremargin{% \begin{mycenter} \begin{tabular}{rcl} \psfig{figure=bitmaps/dilbert.ps,width=1.2in} &\hspace{0.1in}% \raisebox{0.22in}{% \setlength{\unitlength}{1.2mm}% \begin{picture}(20,20)(0,0)% \put(10,1){\makebox(0,0)[t]{$(1-f)$}} \put(10,17){\makebox(0,0)[b]{$(1-f)$}} \put(12,9.5){\makebox(0,0)[l]{$f$}} % \put(10,16.5){\makebox(0,0)[b]{$(1-f)$}} \put(5,2){\vector(1,0){10}} \put(5,16){\vector(1,0){10}} \put(5,4){\vector(1,1){10}} \put(5,14){\vector(1,-1){10}} \put(4,2){\makebox(0,0)[r]{{1}}} \put(4,16){\makebox(0,0)[r]{{0}}} \put(16,2){\makebox(0,0)[l]{{1}}} \put(16,16){\makebox(0,0)[l]{{0}}} \end{picture}% }% \hspace{0.385in}& \psfig{figure=_is/10000.10.ps,width=1.2in} \\ % & & \makebox[0in][l]{\large 10\% of bits are flipped} \\ \end{tabular} \end{mycenter} }{% \caption[a]{A binary data sequence of length $10\,000$ transmitted over a binary symmetric channel with noise level $f=0.1$. \dilbertcopy} \label{fig.bsc.dil} }% \end{figure} \noindent As an example, % For the sake of argument, let's imagine that $f=0.1$, that is, ten \percent\ of the bits are flipped (figure \ref{fig.bsc.dil}). % For a \disc{} drive to be useful, we would prefer that it should % flip no bits at all in its entire lifetime. A useful \disc{} drive would flip no bits at all in its entire lifetime. % If we expect to read and write a gigabyte per day for ten years, we require a bit error probability of the order of $10^{-15}$, or smaller. There are two approaches to this goal. \subsection{The physical solution} The physical solution is to improve the physical characteristics of the communication channel to reduce its error probability. We could improve our \disc{} drive by % , for example, \ben \item using more reliable components in its circuitry; \item evacuating the air from the \disc{} enclosure so as to eliminate the turbulence that perturbs the reading head from the track; \item using a larger magnetic patch to represent each bit; or \item using higher-power signals or cooling the circuitry in order to reduce thermal noise. \een These physical modifications typically increase the cost of the communication channel. % unit of area making the \disc{} spin at a slower rate % % the system solution % \begin{figure}%[htbp] \figuremargin{% \setlength{\unitlength}{1.25mm} \begin{mycenter} \begin{picture}(50,40)(-10,5) \put(0,5){\framebox(25,10){\begin{tabular}{c}Noisy\\ channel\end{tabular}}} \put(-20,20){\framebox(25,10){\begin{tabular}{c}Encoder\end{tabular}}} \put(20,20){\framebox(25,10){\begin{tabular}{c}Decoder\end{tabular}}} %\put(-20,40){\framebox(25,10){\begin{tabular}{c}Compressor\end{tabular}}} %\put(20,40){\framebox(25,10){\begin{tabular}{c}Decompressor\end{tabular}}} %\put(-50,20){\makebox(25,10){\begin{tabular}{c}{\sc Source}\\{\sc coding}\end{tabular}}} % \put(-50,40){\makebox(25,10){\begin{tabular}{c}{\sc Channel}\\{\sc coding}\end{tabular}}} \put(-20,37){\makebox(25,12){Source}} % \put(-10,14){\makebox(0,0){$\bt$}} \put(-10,34){\makebox(0,0){$\bs$}} \put(35,14){\makebox(0,0){$\br$}} \put(35,34){\makebox(0,0){$\hat{\bs}$}} \put(-7.5,18){\line(0,-1){8}} \put(-7.5,10){\vector(1,0){6}} \put(32.5,10){\vector(0,1){8}} \put(32.5,10){\line(-1,0){6}} % \put(32.5,31){\vector(0,1){8}} %\put(32.5,51){\vector(0,1){5}} \put(-7.5,39){\vector(0,-1){8}} %\put(-7.5,55){\vector(0,-1){5}} \end{picture} \end{mycenter} }{% \caption[a]{The `system' solution for achieving % almost perfect reliable communication over a noisy channel. The encoding system introduces systematic redundancy % in a systematic way into the transmitted vector $\bt$. The decoding system uses this known redundancy to deduce from the received vector $\br$ {\em both\/} the original source vector {\em and\/} the noise introduced by the channel. } \label{system.solution} }% \end{figure} \subsection{The `system' solution} Information theory\index{information theory} and \ind{coding theory} offer % \index{system} an alternative (and much more exciting) approach: we accept the given noisy channel as it is and add communication {\dem systems\/} to it so that we can {detect\/} and {correct\/} the errors introduced by the % noise. channel. As shown in \figref{system.solution}, we add an {\dem\ind{encoder}\/} before the channel and a {\dem\ind{decoder}\/} after it. The encoder encodes the source message $\bs$ into a {\dem transmitted\/} message $\bt$, % the idea is that the encoder adds adding {\dem\ind{redundancy}\/} to the original message in some way. The channel adds noise to the transmitted message, yielding a received message $\br$. The decoder uses the known redundancy introduced by the encoding system to infer both the original signal $\bs$ and the added noise. % added by the channel was. Whereas physical solutions give incremental channel improvements only at an ever-increasing cost, % we hope to find % there exist system solutions can turn noisy channels into reliable communication channels with the only cost being a {\em computational\/} requirement at the encoder and decoder. % (and the delay associated with those computations. % % suggested addition: % So, as the cost of computation falls, the cost of reliability will fall as well. {\dbf Information theory} is concerned with the theoretical limitations and % theoretical potentials of such systems. `What is the best error-correcting performance we could achieve?' {\dbf Coding theory} is concerned with the creation of practical encoding and decoding systems. % Some \section{Error-correcting codes for the binary symmetric channel} We now consider examples of encoding and decoding systems. What is the simplest way to add useful redundancy to a transmission? [To make the rules of the game clear: we want to be able to detect {\em and\/} correct errors; and retransmission is not an option. We get only one chance to encode, transmit, and decode.] \subsection{Repetition codes} \label{sec.r3} A straightforward idea is to repeat every bit of the message a prearranged number of times -- for example, three times, as shown in \tabref{fig.r3}. We call this {\dem \ind{repetition code}\/} `$\Rthree$'. %\begin{figure}[htbp] %\figuremargin{% \amargintab{c}{ \begin{mycenter} \begin{tabular}{c@{\hspace{0.3in}}c} \toprule % \hline % Source sequence $\bs$ & Transmitted sequence $\bt$ \\ \hline Source & Transmitted \\[-0.02in] % was -0.1, which was to much sequence & sequence \\ $\bs$ & $\bt$ \\ \midrule % \hline \tt 0 &\tt 000 \\ \tt 1 &\tt 111 \\ \bottomrule % \hline \end{tabular} \end{mycenter} %}{% \caption[a]{The repetition code {$\Rthree$}.} \label{fig.r3} }% %\end{figure} % \noindent % Imagine that % what might happen if we transmit the source message \[ \bs = \mbox{\tt 0 0 1 0 1 1 0} \] over a binary symmetric channel with noise level $f=0.1$ using this repetition code. We can describe the channel as `adding' a sparse noise vector $\bn$ to the transmitted vector -- adding in modulo 2 arithmetic, \ie, the binary algebra in which {\tt 1}+{\tt 1}={\tt 0}. A possible noise vector $\bn$ and received vector $\br = \bt + \bn$ are shown in \figref{fig.r3.transmission}. \begin{figure}[htbp] % % here i should switch the \[ \] for a display that oes not introduce % white space at the top (about 0.1in) % \figuremargin{% \[ \begin{array}{rccccccc} \bs & {\tt 0}&{\tt 0}&{\tt 1}&{\tt 0}&{\tt 1}&{\tt 1}&{\tt 0} \\ \bt & \obr{{\tt 0}}{{\tt 0}}{{\tt 0}}&\obr{{\tt 0}}{{\tt 0}}{{\tt 0}}&\obr{{\tt 1}}{{\tt 1}}{{\tt 1}}&\obr{{\tt 0}}{{\tt 0}}{{\tt 0}}& \obr{{\tt 1}}{{\tt 1}}{{\tt 1}}&\obr{{\tt 1}}{{\tt 1}}{{\tt 1}}& \obr{{\tt 0}}{{\tt 0}}{{\tt 0}} \\ \bn & \nbr{{\tt 0}}{{\tt 0}}{{\tt 0}}& \nbr{{\tt 0}}{{\tt 0}}{{\tt 1}}& \nbr{{\tt 0}}{{\tt 0}}{{\tt 0}}& \nbr{{\tt 0}}{{\tt 0}}{{\tt 0}}& \nbr{{\tt 1}}{{\tt 0}}{{\tt 1}}& \nbr{{\tt 0}}{{\tt 0}}{{\tt 0}}& \nbr{{\tt 0}}{{\tt 0}}{{\tt 0}} \\ \cline{2-8} \br & \nbr{{\tt 0}}{{\tt 0}}{{\tt 0}}& \nbr{{\tt 0}}{{\tt 0}}{{\tt 1}}& \nbr{{\tt 1}}{{\tt 1}}{{\tt 1}}& \nbr{{\tt 0}}{{\tt 0}}{{\tt 0}}& \nbr{{\tt 0}}{{\tt 1}}{{\tt 0}}& \nbr{{\tt 1}}{{\tt 1}}{{\tt 1}}& \nbr{{\tt 0}}{{\tt 0}}{{\tt 0}} \end{array} \] }{% \caption{An example transmission using $\mbox{R}_3$.} \label{fig.r3.transmission} } \end{figure} %\noindent How should we decode this received vector? % % optimality not clear - should justify? % % Perhaps you can see that The optimal algorithm looks at the received bits three at a time and takes a \ind{majority vote} (\algref{alg.r3}). \begin{algorithm}[htbp] \algorithmmargin{% \begin{mycenter} \begin{tabular}{ccc} % \toprule % \hline Received sequence $\br$ & Likelihood ratio $\frac{P(\br\,|\, s\eq {\tt 1})}{P(\br\,|\, s\eq {\tt 0})}$ & Decoded sequence $\hat{\bs}$ \\ \midrule \tt 000 & $\gamma^{-3}$ &\tt 0 \\ \tt 001 & $\gamma^{-1}$ &\tt 0 \\ \tt 010 & $\gamma^{-1}$ &\tt 0 \\ \tt 100 & $\gamma^{-1}$ &\tt 0 \\ \tt 101 & $\gamma^{1}$ &\tt 1 \\ \tt 110 & $\gamma^{1}$ &\tt 1 \\ \tt 011 & $\gamma^{1}$ &\tt 1 \\ \tt 111 & $\gamma^{3}$ &\tt 1 \\ % \bottomrule \end{tabular} \end{mycenter} }{% \caption[a]{Majority-vote decoding algorithm for {$\Rthree$}. Also shown are the likelihood ratios (\ref{eq.likelihood.bsc}), assuming % This is the optimal decoder if the channel is a binary symmetric channel; $\gamma \equiv (1-f)/f$.} % \label{fig.r3d} \label{alg.r3} }% \end{algorithm} % \begin{aside} % At the risk of explaining the obvious, let's prove this result. The optimal decoding decision (optimal in the sense of having the smallest probability of being wrong) is to find which value of $\bs$ is most probable, given $\br$.\index{maximum {\em a posteriori}} % to make clear the assumptions. Consider the decoding of a single bit $s$, which was encoded as % after encoding as $\bt(s)$ and gave rise to three received bits $\br = r_1r_2r_3$. By \ind{Bayes' theorem},\label{sec.bayes.used} the {\dem posterior probability\/} of $s$ is \beq P(s \,|\, r_1r_2r_3 ) = \frac{ P( r_1r_2r_3 \,|\, s ) P( s ) } { P( r_1r_2r_3 ) } . \label{eq.bayestheorem} \eeq We can spell out the posterior probability of the two alternatives thus: \beq P(s\eq {\tt 1} \,|\, r_1r_2r_3 ) = \frac{ P( r_1r_2r_3 \,|\, s\eq {\tt 1} ) P( s\eq {\tt 1} ) } { P( r_1r_2r_3 ) } ; \label{eq.post1} \eeq \beq P(s\eq {\tt 0} \,|\, r_1r_2r_3 ) = \frac{ P( r_1r_2r_3 \,|\, s\eq {\tt 0} ) P( s\eq {\tt 0} ) } { P( r_1r_2r_3 ) } . \label{eq.post0} \eeq % This \ind{posterior probability} is determined by two factors: the {\dem{\ind{prior} probability\/}} $P(s)$, and the data-dependent term $P( r_1r_2r_3 \,|\, s )$, which is called the {\dem{\ind{likelihood}\/}} of $s$. The normalizing constant $P( r_1r_2r_3 )$ % is irrelevant to needn't be computed when finding the optimal decoding decision, which is to guess $\hat{s}\eq {\tt 0}$ if $P(s\eq {\tt 0} \,|\, \br ) > P(s\eq {\tt 1} \,|\, \br )$, and $\hat{s}\eq {\tt 1}$ otherwise. To find $P(s\eq {\tt 0} \,|\, \br )$ and $P(s\eq {\tt 1} \,|\, \br )$, % the optimal decoding decision, we must make an assumption about the prior probabilities of the two hypotheses ${s}\eq {\tt 0}$ and ${s}\eq {\tt 1}$, and we must make an assumption about the probability of $\br$ given $s$. % $\bt(s)$. We assume that the prior probabilities are equal: $P( {s}\eq {\tt 0}) = P( {s}\eq {\tt 1}) = 0.5$; then maximizing the posterior probability $P(s\,|\,\br)$ is equivalent to maximizing the likelihood $P(\br\,|\,s)$.\index{maximum likelihood} And we assume that the channel is a binary symmetric channel with noise level $f<0.5$, so that the likelihood is \beq P( \br \,|\, s ) = P(\br \,|\, \bt(s) ) = \prod_{n=1}^N P(r_n \,|\, t_n(s) ) , \eeq where $N=3$ is the number of transmitted bits in the block we are considering, and \beq P(r_n\,|\,t_n) = \left\{ \begin{array}{lll} (1\!-\!f) & \mbox{if} & r_n=t_n \\ f & \mbox{if} & r_n \neq t_n. \end{array} \right. \eeq Thus the likelihood ratio for the two hypotheses is % if we define $ \beq \frac{P(\br\,|\, s\eq {\tt 1})}{P(\br\,|\, s\eq {\tt 0})} % = \left( \frac{ (1-f) }{f} \right)^{ = \prod_{n=1}^N \frac{P(r_n \,|\, t_n({\tt 1}) )}{P(r_n \,|\, t_n({\tt 0}) )} ; \label{eq.likelihood.bsc} \eeq each factor % $P(r_n \,|\, t_n(s) )$ $\frac{P(r_n | t_n({\tt 1}) )}{P(r_n | t_n({\tt 0}) )}$ equals $\frac{ (1-f) }{f}$ if $r_n=1$ and $\frac{f}{ (1-f) }$ if $r_n=0$. The ratio $\gamma \equiv \frac{ (1-f) }{f}$ is greater than 1, since $f<0.5$, so the winning hypothesis is the one with the most `votes', each vote counting for a factor of $\gamma$ in the % posterior probability. likelihood ratio. Thus the majority-vote decoder shown in \algref{fig.r3d} is the optimal decoder if we assume that the channel is a binary symmetric channel and that the two possible source messages {\tt 0} and {\tt 1} have equal prior probability. \end{aside} %\noindent We now apply the majority vote decoder to the received vector of \figref{fig.r3.transmission}. The first three received bits are all ${\tt 0}$, so we decode this triplet as a ${\tt 0}$. In the second triplet of \figref{fig.r3.transmission}, there are two {\tt 0}s and one {\tt 1}, so we decode this triplet as a ${\tt 0}$ -- which in this case corrects the error. Not all errors are corrected, however. If we are unlucky and two errors fall in a single block, as in the fifth triplet of \figref{fig.r3.transmission}, then the decoding rule gets the wrong answer, as shown in \figref{fig.decoding.R3}. % \Figref{fig.decoding.R3} % shows the result of decoding the received vector % from \figref{fig.r3.transmission}. \begin{figure}[htbp] \figuremargin{% \[ \begin{array}{rccccccc} \bs & {\tt 0}&{\tt 0}&{\tt 1}&{\tt 0}&{\tt 1}&{\tt 1}&{\tt 0} \\ \bt & \obr{{\tt 0}}{{\tt 0}}{{\tt 0}}&\obr{{\tt 0}}{{\tt 0}}{{\tt 0}}&\obr{{\tt 1}}{{\tt 1}}{{\tt 1}}&\obr{{\tt 0}}{{\tt 0}}{{\tt 0}}& \obr{{\tt 1}}{{\tt 1}}{{\tt 1}}&\obr{{\tt 1}}{{\tt 1}}{{\tt 1}}& \obr{{\tt 0}}{{\tt 0}}{{\tt 0}} \\ \bn & \nbr{{\tt 0}}{{\tt 0}}{{\tt 0}}& \nbr{{\tt 0}}{{\tt 0}}{{\tt 1}}& \nbr{{\tt 0}}{{\tt 0}}{{\tt 0}}& \nbr{{\tt 0}}{{\tt 0}}{{\tt 0}}& \nbr{{\tt 1}}{{\tt 0}}{{\tt 1}}& \nbr{{\tt 0}}{{\tt 0}}{{\tt 0}}& \nbr{{\tt 0}}{{\tt 0}}{{\tt 0}} \\ \cline{2-8} \br & \ubr{{\tt 0}}{{\tt 0}}{{\tt 0}}& \ubr{{\tt 0}}{{\tt 0}}{{\tt 1}}& \ubr{{\tt 1}}{{\tt 1}}{{\tt 1}}& \ubr{{\tt 0}}{{\tt 0}}{{\tt 0}}& \ubr{{\tt 0}}{{\tt 1}}{{\tt 0}}& \ubr{{\tt 1}}{{\tt 1}}{{\tt 1}}& \ubr{{\tt 0}}{{\tt 0}}{{\tt 0}} \\ \hat{\bs} & {\tt 0}&{\tt 0}&{\tt 1}&{\tt 0}&{\tt 0}&{\tt 1}&{\tt 0} \\ \mbox{corrected errors} & &\star & & & & & \\ \mbox{undetected errors} & & & & &\star & & \end{array} \] }{% \caption{Decoding % Applying the maximum likelihood decoder for $\mbox{R}_3$ to the received vector from \protect\figref{fig.r3.transmission}.} \label{fig.decoding.R3} }% \end{figure} \noindent % Thus the error probability is reduced by the use of this code. % It is easy to compute the error probability. % Exercise 1.1. Could this be made an Example, i.e. worked through in % the text? -- for a beginner, there is a lot in it, and it seems to % be important. % % see exercise.sty \exercissx{2}{ex.R3ep}{%%%%%%%% keep this as A2, but cut it from the ITPRNN list Show\marginpar{\small\raggedright\reducedlead The exercise's rating, \eg % `{\em{A}}2' `[{\em2\/}]', indicates its difficulty: `1' exercises are the easiest. % An exercise rated {\em{A}}2 is important and should not prove too difficult. Exercises that are accompanied by a marginal rat are especially recommended. If a solution or partial solution is provided, the page is indicated after the difficulty rating; for example, this exercise's solution is on page \pageref{ex.R3ep.sol}. } that the error probability is reduced by the use of {$\Rthree$} by computing the error probability of this code for a binary symmetric channel with noise level $f$. %Do so. } % % This fig is 0.1 inch too wide, 9801 % The error probability is dominated by the probability that two bits in a block of three are flipped, which scales as $f^2$. % % JARGON?????? % In the case of the binary symmetric channel with $f=0.1$, the {$\Rthree$} code has a probability of error, after decoding, of $\pb \simeq 0.03$ per bit. \Figref{fig.r3.dilbert} shows the result of transmitting a binary image over a binary symmetric channel using the repetition code. \begin{figure}[hbtp] %\fullwidthfigure{% %\figuredangle{% this hung off the bottom of the page \figuremarginb{% I think this may make a collision? \begin{center} \setlength{\unitlength}{0.8in}% was 0.75 98.12. changed to 0.8 99.01 \begin{picture}(7,4.3)(0,1.4) \put(0,5){\makebox(0,0)[tl]{\psfig{figure=bitmaps/dilbert.ps,width=1in}}} \put(0.625,5.4){\makebox(0,0){\Large$\bs$}} \thicklines \put(1.35,4.75){\vector(1,0){0.4}} \put(1.55,5.4){\makebox(0,0){{\sc encoder}}} \put(2,5){\makebox(0,0)[tl]{\psfig{figure=poster/10000.r3.ps,width=1in}}} \put(2.625,5.4){\makebox(0,0){\Large$\bt$}} \put(3.6,5.4){\makebox(0,0){{\sc channel}}} \put(3.6,5.15){\makebox(0,0){$f={10\%}$}} \put(3.4,4.75){\vector(1,0){0.4}} \put(4,5){\makebox(0,0)[tl]{\psfig{figure=poster/10000.r3.0.10.ps,width=1in}}} \put(4.625,5.4){\makebox(0,0){\Large$\br$}} \put(5.6,5.4){\makebox(0,0){{\sc decoder}}} %\put(5.6,3.4){\makebox(0,0)[tl]{\parbox[t]{1.75in}{{\em The decoder takes the majority vote of the three signals.}}}} \put(5.4,4.75){\vector(1,0){0.4}} \put(6,5){\makebox(0,0)[tl]{\psfig{figure=poster/10000.r3.0.10.d.ps,width=1in}}} \put(6.625,5.4){\makebox(0,0){\Large$\hat{\bs}$}} \end{picture} \end{center} }{% \caption[a]{Transmitting $10\,000$ source bits over a binary symmetric channel with $f=10\%$ % 0.1$ using a repetition code and the majority vote decoding algorithm. The probability of decoded bit error has fallen to about 3\%; the rate has fallen to 1/3.} % \dilbertcopy \label{fig.r3.dilbert} }% \end{figure} % Should `rate' be explicitly defined? \newpage\indent The repetition code $\Rthree$ has therefore reduced the probability of error, as desired. Yet we have lost something: our {\em rate\/} of information transfer has fallen by a factor of three. So if we use a repetition code to communicate data over a telephone line, it will reduce the error frequency, but it will also reduce our communication rate. We will have to pay three times as much for each phone call. % there will also be a delay Similarly, %As for our \disc{} drive, we would need three of the original noisy gigabyte \disc{} drives in order to create a one-gigabyte \disc{} drive with $\pb=0.03$. Can we % What happens as we try to push the error probability lower, to the values required for a % quality sellable \disc{} drive -- $10^{-15}$? We could achieve lower error probabilities by using repetition codes with more repetitions. \exercissx{3}{ex.R60}{ \ben \item Show that the probability of error of $\RN$, the repetition code with $N$ repetitions, is \beq p_{\rm b} = \sum_{n=(N+1)/2}^{N} {{N}\choose{n}} f^n (1-f)^{N-n} , \eeq for odd $N$. \item Assuming $f = 0.1$, which of the terms in this sum is the biggest? How much bigger is it than the second-biggest term? \item Use \ind{Stirling's approximation} (\pref{sec.stirling}) to approximate % get rid of the ${{N}\choose{n}}$ in the largest term, and find, approximately, the probability of error of the repetition code with $N$ repetitions. \item Assuming $f = 0.1$, find how many repetitions are required % show that it takes a repetition % code with rate about $1/60$ to get the probability of error down to $10^{-15}$. [Answer: about 60.] \een } So to build a {\em single\/} gigabyte \disc{} drive with the required reliability from noisy gigabyte drives with $f=0.1$, we would need {\em sixty\/} of the noisy \disc{} drives. The tradeoff between error probability and rate for repetition codes is shown in \figref{fig.pbR.R}. % % see end of l1.tex for method, also see poster1.gnu % \newcommand{\pbobject}{\hspace{-0.15in}\raisebox{1.62in}{$\pb$}% \hspace{-0.05in}} \begin{figure} \figuremargin{% \begin{center} \begin{tabular}{cc} \hspace{-0.2in}\psfig{figure=\codefigs/rep.1.ps,angle=-90,width=2.6in} & \pbobject\psfig{figure=\codefigs/rep.1.l.ps,angle=-90,width=2.6in} \\ \end{tabular} \end{center} }{% \caption[a]{Error probability $\pb$ versus rate for repetition codes over a binary symmetric channel with $f=0.1$. The right-hand figure shows $\pb$ on a logarithmic scale. We would like the rate to be large and $\pb$ to be small. } \label{fig.pbR.R} }% \end{figure} % see end of this file for method \subsection{Block codes -- the $(7,4)$ Hamming code} \label{sec.ham74} We would like to communicate with\indexs{Hamming code} tiny probability of error {\em and\/} at a substantial rate. Can we improve on repetition codes? What if we add redundancy to {\dem blocks\/} of data instead of % redundantly encoding one bit at a time? % You may already have heard of the idea of `parity check bits'. We now study a simple {\dem{block code}}. A {\dem \ind{block code}\/} is a rule\index{error-correcting code!block code} for converting a sequence of source bits $\bs$, of length $K$, say, into a transmitted sequence $\bt$ of length $N$ bits. To add redundancy, we make $N$ greater than $K$. In a {\dem linear\/} block code, the extra $N-K$ bits are linear functions of the original $K$ bits; these extra bits are called {\dem\ind{parity-check bits}}. An example of a \ind{linear block code} is the \mbox{\dem$(7,4)$ {Hamming code}}, which transmits $N=7$ bits for every $K=4$ source bits. % \index{error-correcting code!linear} \begin{figure}[htbp] \figuremargin{\small% \begin{center} \begin{tabular}{cc} (a)\psfig{figure=hamming/encode.eps,angle=-90,width=1.3in} & (b)\psfig{figure=hamming/correct.eps,angle=-90,width=1.3in} \\ \end{tabular} \end{center} }{ \caption[a]{Pictorial representation of encoding for the $(7,4)$ Hamming code. % a and b are not explained in the caption. Does this matter? % % The parity check bits $t_5,t_6,t_7$ are set so that the parity within %% each circle is even. } \label{fig.74h.pictorial} \label{fig.hamming.pictorial} } \end{figure} % The encoding operation for the code is shown pictorially in \figref{fig.74h.pictorial}. % % \subsubsection{Encoding} We arrange the seven transmitted bits in three intersecting circles. % as shown in \figref{fig.hamming.encode}. The first four transmitted bits, $t_1 t_2 t_3 t_4$, are set equal to the four source bits, $s_1 s_2 s_3 s_4$. The parity-check bits\index{parity-check bits} $t_5 t_6 t_7$ are set so that the {\dem\ind{parity}\/} within each circle is even: the first parity-check bit is the parity of the first three source bits (that is, it is %zero {\tt 0} if the sum of those bits is even, and % one {\tt 1} if the sum is odd); the second is the parity of the last three; and the third parity bit is the parity of source bits one, three and four. As an example, \figref{fig.74h.pictorial}b shows the transmitted codeword for the case $\bs = {\tt 1000}$. % idea for rewriting this: go straight to pictorial story, leave out the % matrix description for another time. % % %\noindent % Table \ref{tab.74h} shows the codewords generated by each of the $2^4=$ sixteen settings of the four source bits. % Notice that the first four transmitted bits are % identical to the four source bits, and the remaining three bits % are parity bits: % The special property of these codewords is that These codewords have the special property that any pair differ from each other in at least three bits. \begin{table}[htbp] \figuremargin{% \begin{center} \mbox{\small \begin{tabular}{cc} \toprule % Source sequence $\bs$ & % Transmitted sequence $\bt$ \\ \midrule \tt 0000 &\tt 0000000 \\ \tt 0001 &\tt 0001011 \\ \tt 0010 &\tt 0010111 \\ \tt 0011 &\tt 0011100 \\ \bottomrule \end{tabular} \hspace{0.02in} \begin{tabular}{cc} \toprule $\bs$ & $\bt$ \\ \midrule \tt 0100 &\tt 0100110 \\ \tt 0101 &\tt 0101101 \\ \tt 0110 &\tt 0110001 \\ \tt 0111 &\tt 0111010 \\ \bottomrule \end{tabular} \hspace{0.02in} \begin{tabular}{cc} \toprule $\bs$ & $\bt$ \\ \midrule \tt 1000 &\tt 1000101 \\ \tt 1001 &\tt 1001110 \\ \tt 1010 &\tt 1010010 \\ \tt 1011 &\tt 1011001 \\ \bottomrule \end{tabular} \hspace{0.02in} \begin{tabular}{cc} \toprule $\bs$ & $\bt$ \\ \midrule \tt 1100 &\tt 1100011 \\ \tt 1101 &\tt 1101000 \\ \tt 1110 &\tt 1110100 \\ \tt 1111 &\tt 1111111 \\ \bottomrule \end{tabular} }%%%%%%%%% end of row of four tables \end{center} }{% \caption[a]{The sixteen codewords $\{ \bt \}$ of the $(7,4)$ Hamming code. Any pair of codewords % have the % beautiful % elegant property that they differ from each other in at least three bits.} %\label{fig.hamming.encode} \label{tab.74h} \label{tab.h74} \label{fig.h74} \label{fig.74h} } \end{table} % \begin{aside} Because the Hamming code is a {linear\/} code, it can\indexs{error-correcting code!linear} be written compactly in terms of matrices as follows.\index{linear block code} % It is a % {\em linear\/} code; that is, t The transmitted codeword $\bt$ is % can be obtained from the source sequence $\bs$ by a linear operation, \beq \bt = \bG^{\T} \bs, \label{eq.encode} \eeq where $\bG$ is the {\dem\ind{generator matrix}} of the code, \beq \bG^{\T} = {\left[ \begin{array}{cccc} \tt 1 &\tt 0 &\tt 0 &\tt 0 \\ \tt 0 &\tt 1 &\tt 0 &\tt 0 \\ \tt 0 &\tt 0 &\tt 1 &\tt 0 \\ \tt 0 &\tt 0 &\tt 0 &\tt 1 \\ \tt 1 &\tt 1 &\tt 1 &\tt 0 \\ \tt 0 &\tt 1 &\tt 1 &\tt 1 \\ \tt 1 &\tt 0 &\tt 1 &\tt 1 \end{array} \right] } , \label{eq.h74.gen} \eeq and the encoding operation (\ref{eq.encode}) uses modulo-2 arithmetic (${\tt 1}+{\tt 1}={\tt{0}}$, ${\tt 0}+{\tt 1}={\tt 1}$, etc.). %\footnote{My notational % convention is that all vectors -- $\bs$, $\bt$, etc.\ -- % are column vectors, except that in the figures where many % vectors are listed, they are displayed as row vectors. The % generator matrix $\bG$ is written ..... as to retain % consistency with established notation in coding texts.} % \begin{aside} In the encoding operation (\ref{eq.encode}) I have assumed that $\bs$ and $\bt$ are column vectors. If instead they are row vectors, then this equation is replaced by \beq \bt = \bs \bG, \label{eq.encodeT} \eeq where \beq \bG = \left[ \begin{array}{ccccccc} \tt 1& \tt 0& \tt 0& \tt 0& \tt 1& \tt 0& \tt 1 \\ \tt 0& \tt 1& \tt 0& \tt 0& \tt 1& \tt 1& \tt 0 \\ \tt 0& \tt 0& \tt 1& \tt 0& \tt 1& \tt 1& \tt 1 \\ \tt 0& \tt 0& \tt 0& \tt 1& \tt 0& \tt 1& \tt 1 \\ \end{array} \right] . \label{eq.Generator} \eeq % f you are like me, you may I find it easier to relate to the right-multiplication (\ref{eq.encode}) % hyphenation specified in itprnnchapter.tex did not work so I do it manually than the left-multiplica-{\breakhere}tion (\ref{eq.encodeT}). % -- I like my matrices to act to the right. Many coding theory texts use the left-multiplying conventions (\ref{eq.encodeT}--\ref{eq.Generator}), however. The rows of the generator matrix (\ref{eq.Generator}) can be viewed as defining four basis vectors lying in a seven-dimensional binary space. The sixteen codewords are obtained by making all possible linear combinations % binary sums of these vectors. \end{aside} % % should I add a cast of characters here? % s,t,r,s^ \subsubsection{Decoding the $(7,4)$ Hamming code} When we invent a more complex encoder $\bs \rightarrow \bt$, the task of decoding the received vector $\br$ becomes less straightforward. Remember that {\em any\/} of the bits may have been flipped, including the parity bits. % We can't assume that the three extra parity bits %(The reader who % is eager to see the denouement of the plot may skip ahead to section % \ref{sec.code.perf}.) % General defn of optimal decoder If we assume that the channel is a binary symmetric channel and that all source vectors are equiprobable, % {\em a priori}, then the optimal decoder % is one that identifies the source vector $\bs$ whose encoding $\bt(\bs)$ differs from the received vector $\br$ in the fewest bits. [{Refer to the likelihood function % equation % {eq.bayestheorem}--\ref{eq.likelihood.bsc}} \bref{eq.likelihood.bsc}} to see why this is so.] We could solve the decoding problem by measuring how far $\br$ is from each of the sixteen codewords in \tabref{tab.74h}, then picking the closest. Is there a more efficient way of finding the most probable source vector? \subsubsection{Syndrome decoding for the Hamming code} \label{sec.syndromedecoding} For the $(7,4)$ Hamming code there is a pictorial solution to the % syndrome decoding problem, based on the encoding picture, \figref{fig.74h.pictorial}. % % \subsubsection{Decoding} % % sanjoy says this is CONFUSING - tried to improve it Sat 22/12/01 % also romke did not like it As a first example, let's assume the transmission was $\bt = {\tt 1000101}$ and the noise flips the second bit, so the received vector is $\br = {\tt 1000101}\oplus{\tt{0100000}} = {\tt{1100101}}$. % \ie, $\bn=({\tt 0},{\tt 1},{\tt 0},{\tt 0},{\tt 0}, {\tt 0},{\tt 0})$, % and the received vector We write the received vector into the three circles as shown in \figref{fig.hamming.decode}a, and look at each of the three circles to see whether its parity is even. The circles whose parity is {\em{not}\/} even are shown by dashed lines in \figref{fig.hamming.decode}b. % The fact that all codewords differ from each other in at least % three bits means that if the noise has flipped any one or two bits, % the received vector will no longer be a valid codeword, and some of % the parity checks will be broken. % The decoding task is %We want to find the smallest set of flipped bits that can account for these violations of the parity rules. % violated. [The pattern of violations of the parity checks is called the {\dem\ind{syndrome}}, and can be written as a binary vector -- for example, in \figref{fig.hamming.decode}b, the syndrome is $\bz = ({\tt1},{\tt1},{\tt0})$, because the first two circles are `unhappy' (parity {\tt1}) and the third circle is `happy' (parity {\tt0}).] % RESTORE ME: %, and the task of syndrome decoding % syndrome (just as a % \ind{doctor} might seek the most probable underlying \ind{disease} to account for % the symptoms shown by a \ind{patient}). \begin{figure}% [htbp] \figuremargin{\small% \begin{center} \begin{tabular}{ccc} (a)\psfig{figure=hamming/decode.eps,angle=0,width=1.3in} \\ (b)\psfig{figure=hamming/s2.eps,angle=-90,width=1.3in} & (c)\psfig{figure=hamming/t5.eps,angle=-90,width=1.3in} & (d)\psfig{figure=hamming/s3.eps,angle=-90,width=1.3in} \\[0.3in] \multicolumn{3}{c}{% (e)\psfig{figure=hamming/s3.t7.eps,angle=0,width=1.3in} \setlength{\unitlength}{1in} \begin{picture}(0.4,0.6)(0,0) \put(0,0.6){\vector(1,0){0.6}} \end{picture} % \raisebox{0.6in}{$\rightarrow$} (${\rm e}'$)\psfig{figure=hamming/s3.t7.d.eps,angle=0,width=1.3in} }\\ \end{tabular} \end{center} }{% \caption[a]{Pictorial representation of decoding of the Hamming $(7,4)$ code. The received vector is written into the diagram as shown in (a). In (b,c,d,e), the received vector is shown, assuming that the transmitted vector was as in % The bits that are flipped relative to \protect \figref{fig.hamming.pictorial}b and the bits labelled by $\star$ were flipped. The violated parity checks are highlighted by dashed circles. One of the seven bits is the most probable suspect to account for each `\ind{syndrome}', \ie, each pattern of violated and satisfied parity checks. In examples (b), (c), and (d), the most probable suspect is the one bit that was flipped. In example (e), two bits have been flipped, $s_3$ and $t_7$. The most probable suspect is $r_2$, marked by a circle in (${\rm e}'$), which shows the output of the decoding algorithm. % each circle is even. }\label{fig.hamming.decode} \label{fig.hamming.s2}% these labels were in the wrong place feb 2000 \label{fig.hamming.s3} \label{fig.hamming.correct} } \end{figure} % % ACTION: sanjoy still thinks this part is hard to follow - fixed Sat 22/12/01? To solve the decoding task, % problem, we ask the question: can we find a unique bit that lies {\em inside\/} all the `unhappy' circles and {\em outside\/} all the `happy' circles? If so, the flipping of that bit would account for the observed syndrome. In the case shown in \figref{fig.hamming.s2}b, the bit $r_2$ % that was flipped lies inside the two unhappy circles and outside the happy circle; no other single bit has this property, so $r_2$ is the only single bit capable of explaining the syndrome. Let's work through a couple more examples. \Figref{fig.hamming.s2}c shows what happens if one of the parity bits, $t_5$, is flipped by the noise. Just one of the checks is violated. Only $r_5$ lies inside this unhappy circle and outside the other two happy circles, so $r_5$ is identified as the only single bit capable of explaining the syndrome. If the central bit $r_3$ is received flipped, \figref{fig.hamming.s3}d shows that all three checks are violated; only $r_3$ lies inside all three circles, so $r_3$ is identified as the suspect bit. If you try flipping any one of the seven bits, you'll find that a different syndrome is obtained in each case -- seven non-zero syndromes, one for each bit. There is only one other syndrome, the all-zero syndrome. So if the channel is a binary symmetric channel with a small noise level $f$, the optimal decoder unflips at most one bit, depending on the syndrome, as shown in \algref{tab.hamming.decode}. Each syndrome could have been caused by other noise patterns too, but any other noise pattern that has the same syndrome must be less probable because it involves a larger number of noise events. %\begin{figure} %\figuremargin{% \begin{algorithm} \algorithmmargin{% \begin{center} \begin{tabular}{c*{8}{c}} % Fri 4/1/02 removed toprule and bottomrule because algorithm has its own frame %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % \toprule Syndrome $\bz$ & {\tt 000} & {\tt 001} & {\tt 010} & {\tt 011} & {\tt 100} & {\tt 101} & {\tt 110} & {\tt 111} \\ \midrule Unflip this bit & {\small{\em none}} & $r_7$ & $r_6$ & $r_4$ & $r_5$ & $r_1$ & $r_2$ & $r_3$ \\ % \bottomrule % Unflip this bit & {\small{\em none}} & 7 & 6 & 4 & 5 & 1 & 2 & 3 \\ % \bottomrule % this is appropriate only if z =z3,z2,z1: % Unflip this bit & {\small{\em none}} & 5 & 6 & 2 & 7 & 1 & 4 & 3 \\ \hline \end{tabular} \end{center} %\begin{center} %\begin{tabular}{cc} \hline %Syndrome $\bz$ & % 3 2 1 !!!!!!!!!!!!!!!!!!! %Flip this bit \\ \hline % 000 &{\small{\em none}} \\ % 001 &5\\ % 010 &6\\ % 011 &2\\ % 100 &7\\ % 101 &1\\ % 110 &4\\ % 111 &3 \\ \hline %\end{tabular} %\end{center} }{% \caption[a]{Actions taken by the optimal decoder for the $(7,4)$ Hamming code, assuming a binary symmetric channel with small noise level $f$. The syndrome vector $\bz$ lists whether each parity check is violated ({\tt 1}) or satisfied ({\tt 0}), going through the checks in the order of the bits $r_5$, $r_6$, and $r_7$. } \label{tab.hamming.decode} }% \end{algorithm} What happens if the noise actually flips more than one bit? \Figref{fig.hamming.s3}e shows the situation when two bits, $r_3$ and $r_7$, are received flipped. The syndrome, {\tt 110}, makes us suspect the single bit $r_2$; so our optimal decoding algorithm flips this bit, giving a decoded pattern with three errors as shown in \figref{fig.hamming.s3}${\rm e}'$. If we use the optimal decoding algorithm, any two-bit error pattern will lead to a decoded seven-bit vector that contains three errors. \subsection{General view of decoding for linear codes: syndrome decoding} \label{sec.syndromedecoding2} \begin{aside} % {\em (Does some of this stuff belong earlier in the pictorial area?)} We can also describe the decoding problem for a linear code in terms of matrices.\index{syndrome decoding}\index{error-correcting code!syndrome decoding}\index{linear block code} % In the case of a linear code and a symmetric channel, % the decoding task can be re-expressed as {\bf syndrome decoding}. % Let's assume that the noise level $f$ is less than $1/2$. The first four received bits, $r_1r_2r_3r_4$, purport to be the four source bits; and the received bits $r_5r_6r_7$ purport to be the parities of the source bits, as defined by the generator matrix $\bG$. We evaluate the three parity-check bits for the received bits, $r_1 r_2r_3 r_4$, and see whether they match the three received bits, $r_5r_6r_7$. The differences (modulo 2) between these two triplets are called the {\dbf\ind{syndrome}} of the received vector. If the syndrome is zero -- if all three parity checks are happy % agree with the corresponding received bits -- then the received vector is a codeword, and the most probable decoding is given by reading out its first four bits. If the syndrome is non-zero, then % we are certain that the noise sequence for this block was non-zero, and the syndrome is our pointer to the most probable error pattern. The computation of the syndrome vector is a linear operation. If we define the $3 \times 4$ matrix $\bP$ such that the matrix of equation (\ref{eq.h74.gen}) is \beq \bG^{\T} = \left[ \begin{array}{c}{\bI_4}\\ \bP\end{array} \right], \eeq where $\bI_4$ is the $4\times 4$ identity matrix, then the syndrome vector is $\bz = \bH \br$, where the {\dbf\ind{parity-check matrix}} $\bH$ is given by $\bH = \left[ \begin{array}{cc} -\bP & \bI_3 \end{array} \right]$; in modulo 2 arithmetic, $-1 \equiv 1$, so \beq \bH = \left[ \begin{array}{cc} \bP & \bI_3 \end{array} \right] = \left[ \begin{array}{ccccccc} \tt 1&\tt 1&\tt 1&\tt 0&\tt 1&\tt 0&\tt 0 \\ \tt 0&\tt 1&\tt 1&\tt 1&\tt 0&\tt 1&\tt 0 \\ \tt 1&\tt 0&\tt 1&\tt 1&\tt 0&\tt 0&\tt 1 \end{array} \right] . \label{eq.pcmatrix} \eeq All the codewords $\bt = \bG^{\T} \bs$ of the code satisfy \beq \bH \bt = \left[ {\tt \begin{array}{c} \tt0\\ \tt0\\ \tt0 \end{array} } \right] . % (0,0,0) . \eeq \exercisaxB{1}{ex.GHis0}{ Prove that this is so by evaluating the $3\times4$ matrix $\bH \bG^{\T}$. } Since the received vector $\br$ is given by $\br = \bG^{\T}\bs + \bn$, % and $\bH \bG^{\T}$=0, the syndrome-decoding problem is to find the most probable noise vector $\bn$ satisfying the equation \beq \bH \bn = \bz . \eeq A decoding algorithm that solves this problem is called a {\dem {maximum-likelihood} decoder}. We will discuss decoding problems like this in later chapters. %\footnote{Somewhere in this book % I need to spell out \Bayes\ theorem for decoding. Here would be % a good spot; but on the other hand, people can understand decoding % intuitively, they don't need Bayes theorem and they might find it % a hindrance if they were not only being hit by % Shannon's theorem but also by likelihoods and priors.} % % ACTION NEEDED ???????????????????????????????????????? % \end{aside} \begin{figure} %\fullwidthfigure{% \figuredanglenudge{% \begin{center} \setlength{\unitlength}{0.8in}% was 1in, with figures 1.25 wide % then was 0.8 with 1in \begin{picture}(7,2.7)(0,2.8) \put(0,5){\makebox(0,0)[tl]{\psfig{figure=bitmaps/dilbert.ps,width=1in}}} \put(0.625,5.4){\makebox(0,0){\Large$\bs$}} \thicklines \put(1.35,4.75){\vector(1,0){0.4}} \put(1.55,5.4){\makebox(0,0){{\sc encoder}}} \put(2,5){\makebox(0,0)[tl]{\psfig{figure=poster/10000.h74.ps,width=1in}}} \put(1.982,3.75){\makebox(0,0)[tr]{{parity bits} $\left.\rule[-0.342in]{0pt}{0.342in} \right\{$}} \put(2.625,5.4){\makebox(0,0){\Large$\bt$}} \put(3.6,5.4){\makebox(0,0){{\sc channel}}} \put(3.6,5.15){\makebox(0,0){$f={10\%}$}} \put(3.4,4.75){\vector(1,0){0.4}} \put(4,5){\makebox(0,0)[tl]{\psfig{figure=poster/10000.h74.0.10.ps,width=1in}}} \put(4.625,5.4){\makebox(0,0){\Large$\br$}} \put(5.6,5.4){\makebox(0,0){{\sc decoder}}} %\put(5.6,3.5){\makebox(0,0)[tl]{\parbox[t]{1.75in}{{\em The decoder picks the $\hat{\bs}$ with maximum likelihood.}}}} \put(5.4,4.75){\vector(1,0){0.4}} \put(6,5){\makebox(0,0)[tl]{\psfig{figure=poster/10000.h74.0.10.d.ps,width=1in}}} \put(6.625,5.4){\makebox(0,0){\Large$\hat{\bs}$}} \end{picture} \end{center} }{% \caption[a]{Transmitting $10\,000$ source bits over a binary symmetric channel with $f=10\%$ %0.1$ using a $(7,4)$ Hamming code. The probability of decoded bit error is about 7\%.} % \dilbertcopy} \label{fig.h74.dilbert} }{0.7in}% third argument is the upward nudge of the caption \end{figure} \subsection{Summary of the $(7,4)$ Hamming code's properties} Every possible received vector of length 7 bits is either a codeword, or it's one flip away from a codeword.% \index{Hamming code} Since there are three parity constraints, each of which might or might not be violated, there are $2\times 2\times 2= 8$ % eight distinct syndromes. They can be divided into seven non-zero syndromes -- one for each of the one-bit error patterns -- and the all-zero syndrome, corresponding to the zero-noise case. The optimal decoder takes no action if the syndrome is zero, otherwise it uses this mapping of non-zero syndromes onto one-bit error patterns to unflip the suspect bit. There is a {\dbf decoding error} if the four decoded bits $\hat{s}_1, \hat{s}_2, \hat{s}_3, \hat{s}_4$ do not all match the source bits ${s}_1, {s}_2, {s}_3, {s}_4$. The {\dbf probability of block error} $\pB$ is the probability that one or more of the decoded bits in one block fail to match the corresponding source bits, \beq \pB = P( \hat{\bs} \neq \bs ) . \eeq The {\dbf probability of bit error} $\pb$ is the average probability % per decoded bit that a decoded bit fails to match the corresponding source bit, \beq \pb = \frac{1}{K} \sum_{k=1}^K P( \hat{s}_k \neq s_k ) . \eeq In the case of the Hamming code, a decoding error will occur whenever the noise has flipped more than one bit in a block of seven. % Any noise pattern that flips more than one bit will give rise to one of % these syndromes, and our decoder will make an erroneous decision. % The probability of block error is thus the probability that two or more bits are flipped in a block. This probability scales as $O(f^2)$, as did the probability of error for the repetition code $\Rthree$. But notice that the Hamming code communicates at a greater rate, $R=4/7$. \Figref{fig.h74.dilbert} shows a binary image transmitted over a binary symmetric channel using the $(7,4)$ Hamming code. About 7\% of the decoded bits are in error. Notice that the errors are correlated: % with each other: often two or three successive decoded bits are flipped. \exercisaxA{1}{ex.Hdecode}{ This exercise and the next three refer to the $(7,4)$ {Hamming code}. Decode the received strings: \ben \item $\br = {\tt 1101011}$ % 10 \item $\br = {\tt 0110110}$ % 4 \item $\br = {\tt 0100111}$ % 4 \item $\br = {\tt 1111111}$. % 15 \een } \exercissxA{2}{ex.H74p}{ \ben \item Calculate the probability of block error $p_{\rm B}$ of the $(7,4)$ Hamming code as a function of the noise level $f$ and show that to leading order % \footnote{Do I need to explain what this means? Or use a different % terminology? Maybe only physicists are familiar?} % % ACTION!!! % it goes as $21 f^2$. \item % } % \exercis{}{ \difficulty{3} % $^{B3}$ Show that to leading order the probability of bit error $\pb$ goes as $9 f^2$. \een} \exercissxA{2}{ex.H74zero}{ % Hamming $(7,4)$ code. Find some noise vectors that give the all-zero syndrome (that is, noise vectors that leave all the parity checks unviolated). How many such noise vectors are there? } % they are the codewords. \exercisaxB{2}{ex.H74detail}{ % Hamming $(7,4)$ code. I asserted above that a block decoding error will result whenever two or more bits are flipped in a single block. Show that this is indeed so. [In principle, there might be error patterns that, after decoding, led only to the corruption of the parity bits, with no source bits incorrectly decoded.] } \subsection{Summary of codes' performances} \label{sec.code.perf} Figure \ref{fig.pbR.RH} shows the performance of \ind{repetition code}s and the {Hamming code}. It also shows the performance of a family of linear block codes that are generalizations of Hamming codes, called \ind{BCH codes}. % Reed-Muller codes, and % see end of this file for method % \begin{figure}[htbp] \figuremargin{% \begin{center} \begin{tabular}{cc} \hspace{-0.2in}\psfig{figure=\codefigs/rephambch.1.ps,angle=-90,width=2.6in} & \pbobject\psfig{figure=\codefigs/rephambch.1.l.ps,angle=-90,width=2.6in} \\ \end{tabular} \end{center} }{% \caption[a]{Error probability $\pb$ versus rate $R$ for repetition codes, the $(7,4)$ Hamming code and BCH codes with blocklengths up to 1023 over a binary symmetric channel with $f=0.1$. The righthand figure shows $\pb$ on a logarithmic scale.} \label{fig.pbR.RH} } \end{figure} % %\noindent % use this noindent if the ``h'' (here) works, otherwise new para. This figure shows that we can, using linear block codes, achieve better performance than repetition codes; but the asymptotic situation still looks grim. \exercissxA{4}{ex.makecode}{ % invent your own code Design an error-correcting code and a decoding algorithm for it, estimate its probability of error, and add it to figure \ref{fig.pbR.RH}. [Don't worry if you find it difficult to make a code better than the Hamming code, or if you find it difficult to find a good decoder for your code; that's the point of this exercise.] } \exercissxA{3}{ex.makecode2error}{ A $(7,4)$ Hamming code can correct any {\em one\/} error; might there be a % (10,4) $(14,8)$ code that can correct any two errors? % What about a (9,4) code? {\sf Optional extra:} Does the answer to this question depend on whether the code is linear or nonlinear? } \exercissxA{4}{ex.makecode2}{ Design an error-correcting code, other than a repetition code, that can correct any {\em two\/} errors in a block of size $N$. } \section{What performance can the best codes achieve?} There seems to be a trade-off between the decoded bit-error probability $\pb$ (which we would like to reduce) and the rate $R$ (which we would like to keep large). How can this trade-off be characterized? % Can we do better than repetition codes? What points in the $(R,\pb)$ plane are achievable? This question was addressed by Claude Shannon\index{Shannon, Claude} in his pioneering paper of 1948, in which he both created the field of information theory and solved most of its fundamental problems. % in the same paper. At that time there was a widespread belief that the boundary between achievable and nonachievable points in the $(R,\pb)$ plane was a curve passing through the origin $(R,\pb) = (0,0)$; if this were so, then, in order to achieve a vanishingly small error probability $\pb$, one would have to reduce the rate correspondingly close to zero. % (figure ref here). % This would seem a reasonable guess, % in accordance with the general rule that the better something works % the more you have to pay for it. % % ACTION: sanjoy doesn't like This % `No pain, no gain.' However, Shannon proved the remarkable result that\wow\ % , for any given channel, the boundary between achievable and nonachievable points meets the $R$ axis at a {\em non-zero\/} value $R=C$, as shown in \figref{fig.pbR.RHS}. \begin{figure}[htbp] \figuremargin{% \begin{center} \begin{tabular}{cc} \hspace{-0.2in}\psfig{figure=\codefigs/repshan.1.ps,angle=-90,width=2.6in} & \pbobject\psfig{figure=\codefigs/repshan.1.l.ps,angle=-90,width=2.6in} \\ \end{tabular} \end{center} }{% \caption[a]{Shannon's noisy-channel coding theorem.\indexs{noisy-channel coding theorem}\index{Shannon, Claude} The solid curve shows the Shannon limit on achievable values of $(R,\pb)$ for the binary symmetric channel with $f=0.1$. Rates up to $R=C$ are achievable with arbitrarily small $\pb$. The points show the performance of some textbook codes, as in \protect\figref{fig.pbR.RH}. %\indent MANUAL INDENT \hspace{1.5em}The equation defining the Shannon limit (the solid curve) is %\[ $R = \linefrac{C}{(1-H_2(\pb))},$ %\] where $C$ and $H_2$ are defined in \protect \eqref{eq.capacity}. } \label{fig.pbR.RHS} } \end{figure} % see end of this file for method % For any channel, there exist codes that make it possible to communicate with {\em arbitrarily small\/} probability of error $\pb$ at non-zero rates. The first half of this book ({\partnoun}s I--III) will be devoted to understanding this remarkable result, which is called the {\dbf{noisy-channel coding theorem}}. \subsection{Example: $f=0.1$}% A few details} The maximum rate at which communication is possible with arbitrarily small $\pb$ is called the {\dbf\ind{capacity}} of the channel.\index{channel!capacity} The formula for the capacity of a binary symmetric channel with noise level $f$ is\index{binary entropy function} \beq C(f) = 1 - H_2(f) = 1 - \left[ f \log_2 \frac{1}{f} + (1-f) \log_2 \frac{1}{1-f} \right] ; \label{eq.capacity} \eeq the channel we were discussing earlier with noise level $f=0.1$ has capacity $C \simeq 0.53$. Let us consider what this means in terms of noisy \disc{} drives. The \ind{repetition code} $\Rthree$ could communicate over this channel with $\pb=0.03$ at a rate $R = 1/3$. Thus we know how to build a single gigabyte \disc{} drive with $\pb = 0.03$ from three noisy gigabyte \disc{} drives. We also know how to make a single gigabyte \disc{} drive with $\pb \simeq 10^{-15}$ from sixty noisy one-gigabyte drives \exercisebref{ex.R60}. And now Shannon\index{Shannon, Claude} passes by, notices us \ind{juggling} % tinkering with \disc{} drives and codes and says: \begin{quotation} \noindent `What performance are you trying to achieve? $10^{-15}$? You don't need {\em sixty\/} \disc{} drives -- you can get that performance with just {\em two\/} \disc{} drives (since 1/2 is less than $0.53$). % (The capacity is 0.53, so the number of \disc{} drives needed at % capacity is 1/0.53.) % ` And if you want $\pb = 10^{-18}$ % , or $10^{-21}$, or $10^{-24}$ or anything, you can get there with two \disc{} drives too!' \end{quotation} %\begin{aside} [Strictly, the above statements might not be quite right, since, as we shall see, Shannon proved his noisy-channel coding theorem %proves the achievability of ever smaller % error probabilities at a given rate $Ra$) is defined to be $\int_{a}^{b} \! \d v \: P(v)$. $P(v)\d v$ is dimensionless. The density $P(v)$ is a dimensional quantity, having dimensions inverse to the dimensions of $v$ -- in contrast to discrete probabilities, which are dimensionless. Don't be surprised to see probability densities % with a numerical value greater than 1. This is normal, and nothing is wrong, as long as $\int_{a}^{b} \! \d v \: P(v) \leq 1$ for any interval $(a,b)$. Conditional and joint probability densities are defined in just the same way as conditional and joint probabilities. % , which is why I choose not to use different notation for them. \end{aside} % More equations here. % % bring from chapter 4? % % at present ch 4 refers to this page as the first occurrence of % Laplace's rule. % % Sort out this mess::::::::::::::: % p30 Ex 2.8 : There claims to be a solution to this on p121 but this is %actually a solution to Ex 6.2 %Generally would be helpful if notation in Chapters 2 and 6 was the same % % !!!!!!!!!!!!!!!!!!!! Idea: move this exe to the end of this subsection? % THIS EX seems to have no solution \exercisaxB{2}{ex.postpa}{% solution added Mon 10/11/03 Assuming a uniform prior on $f_H$, $P(f_H) = 1$, solve the problem posed in \exampleref{exa.bentcoin}. Sketch the posterior distribution of $f_H$ and compute the probability that the $N\!+\!1$th outcome will be a head, for \ben \item $N=3$ and $n_H=0$; \item $N=3$ and $n_H=2$; \item $N=10$ and $n_H=3$; \item $N=300$ and $n_H=29$. \een You will find the \ind{beta integral} useful: \beq \int_0^1 \! \d p_a \: p_a^{F_a} (1-p_a)^{F_b} = \frac{\Gamma(F_a+1)\Gamma(F_b+1)}{ \Gamma(F_a+F_b+2) } = \frac{ F_a! F_b! }{ (F_a + F_b + 1)! } . \eeq You may also find it instructive to look back at \exampleref{ex.ip.urns} and \eqref{eq.laplace.succession.first}. } People sometimes confuse assigning a prior distribution to an unknown parameter such as $f_H$ with making an initial guess of the {\em{value}\/} of the parameter. % But priors are not values, they are distributions. But the prior over $f_H$, $P(f_H)$, is not a simple statement like `initially, I would guess $f_H = \dhalf$'. The prior is a probability density over $f_H$ which specifies the prior degree of belief that $f_H$ lies in any interval $(f,f+\delta f)$. It may well be the case that our prior for $f_H$ is symmetric about $\dhalf$, so that the {\em mean\/} of $f_H$ under the prior is $\dhalf$. %under our prior for $f_H$, the {\em mean\/} of $f_H$ is $\dhalf$ % -- on symmetry grounds for example. In this case, the predictive distribution {\em for the first toss\/} $x_1$ would indeed be \beq P(x_1 \eq \mbox{head}) = \int \! \d f_H \: P(f_H) P(x_1 \eq \mbox{head} \given f_H) = \int \! \d f_H \: P(f_H) f_H = \dhalf . \eeq But the prediction for subsequent tosses will depend on the whole prior distribution, not just its mean. \subsubsection{Data compression and inverse probability} Consider the following task. \exampl{ex.compressme}{ Write a computer program capable of compressing binary files like this one:\par \begin{center}{\footnotesize%was tiny {\tt 0000000000000000000010010001000000100000010000000000000000000000000000000000001010000000000000110000}\\ {\tt 1000000000010000100000000010000000000000000000000100000000000000000100000000011000001000000011000100}\\ {\tt 0000000001001000000000010001000000000000000011000000000000000000000000000010000000000000000100000000}\\[0.1in]% added this space Sat 21/12/02 } \end{center} % This file contains N=300 and n_1 = 29 The string shown contains $n_1=29$ {\tt 1}s and $n_0=271$ {\tt 0}s. % What is the probability that the next character in this file % is a {\tt 1}? } Intuitively, compression works by taking advantage of the predictability of a file. In this case, the source of the file appears more likely to emit {\tt 0}s than {\tt 1}s. A data compression program that compresses this file must, implicitly or explicitly, be addressing the question `What is the probability that the next character in this file is a {\tt 1}?' Do you think this problem is similar in character to \exampleref{exa.bentcoin}? I do. One of the themes of this book is that data compression and data modelling are one and the same, and that they should both be addressed, like the urn of example \ref{ex.ip.urns}, using inverse probability. \Exampleonlyref{ex.compressme} is solved in \chref{ch4}. % % SOLVE IT HERE??? % \subsection{The likelihood principle} \label{sec.lp} Please solve the following two exercises. \exampl{ex.lp1}{ Urn\amarginfig{c}{\begin{center}\psfig{figure=figs/urnsA.ps,width=1.6in}\end{center} \caption[a]{Urns for \protect\exampleonlyref{ex.lp1}.}} A contains three balls: one black, and two white; \ind{urn} B contains three balls: two black, and one white. One of the urns is selected at random and one ball is drawn. The ball is black. What is the probability that the selected urn is urn A? } % \exampl{ex.lp2}{ Urn\amarginfig{c}{\begin{center}\psfig{figure=figs/urns.ps,width=1.6in}\end{center}% \caption[a]{Urns for \protect\exampleonlyref{ex.lp2}.}} A contains five balls: one black, two white, one green and one pink; urn B contains five hundred balls: two hundred black, one hundred white, 50 yellow, 40 cyan, 30 sienna, 25 green, 25 silver, 20 gold, and 10 purple. [One fifth of A's balls are black; two-fifths of B's are black.] One of the urns is selected at random and one ball is drawn. The ball is black. What is the probability that the urn is urn A? } % What do you notice about your solutions? Does each answer depend on the detailed contents of each urn? The details of the other possible outcomes and their probabilities are irrelevant. All that matters is the probability of the outcome that actually happened (here, that the ball drawn was black) given the different hypotheses. We need only to know the {\em likelihood}, \ie, how the probability of the data that happened varies with the hypothesis. This simple rule about inference is known as the {\dbf\ind{likelihood principle}}.\label{sec.likelihoodprinciple} % % NOTE %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % % { \em (connect back to this point when discussing % early stopping and inference in problems where the stopping rule is not known.)} % %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% README NOTE!!!!!!!!!! \begin{conclusionbox} {\sf The likelihood principle:} given a generative model for data $d$ given parameters $\btheta$, $P(d \given \btheta)$, and having observed a particular outcome $d_1$, all inferences\index{key points!likelihood principle} and predictions should depend only on the function $P(d_1 \given \btheta)$. \end{conclusionbox} \noindent In spite of the simplicity of this principle, many classical statistical methods violate it.\index{classical statistics!criticisms}\index{sampling theory!criticisms} % \newpage \section{Definition of entropy and related functions} \begin{description} \item[The Shannon \ind{information content} of an outcome $x$] is defined to be % We define for each $x \in \A_X$, $ \beq h(x) = \log_2 \frac{1}{P(x)} . \eeq % We can interpret $h(a_i)$ as the information content of the event % $x \eq a_i$. It is measured in bits. [The word `bit' is also used to denote a variable whose value is 0 or 1; I hope context will always make clear which of the two meanings is intended.] \noindent In the next few chapters, we will establish that the Shannon information content $h(a_i)$ is indeed a natural measure of the information content of the event $x \normaleq a_i$. At that point, we will shorten the name of this quantity to `the information content'. \margintab{% \begin{center}\small%footnotesize % % vertical table of a-z with probabilities, and information contents too; % four decimal place \begin{tabular}[t]{cccr} \toprule $i$ & $a_i$ & $p_i$ & \multicolumn{1}{c}{$h(p_i)$} \\ \midrule % $i$ & $a_i$ & $p_i$ & \multicolumn{1}{c}{$\log_2 \frac{1}{p_i}$} \\ \midrule % 1 & {\tt a} &.0575 & 4.1 \\ 2 & {\tt b} &.0128 & 6.3 \\ 3 & {\tt c} &.0263 & 5.2 \\ 4 & {\tt d} &.0285 & 5.1 \\ 5 & {\tt e} &.0913 & 3.5 \\ 6 & {\tt f} &.0173 & 5.9 \\ 7 & {\tt g} &.0133 & 6.2 \\ 8 & {\tt h} &.0313 & 5.0 \\ 9 & {\tt i} &.0599 & 4.1 \\ 10 &{\tt j} &.0006 & 10.7 \\ 11 &{\tt k} &.0084 & 6.9 \\ 12 &{\tt l} &.0335 & 4.9 \\ 13 &{\tt m} &.0235 & 5.4 \\ 14 &{\tt n} &.0596 & 4.1 \\ 15 &{\tt o} &.0689 & 3.9 \\ 16 &{\tt p} &.0192 & 5.7 \\ 17 &{\tt q} &.0008 & 10.3 \\ 18 &{\tt r} &.0508 & 4.3 \\ 19 &{\tt s} &.0567 & 4.1 \\ 20 &{\tt t} &.0706 & 3.8 \\ 21 &{\tt u} &.0334 & 4.9 \\ 22 &{\tt v} &.0069 & 7.2 \\ 23 &{\tt w} &.0119 & 6.4 \\ 24 &{\tt x} &.0073 & 7.1 \\ 25 &{\tt y} &.0164 & 5.9 \\ 26 &{\tt z} &.0007 & 10.4 \\ 27 &{\tt{-}}&.1928 & 2.4 \\ \midrule %27 &\verb+-+&.1928 & 2.4 \\ \midrule & & & \\[-0.1in] \multicolumn{3}{r}{ $\displaystyle \sum_i p_i \log_2 \frac{1}{p_i}$ } & 4.1 \\ \bottomrule % 4.11 \end{tabular}\\ \end{center} % vertical table of a-z with probabilities, and information contents too; \caption[a]{Shannon information contents of the outcomes {\tt a}--{\tt z}.} \label{fig.monogram.log} } % The fourth column in \tabref{fig.monogram.log} shows the Shannon information content of the 27 possible outcomes when a random character is picked from an English document. The outcome % character $x={\tt z}$ has a Shannon information content of 10.4 bits, and $x={\tt e}$ has an information content of 3.5 bits. \item[The entropy of an ensemble $X$] is defined to be the average Shannon information content of an outcome:\index{entropy} % from that ensemble: \beq H(X) \equiv \sum_{x \in \A_X} P(x) \log \frac{1}{P(x)}, \eeq %\beq % H(X) = \sum_i p_i \log \frac{1}{p_i}, %\eeq with the convention for \mbox{$P(x) \normaleq 0$} that \mbox{$0 \times \log 1/0 \equiv 0$}, since \mbox{$\lim_{\theta\rightarrow 0^{+}} \theta \log 1/\theta \normaleq 0 $}. Like the information content, entropy is measured in bits. When it is convenient, we may also write $H(X)$ as $H(\bp)$, where $\bp$ is the vector $(p_1,p_2,\ldots,p_I)$. Another name for the entropy of $X$ is the uncertainty of $X$. \end{description} \noindent % The entropy is a measure of the information content or % `uncertainty' of $x$. The question of why entropy is a % fundamental measure of information content will be discussed in the % forthcoming chapters. Here w % was continued example \exampl{eg.mono}{ The entropy of a randomly selected letter in an English document is about 4.11 bits, assuming its probability is as given in \tabref{fig.monogram.log}. %, p.\ \pageref{fig.monogram}. % \tabref{tab.mono}. We obtain this number by averaging $\log 1/p_i$ (shown in the fourth column) under the probability distribution $p_i$ (shown in the third column). } We now note some properties of the entropy function. \bit \item $H(X) \geq 0$ with equality iff $p_i \normaleq 1$ for one $i$. [{`iff' means `if and only if'.}] \item Entropy is maximized if $\bp$ is uniform: \beq H(X) \leq \log(|\A_X|) \:\: \mbox{ with equality iff $p_i \normaleq 1/|\A_X|$ for all $i$. } \eeq % \footnote{Exercise: Prove this assertion.} {\sf Notation:}\index{notation!absolute value}\index{notation!set size} the vertical bars `$|\cdot|$' have two meanings. % If $X$ is an ensemble, then If $\A_X$ is a set, $|\A_X|$ denotes the number of elements in $\A_X$; if $x$ is a number, % for example, the value of a random variable, then $|x|$ is the absolute value of $x$. \eit % % Mon 22/1/01 The {\dem\ind{redundancy}} measures the fractional difference between $H(X)$ and its maximum possible value, $\log(|\A_X|)$. \begin{description}% \item[The redundancy of $X$] is: \beq 1 - \frac{H(X)}{\log |\A_X|} . \eeq We won't make use of `redundancy' % need this definition in this book, so I have not assigned a symbol to it. % -- it would be redundant. \end{description} % ha ha % funny but true. % example: X is select a codeword from a code - H(X) = K, but |X| = 2^N % % Redundancy = 1 - R % of code \begin{description}% duplicated in _l1a and _p5A \item[The joint entropy of $X,Y$] is: \beq H(X,Y) = \sum_{xy \in \A_X\A_Y} P(x,y) \log \frac{1}{P(x,y)}. \eeq Entropy is additive for independent random variables: \beq H(X,Y) = H(X) +H(Y) \:\mbox{ iff }\: P(x,y)=P(x)P(y). \label{eq.ent.indep}% also appears in p5a (.again) \eeq \end{description} \label{sec.entropy.end.parta} Our definitions for information content so far apply only to discrete probability distributions over finite sets $\A_X$. The definitions can be extended to infinite sets, though the entropy may then be infinite. The case of a probability {\em density\/} over a continuous set is addressed in section \ref{sec.entropy.continuous}.\index{probability!density} Further important definitions and exercises to do with entropy will come along in section \ref{sec.entropy.contd}. \section{Decomposability of the entropy} The entropy function satisfies a recursive property that can be very useful when computing entropies. For convenience, we'll stretch our notation\index{notation!entropy} so that we can write $H(X)$ as $H(\bp)$, where $\bp$ is the probability vector associated with the ensemble $X$. Let's illustrate the property by an example first. Imagine that a random variable $x \in \{ 0,1,2 \}$ is created by first flipping a fair coin to determine whether $x = 0$; then, if $x$ is not 0, flipping a fair coin a second time to determine whether $x$ is 1 or 2. The probability distribution of $x$ is \beq P( x\! =\! 0 ) = \frac{1}{2} ; \:\: P( x\! =\! 1 ) = \frac{1}{4} ; \:\: P( x\! =\! 2 ) = \frac{1}{4} . \eeq What is the entropy of $X$? We can either compute it by brute force: \beq H(X) = \dfrac{1}{2} \log 2 + \dfrac{1}{4} \log 4 + \dfrac{1}{4} \log 4 = 1.5 ; \eeq or we can use the following decomposition, in which the value of $x$ is revealed gradually. Imagine first learning whether $x\! =\! 0$, and then, if $x$ is not $0$, learning which non-zero value is the case. The revelation of whether $x\! =\! 0$ or not entails revealing a binary variable whose probability distribution is $\{\dhalf,\dhalf \}$. This revelation has an entropy $H(\dhalf,\dhalf) = \frac{1}{2} \log 2 +\frac{1}{2} \log 2 = 1\ubit$. If $x$ is not $0$, we learn the value of the second coin flip. This too is a binary variable whose probability distribution is $\{\dhalf,\dhalf\}$, and whose entropy is $1\ubit$. We only get to experience the second revelation half the time, however, so the entropy can be written: \beq H(X) = H( \dhalf , \dhalf ) + \dhalf \, H( \dhalf , \dhalf ) . \eeq Generalizing, the observation we are making about the entropy of any probability distribution $\bp = \{ p_1, p_2, \ldots , p_I \}$ is that \beq H(\bp) = H( p_1 , 1\!-\!p_1 ) + (1\!-\!p_1) H \! \left( \frac{p_2}{1\!-\!p_1} , \frac{p_3}{1\!-\!p_1} , \ldots , \frac{p_I}{1\!-\!p_1} \right) . \label{eq.entropydecompose} \eeq When it's written as a formula, this property looks regrettably ugly; nevertheless it is a simple property and one that you should make use of. Generalizing further, the entropy has the property for any $m$ that \beqan H(\bp) &=& H\left[ ( p_1+p_2+\cdots+p_m ) , ( p_{m+1}+p_{m+2}+\cdots+p_I ) \right] \nonumber \\ &&+ ( p_1+ % p_2+ \cdots+p_m ) H\! \left( \frac{p_1}{ ( p_1+\cdots+p_m ) } , % \frac{p_2}{ ( p_1+\cdots+p_m ) } , \ldots , \frac{p_m}{ ( p_1+\cdots+p_m ) } \right) \nonumber \\ && + ( p_{m+1}+ %p_{m+2}+ \cdots+p_I ) H \! \left( \frac{p_{m+1}}{ ( p_{m+1}+\cdots+p_I ) } , % \frac{p_{m+2}}{ ( p_{m+1}+\cdots+p_I ) } , \ldots , \frac{p_I}{ ( p_{m+1}+\cdots+p_I ) } \right) . \nonumber \\ \label{eq.entdecompose2} \eeqan \exampl{example.entropy}{ A source produces a character $x$ from the alphabet $\A = \{ {\tt 0}, {\tt 1}, \ldots, {\tt 9}, {\tt a}, {\tt b}, \ldots, {\tt z} \}$; with probability $\dthird$, $x$ is a numeral (${\tt 0}, \ldots, {\tt 9}$); with probability $\dthird$, $x$ is a vowel (${\tt a}, {\tt e}, {\tt i}, {\tt o}, {\tt u}$); and with probability $\dthird$ it's one of the 21 consonants. All numerals are equiprobable, and the same goes for vowels and consonants. Estimate the entropy of $X$. } \solution\ \ $\log 3 + \frac{1}{3} ( \log 10 + \log 5 + \log 21 )= \log 3 + \frac{1}{3} \log 1050 \simeq \log 30\ubits$.\ENDsolution %> pr log(36)/log(2) %5.16992500144231 %> pr log(30)/log(2) %4.90689059560852 %> pr (log(3) +log(1050)/3.0 )/log(2) %4.93035370490565 % This may be compared with the maximum entropy for an alphabet % of 36 characters, $\log 36\ubits$. \section{Gibbs' inequality} % We will also find useful the following: \begin{description} % SPACE PROBLEM HERE ... \item[The \ind{relative entropy} {\em or\/} \ind{Kullback--Leibler divergence}] \marginpar[t]{\small\raggedright\reducedlead{The `ei' in L{\bf{ei}}bler is pronounced\index{pronunciation} the same as in h{\bf{ei}}st.}}between two probability distributions $P(x)$ and $Q(x)$ that are defined over the same alphabet $\A_X$ is\index{entropy!relative}\index{divergence} \beq D_{\rm KL}(P||Q) = \sum_x P(x) \log \frac{P(x)}{Q(x)} . \label{eq.KL} \label{eq.DKL} \eeq The relative entropy satisfies {\dem\ind{Gibbs' inequality}} \beq D_{\rm KL}(P||Q) \geq 0 \eeq with equality only if $P \normaleq Q$. Note that in general the relative entropy is not symmetric under interchange of the distributions $P$ and $Q$: in general $D_{\rm KL}(P||Q) \neq D_{\rm KL}(Q||P)$, so $D_{\rm KL}$, although it is sometimes called the `\ind{KL distance}', is not strictly a distance\index{distance!$D_{\rm KL}$}.\index{distance!relative entropy} % `distance\index{distance!$D_{\rm KL}$}'. % It is also known as the `discrimination' or `divergence', The \ind{relative entropy} is important in pattern recognition and neural networks, as well as in information theory. % % could include that aston guy's stuff here on (pq)^1/2? % % see also ../notation.tex % \end{description} Gibbs' inequality is probably the most important inequality in this book. It, and many other inequalities, can be proved using the concept of convexity. \section{Jensen's inequality for convex functions} \begin{aside} The words `\ind{\convexsmile}' and `\ind{\concavefrown}' may be pronounced `convex-smile' and `concave-frown'. This terminology has useful redundancy: while one may forget which way up `convex' and `concave' are, it is harder to confuse a smile with a frown.\index{notation!convex/concave} \end{aside} \begin{description} % \item[{\Convexsmile\ functions}\puncspace] A function $f(x)$ is {\dem \ind{\convexsmile}\/} over $(a,b)$ if \amarginfig{c}{% \footnotesize \setlength{\unitlength}{0.75mm} \begin{tabular}{c} \begin{picture}(60,60)(0,0) \put(0,0){\makebox(60,65){\psfig{figure=figs/convex.eps,angle=-90,width=45mm}}} \put(10,8){\makebox(0,0){$x_1$}} \put(48,8){\makebox(0,0){$x_2$}} \put(17,2){\makebox(0,0)[l]{$x^* = \lambda x_1 + (1-\lambda)x_2$}} \put(31,23){\makebox(0,0){$f(x^*)$}} \put(35,39){\makebox(0,0){$\lambda f(x_1) + (1-\lambda)f(x_2)$}} \end{picture} \end{tabular} \caption[a]{Definition of convexity.} \label{fig.convex} }\ every chord of the function lies above the function, as shown in \figref{fig.convex}; that is, for all $x_1,x_2 \in (a,b)$ and $0\leq \lambda \leq 1$, \beq f( \lambda x_1 + (1-\lambda)x_2 ) \:\:\leq \:\:\ \lambda f(x_1) + (1-\lambda) f(x_2 ) . \eeq A function $f$ is {\dem strictly \convexsmile\/} if, for all $x_1,x_2 \in (a,b)$, the equality holds only for $\lambda \normaleq 0$ and $\lambda\normaleq 1$. Similar definitions apply to \concavefrown\ and strictly \concavefrown\ functions. \end{description} \newcommand{\tinyfunction}[2]{ \begin{tabular}{@{}c@{}} {\small{#1}} \\[-0.25in] \psfig{figure=figs/#2.ps,width=1.06in,angle=-90} \\ \end{tabular} } Some strictly \convexsmile\ functions are \bit \item $x^2$, $e^x$ and $e^{-x}$ for all $x$; \item $\log (1/x)$ and $x \log x$ for $x>0$. \eit \begin{figure}[htbp] \figuremargin{% \begin{center} \raisebox{0.4in}{% \begin{tabular}[c]{c@{}c@{}c@{}c} \tinyfunction{$x^2$}{convex_xx} & \tinyfunction{$e^{-x}$}{convex_exp-x} & \tinyfunction{$\log \frac{1}{x}$}{convex_logix} & \tinyfunction{$x \log x$}{convex_xlogx} \\[0.2in] %\tinyfunction{$x^2$}{convex_xx} & %\tinyfunction{$e^{-x}$}{convex_exp-x} \\[0.42in] %\tinyfunction{$\log \frac{1}{x}$}{convex_logix} & %\tinyfunction{$x \log x$}{convex_xlogx} \\[0.2in] \end{tabular} } \end{center} }{% \caption[a]{\Convexsmile\ functions.} \label{fig.convexf} }% \end{figure} \begin{description} \item[Jensen's inequality\puncspace] If $f$ is a \convexsmile\ function and $x$ is a random variable then: \beq \Exp\left[ f(x) \right] \geq f\!\left( \Exp[x] \right) , \label{eq.jensen} \eeq where $\Exp$ denotes \ind{expectation}. If $f$ is strictly \convexsmile\ and $\Exp\left[ f(x) \right] \normaleq f\!\left( \Exp[x] \right)$, then the random variable $x$ is a constant. % (with probability 1). % |!!!!!!!!!!!!!!!!! removed pedantry \ind{Jensen's inequality} can also be rewritten for a \concavefrown\ function, with the direction of the inequality reversed. \end{description} A physical version of Jensen's \ind{inequality} runs as follows. \amarginfignocaption{b}{\mbox{\psfig{figure=figs/jensenmass.ps,width=1.75in,angle=-90}}} \begin{quote} If a collection of masses $p_i$ are placed on a \convexsmile\ curve $f(x)$ at locations $(x_i, f(x_i))$, then the \ind{centre of gravity} of those masses, which is at $\left( \Exp[x], \Exp\left[ f(x) \right] \right)$, lies above the curve. \end{quote} If this fails to convince you, then feel free to do the following exercise. \exercissxC{2}{ex.jensenpf}{ Prove \ind{Jensen's inequality}. } \exampl{ex.jensen}{ Three squares have average area $\bar{A} = 100\,{\rm m}^2$. The average of the lengths of their sides is $\bar{l} = 10\,{\rm m}$. What can be said about the size of the largest of the three squares? [Use Jensen's inequality.] } \solution\ \ Let $x$ be the length of the side of a square, and let the probability of $x$ be $\dthird,\dthird,\dthird$ over the three lengths $l_1,l_2,l_3$. Then the information that we have is that $\Exp\left[ x \right]=10$ and $\Exp\left[ f(x) \right]=100$, where $f(x) = x^2$ is the function mapping lengths to areas. This is a strictly \convexsmile\ function. We notice that the equality $\Exp\left[ f(x) \right] \normaleq f\!\left( \Exp[x] \right)$ holds, therefore $x$ is a constant, and the three lengths must all be equal. The area of the largest square is 100$\,{\rm m}^2$.\ENDsolution \subsection{Convexity and concavity also relate to maximization} If $f(\bx)$ is \convexfrown\ and there exists a point at which \beq \frac{\partial f}{\partial x_k} = 0 \:\: \mbox{for all $k$}, % \forall k \eeq then $f(\bx)$ has its maximum value at that point. The converse does not hold: if a \convexfrown\ $f(\bx)$ is maximized at some $\bx$ it is not necessarily true that the gradient $\grad f(\bx)$ is equal to zero there. For example, $f(x) = -|x|$ is maximized at $x=0$ where its derivative is undefined; and $f(p) = \log(p),$ for a probability $p \in (0,1)$, is maximized on the boundary of the range, at $p=1$, where the gradient $\d f(p)/\d p =1$. %, since $f$ might for example % be an increasing function with no maximum such as $\log x$, % or its maximum might be located at a point $\bx$ % on the boundary of the range of $\bx$. % %{\em (is this use of range correct?)} % exercises from that. % % exercises that belong between old chapters 1 and 2. % % see also _p5a.tex for moved exercises. % \section{Exercises} \subsection*{Sums of random variables} % sums of random variables. % dice questions \exercissxA{3}{ex.sumdice}{ \ben \item Two ordinary dice with faces labelled $1,\ldots,6$ are thrown. What is the probability distribution of the sum\index{law of large numbers} of the values? What is the probability distribution of the absolute difference between the values? \item One\marginpar[c]{\small\raggedright\reducedlead{This exercise is intended to help you think about the \ind{central-limit theorem}, which says that if independent random variables $x_1, x_2, \ldots, x_N$ have means $\mu_n$ and finite variances $\sigma_n^2$, then, in the limit of large $N$, the sum $\sum_n x_n$ has a distribution that tends to a normal (\index{Gaussian distribution}Gaussian) distribution with mean $\sum_n \mu_n$ and variance $\sum_n \sigma_n^2$.}} hundred ordinary dice are thrown. What, roughly, is the probability distribution of the sum of the values? Sketch the probability distribution and estimate its mean and standard deviation. \item How can two cubical dice be labelled using the numbers $\{0,1,2,3,4,5,6\}$ so that when the two dice are thrown the sum has a uniform probability distribution over the integers 1--12? % Can you prove your solution is unique? \item Is there any way that one hundred dice could be labelled with integers such that the probability distribution of the sum is uniform? \een } % answer, one normal, one 060606 % uniqueness proved by noting that every outcome 1-12 has % to be made from 3 microoutcomes, and 12 can only be made % from 6,6, so there must be a six on each die, indeed 3 on 1, and % 1 on the other. 1 can only be mae from 1,0, and don't want 0,0, % so there must be three 0s. (M Gardner) % \subsection*{Inference problems} \exercissxA{2}{ex.logit}{ If $q=1-p$ and $a = \ln \linefrac{p}{q}$, show that \beq p = \frac{1}{1+\exp(-a)} . \label{eq.sigmoid} \label{eq.logistic} \eeq Sketch this function and find its relationship to the hyperbolic tangent function $\tanh(u)=\frac{e^{u} - e^{-u}}{e^{u} + e^{-u}}$. It will be useful to be fluent in base-2 logarithms also. If $b = \log_2 \linefrac{p}{q}$, what is $p$ as a function of $b$? } % % is this exercise inappropriate now because we have not defined % joint ensembles yet? % \exercissxB{2}{ex.BTadditive}{ Let $x$ and $y$ be dependent % correlated random variables with $x$ a binary variable taking values in $\A_X = \{ 0,1 \}$. Use \Bayes\ theorem to show that the log posterior probability ratio for $x$ given $y$ is \beq \log \frac{P(x\eq 1 \given y)}{P(x\eq 0 \given y)} = \log \frac{P(y \given x\eq 1)}{P(y \given x\eq 0)} + \log \frac{P(x\eq 1)}{P(x\eq 0)} . \eeq } % define ODDS ? \exercissxB{2}{ex.d1d2}{ Let $x$, $d_1$ and $d_2$ be random variables such that $d_1$ and $d_2$ are conditionally independent given a binary variable $x$. % (That is, $P(x,d_1,d_2) % = P(x)P(d_1 \given x)P(d_2 \given x)$.) % % somewhere I need to introduce graphical repns and define % % TO DO!!! TODO % % (\ind{conditional independence} is discussed further in section XXX.) % % and give examples. A and C children of B. and A->B->C % Jensen defn is % A is cond indep of B given C if % A|B,C = A|C % which is symmetric, implying by BT % B|A,C = B|C % pf % B|A,C = A|B,C B|C / A|C = B|C % my defn here is % A,B,C = C A|C B|C % proof: % A,B,C = C A|C B|C,A = . % NB graphical model and decomposition are not 1-1 related. The two % graphs A and C children of B. and A->B->C both have a joint prob % that can be factorized in either way. % % $x$ is a binary variable taking values in $\A_X = \{ 0,1 \}$. Use \Bayes\ theorem to show that the posterior probability ratio for $x$ given $\{d_i \}$ is \beq \frac{P(x\eq 1 \given \{d_i \} )}{P(x\eq 0 \given \{d_i \})} = \frac{P(d_1 \given x\eq 1)}{P(d_1 \given x\eq 0)} \frac{P(d_2 \given x\eq 1)}{P(d_2 \given x\eq 0)} \frac{P(x\eq 1)}{P(x\eq 0)} . \eeq } \subsection*{Life in high-dimensional spaces} %{Life in $\R^N$} \index{life in high dimensions} \index{high dimensions, life in} Probability distributions and volumes have some unexpected properties in high-dimensional spaces. % The real line is denoted by $\R$. An $N$--dimensional real space % is denoted by $\R^N$. \exercissxA{2}{ex.RN}{ Consider a sphere of radius $r$ in an $N$-dimensional real space. % dimensions. Show that the fraction of the volume of the sphere that is in the surface shell lying at values of the radius between $r- \epsilon$ and $r$, where $0 < \epsilon < r$, is: \beq f = 1 - \left( 1 - \frac{\epsilon}{r} \right)^{\!N} . \eeq % from Bishop p.29 Evaluate $f$ for the cases $N\eq 2$, $N\eq 10$ and $N\eq 1000$, with (a) $\epsilon/r \eq 0.01$; (b) $\epsilon/r \eq 0.5$. {\sf Implication:} points that are uniformly distributed in a sphere in $N$ dimensions, where $N$ is large, are very likely to be in a \ind{thin shell} near the surface. % (From Bishop (1995).) } % \label{sec.exercise.block1} \subsection*{Expectations and entropies} You are probably familiar with the idea of computing the \ind{expectation}\index{notation!expectation} of a function of $x$, \beq \Exp\left[ f(x) \right] = \left< f(x) \right> = \sum_{x} P(x) f(x) . \eeq Maybe you are not so comfortable with computing this expectation in cases where the function $f(x)$ depends on the probability $P(x)$. The next few examples address this concern. \exercissxA{1}{ex.expectn}{ Let $p_a \eq 0.1$, $p_b \eq 0.2$, and $p_c \eq 0.7$. Let $f(a) \eq 10$, $f(b) \eq 5$, and $f(c) \eq 10/7$. What is $\Exp\left[ f(x) \right]$? What is $\Exp\left[ 1/P(x) \right]$? } \exercissxA{2}{ex.invP}{ For an arbitrary ensemble, what is $\Exp\left[ 1/P(x) \right]$? } \exercissxB{1}{ex.expectng}{ Let $p_a \eq 0.1$, $p_b \eq 0.2$, and $p_c \eq 0.7$. Let $g(a) \eq 0$, $g(b) \eq 1$, and $g(c) \eq 0$. What is $\Exp\left[ g(x) \right]$? } \exercissxB{1}{ex.expectng2}{ Let $p_a \eq 0.1$, $p_b \eq 0.2$, and $p_c \eq 0.7$. What is the probability that $P(x) \in [0.15,0.5]$? What is \[ P\left( \left| \log \frac{P(x)}{ 0.2} \right| > 0.05 \right) ? \] } \exercissxA{3}{ex.Hineq}{ Prove the assertion that $H(X) \leq \log(|\A_X|)$ with equality iff $p_i \normaleq 1/|\A_X|$ for all $i$. ($|\A_X|$ denotes the number of elements in the set $\A_X$.) [Hint: use Jensen's inequality (\ref{eq.jensen}); if your first attempt to use Jensen does not succeed, remember that Jensen involves both a random variable and a function, and you have quite a lot of freedom in choosing these; think about whether your chosen function $f$ should be convex or concave.] % further hint: try $u\eq 1/p_i$ as the random variable.] } \exercissxB{3}{ex.rel.ent}{ Prove that the relative entropy (\eqref{eq.KL}) satisfies $D_{\rm KL}(P||Q) \geq 0$ (\ind{Gibbs' inequality}) with equality only if $P \normaleq Q$. % You may find this result % helps with the previous two exercises. Note (moved to _p5a.tex) % % refer to this in mean field theory chapter {ch.mft} % } % % Decomposability of the entropy \exercisaxB{2}{ex.entropydecompose}{ Prove that the entropy is indeed decomposable as described in \eqsref{eq.entropydecompose}{eq.entdecompose2}. } \exercissxB{2}{ex.decomposeexample}{ A random variable $x \in \{0,1,2,3\}$ is selected by flipping a bent coin with bias $f$ to determine whether the outcome is in $\{0,1\}$ or $\{ 2,3\}$; \amarginfignocaption{t}{% \begin{center}\small%footnotesize \setlength{\unitlength}{0.6mm} \begin{picture}(30,50)(-10,-15) \put(-6,25){{\makebox(0,0)[r]{$f$}}} \put(-6,5){{\makebox(0,0)[r]{$1\!-\!f$}}} \put(-10,15){\vector(1,1){17}} \put(-10,15){\vector(1,-1){17}} \put(10,35){\vector(1,1){10}} \put(10,35){\vector(1,-1){10}} \put(16,45){{\makebox(0,0)[r]{$g$}}} \put(16,25){{\makebox(0,0)[r]{$1\!-\!g$}}} \put(16,5){{\makebox(0,0)[r]{$h$}}} \put(16,-15){{\makebox(0,0)[r]{$1\!-\!h$}}} \put(10,-5){\vector(1,1){10}} \put(10,-5){\vector(1,-1){10}} \put(24,45){{\makebox(0,0)[l]{\tt 0}}} \put(24,25){{\makebox(0,0)[l]{\tt 1}}} \put(24,5){{\makebox(0,0)[l]{\tt 2}}} \put(24,-15){{\makebox(0,0)[l]{\tt 3}}} \end{picture} \end{center} } then either flipping a second \ind{bent coin} with bias $g$ or a third bent coin with bias $h$ respectively. Write down the probability distribution of $x$. Use the decomposability of the entropy (\ref{eq.entdecompose2}) to find the entropy of $X$. [Notice how compact an expression is obtained if you make use of the binary entropy function $H_2(x)$, compared with writing out the four-term entropy explicitly.] Find the derivative of $H(X)$ with respect to $f$. [Hint: $\d H_2(x)/\d x = \log((1-x)/x)$.] } \exercissxB{2}{ex.waithead0}{ An unbiased coin is flipped until one head is thrown. What is the entropy of the random variable $x \in \{1,2,3,\ldots\}$, the number of flips? Repeat the calculation for the case of a biased coin with probability $f$ of coming up heads. [Hint: solve the problem both directly and by using the decomposability of the entropy (\ref{eq.entropydecompose}).] % } % % removed joint entropy questions. \section{Further exercises} % \subsection*{Forward probability}% problems} \exercisaxB{1}{ex.balls}{ An urn contains $w$ white balls and $b$ black balls. Two balls are drawn, one after the other, without replacement. Prove that the probability that the first ball is white is equal to the probability that the second is white. } % \exercisaxB{2}{ex.buffon}{ A circular \ind{coin} of diameter $a$ is thrown onto a \ind{square} grid whose squares are $b \times b$. ($a B$ given that $F>A$?) } \exercisaxB{2}{ex.liars}{ The inhabitants of an island tell the truth one third of the time. They lie with probability 2/3. On an occasion, after one of them made a statement, you ask another `was that statement true?' and he says `yes'. What is the probability that the statement was indeed true? % [Ans: 1/5]. } % \exercissxB{2}{ex.R3error}{ Compare two ways of computing the probability of error of the repetition code $\Rthree$, assuming a binary symmetric channel (you did this once for \exerciseref{ex.R3ep}) and confirm that they give the same answer. \begin{description} \item[Binomial distribution method\puncspace] Add the probability that all three bits are flipped to the probability that exactly two bits are flipped. % Add the probability of all three bits' % being flipped to the probability of exactly two bits' being flipped. \item[Sum rule method\puncspace] % Using the different possible inferences] Using the \ind{sum rule}, compute the marginal probability that $\br$ takes on each of the eight possible values, $P(\br)$. [$P(\br) = \sum_s P(s)P(\br \given s)$.] Then compute the posterior probability of $s$ for each of the eight values of $\br$. [In fact, by symmetry, only two example cases $\br = ({\tt0}{\tt0}{\tt0})$ and $\br = ({\tt0}{\tt0}{\tt1})$ need be considered.] \marginpar{\small\raggedright\reducedlead{\Eqref{eq.bayestheorem} gives the posterior probability of the input $s$, given the received vector $\br$. }} % $\br = ({\tt1},{\tt1},{\tt0})$, % $\br = ({\tt1},{\tt1},{\tt1})$, Notice that some of the inferred bits are better determined than others. From the posterior probability $P(s \given \br)$ you can read out the case-by-case error probability, the probability that the more probable hypothesis is not correct, $P(\mbox{error} \given \br)$. Find the average error probability using the sum rule, \beq P(\mbox{error}) = \sum_{\br} P(\br) P(\mbox{error} \given \br) . \eeq \end{description} } % \exercissxB{3C}{ex.Hwords}{ The frequency % probability $p_n$ of the $n$th most frequent word in English is roughly approximated by \beq p_n \simeq \left\{ \begin{array}{ll} \frac{0.1}{n} & \mbox{for $n \in 1, \ldots, 12\,367$} % 8727$.} \\ 0 & n > 12\,367 . \end{array} \right. \eeq [This remarkable $1/n$ law is known as \ind{Zipf's law}, and applies to the word frequencies of many languages % cite Shannon collection p.197 - except he has the number 8727, wrong! % could also cite Gell-Mann \cite{zipf}.] If we assume that English is generated by picking words at random according to this distribution, what is the entropy of English (per word)? [This calculation can be found in `Prediction and entropy of printed English', C.E.\ Shannon, {\em Bell Syst.\ Tech.\ J.}\ {\bf 30}, p\pdot50--64 (1950), but, inexplicably, the great man made numerical errors in it.] % , in bits per word? } %%% Local Variables: %%% TeX-master: ../book.tex %%% End: % \input{tex/_e1A.tex}%%%%%%%%%%%%%%%%%%%%% inference probs to do with logit and dice and decay moved into _p8.tex \dvips % include urn.tex here for another forward probability exercise. % \section{Solutions}% to Chapter \protect\ref{ch.prob.ent}'s exercises} \fakesection{_s1aa solutions} %================================= \soln{ex.independence.bigram}{ No, they are not independent. If they were then all the conditional distributions $P(y \given x)$ would be identical functions of $y$, regardless of $x$ (\cf\ \figref{fig.conbigrams}). } \soln{ex.fp.toss}{ We define the fraction $f_B \equiv B/K$. \ben \item The number of black balls has a binomial distribution. \beq P(n_B\,|\,f_B,N) = {N \choose n_B} f_B^{n_B} (1-f_B)^{N-n_B} . \eeq \item The mean and variance of this distribution are: \beq \Exp [ n_B ] = N f_B \eeq \beq \var[n_B] = N f_B (1-f_B) . \label{eq.variance.binomial} \eeq These results were derived in \exampleref{ex.binomial}. The standard deviation of $n_B$ is $\sqrt{\var[n_B]} = \sqrt{N f_B (1-f_B)}$. % on page \pageref{sec.first.binomial.sol}. When $B/K = 1/5$ and $N=5$, the expectation and variance of $n_B$ are 1 and 4/5. The standard deviation is 0.89. When $B/K = 1/5$ and $N=400$, the expectation and variance of $n_B$ are 80 and 64. The standard deviation is 8. \een } \soln{ex.fp.chi}{ The numerator of the quantity \[%beq z = \frac{(n_B - f_B N)^2}{ {N f_B (1-f_B)} } %\label{eq.chisquared} \]%eeq can be recognized as\index{chi-squared}\index{$\chi^2$} $\left( n_B - \Exp [ n_B ] \right)^2$; the denominator is equal to the variance of $n_B$ (\ref{eq.variance.binomial}), which is by definition the expectation of the numerator. So the expectation of $z$ is 1. [A random variable like $z$, which measures the deviation of data from the expected % average value, is sometimes called $\chi^2$ (chi-squared).] In the case $N=5$ and $f_B = 1/5$, $N f_B$ is 1, and $\var[n_B]$ is 4/5. The numerator has five possible values, only one of which is smaller than 1: $(n_B - f_B N)^2 = 0$ has probability $P(n_B \eq 1)= 0.4096$; % $(n_B - f_B N)^2 = 1$ has probability $P(n_B = 0)+P(n_B = 2)= $ ; % $(n_B - f_B N)^2 = 4$ has probability $P(n_B = 3)= $ ; % $(n_B - f_B N)^2 = 9$ has probability $P(n_B = 4)= $ ; % $(n_B - f_B N)^2 = 16$ has probability $P(n_B = 5)= $ ; so the probability that $z < 1$ is 0.4096. % } % % stole solution from here % %%%%%%%%%%%%%%%%%%%%%%%%%% added 99 9 14 \soln{ex.jensenpf}{ We wish to prove, given the property \beq f( \lambda x_1 + (1-\lambda)x_2 ) \:\: \leq \:\: \lambda f(x_1) + (1-\lambda) f(x_2 ) , \label{eq.convexdefn} \eeq that, if $\sum p_i = 1$ and $p_i \geq 0$, \beq% % \Exp\left[ f(x) \right] \geq f\left( \Exp[x] \right) , \sum_{i=1}^I p_i f(x_i) \geq f\left( \sum_{i=1}^I p_i x_i \right) . \eeq We proceed by recursion, working from the right-hand side. (This proof does not % needs further work to handle % awkward cases where some $p_i=0$; such details are left to the pedantic reader.) At the first line we use the definition of convexity (\ref{eq.convexdefn}) with $\lambda = \frac{p_1}{\sum_{i=1}^I p_i } = p_1$; at the second line, $\lambda = \frac{p_2}{\sum_{i=2}^I p_i }$. % , and so forth. \fakesection{temporary solution} \begin{eqnarray} \lefteqn{ f\left( \sum_{i=1}^I p_i x_i \right) = % &=& f\left( p_1 x_1 + \sum_{i=2}^I p_i x_i \right) } \nonumber \\ &\leq& p_1 f(x_1) + \left[ \sum_{i=2}^I p_i \right] \left[ f\left( \sum_{i=2}^I p_i x_i \left/ \sum_{i=2}^I p_i \right. \right) \right] \\ &\leq& p_1 f(x_1) + \left[ \sum_{i=2}^I p_i \right] \left[ \frac{p_2} {\sum_{i=2}^I p_i } f\left( x_2 \right) + \frac{\sum_{i=3}^I p_i} {\sum_{i=2}^I p_i } f\left( \sum_{i=3}^I p_i x_i \left/ \sum_{i=3}^I p_i \right. \right) \right] , \nonumber % probably cut this last line, just show one itn of recursion % \end{eqnarray} and so forth. % % this works if I want to restore it. Indeed I have restored it \hfill $\epfsymbol$% $\Box$%\epfs% end proof symbol } %%%%%%%%%%%%%%%%%%%% % main post-chapter exercise solution area: % %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \soln{ex.sumdice}{ \ben \item For the outcomes $\{2,3,4,5,6,7,8,9,10,11,12\}$, the probabilities are $\P = \{ \frac{1}{36}, \frac{2}{36}, \frac{3}{36}, \frac{4}{36}, \frac{5}{36}, \frac{6}{36}, \frac{5}{36}, \frac{4}{36}, \frac{3}{36}, \frac{2}{36}, \frac{1}{36}\}% $. \item The value of one die has mean $3.5$ and variance $35/12$. So the sum of one hundred has mean $350$ and variance $3500/12 \simeq 292$, and by the \ind{central-limit theorem} the probability distribution is roughly Gaussian (but confined to the integers), with this mean and variance. \item In order to obtain a sum that has a uniform distribution we have to start from random variables some of which have a spiky distribution with the probability mass concentrated at the extremes. The unique solution is to have one ordinary die and one with faces 6, 6, 6, 0, 0, 0. % That this solution is unique can be proved with an argument % that starts by noting % that each of the 12 outcomes has to be realized % by 3 distinct microstates (a microstate % being one of the 36 particular orientations % of the two dice). To create outcome `12' % in three ways there must be one six on % one dice and three sixes on the other; % similarly to create outcome `1' three ways, there % must be one die with three zeroes on it % and one with one one. \item Yes, a uniform distribution can be created in several ways,\marginpar[t]{\small\raggedright\reducedlead{To think about: does this uniform distribution contradict the \ind{central-limit theorem}?}} for example by labelling the $r$th die with the numbers $\{0,1,2,3,4,5\}\times 6^r$. \een } % \subsection*{Inference problems} % \soln{ex.logit}{ \beqan a = \ln \frac{p}{q} \hspace{0.2in} & \Rightarrow & \hspace{0.2in} \frac{p}{q} = e^a \label{logit.step1} \eeqan and $q=1-p$ gives \beqan \frac{p}{1-p} & =& e^a \\ \Rightarrow \hspace{0.52in} p & = & \frac{e^a}{e^a+1} = \frac{1}{1+\exp(-a)} . \label{logit.step2} \eeqan The hyperbolic tangent is \beq \tanh(a) = \frac{e^a -e^{-a}}{e^a + e^{-a}} \eeq so \beqan f(a)& \equiv& \frac{1}{1+\exp(-a)} = \frac{1}{2} \left( \frac{1-e^{-a}}{1+e^{-a}} + 1 \right) \nonumber \\ &=& \frac{1}{2}\left( \frac{ e^{a/2} - e^{-a/2} }{ e^{a/2} + e^{-a/2}} +1 \right) = \frac{1}{2} ( \tanh(a/2) + 1 ) . \eeqan In the case $b = \log_2 \linefrac{p}{q}$, we can repeat steps (\ref{logit.step1}--\ref{logit.step2}), replacing $e$ by $2$, to obtain \beq p = \frac{1}{1+2^{-b}} . \label{eq.sigmoid2} \label{eq.logistic2} \eeq } \soln{ex.BTadditive}{ \beqan P(x \given y) &=& \frac{P(y \given x)P(x) }{P(y)} \\%\eeq\beq \Rightarrow\:\: \frac{P(x\eq 1 \given y)}{P(x\eq 0 \given y)} &=& \frac{P(y \given x\eq 1)}{P(y \given x\eq 0)} \frac{P(x\eq 1)}{P(x\eq 0)} \\%\eeq\beq \Rightarrow\:\: \log \frac{P(x\eq 1 \given y)}{P(x\eq 0 \given y)} &=& \log \frac{P(y \given x\eq 1)}{P(y \given x\eq 0)} + \log \frac{P(x\eq 1)}{P(x\eq 0)} . \eeqan } \soln{ex.d1d2}{ The conditional independence of $d_1$ and $d_2$ given $x$ means \beq P(x,d_1,d_2) = P(x)P(d_1 \given x)P(d_2 \given x) . \eeq This gives a separation of the posterior probability ratio into a series of factors, one for each data point, times the prior probability ratio. \beqan \frac{P(x\eq 1 \given \{d_i \} )}{P(x\eq 0 \given \{d_i \})} &=& \frac{P(\{d_i\} \given x\eq 1)}{P(\{d_i\} \given x\eq 0)} \frac{P(x\eq 1)}{P(x\eq 0)} \\ &=& \frac{P(d_1 \given x\eq 1)}{P(d_1 \given x\eq 0)} \frac{P(d_2 \given x\eq 1)}{P(d_2 \given x\eq 0)} \frac{P(x\eq 1)}{P(x\eq 0)} . \eeqan } % % \subsection*{Life in high-dimensional spaces} \soln{ex.RN}{ The \ind{volume} of a \ind{hypersphere} of radius $r$ in $N$ dimensions is in fact \beq V(r,N) = \frac{\pi^{N/2}}{(N/2)!} r^{N} , \eeq but you don't need to know this. For this question all that we need is the $r$-dependence, $V(r,N) \propto r^{N} .$ So the fractional volume in $(r-\epsilon,r)$ is \beq \frac{ r^{N} - (r-\epsilon)^N }{ r^N} = 1 -\left( 1 -\frac{\epsilon}{r}\right)^N . \eeq The fractional volumes in the shells for the required cases are: \begin{center} \begin{tabular}[t]{cccc} \toprule $N$ & 2 & 10 & 1000 \\ \midrule $\epsilon/r = 0.01$ & 0.02 & 0.096 & 0.99996 \\ $\epsilon/r = 0.5\phantom{0}$ & 0.75 & 0.999 & $1 - 2^{-1000}$ \\ \bottomrule \end{tabular}\\ \end{center} \noindent Notice that no matter how small $\epsilon$ is, for large enough $N$ essentially all the probability mass is in the surface shell of thickness $\epsilon$. } %\soln{ex.weigh}{ % See chapter \chtwo. %} % \soln{ex.expectn}{ $p_a \eq 0.1$, $p_b \eq 0.2$, $p_c \eq 0.7$. $f(a) \eq 10$, $f(b) \eq 5$, and $f(c) \eq 10/7$. \beq \Exp\left[ f(x) \right] = 0.1 \times 10 + 0.2 \times 5 + 0.7 \times 10/7 = 3. \eeq For each $x$, $f(x) = 1/P(x)$, so \beq \Exp\left[ 1/P(x) \right] = \Exp\left[ f(x) \right] = 3. \eeq } % \soln{ex.invP}{ For general $X$, \beq \Exp\left[ 1/P(x) \right] = \sum_{x\in \A_X} P(x) 1/P(x) = \sum_{x\in \A_X} 1 = | \A_X | . \eeq } % \soln{ex.expectng}{ $p_a \eq 0.1$, $p_b \eq 0.2$, $p_c \eq 0.7$. $g(a) \eq 0$, $g(b) \eq 1$, and $g(c) \eq 0$. \beq \Exp\left[ g(x) \right]=p_b = 0.2. \eeq } \soln{ex.expectng2}{ \beq P\left( P(x) \! \in \! [0.15,0.5] \right) = p_b = 0.2 . \eeq \beq P\left( \left| \log \frac{P(x)}{ 0.2} \right| > 0.05 \right) = p_a + p_c = 0.8 . \eeq } % \soln{ex.Hineq}{ This type of question can be approached in two ways: either by differentiating the function to be maximized, finding the maximum, and proving it is a global maximum; this strategy is somewhat risky since it is possible for the maximum of a function to be at the boundary of the space, at a place where the derivative is not zero. Alternatively, a carefully chosen inequality can establish the answer. The second method is much neater. \begin{Prooflike}{Proof by differentiation (not the recommended method)} Since it is slightly easier to differentiate $\ln 1/p$ than $\log_2 1/p$, we temporarily define $H(X)$ to be measured using natural logarithms, thus scaling it down by a factor of $\log_2 e$. \beqan H(X) &=& \sum_i p_i \ln \frac{1}{p_i} \\ \frac{\partial H(X)}{\partial p_i} &=& \ln \frac{1}{p_i} - 1 \eeqan we maximize subject to the constraint $\sum_i p_i = 1$ which can be enforced with a Lagrange multiplier: \beqan G(\bp) & \equiv & H(X) + \lambda \left( \sum_i p_i - 1 \right) \\ \frac{\partial G(\bp)}{\partial p_i} &=& \ln \frac{1}{p_i} - 1 + \lambda . \eeqan At a maximum, \beqan \ln \frac{1}{p_i} - 1 + \lambda &=& 0 \\ \Rightarrow \ln \frac{1}{p_i} &=& 1 - \l , \eeqan so all the $p_i$ are equal. That this extremum is indeed a maximum is established by finding the curvature: \beq \frac{\partial^2 G(\bp)}{\partial p_i \partial p_j} = -\frac{1}{p_i} \delta_{ij} , \eeq which is negative definite. \hfill \end{Prooflike} \begin{Prooflike}{Proof using Jensen's inequality (recommended method)} First a reminder of the inequality. \begin{quotation} \noindent If $f$ is a \convexsmile\ function and $x$ is a random variable then: \[%beq \Exp\left[ f(x) \right] \geq f\left( \Exp[x] \right) . \]%eeq If $f$ is strictly \convexsmile\ and $\Exp\left[ f(x) \right] \eq f\left( \Exp[x] \right)$, then the random variable $x$ is a constant (with probability 1). \end{quotation} The secret of a proof using Jensen's inequality is to choose the right function and the right random variable. We could define % $f(u) = \log \frac{1}{u}$ and \beq f(u) = \log \frac{1}{u} = - \log u \eeq (which is a convex function) and think of $H(X) = \sum p_i \log \frac{1}{p_i}$ as the mean of $f(u)$ where $u=P(x)$, but this would not get us there -- it would give us an inequality in the wrong direction. If instead we define \beq u = 1/P(x) \eeq then we find: % this introduces an extra minus sign: \beq H(X) = - \Exp\left[ f( 1/P(x) ) \right] \leq - f\left( \Exp[ 1/P(x) ] \right) ; \eeq now we know from \exerciseref{ex.invP}\ that $\Exp[ 1/P(x) ] = |\A_X|$, so \beq H(X) \leq - f\left( |\A_X| \right) = \log |\A_X| . \eeq Equality holds only if the random variable $u = 1/P(x)$ is a constant, which means $P(x)$ is a constant for all $x$. \end{Prooflike} } % \soln{ex.rel.ent}{ \beq D_{\rm KL}(P||Q) = \sum_x P(x) \log \frac{P(x)}{Q(x)} . % \label{eq.KL} \eeq \label{sec.gibbs.proof}% cross ref problem? Tue 12/12/00 We prove \ind{Gibbs' inequality} using \ind{Jensen's inequality}. Let $f(u) = \log 1/u$ and $u=\smallfrac{Q(x)}{P(x)}$. Then \beqan D_{\rm KL}(P||Q) & =& \Exp[ f( Q(x)/P(x) ) ] \\ &\geq& f\left( \sum_x P(x) \frac{Q(x)}{P(x)} \right) = \log \left( \frac{1}{\sum_x Q(x)} \right) = 0, \eeqan with equality only if $u=\frac{Q(x)}{P(x)}$ is a constant, that is, if $Q(x) = P(x)$.\hfill$\epfsymbol$\\ \begin{Prooflike}{Second solution} In the above proof the expectations were with respect to the probability distribution $P(x)$. A second solution method uses Jensen's inequality with $Q(x)$ instead. We define $f(u) = u \log u$ and let $u = \frac{P(x)}{Q(x)}$. Then \beqan D_{\rm KL}(P||Q)& =& \sum_x Q(x) \frac{P(x)}{Q(x)} \log \frac{P(x)}{Q(x)} = \sum_x Q(x) f\left( \frac{P(x)}{Q(x)} \right) \\ &\geq& f\left( \sum_x Q(x) \frac{P(x)}{Q(x)} \right) = f(1) = 0, \eeqan with equality only if $u=\frac{P(x)}{Q(x)}$ is a constant, that is, if $Q(x) = P(x)$. \end{Prooflike} } % % solns moved to _s5A.tex % \soln{ex.decomposeexample}{ \beq H(X)= H_2(f) + f H_2(g) + (1-f) H_2(h) . \eeq } % \soln{ex.waithead0}{ The probability that there are $x-1$ tails and then one head (so we get the first head on the $x$th toss) is \beq P(x) = (1-f)^{x-1} f . \eeq If the first toss is a tail, the probability distribution for the future looks just like it did before we made the first toss. Thus we have a recursive expression for the entropy: \beq H(X) = H_2( f ) + (1-f) H(X) . \eeq Rearranging, \beq H(X) = H_2( f ) / f . \eeq } % % \fakesection{waithead solution} \soln{ex.waithead}{ The probability of the number of tails $t$ is \beq P(t) = \left(\frac{1}{2}\right)^{\!t} \frac{1}{2} \:\mbox{ for $t\geq 0$}. \eeq The expected number of heads is 1, by definition of the problem. The expected number of tails is \beq \Exp[t] = \sum_{t=0}^{\infty} t \left(\frac{1}{2}\right)^{\!t} \frac{1}{2} , \eeq which may be shown to be 1 in a variety of ways. For example, since the situation after one tail is thrown is equivalent to the opening situation, we can write down the recurrence relation \beq \Exp[t] = \frac{1}{2} ( 1 + \Exp[t] ) + \frac{1}{2}0 \:\: \Rightarrow \:\: \Exp[t] = 1. \eeq % if we define $S=\Exp[t]$ then we can subtract $S/2$ from $S$ to obtain % a geometric series: %\beq % (1-1/2)S = \sum_{t=0}^{\infty} \left(\frac{1}{2}\right)^{t+1} % = \frac{1/2}{1-1/2} = 1 %\eeq % which gives $S=2$ --- what? %%%%%%%%%%%%%%%% %, for example, introducing % $Z(\beta) \equiv \sum_t \left(\frac{1}{2}\right)^{\beta t} \frac{1}{2} % = \frac{1}{2}/\left(1 - (\linefrac{1}{2})^{\beta}\right)$: %\beq % \sum_{t=0}^{\infty} t \left(\frac{1}{2}\right)^{t} \frac{1}{2} % = \frac{\d}{\d\beta} \log Z %\eeq The probability distribution of the `estimator' $\hat{f} = 1/(1+t)$, given that $f=1/2$, is plotted in \figref{fig.f.estimator}. The probability of $\hat{f}$ is simply the probability of the corresponding value of $t$. % % gnuplot % load 'figs/festimator.gnu' %\begin{figure} %\figuremargin{% \marginfig{% \begin{center} \begin{tabular}{c} $P(\hat{f})$\\[-0.3in] \mbox{\psfig{figure=figs/festimator.ps,angle=-90,width=2in}}\\ \hspace{1.82in}$\hat{f}$ \end{tabular} \end{center} %}{% \caption[a]{The probability distribution of the estimator $\hat{f} = 1/(1+t)$, given that $f=1/2$.} % , so that $P(t) = 1/2^{t+1}$.} \label{fig.f.estimator} %} %\end{figure} } } \soln{ex.waitbus}{ \ben \item The mean number of rolls from one six to the next six is six (assuming we % don't count the first of the two sixes). start counting rolls after the first of the two sixes). The probability that the next six occurs on the $r$th roll is the probability of {\em not\/} getting a six for $r-1$ rolls multiplied by the probability of then getting a six: \beq P(r_1 \eq r) = \left( \frac{5}{6} \right)^{\! r-1} \frac{1}{6}, \:\: \mbox{for $r\in \{1,2,3,\ldots \}$.} \eeq This probability distribution of the number of rolls, $r$, may be called an \ind{exponential distribution}, since \beq P(r_1 \eq r) = e^{-\alpha r} / Z, \eeq where $\alpha = \ln({6}/5)$, and $Z$ is a normalizing constant. \item The mean number of rolls from the clock until the next six is six. \item The mean number of rolls, going back in time, until the most recent six is six. \item The mean number of rolls from the six before the clock struck to the six after the clock struck is the sum of the answers to (b) and (c), less one, % (assuming we don't count the first of the two sixes), that is, eleven. \item Rather than explaining the difference between (a) % six and and (d), let me give another hint.\index{bus-stop paradox}\index{waiting for a bus} % see gnu/waitbus.gnu Imagine that the buses in Poissonville arrive independently at random (a \ind{Poisson process}), with, on average, one bus every six minutes. Imagine that passengers turn up at {\busstop}s at a uniform rate, % random also, and are scooped up by the bus without delay, so the interval between two buses remains constant. Buses that follow gaps bigger than six minutes become overcrowded. The passengers' representative complains that two-thirds of all passengers found themselves on overcrowded buses. The bus operator claims, `no, no -- only one third of our buses are overcrowded'. Can both these claims be true? \een \amarginfig{b}{% \begin{center} \mbox{\hspace{-0.3in}\psfig{figure=figs/waitbus.ps,angle=-90,width=2.05in}}\\[-0.2in] \end{center} \caption[a]{The probability distribution of the number of rolls $r_1$ from one 6 to the next (falling solid line), \[%\beq P(r_1 \eq r) = \left( \frac{5}{6} \right)^{\! r-1} \frac{1}{6} , \]%\eeq and the probability distribution (dashed line) of % the quantity $r_{\rm tot}=r_1+r_2-1$, the number of rolls from the 6 before 1pm to the next 6, % where $r_1$ and $r_2$ are the numbers of rolls before % and after the clock strikes, $r_{\rm tot}$, \[%\beq P(r_{\rm tot} \eq r) = r \, \left( \frac{5}{6} \right)^{\! r-1} \left( \frac{1}{6} \right)^{\! 2 } . \]%\eeq The probability $P(r_1>6)$ is about 1/3; the probability $P(r_{\rm tot} > 6 )$ is about 2/3. The mean of $r_1$ is 6, and the mean of $r_{\rm tot}$ is 11. } % other elegant ways of saying it: % P( number rolls from one 6 to the next) % P( number of rolls from the 6 before 1pm to the next) }% end figure }% end solbn % % \subsection{Move this solution} % % \subsection*{Conditional probability} % \soln{ex.R3error}{ % \fakesection{r3 error soln} \soln{ex.R3error}{ \begin{description} \item[Binomial distribution method\puncspace] From the solution to \exerciseonlyref{ex.R3ep}, $p_B = 3 f^2 (1-f) + f^3$.\index{repetition code} \item[Sum rule method\puncspace] The marginal probabilities of the eight values of $\br$ are\index{sum rule} illustrated by: \beq P(\br \eq {\tt0}{\tt0}{\tt0} ) = \dhalf (1-f)^3 + \dhalf f^3 , \eeq \beq P(\br \eq {\tt0}{\tt0}{\tt1} ) = \dhalf f(1-f)^2 + \dhalf f^2(1-f) = \dhalf f(1-f) . \eeq The posterior probabilities are represented by \beq P( s\eq{\tt1} \given \br \eq {\tt0}{\tt0}{\tt0} ) = \frac{ f^3 } { (1-f)^3 + f^3 } \eeq and \beq P( s\eq{\tt1} \given \br \eq {\tt0}{\tt0}{\tt1} ) = \frac{ (1-f)f^2 } { f(1-f)^2 + f^2(1-f) } = f . \eeq The probabilities of error in these representative cases are thus \beq P(\mbox{error} \given \br \eq {\tt0}{\tt0}{\tt0} ) = \frac{ f^3 } { (1-f)^3 + f^3 } \eeq and \beq P(\mbox{error} \given \br \eq {\tt0}{\tt0}{\tt1} ) = f . \eeq Notice that while the average probability of error of $\Rthree$ is about $3 f^2$, the probability (given $\br$) that any {\em{particular}\/} bit is wrong is either about $f^3$ or $f$. The average error probability, using the sum rule, is \beqa P(\mbox{error}) &=& \sum_{\br} P(\br) P(\mbox{error} \given \br) \\ &=& 2 [\dhalf (1-f)^3 + \dhalf f^3] \frac{ f^3 } { (1-f)^3 + f^3 } + 6 [\dhalf f(1-f)] f . \eeqa \marginpar{\vspace{-0.8in}\par\small\raggedright\reducedlead{The first two terms are for the cases $\br = \tt000$ and $\tt111$; the remaining 6 are for the other outcomes, which share the same probability of occurring and identical error probability, $f$.}}% So \beqa P(\mbox{error}) &=& f^3 + 3 f^2(1-f) . \eeqa \end{description} } % % % see also _s1A.tex \soln{ex.Hwords}{ The entropy is 9.7 % 11.8 bits per word. % , which is 2.6 bits per letter WRONG - shannon (p197) is in error } %\soln{ex.Hwords}{ % % z := 1.000004301 % %sum( 0.1/n * log(1.0/(0.1/n))/log(2.0) , n=1..12367) ; % 9.716258456 % 9.716 bits. %} %\input{tex/_s1a.tex} nothing there any more \fakesection{_s1A solutions} %================================= % quake % % \subsection*{Solutions to further inference problems} %\soln{ex.exponential}{ % See chapter \chbayes. %} %\soln{ex.blood}{ % See chapter \chbayes. %} % % The other exercises are discussed in the next chapter. %%%%%%%%%%%%%%%%%%%%%%%%%% \dvipsb{solutions 1a} % now another inference chapter ! \prechapter{About Chapter} \fakesection{About the first Bayes chapter} If you are eager to get on to % with data compression, information content and entropy, information theory, data compression, and noisy channels, you can skip to \chapterref{ch2}. Data compression and data modelling are intimately connected, however, so you'll probably want to come back to this chapter by the time you get to \chapterref{ch4}. % % move this later % % The exercises in this chapter are not a prerequisite for % chapters \ref{ch2}--\ref{ch7}. \fakesection{prerequisites for chapter 8} Before reading \chapterref{ch.bayes}, it might be good to look at the following exercises. % you % should have worked on % finished % all the exercises in chapter \chone, in particular, % \exerciserefrange{ex.logit}{ex.exponential}. % % \exthirtyone--\exthirtysix. % uvw to HXY>0 \exercissxB{2}{ex.dieexponential}{ A die is selected at random from two twenty-faced dice on which the symbols 1--10 are written with nonuniform frequency as follows. \begin{center} \begin{tabular}{l@{\hspace{0.2in}}*{10}{l}} \toprule Symbol & 1 & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 & 10 \\ \midrule Number of faces of die A & 6 & 4 & 3 & 2 & 1 &1 &1 &1 &1 & 0 \\ Number of faces of die B & 3 & 3 & 2 & 2 & 2 &2 &2 &2 &1 & 1 \\ \bottomrule \end{tabular} \end{center} The randomly chosen die is rolled 7 times, with the following outcomes: \begin{center} 5, 3, 9, 3, 8, 4, 7. % Sat 21/12/02 tried cutting this \\ \end{center} What is the probability that the die is die A? } \exercissxB{2}{ex.dieexponentialb}{ Assume that there is a third twenty-faced die, die C, on which the symbols 1--20 are written once each. As above, one of the three dice is selected at random and rolled 7 times, giving the outcomes: % \begin{center} 3, 5, 4, 8, 3, 9, 7. \\ % \end{center} What is the probability that the die is (a) die A, (b) die B, (c) die C? } % no normal solution pointer \exercissxA{3}{ex.exponential}{ {\exercisetitlestyle Inferring a decay constant}\\ %\begin{quotation} Unstable particles are emitted from a source and decay at a distance $x$, a real number that has an exponential probability distribution with characteristic length $\lambda$. Decay events can be observed only if they occur in a window extending from $x=1\cm$ to $x=20\cm$. $N$ decays are observed at locations $\{x_1 , \ldots , x_N\}$. % ($x_n$ is a real number.) What is $\lambda$? %\end{quotation} \begin{center} \mbox{\psfig{figure=\FIGS/decay.ps,width=3in,angle=90,% bbllx=154mm,bblly=147mm,bbury=257mm,bburx=175mm}}\\ \end{center} } % no normal solution pointer % \subsection*{Genetic test evidence} % \begin{quotation} \exercissxB{3}{ex.blood}{ {\exercisetitlestyle Forensic evidence} \\ \index{forensic}\input{tex/ex.blood.tex} } % \end{quotation} %%%%%%%%%% (many are repeated from _s1aa) %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % \prechapter{About Chapter} \mysetcounter{page}{54} \ENDprechapter \chapter{More about Inference} \label{ch.bayes}\label{ch1b} % contains the decay problem, the bent coin, and blood. % % % solutions to exercises are in _s8.tex % %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \fakesection{Inference intro} It is not a controversial statement that \Bayes\ theorem\index{Bayes' theorem} provides the correct language for describing the inference of a message communicated over a noisy channel, as we used it in \chref{ch1} (\pref{sec.bayes.used}). But strangely, when it comes to other inference problems, the use of % approaches based on \Bayes\ theorem is not so widespread. %let's take a little tour of other applications of % probabilistic inference. % Coherent inference can always be mapped onto probabilities (Cox, 1946). %% \cite{cox}. % Many % textbooks on statistics do not mention this fact, so maybe it is worth % using an example to emphasize the contrast between Bayesian inference % and the orthodox methods of statistical inference. %% involving %% estimators, confidence intervals, hypothesis testing, etc. % If this topic interests you, excellent further reading is % to be found in the works of Jaynes, for example, % \citeasnoun{Jaynes.intervals}. \section{A first inference problem} \label{sec.decay}\label{ex.exponential.sol}% special label by hand When I was an undergraduate in Cambridge, I was privileged to receive supervisions from \index{Gull, Steve}{Steve Gull}. Sitting at his desk in a dishevelled office in St.\ John's College, I asked him how one ought to answer an old Tripos question (\exerciseonlyref{ex.exponential}): \begin{quotation} Unstable particles are emitted from a source and decay at a distance $x$, a real number that has an exponential probability distribution with characteristic length $\lambda$. Decay events can be observed only if they occur in a window extending from $x=1\cm$ to $x=20\cm$. $N$ decays are observed at locations $\{x_1 , \ldots , x_N\}$. % ($x_n$ is a real number.) What is $\lambda$? \end{quotation} \begin{center} \mbox{\psfig{figure=\FIGS/decay.ps,width=3in,angle=90,% bbllx=154mm,bblly=147mm,bbury=257mm,bburx=175mm}}\\ \end{center} I had scratched my head over this for some time. My education had provided me with a couple of approaches to solving such inference problems: constructing `\ind{estimator}s' of the unknown parameters; or `fitting' the model to the data, or to a processed version of the data. Since the mean of an unconstrained exponential distribution is $\l$, it seemed reasonable to examine the sample mean $\bar{x} = \sum_n x_n / N$ and see if an estimator $\hat{\l}$ could be obtained from it. It was evident that the {estimator} $\hat{\l}=\bar{x}-1$ would be appropriate for $\lambda \ll 20\,$cm, but not for cases where the truncation of the distribution at the right-hand side is significant; with a little ingenuity and the introduction of ad hoc bins, promising estimators for $\lambda \gg 20$ cm could be constructed. But there was no obvious estimator that would work under all conditions. Nor could I find a satisfactory approach based on fitting the density $P(x\given \lambda)$ to a histogram derived from the data. I was stuck. What is the general solution to this problem and others like it? Is it always necessary, when confronted by a new inference problem, to grope in the dark for appropriate `estimators' and worry about finding the `best' estimator (whatever that means)? %% I hope you have already stopped and thought about this question. % problem. % \\ \mbox{~}\dotfill\ \mbox{~} \\ % \newpage Steve % Gull wrote down the probability of one data point, given $\l$: \beq P(x\given \lambda) =\left\{ \begin{array}{ll} {\textstyle \smallfrac{1}{\l}} \, e^{-x/\lambda } / Z(\lambda) & 1 < x < 20 \\ 0 & {\rm otherwise } \end{array} \right. \label{basic.likelihood} \eeq where \beq Z(\l) = \int_1^{20} \d x \: \smallfrac{1}{\l} \, e^{-x/\lambda } = \left(e^{-1/\l} - e^{-20 /\l} \right). \label{basic.likelihood.Z} \eeq This seemed obvious enough. Then he wrote {\dem{\ind{\Bayes\ theorem}}}: \beqan \label{bayes.theorem} % \begin{array}{l} P(\l\given \{x_1, \ldots, x_N\}) &=& \frac{P(\{x\}\given \lambda) P(\l)}{P(\{x\}) } \\ %&& \hspace{0.5in} &\propto& \frac{1}{\left( \l Z(\l) \right)^N} \exp \left( \textstyle - \sum_1^N x_n / \l \right) P(\l) . % \end{array} \label{basic.posterior} \eeqan Suddenly, the straightforward distribution $P(\{x_1 ,\ldots, x_N \}\given \l)$, defining the probability of the data given the hypothesis $\l$, was being turned on its head so as to define the probability of a hypothesis given the data. A simple figure showed the probability of a single data point $P(x\given \l)$ as a familiar function of $x$, for different values of $\l$ (figure \ref{decay.like.1}). Each curve was an innocent exponential, normalized to have area 1. Plotting the same function as a function of $\l$ for a fixed value of $x$, something remarkable happens: a peak emerges (figure \ref{decay.like.2}). To help understand these two points of view of the one function, \figref{decay.probandlike} shows a surface plot of $P(x\given \l)$ as a function of $x$ and $\l$. \begin{figure} \figuremargin{% \begin{center} \mbox{\psfig{figure=\FIGS/decay.like.1.ps,% width=2 in,angle=-90}\ \ \ \raisebox{-3mm}[0in][0in]{$x$}} \end{center} }{% \caption{{The probability density $P(x\given \l)$ as a function of $x$.}} \label{decay.like.1} }% \end{figure} \begin{figure} \figuremargin{% \begin{center} \mbox{\psfig{figure=\FIGS/decay.like.2.ps,% width=2 in,angle=-90}\ \ \ \raisebox{-3mm}[0in][0in]{$\lambda$}} \end{center} }{% \caption[a]{{The probability density $P(x\given \l)$ as a function of $\l$, for three different values of $x$.} \small When plotted this way round, the function is known as the {\dem\ind{likelihood}\/} of $\l$. The marks indicate the three values of $\l$, $\l=2,5,10$, that were used in the preceding figure. } \label{decay.like.2} } \end{figure} %\begin{figure} %\figuremargin{% \marginfig{ \begin{center} \begin{tabular}{c} \makebox[0pt][l]{\hspace*{0.21in}\raisebox{0.435in}{$x$}}% \mbox{\psfig{figure=\FIGS/probandlike.ps,% width=2in,angle=-90}% \makebox[0pt][l]{\hspace*{-0.352in}\raisebox{0.435in}{$\l$}}}\\[-0.3in]% was -0.6 Sat 5/10/02 \end{tabular}\end{center} %}{% \caption[a]{{The probability density $P(x\given \l)$ as a function of $x$ and $\l$. Figures \ref{decay.like.1} and \ref{decay.like.2} are vertical sections through this surface.} } \label{decay.probandlike} } %\end{figure} \begin{figure} \figuremargin{% \begin{center} \mbox{\psfig{figure=\FIGS/decay.like.xxx.ps,% width=2in,angle=-90}} \end{center} }{% \caption[a]{{The likelihood function in the case of a six-point dataset, $P(\{x\} = \{1.5,2,3,4,5,12\}\given \lambda)$, as a function of $\l$.} } \label{decay.like.xxx} } \end{figure} For a dataset consisting of several points, \eg, the six points $\{x\}_{n=1}^{N} = \{1.5,2,3,4,5,12\}$, the likelihood function $P(\{x\}\given \lambda)$ is the product of the $N$ functions of $\l$, $P(x_n\given \l)$ (\figref{decay.like.xxx}). % Steve summarized \Bayes\ theorem % (equation \ref{bayes.theorem}) as embodying the fact that \begin{conclusionbox} what you know about $\lambda$ after the data arrive is what you knew before [$P(\lambda)$], and what the data told you [$P(\{x\}\given \lambda)$]. \end{conclusionbox} Probabilities are used here to quantify degrees of belief. % The probability % of $\lambda$ is a quantification of what you know about $\lambda$. To nip possible confusion in the bud, it must be emphasized that the hypothesis $\lambda$ that correctly describes the situation is {\em not\/} a {\em stochastic\/} variable, and the fact that the Bayesian uses a probability\index{probability!Bayesian} distribution $P$ does {\em not\/} mean that he thinks of the world as stochastically changing its nature between the states described by the different hypotheses. He uses the notation of probabilities to represent his {\em beliefs\/} about the mutually exclusive micro-hypotheses (here, values of $\l$), of which only one is actually true. That probabilities can denote degrees of belief, given assumptions, seemed reasonable to me. % , and is proved by Cox (1946). % \citeasnoun{cox}. % . Anyone who does not find it reasonable to use % probabilities to quantify degrees of belief can read % paper, where it is proved to be % valid. \label{sec.decayb} The posterior probability distribution % of equation (\ref{basic.posterior}) represents the unique and complete solution to the problem. There is no need to invent\index{classical statistics!criticisms} `estimators'; nor do we need to invent criteria for comparing alternative estimators with each other. Whereas orthodox statisticians offer twenty ways of solving a problem, and another twenty different criteria for deciding which of these solutions is the best, Bayesian statistics only offers one answer to a well-posed problem. % Added Mon 4/2/02 \marginpar{\small\raggedright\reducedlead{If you have any difficulty understanding this chapter I recommend ensuring you are happy with exercises \ref{ex.dieexponential} and \ref{ex.dieexponentialb} (\pref{ex.dieexponentialb}) then noting their similarity to \exerciseonlyref{ex.exponential}.}} \subsection{Assumptions in inference} Our inference is conditional on our \ind{assumptions} [for example, the prior $P(\lambda)$]. Critics view such priors as a difficulty because they are `subjective', but I don't see how it could be otherwise. How can one perform inference without making assumptions? I believe that it is of great value that Bayesian methods force one to make these tacit assumptions explicit. First, once assumptions are made, the inferences are objective and unique, reproducible with complete agreement by anyone who has the same information and makes the same assumptions. For example, given the assumptions listed above, $\H$, and the data $D$, % from an experiment % measuring decay lengths, everyone will agree about the posterior probability of the decay length $\l$: \beq P(\l\given D,\H) = \frac{ P(D\given \l,\H) P(\l\given \H) }{ P(D\given \H) } . \eeq Second, when the assumptions are explicit, they are easier to criticize, and easier to modify -- indeed, we can quantify the sensitivity of our inferences to the details of the assumptions. For example, we can note from the likelihood curves in figure \ref{decay.like.2} that in the case of a single data point at $x=5$, the likelihood function is less strongly peaked than in the case $x=3$; the details of the prior $P(\lambda)$ become increasingly important as the sample mean $\bar{x}$ gets closer to the middle of the window, 10.5. In the case $x=12$, the likelihood function doesn't have a peak at all -- such data merely rule out small values of $\lambda$, and don't give any information about the relative probabilities of large values of $\lambda$. So in this case, the details of the prior at the small--$\lambda$ end of things are not important, but at the large--$\lambda$ end, the prior is important. % is whatever we knew before % the experiment, \ie, our prior. Third, when we are not sure which of various alternative assumptions is the most appropriate for a problem, we can treat this question as another inference task. Thus, given data $D$, we can\index{Bayes' theorem} % learn from the data compare alternative assumptions $\H$ using \Bayes\ theorem: \beq P(\H\given D,\I) = \frac{ P(D\given \H,\I) P(\H\given \I) }{ P(D\given \I) } , \label{basic.ev} \eeq where $\I$ denotes the highest assumptions, which we are not questioning. Fourth, we can take into account our uncertainty regarding such assumptions when we make subsequent predictions. Rather than choosing one particular assumption $\H^{*}$, and working out our predictions about some quantity $\bt$, $P(\bt\given D,\H^{*},\I)$, we obtain predictions that take into account our uncertainty about $\H$ by using the sum rule: \beq P(\bt \given D, \I) = \sum_{\H} P(\bt \given D, \H , \I ) P(\H\given D,\I) . \label{basic.marg} \eeq This is another contrast with orthodox statistics, in which it is conventional to `test' a default model, and then, if the test\index{test!statistical}\index{statistical test} `accepts the model' at some `\ind{significance level}', to use exclusively that model to make predictions. Steve thus persuaded me that \begin{conclusionbox} probability theory reaches parts that ad hoc methods cannot reach. \end{conclusionbox} % However, that is a topic for another lecture. Let's look at a few more examples of simple inference problems. \section{The bent coin} \label{sec.bentcoin} A \ind{bent coin} %\index{inference problems!bent coin} is tossed $F$ times; we observe a sequence $\bs$ of heads and tails (which we'll denote by the symbols $\ta$ and $\tb$). We wish to know the bias of the coin, and predict the probability that the next toss will result in a head. We first encountered this task in \exampleref{exa.bentcoin}, and we will encounter it again in \chref{ch.four}, when we discuss adaptive data compression. % the adaptive encoder for $a$s and $b$s. It is also the original inference problem studied by % Rev.\ {Thomas Bayes} in his essay published in 1763.\index{Bayes, Rev.\ Thomas} % cite{Bayes} As in % \chref{ch.prob.ent} \exerciseref{ex.postpa}, we will assume % In chapter \chfour\ we assumed a uniform prior distribution and obtain a posterior distribution by multiplying by the likelihood. A critic might object, `where did this prior come from?' I will not claim that the uniform prior is in any way fundamental; indeed we'll give examples of nonuniform priors later. The prior is % It is simply a subjective assumption. One of the themes of this book is: % % put this back somewhere? % % One way to justify the need for a prior is % to assume, as in chapter \chfour, % that our task is simply to make a code to encode the % outcome $\bs$ as efficiently as possible. We have to compress the % data from the source somehow, and any choice of a compression scheme % must correspond to a prior distribution over coin biases. I see no % way round this. The choice of code implies an assumed probability % distribution over outcomes. %\begin{quotation} \begin{conclusionbox} \noindent you can't do inference -- or data compression -- without making assumptions. % You can't do data compression -- or inference -- without % making assumptions. \end{conclusionbox} %\end{quotation} % % change notation? f_H????????????????????????????????? % %\subsubsection*{Likelihood function} We give the name $\H_1$ to our assumptions. [We'll be introducing an alternative set of assumptions in a moment.] The probability, given $p_{\ta}$, that $F$ tosses result in a sequence $\bs$ that contains $\{F_{\ta},F_{\tb}\}$ counts of the two outcomes % $\{ a , b \}$ is \beq P( \bs \given p_{\ta} , F,\H_1 ) = p_{\ta}^{F_{\ta}} (1-p_{\ta})^{F_{\tb}} . \label{eq.pa.likeb} \eeq [{For example, $P(\bs\eq {\tt{aaba}} \given p_{\ta},F \eq 4,\H_1) = p_{\ta}p_{\ta}(1-p_{\ta})p_{\ta}.$}] % This function of $p_{\ta}$ (\ref{eq.pa.likeb}) defines the likelihood function. % Model 1 Our first model assumes a uniform prior distribution for $p_{\ta}$, \beq P(p_{\ta}\given\H_1) = 1 , \: \: \: \: \: \: p_{\ta} \in [0,1] \label{eq.pa.priorb} \eeq and $p_{\tb} \equiv 1-p_{\ta}$. \subsubsection{Inferring unknown parameters} Given a string of length $F$ of which $F_{\ta}$ are $\ta$s and $F_{\tb}$ are $\tb$s, we are interested in (a) inferring what $p_{\ta}$ might be; (b) predicting whether the next character is an $\ta$ or a $\tb$. [Predictions\index{prediction} are always expressed as probabilities. So `predicting whether the next character is an $\ta$' is the same as computing the probability that the next character is an $\ta$.] Assuming $\H_1$ to be true, the posterior probability of $p_{\ta}$, given a string $\bs$ of length $F$ that has counts $\{F_{\ta},F_{\tb}\}$, is, by \Bayes\ theorem, \beqan P( p_{\ta} \given \bs ,F,\H_1) &=& \frac{ P( \bs \given p_{\ta} , F,\H_1 ) P(p_{\ta}\given\H_1) }{ P( \bs \given F,\H_1 ) } . \label{eq.pa.post} \label{eq.pa.post.again} \eeqan The factor $P( \bs \given p_{\ta} , F,\H_1 )$, which, as a function of $p_{\ta}$, is known as the likelihood function, was given in \eqref{eq.pa.likeb}; the prior $P(p_{\ta}\given\H_1)$ was given in \eqref{eq.pa.priorb}. Our inference of $p_{\ta}$ is thus: % The posterior \beqan P( p_{\ta} \given \bs ,F,\H_1) &=& \frac{ p_{\ta}^{F_{\ta}} (1-p_{\ta})^{F_{\tb}} }{ P( \bs \given F,\H_1 ) } . \label{eq.pa.postb.again} \eeqan The normalizing constant is given by the beta integral \beq P( \bs \given F,\H_1 ) = \int_0^1 \d p_{\ta} \: p_{\ta}^{F_{\ta}} (1-p_{\ta})^{F_{\tb}} = \frac{\Gamma(F_{\ta}+1)\Gamma(F_{\tb}+1)}{ \Gamma(F_{\ta}+F_{\tb}+2) } = \frac{ F_{\ta}! F_{\tb}! }{ (F_{\ta} + F_{\tb} + 1)! } . \label{eq.evidenceZ} \eeq % Our inference of $p_{\ta}$, assuming $\H_1$ to be true, % is thus given by \eqref{eq.pa.postb.again}. %%%%%%%%%%%%% \exercissxA{2}{ex.postpaII}{ Sketch the posterior probability $P( p_{\ta} \given \bs\eq {\tt aba} ,F\eq 3)$. What is the most probable value of $p_{\ta}$ (\ie, the value that maximizes the posterior probability density)? What is the mean value of $p_{\ta}$ under this distribution? Answer the same questions for the posterior probability $P( p_{\ta} \given \bs\eq {\tt bbb} ,F\eq 3)$. } \subsubsection{From inferences to predictions} Our prediction about the next toss, the probability that the next toss is an $\ta$, is obtained by integrating over $p_{\ta}$. This has the effect of taking into account our uncertainty about $p_{\ta}$ when making predictions. By the sum rule, \beqan P(\ta \given \bs ,F)& =& \int \d p_{\ta} \: P(\ta \given p_{\ta} ) P(p_{\ta} \given \bs,F ) . \eeqan The probability of an $\ta$ given $p_{\ta}$ is simply $p_{\ta}$, so \beqan \lefteqn{ P(\ta \given \bs ,F) = \int \d p_{\ta} \: p_{\ta} \frac{p_{\ta}^{F_{\ta}} (1-p_{\ta})^{F_{\tb}}} {P( \bs \given F ) } } \\ &=& \int \d p_{\ta} \: \frac{p_{\ta}^{F_{\ta}+1} (1-p_{\ta})^{F_{\tb}}} {P( \bs \given F ) } \\ &=& \left. % \frac { \left[ \frac{ (F_{\ta}+1)! \, F_{\tb}! }{ (F_{\ta} + F_{\tb} + 2)! } \right] } \right/ { \left[ \frac{ F_{\ta}! \, F_{\tb}! }{ (F_{\ta} + F_{\tb} + 1)! } \right] } \:\: = \:\: \frac{ F_{\ta}+1 }{ F_{\ta} + F_{\tb} + 2 } , \label{eq.laplacederived} \eeqan which is known as {\dem{\ind{Laplace's rule}}}. \section{The bent coin and model comparison} \label{sec.bentcoin2} Imagine that a scientist introduces another theory for our data. He asserts that the source is not really a {bent coin} but is really a perfectly formed die with one face painted heads (`$\ta$') and the other five painted tails (`$\tb$'). Thus the parameter $p_{\ta}$, which in the original model, $\H_1$, could take any value between 0 and 1, is according to the new hypothesis, $\H_0$, not a free parameter at all; rather, it is equal to % p_{\ta} = $1/6$. [This hypothesis is termed $\H_0$ so that the suffix of each model indicates its number of free parameters.] How can we compare these two models in the light of data? We wish to infer how probable $\H_1$ is relative to $\H_0$. % , so we can use \Bayes\ theorem again. % Let us write down the first model's probabilities again. % {\em Here we repeat some material from the arithmetic coding % chapter, chapter \ref{ch4}.} \subsubsection*{Model comparison as inference} In order to perform model comparison, we write down \Bayes\ theorem again, but this time with a different\index{Bayes' theorem} argument on the left-hand side. We wish to know how probable $\H_1$ is given the data. By \Bayes\ theorem, \beq P( \H_1 \given \bs ,F ) = \frac{ P( \bs \given F,\H_1 ) P( \H_1 ) }{ P( \bs \given F) } . \eeq Similarly, the posterior probability of $\H_0$ is \beq P( \H_0 \given \bs ,F ) = \frac{ P( \bs \given F,\H_0 ) P( \H_0 ) }{ P( \bs \given F) }. \eeq The normalizing constant in both cases is $P(\bs\given F)$, which is the total probability of getting the observed data. % regardless of which model is true. If $\H_1$ and $\H_0$ are the only models under consideration, this probability is given by the sum rule: \beq P( \bs \given F) = P( \bs \given F,\H_1 ) P( \H_1 ) + P( \bs \given F,\H_0 ) P( \H_0 ) . \eeq To evaluate the posterior probabilities of the hypotheses we need to assign values to the prior probabilities $P( \H_1 )$ and $P( \H_0 )$; in this case, we might set these to 1/2 each. And we need to evaluate the data-dependent terms $P( \bs \given F,\H_1 )$ and $P( \bs \given F,\H_0 )$. We can give names to these quantities. The quantity $P( \bs \given F,\H_1 )$ is a measure of how much the data favour $\H_1$, and we call it the {\dbf\ind{evidence}} for model $\H_1$. We already encountered this quantity in equation (\ref{eq.pa.post.again}) where it appeared as the normalizing constant of the first inference we made -- the inference of $p_{\ta}$ given the data. \medskip \begin{conclusionbox} %\begin{description} %\item[How model comparison works:] {\bf How model comparison works:} The evidence for a model is usually\index{key points!model comparison} the normalizing constant of an earlier Bayesian inference. %\end{description} \end{conclusionbox} \medskip We evaluated the normalizing constant for model $\H_1$ in (\ref{eq.evidenceZ}). The evidence for model $\H_0$ is very simple because this model has no parameters to infer. Defining $p_0$ to be $1/6$, we have \beq P( \bs \given F,\H_0 ) = p_0^{F_{\ta}} (1-p_0)^{F_{\tb}} . \eeq Thus the posterior probability ratio of model $\H_1$ to model $\H_0$ is \beqan \frac{ P( \H_1 \given \bs ,F )} {P( \H_0 \given \bs ,F )} & =& \frac{ P( \bs \given F,\H_1 ) P( \H_1 ) } { P( \bs \given F,\H_0 ) P( \H_0 ) } \\ &=& \left. { \frac{ F_{\ta}! F_{\tb}! }{ (F_{\ta} + F_{\tb} + 1)! } } \right/ { p_0^{F_{\ta}} (1-p_0)^{F_{\tb}} } . % \frac{ \smallfrac{ F_{\ta}! F_{\tb}! }{ (F_{\ta} + F_{\tb} + 1)! } }{ p_0^{F_{\ta}} (1-p_0)^{F_{\tb}} } . % SECOND EDN - sanjoy says use linefrac \label{eq.compare.final} \eeqan Some values of this posterior probability ratio are illustrated in table \ref{tab.mod.comp}. The first five lines illustrate that some outcomes favour one model, and some favour the other. No outcome is completely incompatible with either model. \begin{table} \figuremargin{% \begin{center} \begin{tabular}{cccl} \toprule $F$ & Data $(F_{\ta},F_{\tb})$ & $\displaystyle \frac{ P( \H_1 \given \bs ,F )} {P( \H_0 \given \bs ,F )}$ \\ \midrule 6 & $(5,1)$ & 222.2 & \\ 6 & $(3,3)$ & 2.67 &\\ 6 & $(2,4)$ & 0.71 & = 1/1.4 \\ 6 & $(1,5)$ & 0.356 & = 1/2.8 \\ 6 & $(0,6)$ & 0.427 & = 1/2.3 \\ \midrule 20 & $(10,10)$ & 96.5 & \\ 20 & $(3,17)$ & 0.2 & = 1/5 \\ 20 & $(0,20)$ & 1.83 & \\ \bottomrule \end{tabular} \end{center} }{% \caption{Outcome of model comparison between models $\H_1$ and $\H_0$ for the `bent coin'. Model $\H_0$ states that $p_{\ta}=1/6$, $p_{\tb}=5/6$.} \label{tab.mod.comp} } \end{table} With small amounts of data (six tosses, say) it is typically not the case that one of the two models is overwhelmingly more probable than the other. But with more data, the evidence against $\H_0$ given by any data set with the ratio $F_{\ta} \colon F_{\tb}$ differing from $1 \colon 5$ mounts up. % % add figure showing some typical histories % You can't predict in advance how much data are needed to be pretty sure which theory is true.\index{key points!how much data needed} It depends what $p_{\ta}$ is. % % THIS IS A VERY GENERAL % message for machine learning. % corrected Wed 28/11/01 The simpler model, $\H_0$, since it has no adjustable parameters, is able to lose out by the biggest margin. The odds may be hundreds to one against it. The more complex model can never lose out by a large margin; there's no data set that is actually {\em unlikely\/} given model $\H_1$. \exercisaxB{2}{ex.evidencebounds}{ Show that after $F$ tosses have taken place, the biggest value that the log evidence ratio \beq \log \frac{ P( \bs \given F,\H_1 ) } { P( \bs \given F,\H_0 ) } \eeq can have scales {\em linearly\/} with $F$ if $\H_1$ is more probable, but the log evidence in favour of $\H_0$ can grow at most as $\log F$. } \exercissxB{3}{ex.evidenceest}{ Putting your sampling theory hat on, assuming $F_{\ta}$ has not yet been measured, compute a plausible range that % the mean and variance -- or some sort of most probable value, and indication of spread -- of the the log evidence ratio might lie in, as a function of $F$ and the true value of $p_{\ta}$, and sketch it as a function of $F$ for $p_{\ta}=p_0=1/6$, $p_{\ta}=0.25$, and $p_{\ta}=1/2$. [Hint: sketch the log evidence as a function of the random variable $F_{\ta}$ and work out the mean and standard deviation of $F_{\ta}$.] % [Hint: Taylor-expand the log evidence as a function % of $F_{\ta}$.] } % % This page comes out rotated bizarrely by 90 degrees in pdf % \subsection{Typical behaviour of the evidence} % see figs/sixtoone % and bin/sixtoone.p \Figref{fig.evidencetyp} shows the log evidence ratio as a function of the number of tosses, $F$, in a number of simulated experiments. In the left-hand experiments, $\H_0$ was true. In the right-hand ones, $\H_1$ was true, and the value of $p_{\ta}$ was either 0.25 or 0.5. % \newcommand{\sixtoone}[2]{% in newcommands1.tex \begin{figure} \figuremargin{% \small% \begin{center} \begin{tabular}{cccc} $\H_0$ is true && \multicolumn{2}{c}{$\H_1$ is true} \\ \cmidrule{1-1}\cmidrule{3-4} \sixtoone{$p_{\ta}=1/6$}{h09}&& \sixtoone{$p_{\ta}=0.25$}{h69}& \sixtoone{$p_{\ta}=0.5$}{h29}\\ \sixtoone{}{h08}&& \sixtoone{}{h68}& \sixtoone{}{h28}\\ \sixtoone{}{h07}&& \sixtoone{}{h67}& \sixtoone{}{h27}\\ \end{tabular} \end{center} }{% \caption[a]{Typical behaviour of the evidence in favour of $\H_1$ as bent coin tosses accumulate\index{typical evidence}\index{evidence!typical behaviour of}\index{model comparison!typical evidence} under three different conditions (columns 1, 2, 3). Horizontal axis is the number of tosses, $F$. The vertical axis on the left is $\ln \smallfrac{ P( \bs \given F,\, \H_1 ) } { P( \bs \given F,\, \H_0 ) }$; the right-hand vertical axis shows the values of $\smallfrac{ P( \bs \given F,\, \H_1 ) } { P( \bs \given F,\, \H_0 ) }$. % added jan 2005: The three rows show % Each row shows an independent simulated experiments. (See also \protect\figref{fig.evidenceMSD}, \pref{fig.evidenceMSD}.) } \label{fig.evidencetyp} }% \end{figure} We will discuss model comparison more in a later chapter. \section{An example of legal evidence} \label{ex.blood.sol}% special label by hand The following example % (\exerciseonlyref{ex.blood}) illustrates that there is more to Bayesian inference than the priors.\index{blood group} \begin{quote} % Two people have left traces of their own blood at the scene of a % crime. Their blood groups can be reliably identified from these % traces and are found % to be of type `O' (a common type in the local population, having % frequency 60\%) and of type `AB' (a rare type, with frequency 1\%). % A suspect is tested and found to have type `O' blood. % A careless lawyer might claim that the fact that the suspect's % blood type was found at the scene is positive evidence for the theory % that he was present. But do these data % $D=$ \{type `O' and `AB' blood were found at scene\} make it more % probable that this suspect was one of the two people present at the % crime? Two people have left traces of their own blood at the scene of a crime. A suspect, Oliver, is tested and found to have type `O' blood. The blood groups of the two traces are found to be of type `O' (a common type in the local population, having frequency 60\%) and of type `AB' (a rare type, with frequency 1\%). Do these data (type `O' and `AB' blood were found at scene) give evidence in favour of the proposition that Oliver was one of the two people present at the crime? \end{quote} A careless \ind{lawyer} might claim that the fact that the suspect's blood type was found at the scene is positive evidence for the theory that he was present. But this is not so. Denote the proposition `the suspect and one unknown person were present' by $S$. The alternative, $\bar{S}$, states `two unknown people from the population were present'. The prior in this problem is the prior probability ratio between the propositions $S$ and $\bar{S}$. This quantity is important to the final verdict and would be based on all other available information in the case. Our task here is just to evaluate the contribution made by the data $D$, that is, the likelihood ratio, $P(D\given S,\H)/P(D\given \bar{S},\H)$. In my view, a jury's task should generally be to multiply together carefully evaluated likelihood ratios from each independent piece of admissible evidence with an equally carefully reasoned prior probability. [This view is shared by many statisticians but learned British appeal judges\index{judge} recently disagreed and actually overturned the verdict of a trial because the \index{jury}{jurors} {\em had\/} been taught to use \Bayes\ theorem to handle complicated \ind{DNA} evidence.] % The probability of the data given $S$ is the probability that one unknown person drawn from the population has blood type AB: \beq P(D\given S,\H) = p_{\rm{AB}} \eeq (since given $S$, we already know that one trace will be of type O). The probability of the data given $\bar{S}$ is the probability that two unknown people drawn from the population have types O and AB: \beq P(D\given \bar{S},\H) = 2 \, p_{\rm{O}} \, p_{\rm{AB}} . \eeq In these equations $\H$ denotes the assumptions that two people were present and left blood there, and that the probability distribution of the blood groups of unknown people in an explanation is the same as the population frequencies. % Our posterior probability ratio for % $S$ relative to $\bar{S}$ is obtained by multiplying the probability % ratio based on all other independent information by the ratio of % these likelihoods. The most straightforward way to summarize the % contribution of any piece of evidence is in terms of a likelihood % ratio. Dividing, we obtain the likelihood ratio: \beq \frac{P(D\given S,\H)}{P(D\given \bar{S},\H)} = \frac{1}{2 p_{\rm O}} = \frac{1}{2 \times 0.6} = 0.83 . \eeq Thus the data in fact provide weak evidence {\em against\/} the supposition that Oliver was present. This result may be found surprising, so let us examine it from various points of view. First consider the case of another suspect, Alberto, who has type AB. Intuitively, the data do provide evidence in favour of the theory $S'$ that this suspect was present, relative to the null hypothesis $\bar{S}$. And indeed the likelihood ratio in this case is: \beq \frac{P(D\given S',\H)}{P(D\given \bar{S},\H)} = \frac{1}{2\, p_{\rm{AB}}} = 50. \eeq Now let us change the situation slightly; imagine that 99\% of people are of blood type O, and the rest are of type AB. Only these two blood types exist in the population. The data at the scene are the same as before. Consider again how these data influence our beliefs about Oliver, a suspect of type O, and Alberto, a suspect of type AB. Intuitively, we still believe that the presence of the rare AB blood provides positive evidence that \ind{Alberto} was there. But does % we still have the feeling that the fact that type O blood was detected at the scene favour the hypothesis that Oliver was present? If this were the case, that would mean that regardless of who the suspect is, the data make it more probable they were present; everyone in the population would be under greater suspicion, which would be absurd. The data may be {\em compatible\/} with any suspect of either blood type being present, but if they provide evidence {\em for\/} some theories, they must also provide evidence {\em against\/} other theories. Here is another way of thinking about this: imagine that instead of two people's blood stains there are ten, and that in the entire local population of one hundred, there are ninety type O suspects and ten type AB suspects. % Initially all 100 people are suspects. Consider a particular type O suspect, \ind{Oliver}: without any other information, and before the blood test results come in, there is a one in 10 chance that he was at the scene, since we know that 10 out of the 100 suspects were present. We now get the results of blood tests, and find that {\em nine\/} of the ten stains are of type AB, and {\em one\/} of the stains is of type O. Does this make it more likely that Oliver was there? No, % although he could have been, there is now only a one in ninety chance that he was there, since we know that only one person present was of type O. Maybe the intuition is aided finally by writing down the formulae for the general case where $n_{\rm{O}}$ blood stains of individuals of type O are found, and $n_{\rm{AB}}$ of type $\rm{AB}$, a total of $N$ individuals in all, and unknown people come from a large population with fractions $p_{\rm{O}}, p_{\rm{AB}}$. (There may be other blood types too.) The task is to evaluate the likelihood ratio for the two hypotheses: $S$, `the type O suspect (Oliver) and $N\!-\!1$ unknown others left $N$ stains'; and $\bar{S}$, `$N$ unknowns left $N$ stains'. The probability of the data under hypothesis $\bar{S}$ is just the probability of getting $n_{\rm{O}}, n_{\rm{AB}}$ individuals of the two types when $N$ individuals are drawn at random from the population: \beq P(n_{\rm{O}},n_{\rm{AB}}\given \bar{S}) = \frac{ N! }{ n_{\rm{O}} ! \, n_{\rm{AB}}! } p_{\rm{O}}^{n_{\rm{O}}} p_{\rm{AB}}^{n_{\rm{AB}}} . \eeq In the case of hypothesis $S$, we need the distribution of the $N\!-\!1$ other individuals: \beq P(n_{\rm{O}},n_{\rm{AB}}\given S) = \frac{ (N-1)! }{ (n_{\rm{O}}-1)! \, n_{\rm{AB}}! } p_{\rm{O}}^{n_{\rm{O}}-1} p_{\rm{AB}}^{n_{\rm{AB}}} . \eeq The likelihood ratio is: \beq \frac{ P(n_{\rm{O}},n_{\rm{AB}}\given S) }{ P(n_{\rm{O}},n_{\rm{AB}}\given \bar{S}) } = \frac{n_{\rm{O}}/N}{p_{\rm{O}}} . \eeq This is an instructive result. The likelihood ratio, \ie\ the contribution of these data to the question of whether Oliver was present, depends simply on a comparison of the frequency of his blood type % type O blood in the observed data with the background frequency % of type O blood in the population. There is no dependence on the counts of the other types found at the scene, or their frequencies in the population. If there are more type O stains than the average number expected under hypothesis $\bar{S}$, then the data give evidence in favour of the presence of Oliver. Conversely, if there are fewer type O stains than the expected number under $\bar{S}$, then the data reduce the probability of the hypothesis that he was there. In the special case $n_{\rm{O}}/N = p_{\rm{O}}$, the data contribute no evidence either way, regardless of the fact that the data are compatible with the hypothesis $S$. \section{Exercises} % \subsection*{The game show} %\subsubsection*{The normal rules} %\subsubsection*{The earthquake scenario} \exercissxA{2}{ex.3doors}{ {\sf The \ind{three doors},\index{Monty Hall problem} normal rules.} % "Let's Make A Deal," hosted by Monty Hall On a \ind{game show},\index{doors, on game show}\index{game!three doors} a contestant is told the rules as follows: \begin{quote} There are three doors, labelled 1, 2, 3. A single prize has been hidden behind one of them. You get to select one door. Initially your chosen door will {\em not\/} be opened. Instead, the gameshow host will open one of the other two doors, and {\em he will do so in such a way as not to reveal the prize.} For example, if you first choose door 1, he will then open {one\/} of doors 2 and 3, and it is guaranteed that he will choose which one to open so that the prize will not be revealed. At this point, you will be given a fresh choice of door: you can either stick with your first choice, or you can switch to the other closed door. All the doors will then be opened and you will receive whatever is behind your final choice of door. \end{quote} Imagine that the contestant chooses door 1 first; then the gameshow host opens door 3, revealing nothing behind the door, as promised. Should the contestant (a) stick with door 1, or (b) switch to door 2, or (c) does it make no difference? } \exercissxA{2}{ex.3doorsb}{ {\sf The three doors, earthquake scenario.} Imagine that the game happens again and just as the gameshow host is about to open one of the doors a violent earthquake\index{earthquake, during game show} rattles the building and one of the three doors flies open. It happens to be door 3, and it happens not to have the prize behind it. The contestant had initially chosen door 1. Repositioning his toup\'ee, the host suggests, `OK, since you chose door 1 initially, door 3 is a valid door for me to open, according to the rules of the game; I'll let door 3 stay open. Let's carry on as if nothing happened.' Should the contestant stick with door 1, or switch to door 2, or does it make no difference? Assume that the prize was placed randomly, that the gameshow host does not know where it is, and that the door flew open because its latch was broken by the earthquake. [A similar alternative scenario is a gameshow whose {\em confused host\/}\index{confused gameshow host} forgets the rules, and where the prize is, and opens one of the unchosen doors at random. He opens door 3, and the prize is not revealed. Should the contestant choose what's behind door 1 or door 2? Does the optimal decision for the contestant depend on the contestant's \ind{belief}s about whether the gameshow host is confused or not?]\index{game show}\index{three doors}\index{doors, on game show}\index{prize, on game show}\index{Monty Hall problem} } \exercisaxB{2}{ex.girlboy}{ %\subsection {\sf Another example in which the emphasis is not on priors.} %\begin{quote} You visit a family whose three children are all at the local school. You don't know anything about the sexes of the children. While walking clumsily round the home, you stumble through one of the three unlabelled bedroom doors that you know belong, one each, to the three children, and find that the bedroom contains \ind{girlie stuff} in sufficient quantities to convince you that the child who lives in that bedroom is a girl. Later, you sneak a look at a letter addressed to the parents, which reads `From the Headmaster: we are sending this letter to all parents who have male children at the school to inform them about the following \ind{boyish matters}\ldots'. These two sources of evidence establish that at least one of the three children is a girl, and that at least one of the children is a boy. What are the probabilities that there are (a) two girls and one boy; (b) two boys and one girl? %\end{quote} } % Another example of legal evidence} \exercissxB{2}{ex.simpsons}{ Mrs\ S is found stabbed in her family garden. % \index{Simpson, O.J., similar case to} Mr\ S behaves strangely after her death and is considered as a suspect. On investigation of police and social records it is found that Mr\ S had beaten up his wife on at least nine previous occasions. The prosecution advances this data as evidence in favour of the hypothesis that Mr\ S is guilty of the murder. `Ah no,' says % Mr.\ Merd-Kopf, Mr\ S's highly paid lawyer,\index{lawyer}\index{wife-beater}\index{murder} `{\em statistically}, only one in a thousand wife-beaters actually goes on to murder his wife.\footnote{In the U.S.A., it is estimated that % http://www.umn.edu/mincava/papers/factoid.htm 2 million women are abused each year by their partners. In 1994, $4739$ women were victims of homicide; of those, % 28 \percent, $1326$ women (28\%) were slain by husbands and boyfriends.\\ (Sources: {\tt http://www.umn.edu/mincava/papers/factoid.htm,\\ http://www.gunfree.inter.net/vpc/womenfs.htm}) % http://www.gunfree.inter.net/vpc/womenfs.htm % In keeping % with the fictitious nature of this story, the $1/100\,000$ % figure was made up by me. }\label{footnote.murder} So the wife-beating % , which is not denied by Mr\ S, is not strong evidence at all. In fact, given the wife-beating evidence alone, it's extremely {\em{unlikely}\/} that he would be the murderer of his wife -- only a $1/1000$ chance. You should therefore find him innocent.' Is the lawyer % Mr\ Merd-Kopf right to imply that the history of wife-beating does not point to Mr\ S's being the murderer? Or is the lawyer a slimy trickster? If the latter, what is wrong with his argument? [Having received an indignant letter from a lawyer about the preceding paragraph, I'd like to add an extra inference exercise at this point: {\em Does my suggestion that Mr.\ S.'s lawyer may have been a slimy trickster imply that I believe {\em all} lawyers are slimy tricksters?} (Answer: No.)] } % Lewis Carroll's Pillow Problem \exercisaxB{2}{ex.bagcounter}{ A bag contains one counter, known to be either white or black. A white counter is put in, the bag is shaken, and a counter is drawn out, which proves to be white. What is now the chance of drawing a white counter? [Notice that the state of the bag, after the operations, is exactly identical to its state before.] } \exercissxB{2}{ex.phonetest}{% ????????????????? needs solution adding (was phonecheck!) You move into a new house; the phone is connected, and % you are unsure of your phone number -- you're pretty sure that the \ind{phone number} is % \index{telephone number} % it's {\tt 740511}, but not as sure as you would like to be. % As an experiment, you pick up the phone and dial {\tt 740511}; you obtain a `busy' signal. Are you now more sure of your phone number? If so, how much? } % \exercisaxB{1}{ex.othercoin}{ In a game, two coins are tossed. If either of the coins comes up heads, you have won a prize. To claim the prize, you must point to one of your coins that is a head and say `look, that coin's a head, I've won'. You watch Fred play the game. He tosses the two coins, and he points to a coin and says `look, that coin's a head, I've won'. What is the probability that the {\em other\/} coin is a head? } %\subsection*{Another quasi-legal story} % \exercis{ex.}{ % During a radio chat show on the health consequences of % secondary smoking, it is reported by an expert that % twelve recent studies have investigated whether % there was a link between secondary smoking and cancer. % Of these, eleven studies failed to establish a link % and one study found significant evidence of a causal % link -- secondary smoking increasing the risk of getting % cancer. The expert said that the net evidence from these % twelve results was that there was significant evidence of a causal % link. % % Shortly thereafter, a Mr.\ N.T.\ Social called in in support % of smokers' ``rights'' to pollute public air. `If eleven % of the studies didn't find a link, and only one found a link, % then it's eleven to one that there isn't a link, isn't it?' % % `Well, you clearly don't understand statistics, do you?' responded % the condescending host. % % Can you suggest a more helpful explanation of the expert's statement? %} % euro.tex \exercissxB{2}{ex.eurotoss}{ A statistical statement appeared in % \footnote{Quoted by Charlotte Denny and Sarah Dennis {\em The Guardian} on Friday January 4, 2002: \begin{quote} When spun on edge 250 times, a Belgian one-euro coin came up heads 140 times and tails 110. `It looks very suspicious to me', said Barry Blight, a statistics lecturer at the London School of Economics. `If the coin were unbiased the chance of getting a result as extreme as that would be less than 7\%'. \end{quote} But {\em do\/} these data give evidence that the coin is biased rather than fair? [Hint: see \eqref{eq.compare.final}.] } % \input{tex/bayes_occam.tex} \dvips \section{Solutions}% to Chapter \protect\ref{ch.bayes}'s exercises} % \soln{ex.dieexponential}{ Let the data be $D$. Assuming equal prior probabilities, \beqan \frac{P(A \given D)}{P(B \given D)} = \frac{1}{2}\frac{3}{2}\frac{1}{1}\frac{3}{2} \frac{1}{2}\frac{2}{2}\frac{1}{2} = \frac{9}{32} \eeqan and $P(A \given D) = 9/41.$ % (check me). } \soln{ex.dieexponentialb}{ The probability of the data given each hypothesis is: \beq P(D \given A) = \frac{3}{20}\frac{1}{20}\frac{2}{20}\frac{1}{20} \frac{3}{20}\frac{1}{20} \frac{1}{20} = \frac{18}{20^7} ; \eeq \beq P(D \given B) = \frac{2}{20}\frac{2}{20}\frac{2}{20}\frac{2}{20} \frac{2}{20}\frac{1}{20} \frac{2}{20} = \frac{64}{20^7} ; \eeq \beq P(D \given C) = \frac{1}{20}\frac{1}{20}\frac{1}{20}\frac{1}{20} \frac{1}{20}\frac{1}{20} \frac{1}{20} = \frac{1}{20^7}. \eeq So \beq % \hspace*{-0.1in} P(A \given D) = \frac{18}{18+64+1} = \frac{18}{83} ; \hspace{0.3in} P(B \given D) = \frac{64}{83} ;\hspace{0.3in} P(C \given D) = \frac{1}{83} . \eeq } \fakesection{Bent coin exercise solns} \begin{figure}[htbp] \figuremargin{% \footnotesize \begin{center} \begin{tabular}{cc} (a) \psfig{figure=figs/aba.ps,width=2in,angle=-90}& (b) \psfig{figure=figs/bbb.ps,width=2in,angle=-90}\\ $P( p_{\tt{a}} \given \bs\eq {\tt{aba}} ,F\eq 3) \propto p_{\tt{a}}^2 (1-p_{\tt{a}})$ & $P( p_{\tt{a}} \given \bs\eq {\tt{bbb}} ,F\eq 3) \propto (1-p_{\tt{a}})^3$ \\ \end{tabular} \end{center} }{% \caption[a]{Posterior probability for the bias $p_a$ of a bent coin given two different data sets.} \label{fig.aba.bbb} }% \end{figure} \soln{ex.postpaII}{% relabelled from postpa Sun 6/4/03 - beware incorrect refs likely \ben \item $P( p_{\tt{a}} \given \bs\eq {\tt{aba}} ,F\eq 3) \propto p_{\tt{a}}^2 (1-p_{\tt{a}})$. The most probable value of $p_{\tt{a}}$ (\ie, the value that maximizes the posterior probability density) is $2/3$. The mean value of $p_{\tt{a}}$ is $3/5$. See \figref{fig.aba.bbb}a. \item $P( p_{\tt{a}} \given \bs\eq {\tt{bbb}} ,F\eq 3) \propto (1-p_{\tt{a}})^3$. The most probable value of $p_{\tt{a}}$ (\ie, the value that maximizes the posterior probability density) is $0$. The mean value of $p_{\tt{a}}$ is $1/5$. See \figref{fig.aba.bbb}b. \een } %/home/mackay/_courses/itprnn/figs %gnuplot> plot x**2*(1-x) %gnuplot> set xrange [0:1] %gnuplot> replot %gnuplot> set nokey %gnuplot> set size 0.4,0.4 %gnuplot> replot %gnuplot> set noytics %gnuplot> replot %gnuplot> set yrange [0:0.4] %gnuplot> replot %gnuplot> set yrange [0:0.17] %gnuplot> replot %gnuplot> set term post %Terminal type set to 'postscript' %Options are 'landscape monochrome dashed "Helvetica" 14' %gnuplot> set output "aba.ps" %gnuplot> replot %gnuplot> set term X %Terminal type set to 'X11' %gnuplot> set yrange [0:1] %gnuplot> plot (1-x)**3 %gnuplot> set term post %Terminal type set to 'postscript' %Options are 'landscape monochrome dashed "Helvetica" 14' %gnuplot> set output "bbb.ps" %gnuplot> replot \fakesection{evidence est} \begin{figure}[htbp] \figuremargin{% \small% \begin{center} \begin{tabular}{cccc} $\H_0$ is true && \multicolumn{2}{c}{$\H_1$ is true} \\ \cmidrule{1-1}\cmidrule{3-4} \sixtoone{$p_a=1/6$}{h0MSD}&& \sixtoone{$p_a=0.25$}{h6MSD}& \sixtoone{$p_a=0.5$}{h2MSD}\\ \end{tabular} \end{center} }{% \caption[a]{Range of plausible values of the log evidence in favour of $\H_1$ as a function of $F$. The vertical axis on the left is $\log \smallfrac{ P( \bs \given F,\H_1 ) } { P( \bs \given F,\H_0 ) }$; the right-hand vertical axis shows the values of $\smallfrac{ P( \bs \given F,\H_1 ) } { P( \bs \given F,\H_0 ) }$. \index{typical evidence}\index{evidence!typical behaviour of}\index{model comparison!typical evidence}% The solid line shows the log evidence if the random variable $F_a$ takes on its mean value, $F_a = p_aF$. The dotted lines show (approximately) the log evidence if $F_a$ is at its 2.5th or 97.5th percentile. (See also \protect\figref{fig.evidencetyp}, \pref{fig.evidencetyp}.) } \label{fig.evidenceMSD} }% \end{figure} \soln{ex.evidenceest}{ The curves in \figref{fig.evidenceMSD} were found by finding the mean and standard deviation of $F_a$, then setting $F_a$ to the mean $\pm$ two standard deviations to get a 95\% plausible range for $F_a$, and computing the three corresponding values of the log evidence ratio. }% \soln{ex.3doors}{ Let $\H_i$ denote the hypothesis that the prize is behind door $i$. We make the following assumptions: the three hypotheses $\H_1$, $\H_2$ and $\H_3$ are equiprobable {\em a priori}, \ie, \beq P(\H_1) = P(\H_2) = P(\H_3) = \frac{1}{3} . \eeq The datum we receive, after choosing door 1, is one of $D \eq 3$ and $D \eq 2$ (meaning door 3 or 2 is opened, respectively). We assume that these two possible outcomes have the following probabilities. If the prize is behind door 1 then the host has a free choice; in this case we assume that the host selects at random between $D\eq 2$ and $D\eq 3$. Otherwise the choice of the host is forced and the probabilities are 0 and 1. \beq \begin{array}{|r@{\,}c@{\,}l|r@{\,}c@{\,}l|r@{\,}c@{\,}l|} P( D\eq 2 \given \H_1) &=& \dfrac{1}{2} & P( D\eq 2 \given \H_2) &=& 0 & P( D\eq 2 \given \H_3) &=& {1} \\ P( D\eq 3 \given \H_1) &=& \dfrac{1}{2} & P( D\eq 3 \given \H_2) &=& {1} & P( D\eq 3 \given \H_3) &=& 0 \end{array} \eeq Now, using \Bayes\ theorem, we evaluate the posterior probabilities of the hypotheses: \beq P( \H_i \given D\eq3 ) = \frac{P( D\eq3 \given \H_i) P(\H_i) }{P(D\eq3) } \eeq \beq \begin{array}{|r@{\,}c@{\,}l|r@{\,}c@{\,}l|r@{\,}c@{\,}l|} P(\H_1 \given D\eq 3) &=& \frac{ (1/2) (1/3) }{P(D\normaleq 3) } & P(\H_2 \given D\eq 3) &=& \frac{ ({1}) (1/3) }{P(D\normaleq 3) } & P(\H_3 \given D\eq 3) &=& \frac{ ({0}) (1/3) }{P(D\normaleq 3) } \end{array} \eeq The denominator $P(D\eq 3)$ is $(1/2)$ because it is the normalizing constant for this posterior distribution. So \beq \begin{array}{|rcl|rcl|rcl|} P( \H_1 \given D\eq3 ) &=& \dfrac{ 1}{3} & P(\H_2 \given D\eq3) &=& \dfrac{ 2}{3} & P(\H_3 \given D\eq3) &=& 0 . \end{array} \eeq So the contestant should switch to door 2 in order to have the biggest chance of getting the prize. Many people find this outcome surprising. There are two ways to make it more intuitive. One is to play the game\index{game!three doors} thirty times with a friend and keep track of the frequency with which switching gets the prize. Alternatively, you can perform a thought experiment in which the game is played with a million doors. The rules are now that the contestant chooses one door, then the game show host opens 999,998 doors in such a way as not to reveal the prize, leaving the {\em contestant's\/} selected door and {\em one other door\/} closed. The contestant may now stick or switch. Imagine the contestant confronted by a million doors, of which doors 1 and 234,598 have not been opened, door 1 having been the contestant's initial guess. Where do you think the prize is? } % \soln{ex.3doorsb}{ % earthquake rules. If door 3 is opened by an earthquake, the inference comes out differently -- even though visually the scene looks the same. The nature of the data, and the probability of the data, are both now different. The possible data outcomes are, firstly, that any number of the doors might have opened. We could label the eight possible outcomes $\bd = (0,0,0), (0,0,1), (0,1,0), (1,0,0), (0,1,1), \ldots, (1,1,1)$. Secondly, it might be that the prize is visible after the earthquake has opened one or more doors. So the data $D$ consists of the value of $\bd$, and a statement of whether the prize was revealed. It is hard to say what the probabilities of these outcomes are, since they depend on our beliefs about the reliability of the door latches and the properties of earthquakes, but it is possible to extract the desired posterior probability without naming the values of $P(\bd \given \H_i)$ for each $\bd$. All that matters are the relative values of the quantities $P(D \given \H_1)$, $P(D \given \H_2)$, $P(D \given \H_3)$, for the value of $D$ that actually occurred. [This is the {\dem\ind{likelihood principle}}, which we met in \sectionref{sec.lp}.] % !!!!!!!!! add page ref? The value of $D$ that actually occurred is `$\bd \eq (0,0,1)$, and no prize visible'. First, it is clear that $P(D \given \H_3)=0$, since the datum that no prize is visible is incompatible with $\H_3$. Now, assuming that the contestant selected door 1, how does the probability $P(D \given \H_1)$ compare with $P(D \given \H_2)$? Assuming that earthquakes are not sensitive to decisions of game show contestants, these two quantities have to be equal, by symmetry. We don't know how likely it is that door 3 falls off its hinges, but however likely it is, it's just as likely to do so whether the prize is behind door 1 or door 2. So, if $P(D \given \H_1)$ and $P(D \given \H_2)$ are equal, we obtain: \beq \begin{array}{|r@{\,\,=\,\,}l|r@{\,\,=\,\,}l|r@{\,\,=\,\,}l|} P(\H_1 | D) & \smallfrac{ P(D | \H_1) (\smalldfrac{1}{3}) }{P(D) } & P(\H_2 | D) & \smallfrac{ P(D | \H_2) (\smalldfrac{1}{3}) }{P(D) } & P(\H_3 | D) & \smallfrac{ P(D | \H_3) (\smalldfrac{1}{3}) }{P(D) } \\ & \dfrac{ 1}{2} & & \dfrac{ 1}{2} & & 0 . \end{array} \eeq The two possible hypotheses are now equally likely. If we assume that the host knows where the prize is and might be acting deceptively, then the answer might be further modified, because we have to view the host's words as part of the data. Confused? It's well worth making sure you understand these two gameshow problems. Don't worry, I slipped up on the second problem, the first time I met it. There is a general rule which helps immensely when you have a confusing probability problem:\index{key points!solving probability problems} \begin{conclusionbox} Always\index{Gull, Steve} write down the probability of everything.\\ { \hfill {\em ({Steve Gull})} \par } \end{conclusionbox} From this joint probability, any desired inference can be mechanically obtained (\figref{fig.everything}). \amarginfig{b}{ \begin{center} \newcommand{\tabwidth}{30} \newcommand{\tabheight}{80} \setlength{\unitlength}{1mm}{ \begin{picture}(43,92)(-13,0) \put(15,90){\makebox(0,0){\small\sf{Where the prize is}}} \put( 5,85){\makebox(0,0){\small{door}}} \put(15,85){\makebox(0,0){\small{door}}} \put(25,85){\makebox(0,0){\small{door}}} \put( 5,82){\makebox(0,0){\small{1}}} \put(15,82){\makebox(0,0){\small{2}}} \put(25,82){\makebox(0,0){\small{3}}} \put(-1, 5){\makebox(0,0)[r]{\footnotesize{1,2,3}}} \put(-1,15){\makebox(0,0)[r]{\footnotesize{2,3}}} \put(-1,25){\makebox(0,0)[r]{\footnotesize{1,3}}} \put(-1,35){\makebox(0,0)[r]{\footnotesize{1,2}}} \put(-1,45){\makebox(0,0)[r]{\footnotesize{3}}} \put( 5,75){\makebox(0,0){\footnotesize{$\displaystyle\frac{p_{\rm none}}{3}$}}} \put(15,75){\makebox(0,0){\footnotesize{$\displaystyle\frac{p_{\rm none}}{3}$}}} \put(25,75){\makebox(0,0){\footnotesize{$\displaystyle\frac{p_{\rm none}}{3}$}}} \put( 5,45){\makebox(0,0){\footnotesize{$\displaystyle\frac{p_{3}}{3}$}}} \put(15,45){\makebox(0,0){\footnotesize{$\displaystyle\frac{p_{3}}{3}$}}} \put(25,45){\makebox(0,0){\footnotesize{$\displaystyle\frac{p_{3}}{3}$}}} \put( 5, 5){\makebox(0,0){\footnotesize{$\displaystyle\frac{p_{1,2,3}}{3}$}}} \put(15, 5){\makebox(0,0){\footnotesize{$\displaystyle\frac{p_{1,2,3}}{3}$}}} \put(25, 5){\makebox(0,0){\footnotesize{$\displaystyle\frac{p_{1,2,3}}{3}$}}} \put(-1,55){\makebox(0,0)[r]{\footnotesize{2}}} \put(-1,65){\makebox(0,0)[r]{\footnotesize{1}}} \put(-1,75){\makebox(0,0)[r]{\footnotesize{none}}} \put(-12,40){\makebox(0,0){\rotatebox{90}{\small\sf{Which doors opened by earthquake}}}} \multiput(0,0)(0,10){9}{\line(1,0){\tabwidth}} \multiput(0,0)(10,0){4}{\line(0,1){\tabheight}} \end{picture}} \end{center} \caption[a]{The probability of everything, for the second three-door problem, assuming an earthquake has just occurred. Here, $p_3$ is the probability that door 3 alone is opened by an earthquake.} \label{fig.everything} } } \fakesection{simpsons} \soln{ex.simpsons}{ The statistic quoted by the lawyer indicates the % {prior\/} probability % \index{Simpson, O.J., similar case to}% %\index{Simpson, O.J., allusion to} \index{lawyer}\index{wife-beater}\index{murder} that a randomly selected wife-beater will also murder his wife. The probability that the husband was the murderer, {\em given that the wife has been murdered}, is a completely different quantity. To deduce the latter, we need to make further assumptions about the probability that the wife is murdered by someone else. If she lives in a neighbourhood with frequent random murders, then this probability is large and the posterior probability that the husband did it (in the absence of other evidence) may not be very large. But in more peaceful regions, it may well be that the most likely person to have murdered you, if you are found murdered, is one of your closest relatives. %{\em Numbers here.} Let's work out some illustrative numbers with the help of the statistics on page \pageref{footnote.murder}. Let $m\eq 1$ denote the proposition that a woman has been murdered; $h\eq 1$, the proposition that the husband did it; and $b\eq 1$, the proposition that he beat her in the year preceding the murder. The statement `someone else did it' is denoted by $h\eq 0$. We need to define $P(h \given m\eq 1)$, $P(b \given h\eq 1,m\eq 1)$, and $P(b\eq 1 \given h\eq 0,m\eq 1)$ in order to compute the posterior probability $P(h\eq 1 \given b\eq 1,m\eq 1)$. From the statistics, we can read out $P(h\eq 1 \given m\eq 1)=0.28$. And if two million women out of 100 million are beaten, then $P(b\eq 1 \given h\eq 0,m\eq 1)=0.02$. Finally, we need a value for $P(b \given h\eq 1,m\eq 1)$: if a man murders his wife, how likely is it that this is the first time he laid a finger on her? I expect it's pretty unlikely; so maybe $P(b\eq 1 \given h\eq 1,m\eq 1)$ is 0.9 or larger. By \Bayes\ theorem, then, \beq P(h\eq 1 \given b\eq 1,m\eq 1) = \frac{ .9 \times .28 }{ .9 \times .28 + .02 \times .72 } \simeq 95\% . \eeq One way to make obvious the sliminess of the lawyer on \pref{ex.simpsons} is to construct arguments, with the same logical structure as his, that are clearly wrong. For example, the lawyer could say `Not only was Mrs.\ S murdered, she was murdered between 4.02pm and 4.03pm. {\em Statistically}, only one in a {\em million\/} wife-beaters actually goes on to murder his wife between 4.02pm and 4.03pm. So the wife-beating % , which is not denied by Mr.\ S, is not strong evidence at all. In fact, given the wife-beating evidence alone, it's extremely unlikely that he would murder his wife in this way -- only a 1/1,000,000 chance.' } % arrived here Sun 6/4/03 \soln{ex.phonetest}{% was phonecheck There are two hypotheses. $\H_0$: your number is {\tt 740511}; $\H_1$: it is another number. The data, $D$, are `when I dialed {\tt 740511}, I got a busy signal'. What is the probability of $D$, given each hypothesis? If your number is {\tt 740511}, then we expect a busy signal with certainty: \[ P(D \given \H_0) = 1 . \] On the other hand, if $\H_1$ is true, then the probability that the number dialled returns a busy signal is smaller than 1, since various other outcomes were also possible (a ringing tone, or a number-unobtainable signal, for example). The value of this probability $P(D \given \H_1)$ will depend on the probability $\alpha$ that a random phone number similar to your own phone number would be a valid phone number, and on the probability $\beta$ that you get a busy signal when you dial a valid phone number. % 37 per col, 4 cols per page, 250 pages. % 20 per col, 3 cols per page, 270 pages. % 50,000. maybe another 50% ex-directory? I estimate from the size of my phone book that Cambridge has about $75\,000$ valid phone numbers, all of length six digits. The probability that a random six-digit number is valid is therefore about $75\,000/10^6 = 0.075$. If we exclude numbers beginning with 0, 1, and 9 from the random choice, the probability $\a$ is about $75\,000/700\,000 \simeq 0.1$. If we assume that telephone numbers are clustered then a misremembered number might be more likely to be valid than a randomly chosen number; so the probability, $\alpha$, that our guessed number would be valid, assuming $\H_1$ is true, might be bigger than 0.1. Anyway, $\alpha$ must be somewhere between 0.1 and 1. We can carry forward this uncertainty in the probability and see how much it matters at the end. The probability $\beta$ that you get a busy signal when you dial a valid phone number is equal to the fraction of phones you think are in use or off-the-hook when you make your tentative call. This fraction varies from town to town and with the time of day. In Cambridge, during the day, I would guess that about 1\% of phones are in use. At 4am, % four in the morning, maybe 0.1\%, or fewer. The probability $P(D \given \H_1)$ is the product of $\alpha$ and $\beta$, that is, about $0.1 \times 0.01 = 10^{-3}$. According to our estimates, there's about a one-in-a-thousand chance of getting a busy signal when you dial a random number; or one-in-a-hundred, if valid numbers are strongly clustered; or one-in-$10^4$, if you dial in the wee hours. How do the data affect your beliefs about your phone number? The posterior probability ratio is the likelihood ratio times the prior probability ratio: \beq \frac{ P(\H_0 \given D) }{ P(\H_1 \given D) } = \frac{ P(D \given \H_0) }{ P(D \given \H_1) } \frac{ P(\H_0) }{ P(\H_1) } . \eeq The likelihood ratio is about 100-to-1 or 1000-to-1, so the posterior probability ratio is swung by a factor of 100 or 1000 in favour of $\H_0$. If the prior probability of $\H_0$ was 0.5 then the posterior probability is \beq P(\H_0 \given D) = \frac{1}{1 + \smallfrac{ P(\H_1 \given D) }{ P(\H_0 \given D) } } \simeq 0.99 \: \mbox{or} \: 0.999 . \eeq } \soln{ex.eurotoss}{ % see also % http://www.dartmouth.edu/~chance/chance_news/recent_news/chance_news_11.02.html % for lots of practical info on coin biases. %%%%%%%%%%%%%%%%%%%%%%%%%%% included by _s8.tex % First, could confirm his sampling theory %Sampling theory: number of heads $\sim 125 \pm 8$ %$ \sqrt{62.5}$ %so two-tail probability is % pr 2*(1-myerf(14.5/7.9)) ans = 0.066440 % if the data were 141 out of 250 then we get % 2*(1-myerf(15.5/7.9)) ans = 0.049760 \index{euro}We compare the models $\H_0$ -- the coin is fair -- and $\H_1$ -- the \ind{coin} is biased, with the prior on its bias set to the uniform distribution $P(p|\H_1)=1$. % ent, as defined in this chapter. \amarginfig{t}{ \begin{center} \mbox{\psfig{figure=gnu/euro.ps,width=1.62in,angle=-90}} \end{center} \caption[a]{The probability distribution of the number of heads given the two hypotheses, that the coin is fair, and that it is biased, with the prior distribution of the bias being uniform. The outcome ($D = 140$ heads) gives weak evidence in favour of $\H_0$, the hypothesis that the coin is fair.} \label{fig.euro} } [The use of a uniform prior seems reasonable to me, since I know that some coins, such as American pennies, have severe biases when spun on edge; so the situations $p=0.01$ or $p=0.1$ or $p=0.95$ would not surprise me.] \begin{aside} When I mention $\H_0$ -- the coin is fair -- a pedant would say, `how absurd to even consider that the coin is fair -- any coin is surely biased to some extent'. And of course I would agree. So will pedants kindly understand $\H_0$ as meaning `the coin is fair to within one part in a thousand, \ie, $p \in 0.5\pm 0.001$'. \end{aside} The likelihood ratio is: % given in \eqref{eq.compare.final}. \beq % Bayesian approach: Model comparison: \frac{ P( D|\H_1 )} {P( D|\H_0 )} = \frac{ \smallfrac{ 140! 110! }{ 251! } }{ 1/2^{250} } = 0.48 . \eeq Thus the data give scarcely any evidence either way; in fact they give weak evidence (two to one) in favour of $\H_0$! % load 'gnu/euro.gnu' `No, no', objects the believer in bias, `your silly uniform prior doesn't represent {\em my\/} prior beliefs about the bias of biased coins -- I was {\em expecting\/} only a small bias'. To be as generous as possible to the $\H_1$, let's see how well it could fare if the prior were presciently set. Let us allow a prior of the form \beq P(p|\H_1,\a) = \frac{1}{Z(\a)} p^{\a-1}(1-p)^{\a-1}, \:\:\:\: \mbox{where $Z(\a)=\Gamma(\alpha)^2/\Gamma(2 \alpha)$} \eeq (a Beta % Dirichlet (or Beta) distribution, with the original uniform prior reproduced by setting $\a=1$). By tweaking $\alpha$, the likelihood ratio for $\H_1$ over $\H_0$, \beq \frac{ P( D|\H_1,\a )} {P( D|\H_0 )} = \frac{\Gamma(140 \!+\! \alpha) \, \Gamma(110 \!+\! \alpha) \, \Gamma(2 \alpha) 2^{250}} { \Gamma(250 \!+\! 2 \alpha) \, \Gamma(\alpha)^2 }, \eeq can be increased a little. It is shown for several values of $\a$ in \figref{fig.eurot}.% % % fig.eurot WAS here but has been moved away to avoid a crunch % This figure belongs earlier. \amarginfig{t}{ {\footnotesize \begin{tabular}{r@{}l@{$\:\:\:$}r@{\hspace*{0.3in}}r@{}l} \toprule \multicolumn{2}{c}{$\alpha$}& \multicolumn{3}{c}{$\displaystyle \frac{ P( D|\H_1,\a )} {P( D|\H_0 )}$}\\ \midrule &.37 & & &.25\\ 1&.0 & & &.48\\ 2&.7 & & &.82\\ 7&.4 & &1&.3\\ 20& & &1&.8\\ 55& & &1&.9\\ 148& & &1&.7\\ 403& & &1&.3\\ 1096& & &1&.1\\ % from euro.dat \bottomrule \end{tabular} } \caption[a]{Likelihood ratio for various choices of the prior distribution's hyperparameter $\alpha$. } \label{fig.eurot} } % Even the most favourable choice of $\alpha$ ($\a \simeq 50$) can yield a likelihood ratio of only two to one in favour of $\H_1$. In conclusion, the data are not `very suspicious'. They can be construed as giving at most two-to-one evidence in favour of one or other of the two hypotheses. \begin{aside} Are these wimpy likelihood ratios the fault of over-restrictive priors? Is there any way of producing a `very suspicious' conclusion? The prior that is best-matched to the data, in terms of likelihood, % and one that surely has to be viewed as unreasonable, is the prior that sets $p$ to $f \equiv 140/250$ with probability one. Let's call this model $\H_*$. % , since it is a parameterless model like $\H_0$. The likelihood ratio is $P(D|\H_*)/P(D|\H_0) = 2^{250} f^{140} (1-f)^{110} =6.1$. So the strongest evidence that these data can possibly muster against the hypothesis that there is no bias is six-to-one. \end{aside} % b.blight@lse.ac.uk %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % alternate answers for the case of 141 heads where % the P value is 0.05 (0.04976) % %The outcomes of the computations for this case (141 from 250) % are % alpha , likelihood ratio % %.3678794412, .3166098681 %1., .6110726692 %2.718281828, 1.049115229 %7.389056099, 1.627382387 %20.08553692, 2.181864309 %54.59815003, 2.303276774 %148.4131591, 1.882663014 %403.4287935, 1.419011740 %1096.633158, 1.168433218 %2980.957987, 1.063851106 %8103.083928, 1.023737702 %22026.46579, 1.008765749 % % and H_BF achieves 7.796 While we are noticing the absurdly misleading\index{sampling theory!criticisms}\index{sermon!sampling theory}\index{p-value} answers that `sampling theory' statistics produces, such as the \index{p-value}$p$-value of 7\% in the exercise we just solved, let's stick the boot in.\label{sec.sampling5percent} If we make a tiny change to the data set, increasing the number of heads in 250 tosses from 140 to 141, we find that the $p$-value goes below the mystical value of 0.05 (the $p$-value is 0.0497). The sampling theory statistician would happily squeak `the probability of getting a result as extreme as 141 heads is smaller than 0.05 -- we thus reject the null hypothesis at a significance level of 5\%'. The correct answer is shown for several values of $\a$ in \figref{fig.eurot141}. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % alternate answers for the case of 141 heads where % the P value is 0.05 (0.04976) % Radford: Using R, I get that the true p-value (with genuine binomial %probabilities) for 141 out of 250 is 0.04970679, close to your value. %5 %The outcomes of the computations for this case (141 from 250) % are % alpha , likelihood ratio % %.3678794412, .3166098681 %1., .6110726692 %2.718281828, 1.049115229 %7.389056099, 1.627382387 %20.08553692, 2.181864309 %54.59815003, 2.303276774 %148.4131591, 1.882663014 %403.4287935, 1.419011740 %1096.633158, 1.168433218 %2980.957987, 1.063851106 %8103.083928, 1.023737702 %22026.46579, 1.008765749 % % and H_BF achieves 7.796 The values worth highlighting from this table are, first, the likelihood ratio when $\H_1$ uses the standard uniform prior, which is 1:0.61 in favour of the {\em null hypothesis\/} $\H_0$. Second, the most favourable choice of $\a$, from the point of view of $\H_1$, can only yield a likelihood ratio of about 2.3:1 in favour of $\H_1$.\label{sec.pvalue05} \amarginfig{c}{ {\footnotesize \begin{tabular}{r@{}l@{$\:\:\:$}r@{\hspace*{0.3in}}r@{}l} \toprule \multicolumn{2}{c}{$\alpha$}& \multicolumn{3}{c}{$\displaystyle \frac{ P( D'|\H_1,\a )} {P( D'|\H_0 )}$ }\\ \midrule &.37 & & &.32\\ 1&.0 & & &.61\\ 2&.7 & &1&.0\\ 7&.4 & &1&.6\\ 20& & &2&.2\\ 55& & &2&.3\\ 148& & &1&.9\\ 403& & &1&.4\\ 1096& & &1&.2\\ % from euro.dat \bottomrule \end{tabular} } \caption[a]{Likelihood ratio for various choices of the prior distribution's \ind{hyperparameter} $\alpha$, when the data are $D'=141$ heads in 250 trials. } \label{fig.eurot141} } % Be warned! A $p$-value of 0.05 is often interpreted % gives the impression to many as implying that the odds are stacked about twenty-to-one {\em against\/} the null hypothesis. But the truth in this case is that the evidence either slightly {\em favours\/} the null hypothesis, or disfavours it by at most 2.3 to one, depending on the choice of prior. % $p$-values The $p$-values and `\ind{significance level}s' of \ind{classical statistics}\index{sermon!classical statistics} should be treated with {\em extreme caution}.\index{caution!sampling theory} % This is the last we will see of them in this book. Shun them! Here ends the sermon.\index{sermon!sampling theory} % Classical statistics and Microsoft Windows 95 -- % two of the greatest evils to come out of the twentieth century. } \dvipsb{solutions bayes} % \input{tex/_l1b.tex} % % message passing was here % \renewcommand{\partfigure}{\poincare{8.0}} \part{Data Compression} \prechapter{About Chapter} \fakesection{prerequisites for chapter 2} % In this chapter we discuss how to measure the information content of the outcome of a random experiment. This chapter has some tough bits. If you find the mathematical details hard, % to follow, skim through them and keep going -- you'll be able to enjoy Chapters \ref{ch3} and \ref{ch4} without this chapter's tools. % of typicality. \amarginfignocaption{t}{%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Cast of characters} \footnotesize \begin{tabular}{@{}lp{1.14in}} \multicolumn{2}{c}{ {\sf Notation} }\\ \midrule $x \in \A$ & $x$ is a {\dem{member}\/} of the \ind{set} $\A$ \\ $\S \subset \A$ & $\S$ is a {\dem\ind{subset}\/} of the set $\A$ \\ $\S \subseteq \A$ & $\S$ is a {\ind{subset}} of, or equal to, the set $\A$ \\ % \union $\V = \B \cup \A$ & $\V$ is the {\dem\ind{union}\/} of the sets $\B$ and $\A$ \\ $\V = \B \cap \A$ & $\V$ is the {\dem\ind{intersection}\/} of the sets $\B$ and $\A$ \\ $|\A|$ & number of elements in set $\A$\\ \bottomrule \end{tabular} \medskip % end marginstuff }%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Before reading \chref{ch2}, you should have read % section \ref{ch1.secprob} \chref{ch1.secprob} and worked on % \exerciseref{ex.expectn}. % It will also help if you have worked on % % do I need to ensure that {ex.Hadditive} occurs earlier? % \exerciseonlyrange{ex.expectn}{ex.Hineq} and \ref{ex.sumdice} % \exerciseonlyrangeshort{ex.sumdice}{ex.RNGaussian} \pagerange{ex.sumdice}{ex.invP}, % {ex.RNGaussian}. % exercises \exnine-\exfourteen\ and \extwentyfive-\extwentyseven. and \exerciseonlyref{ex.weigh} below. The following exercise is intended to help you think about how to measure information content. % Please work on this exercise now. % weighing % ITPRNN Problem 1 % % weighing problem % \fakesection{the weighing problem} \exercissxA{2}{ex.weigh}{ -- {\em Please work on this problem before reading \chref{ch.two}.} \index{weighing problem}You are given 12 balls, all equal in weight except for one that is either heavier or lighter. You are also given a two-pan \ind{balance} to use. % , which you are to use as few times as possible. In each use of the balance you may put {any\/} number of the 12 balls on the left pan, and the same number on the right pan, and push a button to initiate the weighing; there are three possible outcomes: either the weights are equal, or the balls on the left are heavier, or the balls on the left are lighter. Your task is to design a strategy to determine which is the odd ball {\em and\/} whether it is heavier or lighter than the others {\em in as few uses of the balance as possible}. % There will be a prize for the best answer. While thinking about this problem, you % should may find it helpful to consider the following questions: \ben \item How can one measure {\dem{information}}? \item When you have identified the odd ball and whether it is heavy or light, how much information have you gained? \item Once you have designed a strategy, draw a tree showing, for each of the possible outcomes of a weighing, what weighing you perform next. At each node in the tree, how much information have the outcomes so far given you, and how much information remains to be gained? % What is the probability of each of the possible outcomes of the first % weighing? %\item % What is the most information you can get from a single weighing? % How much information do you get from a single weighing % if the three outcomes are equally probable? %\item What is the smallest number of weighings that might conceivably %be sufficient always to identify the odd ball and whether it is heavy %or light? \item How much information is gained when you learn (i) the state of a flipped coin; (ii) the states of two flipped coins; (iii) the outcome when a four-sided die is rolled? \item How much information is gained on the first step of the weighing problem if 6 balls are weighed against the other 6? How much is gained if 4 are weighed against 4 on the first step, leaving out 4 balls? % the other 4 aside? \een } % % How many possible outcomes of an e weighing process are there? To put it another way, imagine that you report the outcome by sending a postcard which says, for example, "ball number 5 is heavy", how many prepare a postcard % % how many outcomes are there? % How many possible states of the world are y % if you tell someone ball number x is heavy, how much info have you given % them? how much information can be conveyed by $k$ uses of the balance? % % % make clear that you can put any objects on the scales, % don't have to weigh 6 vs 6. % no cheating by gradually adding weights % % katriona's problem: 4 bits, randomly rotated every time you ask them % to be flipped. % % hhhh llll gggg % hhll lhgg lh % if left is h then % hh or l % so do h vs h % % else gggg gggg ???? % -> ?? ?g % -> hh l or ggg -> wegh last dude (1 bit) % do h vs h % % if 13 and good avail, - hhhhh llll* gggg % hhll lhgg hhl % \mysetcounter{page}{76} \ENDprechapter \chapter{The Source Coding Theorem} \label{ch.two}\label{ch2}\label{chtwo} % _l2.tex % \part{Data Compression} % \chapter{The Source Coding Theorem} % % I introduce the idea of a "name" (or label?) here, and should clarify % (example 2.1) % % E = 13%, Q,Z = 0.1% % TH = 3.7% % % New plan for this chapter: % \section{Key concept} % Rather than $H(\bp)$ being the measure of information content of % an ensemble, % I want the central idea of this chapter to be that % $\log 1/P(\bx)$ is the information content of a particular % outcome $\bx$. $H$ is then of interest because it is the average % information content. % % An example to illustrate this is `hunt the professor'. Or crack % the combination. Guess the PIN. % An absent-minded professor wishes to remember an % integer between 1 and 256, that is, eight bits of information. % He takes 256 large numbered cardboard boxes, and climbs % in the box whose number is the integer to be remembered. % The only way to find him % is to open the lid of a box. A single experiment involves % opening a particular box. The outcome is either $x={\tt n}$ -- no % professor -- or $x={\tt y}$ -- the professor is in there. % The probabilities are % \beq % P(x\eq {\tt n}) = 255/256; P(x\eq {\tt y}) = 1/256. % \eeq % We open box $n$. % If the professor is revealed, we have learned the integer, % and thus recovered 8 bits of information. If he is not revealed, % we have learned very little -- simply that the % integer is not $n$. The information contents are: % \beq % h(x\eq 0) = \log_2( 256/255) = 0.0056 ; h(x\eq 1) = \log_2 256 = 8 . % \eeq % The average information content is % \beq % H(X) = 0.037 \bits . % \eeq % This example shows that in the event of an improbable outcome's occuring, % a large amount of information really is conveyed. % % \section{Weighing problem} % The weighing problem remains useful, let's keep it. % % \section{Source coding theorem} % Relate `information content' $\log 1/P$ to message length % in two steps. First, establish the AEP, that % the outcome from an ensemble $X^N$ % is very likely to lie in a typical set having `information % content' close to NH. % % Second, show that we can count the number of elements in the % typical set, give them all names, and the number of % names will be about $2^{NH}$. % % At what point should $H_{\delta}$ be introduced? \section{How to measure the information content of a random variable?} In the next few chapters, we'll be talking about probability distributions and random variables. Most of the time we can get by with sloppy notation, but occasionally, we will need precise notation. Here is the %definition and notation that we established in \chapterref{ch.prob.ent}.\indexs{ensemble} % \sloppy \begin{description} \item[An ensemble] $X$ is a triple $(x,\A_X, \P_X)$, where the {\dem outcome\/} $x$ is the value of a random variable, % whose value $x$ can take on a which takes on one of a set of possible values, % the alphabet % {\em outcomes}, $\A_X = \{a_1,a_2,\ldots,a_i,\ldots, a_I\}$, % \ie, possible values for a random variable $x$ % and a probability distribution over them, having probabilities $\P_X = \{p_1,p_2,\ldots, p_I\}$, with $P(x\eq a_i) = p_i$, $p_i \geq 0$ and $\sum_{a_i \in \A_X} P(x \eq a_i) = 1$. \end{description} %\begin{description} %\item[An ensemble] $X$ is a random variable $x$ taking on a value % from a set of possible {\em outcomes}, % $$\A_X \eq \{a_1,\ldots,a_I\},$$ % having probabilities % $$\P_X = \{p_1,\ldots, p_I\},$$ with $P(x\eq a_i) = p_i$, % $p_i \geq 0$ and $\sum_{x \in \A_X} P(x) = 1$. %\end{description} % An ensemble is a set of possible values for a random variable % and a probability distribution over them. {How can we measure the information content of an outcome $x = a_i$ from such an ensemble?} In this chapter we examine the assertions \ben \item that the % It is claimed that the {\dem{{Shannon information content}}},\index{information content!Shannon}\index{information content!how to measure} \beq h(x\eq a_i) \equiv \log_2 \frac{1}{p_i}, \eeq is a sensible measure of the information content of the outcome $x = a_i$, and \item that the {\dem{\ind{entropy}}} of the ensemble, \beq H(X) = \sum_i p_i \log_2 \frac{1}{p_i}, \eeq is a sensible measure of the ensemble's average information content. \een \begin{figure}[htbp] \figuremargin{%1 {\small% \begin{center} \mbox{ \mbox{ \hspace{-9mm} \mbox{\psfig{figure=figs/h.ps,% width=42mm,angle=-90}}$p$ \hspace{-35mm} \makebox[0in][l]{\raisebox{\hpheight}{$h(p)= \log_2 \displaystyle \frac{1}{p}$ }} \hspace{35mm} } \hspace{0.9mm} \begin{tabular}[b]{ccc}\toprule $p$ & $h(p)$ & $H_2(p)$ \\ \midrule 0.001 & 10.0 & 0.011 \\ % 9.96578 & 0.0114078 0.01\phantom{0} & \phantom{1}6.6 & 0.081 \\ 0.1\phantom{01} & \phantom{1}3.3 & 0.47\phantom{1} \\ 0.2\phantom{01} & \phantom{1}2.3 & 0.72\phantom{1} \\ 0.5\phantom{01} & \phantom{1}1.0 & 1.0\phantom{01} \\ \bottomrule \end{tabular} \mbox{ % to put H at left: \hspace{1.2mm} \hspace{6.2mm} \raisebox{\hpheight}{$H_2(p)$} % to put H at left: \hspace{-7.5mm} \hspace{-20mm} \mbox{\psfig{figure=figs/H2.ps,% width=42mm,angle=-90}}$p$ } % see also H2x.tex } \end{center} }% end small }{% \caption[a]{The {Shannon information content} $h(p) = \log_2 \frac{1}{p}$ and the binary entropy function $H_2(p)=H(p,1\!-\!p)=p \log_2 \frac{1}{p} + (1-p)\log_2 \frac{1}{(1-p)}$ as a function of $p$.} \label{fig.h2} }% \end{figure} % gnuplot % load 'figs/l2.gnu' \noindent \Figref{fig.h2} shows the Shannon information content of an outcome with probability $p$, as a function of $p$. The less probable an outcome is, the greater its Shannon information content. \Figref{fig.h2} also shows % $h(p) = \log_2 \frac{1}{p}$, the binary entropy function, \beq H_2(p)=H(p,1\!-\!p)=p \log_2 \frac{1}{p} + (1-p)\log_2 \frac{1}{(1-p)} , \eeq which is the entropy of the ensemble $X$ whose alphabet and probability distribution are $\A_X = \{ a , b \}, \P_X = \{ p , (1-p) \}$. % \subsection{Information content of independent random variables} Why should $\log 1/p_i$ have anything to do with the information content? Why not some other function of $p_i$? We'll explore this question in detail shortly, but first, notice a nice property of this particular function $h(x)=\log 1/p(x)$. Imagine learning the value of two {\em independent\/} random variables, $x$ and $y$. The definition of independence is that the probability distribution is separable into a {\em product}: \beq P(x,y) = P(x) P(y) . \eeq Intuitively, we might want any measure of the `amount of information gained' to have the property of {\em additivity} -- that is, for independent random variables $x$ and $y$, the information gained when we learn $x$ and $y$ should equal the sum of the information gained if $x$ alone were learned and the information gained if $y$ alone were learned. The Shannon information content of the outcome $x,y$ is \beq h(x,y) = \log \frac{1}{P(x,y)} = \log \frac{1}{P(x)P(y)} = \log \frac{1}{P(x)} + \log \frac{1}{P(y)} \eeq so it does indeed satisfy \beq h(x,y) = h(x) + h(y), \:\:\mbox{if $x$ and $y$ are independent.} \eeq \exercissxA{1}{ex.Hadditive}{ Show that, if $x$ and $y$ are independent, the entropy of the outcome $x,y$ satisfies \beq H(X,Y) = H(X) + H(Y) . \eeq In words, entropy is additive for independent variables. } We now explore these ideas with some examples; then, in section \ref{sec.aep} and in Chapters \ref{ch3} and \ref{ch4}, we prove that the Shannon information content and the entropy are related to the number of bits needed to describe the outcome of an experiment. % \section{Thinking about information content} % \subsection{Ensembles with maximum average information content} % The first property of the entropy that we will % consider is the property that you proved when you solved % \exerciseref{ex.Hineq}: the entropy of an ensemble % $X$ is biggest if all the outcomes % have equal probability $p_i \eq 1/|X|$. % % If entropy measures the average information content % of an ensemble, then this idea of equiprobable outcomes % should have relevance for the design of efficient experiments. \subsection{The weighing problem: designing informative experiments} Have you solved the \ind{weighing problem}\index{puzzle!weighing 12 balls} \exercisebref{ex.weigh}\ yet? Are you sure? Notice that in three uses of the balance -- which reads either `left heavier', `right heavier', or `balanced' -- the number of conceivable outcomes is $3^3=27$, whereas the number of possible states of the world is 24: the odd ball could be any of twelve balls, and it could be heavy or light. So in principle, the problem might be solvable in three weighings -- but not in two, since $3^2 < 24$. If you know how you {can} determine the odd weight {\em and\/} whether it is heavy or light in {\em three\/} weighings, then you may read on. If you haven't found a strategy that always gets there in three weighings, I encourage you to think about \exerciseonlyref{ex.weigh} some more. % {ex.weigh} % \subsection{Information from experiments} Why is your strategy optimal? What is it about your series of weighings that allows useful information to be gained as quickly as possible? \begin{figure}%[htbp] \fullwidthfigureright{% % included by l2.tex % % shows weighing trees, ternary % % decisions of what to weigh are shown in square boxes with 126 over 345 (l:r) % state of valid hypotheses are listed in double boxes % three arrows, up means left heavy, straight means right heavy, down is balance % actually s and d boxes end up having the same defn. % \setlength{\unitlength}{0.56mm}% page width is 160mm % was 6mm \begin{center} \small \begin{picture}(260,260)(-50,-130) % % initial state % % all 24 hypotheses \mydbox{-50,-100}{15,200}{$1^+$\\$2^+$\\$3^+$\\$4^+$\\$5^+$\\$6^+$\\$7^+$\\ $8^+$\\$9^+$\\$10^+$\\$11^+$\\$12^+$\\$1^-$\\$2^-$\\$3^-$\\$4^-$\\ $5^-$\\$6^-$\\$7^-$\\$8^-$\\$9^-$\\$10^-$\\$11^-$\\$12^-$} \mysbox{-30,-8}{25,16}{$\displaystyle\frac{1\,2\,3\,4}{5\,6\,7\,8}$} \put(-30,10){\makebox(25,8){weigh}} % % 1st arrows % \mythreevector{0,0}{1}{3}{30} % % first three boxes of hypotheses % boxes of actions % #1 is bottom left corner, so has to be offset by height of box % #2 is dimensions of box % % each digit is about 10 high % \mydbox{40,55}{15,70}{$1^+$\\$2^+$\\$3^+$\\$4^+$\\$5^-$\\$6^-$\\$7^-$\\$8^-$} \mysbox{65,82}{25,16}{$\displaystyle\frac{1\,2\,6}{3\,4\,5}$} \put(65,100){\makebox(25,8){weigh}} \mydbox{40,-35}{15,70}{$1^-$\\$2^-$\\$3^-$\\$4^-$\\$5^+$\\$6^+$\\$7^+$\\$8^+$} \mysbox{65,-8}{25,16}{$\displaystyle\frac{1\,2\,6}{3\,4\,5}$} \put(65,10){\makebox(25,8){weigh}} \mydbox{40,-125}{15,70}{$9^+$\\$10^+$\\$11^+$\\$12^+$\\$9^-$\\$10^-$\\$11^-$\\$12^-$} \mysbox{65,-98}{25,16}{$\displaystyle\frac{9\,10\,11}{1\,2\,3}$} \put(65,-80){\makebox(25,8){weigh}} % % 2nd arrows % \mythreevector{95,90}{1}{2}{15} \mythreevector{95,0}{1}{2}{15} \mythreevector{95,-90}{1}{2}{15} % nine intermediate states. top ones \mydbox{115,113}{35,14}{$1^+2^+5^-$} \mysbox{155,112}{25,16}{$\displaystyle\frac{1}{2}$} \mydbox{115,83}{35,14}{$3^+4^+6^-$} \mysbox{155,82}{25,16}{$\displaystyle\frac{3}{4}$} \mydbox{115,53}{35,14}{$7^-8^-$} \mysbox{155,52}{25,16}{$\displaystyle\frac{1}{7}$} % nine intermediate states. mid ones \mydbox{115,23}{35,14}{$6^+3^-4^-$} \mysbox{155,22}{25,16}{$\displaystyle\frac{3}{4}$} \mydbox{115,-7}{35,14}{$1^-2^-5^+$} \mysbox{155,-8}{25,16}{$\displaystyle\frac{1}{2}$} \mydbox{115,-37}{35,14}{$7^+8^+$} \mysbox{155,-38}{25,16}{$\displaystyle\frac{7}{1}$} % nine intermediate states. bot ones \mydbox{115,-67}{35,14}{$9^+10^+11^+$} \mysbox{155,-68}{25,16}{$\displaystyle\frac{9}{10}$} \mydbox{115,-97}{35,14}{$9^-10^-11^-$} \mysbox{155,-98}{25,16}{$\displaystyle\frac{9}{10}$} \mydbox{115,-127}{35,14}{$12^+12^-$} \mysbox{155,-128}{25,16}{$\displaystyle\frac{12}{1}$} % 3rd arrows mainline \mythreevector{185,60}{1}{1}{10} \mythreevector{185,0}{1}{1}{10} \mythreevector{185,-60}{1}{1}{10} % other branch lines \mythreevector{185,120}{1}{1}{10} \mythreevector{185,90}{1}{1}{10} \mythreevector{185,30}{1}{1}{10} \mythreevector{185,-30}{1}{1}{10} \mythreevector{185,-90}{1}{1}{10} \mythreevector{185,-120}{1}{1}{10} % final answers aligned at 200,x*10 \mydbox{200,126}{10,8}{$1^+$} \mydbox{200,116}{10,8}{$2^+$} \mydbox{200,106}{10,8}{$5^-$} \mydbox{200,96}{10,8}{$3^+$} \mydbox{200,86}{10,8}{$4^+$} \mydbox{200,76}{10,8}{$6^-$} \mydbox{200,66}{10,8}{$7^-$} \mydbox{200,56}{10,8}{$8^-$} \mydbox{200,46}{10,8}{$\star$}% ---------- impossible outcome \mydbox{200,36}{10,8}{$4^-$} \mydbox{200,26}{10,8}{$3^-$} \mydbox{200,16}{10,8}{$6^+$} \mydbox{200,6}{10,8}{$2^-$} \mydbox{200,-4}{10,8}{$1^-$}% the middle, 0 \mydbox{200,-14}{10,8}{$5^+$} \mydbox{200,-24}{10,8}{$7^+$} \mydbox{200,-34}{10,8}{$8^+$} \mydbox{200,-44}{10,8}{$\star$} \mydbox{200,-54}{10,8}{$9^+$} \mydbox{200,-64}{10,8}{$10^+$} \mydbox{200,-74}{10,8}{$11^+$} \mydbox{200,-84}{10,8}{$10^-$} \mydbox{200,-94}{10,8}{$9^-$} \mydbox{200,-104}{10,8}{$11^-$} \mydbox{200,-114}{10,8}{$12^+$} \mydbox{200,-124}{10,8}{$12^-$} \mydbox{200,-134}{10,8}{$\star$} \end{picture} \end{center} }{% \caption[a]{An optimal solution to the weighing problem. % At each step there are two boxes: the left box shows which hypotheses are still possible; the right box shows the balls involved in the next weighing. The 24 hypotheses are written $1^+, % 2^+,\ldots,1^-, \ldots, 12^-$, with, \eg, $1^+$ denoting that 1 is the odd ball and it is heavy. Weighings are written by listing the names of the balls on the two pans, separated by a line; for example, in the first weighing, % $\displaystyle\frac{1\,2\,3\,4}{5\,6\,7\,8}$ denotes that balls 1, 2, 3, and 4 are put on the left-hand side and 5, 6, 7, and 8 on the right. In each triplet of arrows the upper arrow leads to the situation when the left side is heavier, the middle arrow to the situation when the right side is heavier, and the lower arrow to the situation when the outcome is balanced. The three points labelled $\star$ % arrows without subsequent boxes at the right-hand side correspond to impossible outcomes. %The total number of outcomes % of the weighing process is 24, which equals $3^3 - 3$, so we would expect % this ternary tree of depth three to have three spare branches. } \label{fig.weighing}\label{ex.weigh.sol} }% \end{figure} The answer is that at each step of an optimal procedure, the three outcomes (`left heavier', `right heavier', and `balance') are {\em as close as possible to equiprobable}. An optimal solution is shown in \figref{fig.weighing}. Suboptimal strategies, such as weighing balls 1--6 against 7--12 on the first step, do not achieve all outcomes with equal probability: these two sets of balls can never balance, so the only possible outcomes are `left heavy' and `right heavy'. % Similarly, strategies % that after an unbalanced initial result % do not mix together balls that might be heavy with balls that % might be light are incapable of giving one of the three outcomes. Such a binary outcome rules out only half of the possible hypotheses, so a strategy that uses such outcomes must sometimes take longer to find the right answer. % Some suboptimal strategies produce binary trees rather than ternary trees like % the one in \figref{fig.weighing}, and binary trees % are necessarily deeper than balanced ternary trees % with the same number of leaves. The insight that the outcomes should be as near as possible to equiprobable makes it easier to search for an optimal strategy. The first weighing must divide the 24 possible hypotheses into three groups of eight. Then the second weighing must be chosen so that there is a 3:3:2 split of the hypotheses. Thus we might conclude: \begin{conclusionbox} {the outcome of a random experiment is guaranteed to be most informative if the probability distribution over outcomes is uniform.} \end{conclusionbox} This conclusion agrees with the property of the entropy that you proved when you solved \exerciseref{ex.Hineq}: the entropy of an ensemble $X$ is biggest if all the outcomes have equal probability $p_i \eq 1/|\A_X|$. % for anyone who wants to play it against a machine: % http://y.20q.net:8095/btest % http://www.smalltime.com/dictator.html % http://www.guessmaster.com/ \subsection{Guessing games} In the game of \ind{twenty questions},\index{game!twenty questions} one player thinks of an object, and the other player attempts to guess what the object is by asking questions that have yes/no answers, for example, `is it alive?', or `is it human?' The aim is to identify the object with as few questions as possible. What is the best strategy for playing this game? For simplicity, imagine that we are playing the rather dull version of twenty questions called `sixty-three'. % % two hundred and fifty five'. % In this game, the permitted objects are the $2^6$ integers % $\A_X = \{ 0 , 1 , 2 , \dots 63 \}$. % One player selects an $x \in \A_X$, and we ask % questions that have yes/no answers in order to identify $x$. \exampl{example.sixtythree}{ {\sf The game `sixty-three'}. What's the smallest number of yes/no questions needed\index{game!sixty-three} to identify an integer $x$ between 0 and 63?\index{twenty questions} } Intuitively, the best questions successively divide the 64 possibilities into equal sized sets. Six questions suffice. One reasonable strategy asks the following questions: % % want a computer program environment here. % \begin{quote} \begin{tabbing} {\sf 1:} is $x \geq 32$? \\ {\sf 2:} is $x \mod 32 \geq 16$? \\ {\sf 3:} is $x \mod 16 \geq 8$? \\ {\sf 4:} is $x \mod 8 \geq 4$? \\ {\sf 5:} is $x \mod 4 \geq 2$? \\ {\sf 6:} is $x \mod 2 = 1$? \end{tabbing} \end{quote} % % I'd like to put this in a comment column on the right beside the 'code': % [The notation $x \mod 32$, pronounced `$x$ modulo 32', denotes the remainder when $x$ is divided by 32; for example, $35 \mod 32 = 3$ and $32 \mod 32 = 0$.] The answers to these questions, if translated from $\{\mbox{yes},\mbox{no}\}$ to $\{{\tt{1}},{\tt{0}}\}$, give the binary expansion of $x$, for example $35 \Rightarrow {\tt{100011}}$.\ENDsolution\smallskip What are the Shannon information contents of the outcomes in this example? If we assume that all values of $x$ are equally likely, then the answers to the questions are independent and each has % entropy $H_2(0.5) = 1 \ubit$. The Shannon information content % of each answer is $\log_2 (1/0.5) = 1 \ubit$; the total Shannon information gained is always six bits. Furthermore, the number $x$ that we learn from these questions is a six-bit binary number. Our questioning strategy defines a way of encoding the random variable $x$ as a binary file. So far, the Shannon information content makes sense: it measures the length of a binary file that encodes $x$. % However, we have not yet studied ensembles where the outcomes have unequal probabilities. Does the Shannon information content make sense there too? \fakesection{Submarine figure} % \newcommand{\subgrid}{\multiput(0,0)(0,10){9}{\line(1,0){80}}\multiput(0,0)(10,0){9}{\line(0,1){80}}} \newcommand{\sublabels}{ \put(-5,75){\makebox(0,0){\sf\tiny{A}}} \put(-5,65){\makebox(0,0){\sf\tiny{B}}} \put(-5,55){\makebox(0,0){\sf\tiny{C}}} \put(-5,45){\makebox(0,0){\sf\tiny{D}}} \put(-5,35){\makebox(0,0){\sf\tiny{E}}} \put(-5,25){\makebox(0,0){\sf\tiny{F}}} \put(-5,15){\makebox(0,0){\sf\tiny{G}}} \put(-5, 5){\makebox(0,0){\sf\tiny{H}}} % \put(75,-5){\makebox(0,0){\tiny{8}}} \put(65,-5){\makebox(0,0){\tiny{7}}} \put(55,-5){\makebox(0,0){\tiny{6}}} \put(45,-5){\makebox(0,0){\tiny{5}}} \put(35,-5){\makebox(0,0){\tiny{4}}} \put(25,-5){\makebox(0,0){\tiny{3}}} \put(15,-5){\makebox(0,0){\tiny{2}}} \put( 5,-5){\makebox(0,0){\tiny{1}}} } \newcommand{\misssixteen}{ \put(45,65){\makebox(0,0){$\times$}} \put(45,45){\makebox(0,0){$\times$}} \put(35,75){\makebox(0,0){$\times$}} \put(35,65){\makebox(0,0){$\times$}} \put(35,55){\makebox(0,0){$\times$}} \put(35,45){\makebox(0,0){$\times$}} \put(35,35){\makebox(0,0){$\times$}} \put(35,25){\makebox(0,0){$\times$}} \put(35,15){\makebox(0,0){$\times$}} \put(35, 5){\makebox(0,0){$\times$}} \put(25,75){\makebox(0,0){$\times$}} \put(25,65){\makebox(0,0){$\times$}} \put(25,55){\makebox(0,0){$\times$}} \put(25,45){\makebox(0,0){$\times$}} \put(25,35){\makebox(0,0){$\times$}} \put(25,25){\makebox(0,0){$\times$}} \put(25,15){\makebox(0,0){$\times$}} } \newcommand{\missthirtytwo}{ \put(75,75){\makebox(0,0){$\times$}} \put(75,65){\makebox(0,0){$\times$}} \put(75,55){\makebox(0,0){$\times$}} \put(75,45){\makebox(0,0){$\times$}} \put(75,35){\makebox(0,0){$\times$}} \put(75,25){\makebox(0,0){$\times$}} \put(75,15){\makebox(0,0){$\times$}} \put(75, 5){\makebox(0,0){$\times$}} \put(65,75){\makebox(0,0){$\times$}} \put(65,65){\makebox(0,0){$\times$}} \put(65,55){\makebox(0,0){$\times$}} \put(65,45){\makebox(0,0){$\times$}} \put(65,35){\makebox(0,0){$\times$}} \put(65,25){\makebox(0,0){$\times$}} \put(65,15){\makebox(0,0){$\times$}} \put(65, 5){\makebox(0,0){$\times$}} \put(55,75){\makebox(0,0){$\times$}} \put(55,65){\makebox(0,0){$\times$}} \put(55,55){\makebox(0,0){$\times$}} \put(55,45){\makebox(0,0){$\times$}} \put(55,35){\makebox(0,0){$\times$}} \put(55,25){\makebox(0,0){$\times$}} \put(55,15){\makebox(0,0){$\times$}} \put(55, 5){\makebox(0,0){$\times$}} \put(45,75){\makebox(0,0){$\times$}} %%\put(45,65){\makebox(0,0){$\times$}} \put(45,55){\makebox(0,0){$\times$}} %% \put(45,45){\makebox(0,0){$\times$}} \put(45,35){\makebox(0,0){$\times$}} \put(45,25){\makebox(0,0){$\times$}} \put(45,15){\makebox(0,0){$\times$}} \put(45, 5){\makebox(0,0){$\times$}} \put(5,65){\makebox(0,0){$\times$}} } %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%% submarine figure %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \begin{figure} \figuredangle{% \begin{center} %\begin{tabular}{l@{\hspace{-1mm}}*{5}{@{\hspace{2pt}}c}} \toprule \begin{tabular}{l@{\hspace{0mm}}*{5}{@{\hspace{8.5mm}}c}} \toprule % moves made & 1 & 2 & 32 & 48 & 49 \\ & % % 1 miss % % this fig actually needs extra width on left, but there is nothing there. \setlength{\unitlength}{0.26mm} \begin{picture}(80,95)(0,-10)\subgrid\sublabels \put(25,15){\makebox(0,0){$\times$}} \put(25,15){\circle{15}} \end{picture} & % % 2 miss % \setlength{\unitlength}{0.26mm} \begin{picture}(80,95)(0,-10)\subgrid \put(25,15){\makebox(0,0){$\times$}} \put(5,65){\makebox(0,0){$\times$}} \put(5,65){\circle{15}} \end{picture} & % % 32 miss % \setlength{\unitlength}{0.26mm} \begin{picture}(80,95)(0,-10)\subgrid \put(25,15){\makebox(0,0){$\times$}} \put(45,35){\circle{15}} \missthirtytwo \end{picture} & % % 49 miss % \setlength{\unitlength}{0.26mm} \begin{picture}(80,95)(0,-10)\subgrid \put(25,15){\makebox(0,0){$\times$}} \put(5,65){\makebox(0,0){$\times$}} \missthirtytwo \misssixteen \put(25,25){\circle{15}} \end{picture} & \setlength{\unitlength}{0.26mm} \begin{picture}(80,95)(0,-10)\subgrid \put(25,15){\makebox(0,0){$\times$}} \put(5,65){\makebox(0,0){$\times$}} \missthirtytwo \misssixteen %%%%%%%%%%%%%%%%%%%%%%% hit the submarine: \put(25,5){\circle{15}} \put(25,5){\makebox(0,0){\tiny\bf S}} \end{picture} \\ move \# & 1 & 2 & 32 & 48 & 49 \\ question & G3 & B1 & E5 & F3 & H3 \\ outcome & $x = {\tt n}$ % $(\times)$ & $x = {\tt n}$ %$(\times)$ & $x = {\tt n}$ %$(\times)$ & $x = {\tt n}$ %$(\times)$ & $x = {\tt y}$ %({\small\bf S}) \\[0.1in] $P(x)$ & $\displaystyle\frac{63}{64}$ & $\displaystyle\frac{62}{63}$ & $\displaystyle\frac{32}{33}$ & $\displaystyle\frac{16}{17}$ & $\displaystyle\frac{1}{16}$ \\[0.15in] $h(x)$ & 0.0227 & 0.0230 & 0.0443 % & 0.0430 -------- 0.9556 , just before 32 are pasted & 0.0874 & 4.0 \\[0.05in] Total info. & 0.0227 & 0.0458 & 1.0 & 2.0 & 6.0 \\ \bottomrule \end{tabular} \end{center} }{% \caption[a]{A game of {\tt submarine}. The submarine is hit on the 49th attempt.} \label{fig.sub} }% \end{figure} \subsection{The game of {\ind{submarine}}: how many bits can one bit convey?} In the game of {\ind{battleships}}, each player hides a fleet of ships in a sea represented by a square grid. On each\index{game!submarine} turn, one player attempts to hit the other's ships by firing at one square in the opponent's sea. The response to a selected square such as `G3' is either `miss', `hit', or `hit and destroyed'. In a % rather boring version of battleships called {\tt submarine}, each player hides just one submarine in one square of an eight-by-eight grid. \Figref{fig.sub} shows a few pictures of this game in progress: the circle represents the square that is being fired at, and the $\times$s show squares in which the outcome was a miss, $x={\tt{n}}$; the submarine is hit (outcome $x={\tt{y}}$ shown by the symbol $\bs$) on the 49th attempt. Each shot made by a player defines an ensemble. The two possible outcomes are $\{ {\tt{y}} ,{\tt{n}}\}$, corresponding to a hit and a miss, and their probabilities depend on the state of the board. At the beginning, $P({\tt{y}}) = \linefrac{1}{64}$ and $P({\tt{n}}) = \linefrac{63}{64}$. At the second shot, if the first shot missed, % enemy sub has not yet been hit, $P({\tt{y}}) = \linefrac{1}{63}$ and $P({\tt{n}}) = \linefrac{62}{63}$. At the third shot, if the first two shots missed, % enemy submarine has not yet been hit, $P({\tt{y}}) = \linefrac{1}{62}$ and $P({\tt{n}}) = \linefrac{61}{62}$. % According to the Shannon information content, t The Shannon information gained from an outcome $x$ is $h(x) = \log (1/P(x))$. % Let's investigate this assertion. If we are lucky, and hit the submarine on the first shot, then \beq h(x) = h_{(1)}({\tt y}) = \log_2 64 = 6 \ubits . \eeq Now, it might seem a little strange that one binary outcome can convey six bits. % , but it does make sense. W But we have learnt the hiding place, % where the submarine was, which could have been any of 64 squares; so we have, by one lucky binary question, indeed learnt six bits. What if the first shot misses? The Shannon information that we gain from this outcome is \beq h(x) = h_{(1)}({\tt n}) = \log_2 \frac{64}{63} = 0.0227 \ubits . \eeq Does this make sense? It is not so obvious. Let's keep going. If our second shot also misses, the Shannon information content of the second outcome is \beq h_{(2)}({\tt n}) = \log_2 \frac{63}{62} = 0.0230 \ubits . \eeq If we miss thirty-two times (firing at a new square each time), the total Shannon information gained is \beqan %\hspace*{-0.2in} \lefteqn{ \log_2 \frac{64}{63} + \log_2 \frac{63}{62} + \cdots + \log_2 \frac{33}{32} } \nonumber \\ & \!\!\!=\!\!\! & 0.0227 + 0.0230 + \cdots + 0.0430 \:\:=\:\: 1.0 \ubits . \eeqan Why this round number? Well, what have we learnt? We now know that the submarine is not in any of the 32 squares we fired at; learning that fact is just like playing a game of \sixtythree\ (\pref{example.sixtythree}), asking as our first question `is $x$ one of the thirty-two numbers corresponding to these squares I fired at?', and receiving the answer `no'. This answer rules out half of the hypotheses, so it gives us one bit. %It doesn't matter what the % outcome might have been; all that matters is the probability % of what actually happened. After 48 unsuccessful shots, the information gained is 2 bits: the unknown location has been narrowed down to one quarter of the original hypothesis space. What if we hit the submarine on the 49th shot, when there were 16 squares left? The Shannon information content of this outcome is \beq h_{(49)}({\tt y}) = \log_2 16 = 4.0 \ubits . \eeq The total Shannon information content of all the outcomes is \beqan \lefteqn{ \log_2 \frac{64}{63} + \log_2 \frac{63}{62} + \cdots + % \log_2 \frac{33}{32} + \cdots + \log_2 \frac{17}{16} + \log_2 \frac{16}{1} } \nonumber \\ &=& 0.0227 + 0.0230 + \cdots % + 0.0430 + \cdots + 0.0874 + 4.0 \:\: =\:\: 6.0 \ubits . \label{eq.sum.me} \eeqan So once we know where the submarine is, the total Shannon information content gained is 6 bits. This result holds regardless of when we hit the submarine. If we hit it when there are $n$ squares left to choose from -- $n$ was 16 in \eqref{eq.sum.me} -- then the total information gained is: \beqan \lefteqn{ \log_2 \frac{64}{63} + \log_2 \frac{63}{62} + \cdots + \log_2 \frac{n+1}{n} + \log_2 \frac{n}{1} } \nonumber \\ &=& \log_2 \left[ \frac{64}{63} \times \frac{63}{62} \times \cdots \times \frac{n+1}{n} \times \frac{n}{1} \right] %\times 63 \times \cdots \times (n+1) \times n} % {63 \times 62 \times \cdots \times n \times 1} \:\:=\:\: \log_2 \frac{64}{1}\:\: =\:\: 6 \,\bits. \eeqan % % add winglish here? % % follows in lecture 2, after submarine % % aim: introduce the language of Wenglish % and demonstrate Shannon info content. What have we learned from the examples so far? I think the {\tt submarine} example makes quite a convincing case for the claim that the Shannon information content is a sensible measure of information content. And the game of {\tt sixty-three} shows that the Shannon information content can be intimately connected to the size of a file that encodes the outcomes of a random experiment, thus suggesting a possible connection to data compression. In case you're not convinced, let's look at one more example. \subsection{The \Wenglish\ language} \label{sec.wenglish} % [this section under construction]} {\dem{\ind{\Wenglish}}} is a language similar to \ind{English}. \Wenglish\ sentences consist of words drawn at random from the \Wenglish\ dictionary, which contains $2^{15}=32$,768 words, all of length 5 characters. Each word in the \Wenglish\ dictionary was constructed % by the \Wenglish\ language committee, who created each of those $32\,768$ words at random by picking five letters from the probability distribution over {\tt a$\ldots$z} depicted in \figref{fig.monogram}. % Since all words are five characters long %\begin{figure} %\figuremargin{ \marginfig{\small \begin{center} \begin{tabular}{rc} \toprule % & Word \\ \midrule 1 & {\tt{aaail}} \\ 2 & {\tt{aaaiu}} \\ 3 & {\tt{aaald}} \\ & $\vdots$ \\ 129 & {\tt{abati}} \\ & $\vdots$ \\ $2047$ & {\tt{azpan}} \\ $2048$ & {\tt{aztdn}} \\ & $\vdots$ \\ & $\vdots$ \\ $16\,384$ & {\tt{odrcr}} \\ & $\vdots$ \\ & $\vdots$ \\ $32\,737$ & {\tt{zatnt}} \\ & $\vdots$ \\ $32\,768$ & {\tt{zxast}} \\ \bottomrule \end{tabular} \end{center} %}{ \caption[a]{The \Wenglish\ dictionary.} \label{fig.wenglish} } %\end{figure} % 5366+1219+2602+2718+8377+1785+1280+3058+5903+70+800+3431+2319+5470+6526+1896+539+4660+5453+6767+3108+652+1388+765+1564+78 % 77794 Some entries from the dictionary are shown in alphabetical order in \figref{fig.wenglish}. Notice that the number of words in the \ind{dictionary} (32,768) is much smaller than the total number of possible words of length 5 letters, $26^5 \simeq 12$,000,000. Because the probability of the letter {{\tt{z}}} is about $1/1000$, only 32 of the words in the dictionary begin with the letter {\tt z}. In contrast, the probability of the letter {{\tt{a}}} is about $0.0625$, and 2048 of the words begin with the letter {\tt a}. Of those 2048 words, two start {\tt az}, and 128 start {\tt aa}. Let's imagine that we are reading a \Wenglish\ document, and let's discuss the Shannon \ind{information content} of the characters as we acquire them. If we are given the text one word at a time, the Shannon information content of each five-character word is $\log \mbox{32,768} = 15$ bits, since \Wenglish\ uses all its words with equal probability. The average information content per character is therefore 3 bits. Now let's look at the information content if we read the document one character at a time. If, say, the first letter of a word is {\tt a}, the Shannon information content is $\log 1/ 0.0625 \simeq 4$ bits. If the first letter is {\tt z}, the Shannon information content is $\log 1/0.001 \simeq 10$ bits. The information content is thus highly variable at the first character. The total information content of the 5 characters in a word, however, is exactly 15 bits; so the letters that follow an initial {\tt{z}} have lower average information content per character than the letters that follow an initial {\tt{a}}. A rare initial letter such as {\tt{z}} indeed conveys more information about what the word is than a common initial letter. Similarly, in English, if rare characters occur at the start of the word (\eg\ {\tt{xyl}\ldots}), then often we can identify the whole word immediately; whereas words that start with common characters (\eg\ {\tt{pro}\ldots}) require more characters before we can identify them. % Does this make sense? Well, in English, % the first few characters of a word do very often fully identify the whole word. % % {\em MORE HERE........} \section{Data compression} \index{data compression}\index{source code}The preceding examples justify the idea that the Shannon \ind{information content} of an outcome is a natural measure of its \ind{information content}. Improbable outcomes do convey more information than probable outcomes. We now discuss the information content of a source by considering how many bits are needed to describe the outcome of an experiment. % , that is, by studying {data compression}. If we can show that we can compress data from a particular source into a file of $L$ bits per source symbol and recover the data reliably, then we will say that the average information content of that source is at most % less than or equal to $L$ bits per symbol. % % cut Sat 13/1/01 % % We will show that, for any source, the information content of the source % is intimately related to its entropy. \subsection{Example: compression of text files} A file is composed of a sequence of bytes. A byte is composed of 8 bits\marginpar{\small\raggedright\reducedlead{Here we use the word `\ind{bit}' with its meaning, `a symbol with two values', not to be confused with the unit of information content.}} and can have a decimal value between 0 and 255. A typical text file is composed of the ASCII character set (decimal values 0 to 127). This character set uses only seven of the eight bits in a byte. \exercissxB{1}{ex.ascii}{ By how much could the size of a file be reduced given that it is an ASCII file? How would you achieve this reduction? } Intuitively, it seems reasonable to assert that an ASCII file contains $7/8$ as much information as an arbitrary file of the same size, since we already know one out of every eight bits before we even look at the file. This is a % very simple example of redundancy. Most sources of data have further redundancy: English text files use the ASCII characters with non-equal frequency; certain pairs of letters are more probable than others; and entire words can be predicted given the context and a semantic understanding of the text. % this par is repeated in l4. % compressibility. \subsection{Some simple data compression methods that define measures of information content} % % IDEA: connect back to opening % One way of measuring the information content of a random variable is simply to count the number of {\em possible\/} outcomes, $|\A_X|$. (The number of elements in a set $\A$ is denoted by $|\A|$.) If we gave a binary name to each outcome, the length of each name would be $\log_2 |\A_X|$ bits, if $|\A_X|$ happened to be a power of 2. We thus make the following definition. \begin{description}%%%% was: [Perfect information content] Raw bit content %%%%%%%%%%%%%%%%%%%%%%% see newcommands1.tex \item[The \perfectic] of $X$ is \beq H_0(X) = \log_2 |\A_X| . \eeq \end{description} $H_0(X)$ is a lower bound for the number of binary questions that are always guaranteed to identify an outcome from the ensemble $X$. It is an additive quantity: the \perfectic\ of an ordered pair $x,y$, having $|\A_X||\A_Y|$ possible outcomes, satisfies \beq H_0(X,Y)= H_0(X) + H_0(Y). \eeq This measure of information content does not include any probabilistic element, and the encoding rule it corresponds to does not `compress' the source data, it simply maps each outcome % source character to a constant-length binary string. \exercissxA{2}{ex.compress.possible}{ Could there be a compressor that maps an outcome $x$ to a binary code $c(x)$, and a decompressor that maps $c$ back to $x$, such that {\em every possible outcome\/} is compressed into a binary code of length {\em shorter\/} than $H_0(X)$ bits? } Even though a simple counting argument\index{compression!of {\em any\/} file} shows that it is impossible to make a reversible compression program that reduces the size of {\em all\/} files, amateur compression enthusiasts frequently announce that they have invented a program that can do this -- indeed that they can further compress compressed files by putting them through their compressor several\index{compression!of already-compressed files}\index{myth!compression} times. Stranger yet, patents have been granted to these modern-day \ind{alchemists}. See the {\tt{comp.compression}} frequently asked questions % \verb+http://www.faqs.org/faqs/compression-faq/part1/+ for further reading.\footnote{\tt{http://sunsite.org.uk/public/usenet/news-faqs/comp.compression/}} %\footnote{\verb+http://www.lib.ox.ac.uk/internet/news/faq/+} % ............by_category.compression-faq.html+} % http://www.faqs.org/faqs/compression-faq/part1/preamble.html There are only two ways in which a `compressor' can actually compress files: \ben \item A {\dem lossy\/} compressor compresses some\index{compression!lossy} files, but maps some files % {\em distinct\/} files are mapped to the {\em same\/} encoding. We'll assume that the user requires perfect recovery of the source file, so the occurrence of one of these confusable files leads to a failure (though in applications such as \ind{image compression}, lossy compression is viewed as satisfactory). We'll denote by $\delta$ the probability that the source string is one of the confusable files, so a lossy compressor\index{error probability!in compression} has a probability $\delta$ of failure. If $\delta$ can be made very small then a lossy compressor may be practically useful. \item A {\dem lossless} compressor maps all files to different encodings; if it % f a lossless compressor shortens some files,\index{compression!lossless} it necessarily {\em makes others longer}. We try to design the compressor so that the probability that a file is lengthened is very small, and the probability that it is shortened is large. \een In this chapter we discuss a simple lossy compressor. In subsequent chapters we discuss lossless compression methods. % \section{Information content defined in terms of lossy compression} % Whichever type of compressor we construct, we need somehow to take into account the {\em probabilities\/} of the different outcomes. Imagine comparing the information contents of two text files -- one in which all 128 ASCII characters are used with equal probability, and one in which the characters are used with their frequencies in English text. %: $P(x={\tt e})=$, % $P(x={\tt e})=$, $P(x={\tt e})=$,$P(x={\tt e})=$,$P(x={\tt e})=$, \ldots % $P(x={\tt e})=$, \ldots. % only the characters {\tt 0} and {\tt 1} are used. Can we define a measure of information content that distinguishes between these two files? Intuitively, the latter file contains less information per character because it is more predictable. %And a file of {\tt 0}s % and {\tt 1}s in which nearly all the characters are {\tt 0}s % conveys even less information. % Maybe introducing 0 and 1 is nto a good idea. % At this point I start talking in terms of compression. % How can we include a probabilistic element? One simple way to use our knowledge that some symbols have a smaller probability is to imagine recoding the observations into a smaller alphabet -- thus losing the ability to encode some of the more improbable symbols -- and then measuring the \perfectic\ of the new alphabet. % choice here - could either map multiple symbols onto % one, so the compression is lossy, % or could define no entry at all for some symbols, so compression % fails. % The general mapping situation is not ideal since I really want all % the losers to be mapped to one symbol. Student might imagine mapping % Z and z to Z, Y and y to Y.. and claim they are losing little info. % But this messes up the defn of delta. For example, we might take a risk when compressing English text, guessing that the most infrequent characters won't occur, and make a reduced ASCII code that omits the characters % for example, % `\verb+!+', `\verb+@+', `\verb+#+', % `\verb+$+', `\verb+%+', `\verb+^+', `\verb+*+', `\verb+~+', % `\verb+<+', `\verb+>+', `\verb+/+', `\verb+\+', `\verb+_+', % `\verb+{+', `\verb+}+', `\verb+[+', `\verb+]+', % and `\verb+|+', $\{$ \verb+!+, \verb+@+, \verb+#+, % \verb+$+, $ \verb+%+, \verb+^+, \verb+*+, \verb+~+, \verb+<+, \verb+>+, \verb+/+, \verb+\+, \verb+_+, \verb+{+, \verb+}+, \verb+[+, \verb+]+, \verb+|+ $\}$, thereby reducing the size of the alphabet % the total number of characters by seventeen. % % cut this dec 2000 % Thus we can give new %%%% a (not necessarily unique) % names to a {\em subset\/} of the possible outcomes and count how many names we % use. The larger the risk we are willing to take, the smaller our final alphabet becomes. % ] the number of names we need. % We thus relax the exhaustive requirement of the definition of % % aside % % We could imagine doing this to the numbers coming out of the guessing % game with which this chapter started, for example. It seems % quite unlikely that the subject would have to guess 25, 26 or 27 times % to get the next letter; these outcomes %%`27' is % are very improbable, % and we might be willing to record the sequence of numbers using % 24 symbols only, taking the gamble that in fact more guesses might % be needed. We introduce a parameter $\delta$ that describes the risk we are taking when using this compression method: $\delta$ is the probability that there will be no name for an outcome $x$. \exampl{exHdelta}{ Let \beq \begin{array}{l*{14}{@{\,}c}} & \A_X & = & \{ & {\tt a},& {\tt b},&{\tt c},&{\tt d},&{\tt e},&{\tt f},&{\tt g},&{\tt h} & \}, \\ \mbox{and }\:\: & \P_X & = & \bigl\{ & \frac{1}{4} ,& \frac{1}{4} ,& \frac{1}{4} ,& \frac{3}{16} ,& \frac{1}{64} ,& \frac{1}{64} ,& \frac{1}{64} ,& \frac{1}{64} & \bigr\} . \end{array} \eeq The \perfectic\ of this ensemble is 3 bits, corresponding to 8 binary names. But notice that $P( x \in \{ {\tt a}, {\tt b}, {\tt c}, {\tt d} \} ) = 15/16$. So if we are willing to run a risk of $\delta=1/16$ of not having a name for $x$, then we can get by with four names -- half as many names as are needed if every $x \in \A_X$ has a name. Table \ref{fig.delta.examples} shows binary names that could be given to the different outcomes in the cases $\delta = 0$ and $\delta = 1/16$. When $\delta=0$ we need 3 bits to encode the outcome; when $\delta=1/16$ we need only 2 bits. % %\begin{figure}[htbp] %\figuremargin{% \amargintab{b}{ \begin{center} \begin{tabular}{cc} \toprule \multicolumn{2}{c}{$\delta = 0$} \\ \midrule $x$ & $c(x)$ \\ \midrule {\tt a} & {\tt{000}} \\ {\tt b} & {\tt{001}} \\ {\tt c} & {\tt{010}} \\ {\tt d} & {\tt{011}} \\ {\tt e} & {\tt{100}} \\ {\tt f} & {\tt{101}} \\ {\tt g} & {\tt{110}} \\ {\tt h} & {\tt{111}} \\ \bottomrule \end{tabular} % \hspace{0.61in} \hspace{0.1in} \begin{tabular}{cc} \toprule \multicolumn{2}{c}{$\delta = 1/16$} \\ \midrule $x$ & $c(x)$ \\ \midrule {\tt a} & {\tt{00}} \\ {\tt b} & {\tt{01}} \\ {\tt c} & {\tt{10}} \\ {\tt d} & {\tt{11}} \\ {\tt e} & $-$ \\ {\tt f} & $-$ \\ {\tt g} & $-$ \\ {\tt h} & $-$ \\ \bottomrule \end{tabular} \end{center} %}{% \caption[a]{Binary names for the outcomes, for two failure probabilities $\delta$.} \label{fig.delta.examples} \label{tab.twosillycodes} }% %\end{figure} } %\noindent Let us now formalize this idea. %%\index{source code} % To make a compression strategy with risk $\delta$, % we consider all subsets $T$ of the alphabet $\A_X$ and % seek out we make the smallest possible subset $S_{\delta}$ such that the probability that $x$ is not in $S_{\delta}$ is less than or equal to $\delta$, \ie, $P(x \not\in S_{\delta} ) \leq \delta$. For each value of $\delta$ we can then define a new measure of information content -- the log of the size of this smallest subset $S_{\delta}$. [In ensembles in which several elements have the same probability, there may be several smallest subsets that contain different elements, but all that matters is their sizes (which are equal), so we will not dwell on this ambiguity.] % worry about this possibility. \begin{description} \item[The smallest $\delta$-sufficient subset] $S_{\delta}$ is the smallest subset of $\A_X$ satisfying \beq P(x \in S_{\delta} ) \geq 1 - \delta. \eeq %\beq % S_{\delta} = \argmin %\eeq \end{description} The subset $S_{\delta}$ can be constructed by ranking the elements of $\A_X$ in order of decreasing probability and adding successive elements starting from the most probable elements % front of the list until the total probability is $\geq (1\!-\!\delta)$. We can make a data compression code by assigning a binary name to each element of the smallest sufficient subset. % (\tabref{tab.twosillycodes}). This compression scheme motivates the following measure of information content: \begin{description} \item[The \essentialic] of $X$ is: %%%%% was ESSENTIAL information content % consider risk-delta bit content? \beq H_{\delta}(X) = \log_2 |S_{\delta}| . % = \log_2 \min \left\{ |S| : S\subseteq \A_X, %% P(S)\geq 1-\delta \right\}. % P(x \in S)\geq 1-\delta \right\}. \eeq \end{description} Note that $H_0(X)$ is the special case of $H_{\delta}(X)$ with $\delta = 0$ (if $P(x) > 0$ for all $x \in \A_X$). % [{\sf Caution:} do not confuse $H_0(X)$ and $H_{\delta}(X)$ with the function $H_2(p)$ displayed in \figref{fig.h2}.] %%%%%%%(Should I change notation to avoid confusion?) % \newcommand{\gapline}{\cline{1-4}\cline{6-9}} \begin{figure} \figuremargin{% \begin{center} \footnotesize% \begin{tabular}{rc} (a)& \hspace*{-0.2in}\input{Hdelta/Sdelta/X.tex}\\ (b)& \mbox{\makebox[0in][r]{\raisebox{1.3in}{$H_{\delta}(X)$}}\hspace{-5mm}% \psfig{figure=Hdelta/byhand/X.ps,% width=70mm,angle=-90}$\delta$}% \\ \end{tabular} \end{center} }{% \caption[a]{(a) The outcomes of $X$ (from \protect\exampleref{exHdelta}), ranked by their probability. (b) The \essentialic\ $H_{\delta}(X)$. The labels on the graph show the smallest sufficient set as a function of $\delta$. Note $H_0(X) = 3$ bits and $H_{1/16}(X) = 2$ bits. } \label{fig.hd.1} } \end{figure} %\noindent {\Figref{fig.hd.1} shows $H_{\delta}(X)$ for the ensemble of \exampleonlyref{exHdelta} as a function of $\delta$. } \subsection{Extended ensembles} % The compression method we're studying in which a subset of % outcomes are given binary names is not giving us a % measure of information content for a single symbol. % % sanjoy wants a motivation here. % Is this compression method any more useful if we compress {\em blocks\/} of symbols from a source?\index{source code!block code}\index{ensemble!extended}\index{extended ensemble} % We now turn to examples where the outcome $\bx = (x_1,x_2,\ldots, x_N)$ is a string of $N$ independent identically distributed random variables from a single ensemble $X$. We will denote by % $\bX$ or $X^N$ the ensemble $( X_1, X_2, \ldots, X_N )$. % for which $\bx$ is the random variable. Remember that entropy is additive for independent variables (\exerciseref{ex.Hadditive}), % \footnote{There should have been an exercise on this by now.} so % $H(\bX) = N H(X)$. $H(X^N) = N H(X)$. \exampl{ex.Nfrom.1}{ % {\sf Example 2:} Consider a string of $N$ flips of a \ind{bent coin}\index{coin}, $\bx = (x_1,x_2,\ldots, x_N)$, where $x_n \in \{{\tt{0}},{\tt{1}}\}$, with probabilities $p_0 \eq 0.9,$ $p_1 \eq 0.1$. The most probable strings $\bx$ are those with most {\tt{0}}s. If $r(\bx)$ is the number of {\tt{1}}s in $\bx$ then \beq % |p_0,p_1 P(\bx) = p_0^{N-r(\bx)} p_1^{r(\bx)} . \eeq To evaluate $H_{\delta}(X^N)$ we must find the smallest sufficient subset $S_{\delta}$. This subset will contain all $\bx$ with $r(\bx) = 0, 1, 2, \ldots,$ up to some $r_{\max}(\delta)-1$, and some of the $\bx$ with $r(\bx) = r_{\max}(\delta)$. % Working backwards, we can evaluate the cumulative probability % $P(r(\bx) \leq r)$ and evaluate the size of the subset $T(r): \{ \bx: % r(\bx) \leq r \}$. %\beq % |T(r)| = \sum_{r=0}^{r} \frac{N!}{(N-r)!r!} %\label{l2.T} %\eeq %\beq % P(r(\bx) \leq r) = \sum_{r=0}^{r} \frac{N!}{(N-r)!r!} p_0^{N-r} p_1^{r} %\label{l2.Pr} %\eeq % We can then plot $\log |T(r)|$ versus $P(r(\bx) \leq r)$. This defines % a graph of $H_{\delta}(\bX)$ against $\delta$. Figures \ref{fig.hd.4} and \ref{fig.hd.10} % Figure \ref{fig.hd.4} show graphs of $H_{\delta}(X^N)$ against $\delta$ for the cases $N=4$ and $N=10$. The steps are the values of $\delta$ at which $|S_{\delta}|$ changes by~1, and the cusps where the slope of the staircase changes are the points where $r_{\max}$ changes by 1. } \exercissxC{2}{ex.cusps}{ What are the mathematical shapes of the curves between the cusps? } % , both with $p_1 = % 0.1$. The points defined by equations (\ref{l2.T}) and (\ref{l2.Pr}) % are the cusps in the curve. % % I think this figure may be sick. CHECK IT. % \renewcommand{\gapline}{\cline{1-3}\cline{5-8}} \begin{figure} \figuremargin{% % % this table done by hand with help of (above hd.p command) /home/mackay/itp/Hdelta> more figs/4.tex % \begin{center} \footnotesize% \begin{tabular}{r@{\hspace*{-0.3in}}c} (a)& %%%%%%%% written by hand see also X.tex % % picture of Sdelta for X^4 % \newcommand{\axislevel}{24} \newcommand{\axislevelp}{29.5} \newcommand{\axislevelm}{21} \newcommand{\axislevelmm}{18} \newcommand{\forestgap}{-0.7} \newcommand{\forest}[3]{\multiput(#1)(\forestgap,0){#2}{\line(0,1){#3}}} % % % \setlength{\unitlength}{2.2pt}% \begin{picture}(155,50)(-143,-20)% adjusted vertical height from 50 to 60 Sat 5/10/02. And put back again Sun 22/12/02 was (-143,-22) Sun 22/12/02 % - log P = 2.0 , 2.4 and 6.0 \forest{-6.1,0}{1}{16}% heights fictitious \forest{-37.3,0}{4}{12.5}% \forest{-68.5,0}{6}{9.4}% 69.5 \forest{-100.8,0}{4}{6.3}% \forest{-132.9,0}{1}{4.2}% % axis: \put(-143,\axislevelm){\vector(1,0){151.0}} % % axis labels \put(5,\axislevelp){\makebox(0,0)[b]{\small$\log_2 P(x)$}} \put(0,\axislevel){\makebox(0,0)[b]{\small$0$}} \put(-20,\axislevel){\makebox(0,0)[b]{\small$-2$}} \put(-40,\axislevel){\makebox(0,0)[b]{\small$-4$}} \put(-60,\axislevel){\makebox(0,0)[b]{\small$-6$}} \put(-80,\axislevel){\makebox(0,0)[b]{\small$-8$}} \put(-100,\axislevel){\makebox(0,0)[b]{\small$-10$}} \put(-120,\axislevel){\makebox(0,0)[b]{\small$-12$}} \put(-140,\axislevel){\makebox(0,0)[b]{\small$-14$}} % % this box is right size for the whole set %\put(0,-2.5){\framebox(140,\axislevelm){}} %\put(142,13){\makebox(0,0)[l]{\small$S_0$}} % this box is round 3 clumps \put(-83.5,-2.5){\framebox(83.5,\axislevelm){}} \put(-84.5,13){\makebox(0,0)[r]{\small$S_{0.01}$}} % a smaller box round 3 clumps %\put(2.5,-1){\framebox(81,\axislevelmm){}} % \put(-53.5,-1){\framebox(51,\axislevelmm){}} \put(-54.5,13){\makebox(0,0)[r]{\small$S_{0.1}$}} % % object labels \put(-6.1,-12){\makebox(0,0)[t]{\footnotesize{\tt 0000}}} \put(-37.7,-12){\makebox(0,0)[t]{\footnotesize${\tt 0010},{\tt 0001},\ldots$}} \put(-69.5,-12){\makebox(0,0)[t]{\footnotesize${\tt 0110},{\tt 1010},\ldots$}} \put(-101.2,-12){\makebox(0,0)[t]{\footnotesize${\tt 1101},{\tt 1011},\ldots$}} \put(-132.9,-12){\makebox(0,0)[t]{\footnotesize{\tt 1111}}} \multiput(-6.1,-10)(-31.6,0){5}{\vector(0,1){5}} \end{picture} % % % % (b)& \makebox[0in][r]{\raisebox{1.3in}{$H_{\delta}(X^4)$}}\hspace{-5mm}% \psfig{figure=Hdelta/figs/hd/4.ps,% width=65mm,angle=-90}$\delta$%% % % % useful for making table: % hd.p mmin=4 mmax=4 mstep=6 scale_by_n=0 plot_sub_graphs=1 latex=1 % \end{tabular} \end{center} }{% % % I think this figure may be sick. CHECK IT. % \caption[a]{(a) The sixteen outcomes of the ensemble $X^4$ with $p_1=0.1$, ranked by probability. (b) The \essentialic\ $H_{\delta}(X^4)$. The upper schematic diagram indicates the strings' probabilities by the vertical lines' lengths (not to scale).} \label{fig.hd.4} }% \end{figure} % % % \begin{figure}%[htbp] \figuremargin{% \begin{center} \mbox{%%%%%%%%%%%%% (twocol) %}\\ \mbox{ \makebox[0in][r]{\raisebox{1.3in}{$H_{\delta}(X^{10})$}}\hspace{-5mm}% \psfig{figure=Hdelta/figs/hd/10.ps,% width=65mm,angle=-90}$\delta$} % command, in Hdelta: % hd.p mmin=4 mmax=10 mstep=6 scale_by_n=0 plot_sub_graphs=1 | gnuplot \end{center} }{% \caption[a]{$H_{\delta}(X^N)$ for $N=10$ binary variables with $p_1=0.1$.} \label{fig.hd.10} }% \end{figure} For the examples shown in figures \ref{fig.hd.1}--\ref{fig.hd.10}, $H_{\delta}(X^N)$ depends strongly on the value of $\delta$, so it might not seem a fundamental or useful definition of information content. But we will consider what happens as $N$, the number of independent variables in $X^N$, increases. We will find the remarkable result that $H_{\delta}(X^N)$ becomes almost independent of $\delta$ -- and for all $\delta$ it is very close to $N H(X)$, where $H(X)$ is the entropy of one of the random variables. % sketch? \begin{figure} \figuremargin{% \begin{center} \mbox{\makebox[0in][r]{\raisebox{1.3in}{$\frac{1}{N}H_{\delta}(X^{N})$}}\hspace{-5mm}% \psfig{figure=Hdelta/figs/hd/all.10.1010.ps,% width=65mm,angle=-90}$\delta$} \end{center} }{% \caption[a]{$\frac{1}{N} H_{\delta}(X^{N})$ for $N=10, 210, \dots,1010$ binary variables with $p_1=0.1$.} \label{fig.hd.10.1010} } \end{figure} \Figref{fig.hd.10.1010} illustrates this asymptotic tendency for the binary ensemble of example \ref{ex.Nfrom.1}. % discussed earlier with $N$ binary variables with $p_1 = 0.1$. As $N$ increases, $\frac{1}{N} H_{\delta}(X^N)$ becomes an increasingly flat function, except for tails close to $\delta=0$ and $1$. % The limiting value of the plateau is $H(X) = 0.47$. % We will explain and prove this result in the remainder of % this chapter. Let's first note the implications of this result. % The limiting value of the plateau, which for $N$ binary variables with $p_1 = 0.1$ % appears to be about 0.5, defines how much compression is possible: % $N$ binary variables with $p_1 = 0.1$ can be compressed into % about $N/2$ bits, with a probability of error $\delta$ which % can be any value between 0 and 1. % We will show that the plateau value to which $\frac{1}{N} H_{\delta}(X^N)$ % tends, for large $N$, is the entropy, $H(X)$. % % IDEA: Box this next sentence? % As long as we are allowed a tiny probability of error $\delta$, compression down to $NH$ bits is possible. Even if we are allowed a large probability of error, we still can compress only down to $NH$ bits. % % IDEA: Box above? % This is the \ind{source coding theorem}. % \subsection{The theorem} \begin{ctheorem} \label{thm.sct} {\sf Shannon's source coding theorem.} % HOW TO NAME THIS????????????????? % this name is taken later Let $X$ be an ensemble with entropy $H(X) = H$ bits. Given $\epsilon>0$ and $0<\delta<1$, there exists a positive integer $N_0$ such that for $N>N_0$, \beq \left| \frac{1}{N} H_{\delta}(X^N) - H \right| < \epsilon. \eeq \end{ctheorem} % % sanjoy wants explan here % % The reason that increasing $N$ helps is that, if $N$ is large, % the outcome $\bx$ \section{Typicality} Why does increasing $N$ help?\indexs{typicality} Let's examine long strings from $X^N$. Table \ref{tab.typical.tcl} shows fifteen samples from $X^N$ for $N=100$ and $p_1=0.1$. \begin{figure} \figuremargin{% \begin{center} \begin{tabular}{lr} \toprule $\bx$ & % \multicolumn{1}{c}{$\log_2(P(\bx))$} \hspace{-0.3in}{$\log_2(P(\bx))$} % {\rule[-3mm]{0pt}{8mm}}%strut \\ \midrule % REQUIRE MONOSPACED FONT!!! {\tinytt{%VERB ...1...................1.....1....1.1.......1........1...........1.....................1.......11...%END }} & $-$50.1 \\ {\tinytt{%VERB ......................1.....1.....1.......1....1.........1.....................................1....%END }} & $-$37.3 \\ {\tinytt{%VERB ........1....1..1...1....11..1.1.........11.........................1...1.1..1...1................1.%END }} & $-$65.9 \\ {\tinytt{%VERB 1.1...1................1.......................11.1..1............................1.....1..1.11.....%END }} & $-$56.4 \\ {\tinytt{%VERB ...11...........1...1.....1.1......1..........1....1...1.....1............1.........................%END }} & $-$53.2 \\ {\tinytt{%VERB ..............1......1.........1.1.......1..........1............1...1......................1.......%END }} & $-$43.7 \\ {\tinytt{%VERB .....1........1.......1...1............1............1...........1......1..11........................%END }} & $-$46.8 \\ {\tinytt{%VERB .....1..1..1...............111...................1...............1.........1.1...1...1.............1%END }} & $-$56.4 \\ {\tinytt{%VERB .........1..........1.....1......1..........1....1..............................................1...%END }} & $-$37.3 \\ {\tinytt{%VERB ......1........................1..............1.....1..1.1.1..1...................................1.%END }} & $-$43.7 \\ {\tinytt{%VERB 1.......................1..........1...1...................1....1....1........1..11..1.1...1........%END }} & $-$56.4 \\ {\tinytt{%VERB ...........11.1.........1................1......1.....................1.............................%END }} & $-$37.3 \\ {\tinytt{%VERB .1..........1...1.1.............1.......11...........1.1...1..............1.............11..........%END }} & $-$56.4 \\ {\tinytt{%VERB ......1...1..1.....1..11.1.1.1...1.....................1............1.............1..1..............%END }} & $-$59.5 \\ {\tinytt{%VERB ............11.1......1....1..1............................1.......1..............1.......1.........%END }} & $-$46.8 \\ \midrule % [0.2in] % {\tinytt{%VERB ....................................................................................................%END }} & $-$15.2 \\ {\tinytt{%VERB 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111%END }} & $-$332.1\\ % \bottomrule \end{tabular} \end{center} }{% \caption[a]{The top 15 strings are samples from $X^{100}$, where $p_1 = 0.1$ and $p_0 = 0.9$. The bottom two are the most and least probable strings in this ensemble. The final column shows the % Compare the log-probabilities of the random strings, which may be compared with the entropy % with % the \aep: $H(X) = 0.469$, so $H(X^{100}) = 46.9$ bits.} \label{tab.typical.tcl} } \end{figure} % 1000 Typical set size +/- 28.46 has log_2(P(x)) within +/- 90.22 % i.e. 1/N (logp) is within 0.090 % 100 Typical set size +/- 9 has log_2(P(x)) within +/- 28.53 % i.e. 1/N(logp) is within 0.285 % 200 Typical set size +/- 12.73 has log_2(P(x)) within +/- 40.35 % % N=100 alternative (see hd.p for the commands) % \begin{figure} \fullwidthfigureright{ %\figuremargin{% \begin{center} \begin{tabular}{r@{\hspace*{-0in}}c@{\hspace*{-0.1in}}c} \toprule & $N=100$ & $N=1000$ \\ \midrule \raisebox{0.71in}{\small$n(r) = {N \choose r}$} & \mbox{\psfig{figure=Hdelta/figs/num/100.ps,% width=50mm,angle=-90}} & \mbox{\psfig{figure=Hdelta/figs/num/1000.ps,% width=50mm,angle=-90}} \\ \raisebox{0.71in}{\small$P(\bx) = p_1^r (1-p_1)^{N-r}$} & \mbox{\psfig{figure=Hdelta/figs/per/100.ps,% width=50mm,angle=-90}}% \makebox[0in][r]{\raisebox{0.4in}{% \psfig{figure=Hdelta/figs/perdet/100.ps,% width=30mm,angle=-90}}\hspace{0.2in}} & \\ \raisebox{0.71in}{\small$\log_2 P(\bx)$} & \mbox{\psfig{figure=Hdelta/figs/logper/100.ps,% width=50mm,angle=-90}} & \mbox{\psfig{figure=Hdelta/figs/logper/1000.ps,% width=50mm,angle=-90}} \\ \raisebox{0.71in}{\small$n(r)P(\bx)= {N \choose r} p_1^r (1-p_1)^{N-r}$} & \mbox{\psfig{figure=Hdelta/figs/tot/100.ps,% width=50mm,angle=-90}} & \mbox{\psfig{figure=Hdelta/figs/tot/1000.ps,% width=50mm,angle=-90}} % \makebox[0in][l]{$r$} \\ & $r$ & $r$ \\ \bottomrule \end{tabular} \end{center} }{% \caption[a]{Anatomy of the typical set $T$. For $p_1=0.1$ and $N=100$ and $N=1000$, these graphs show $n(r)$, the number of strings containing $r$ {\tt{1}}s; the probability $P(\bx)$ of a single string that contains $r$ {\tt{1}}s; the same probability on a log scale; and the total probability $n(r)P(\bx)$ of all strings that contain $r$ {\tt{1}}s. The number $r$ is on the horizontal axis. The plot of $\log_2 P(\bx)$ also shows by a dotted line the mean value of $\log_2 P(\bx) = -N H_2(p_1)$, which equals $-46.9$ when $N=100$ and $-469$ when $N=1000$. The typical set includes only the strings that have $\log_2 P(\bx)$ close to this value. The range marked {\sf T} shows the set $T_{N \beta}$ (as defined in \protect\sectionref{sec.ts}) for $N=100$ and $\beta = 0.29$ (left) and $N=1000$, $\beta = 0.09$ (right). } \label{fig.num.per.tot} }% \end{figure} The probability of a string $\bx$ that contains $r$ {\tt{1}}s and $N\!-\!r$ {\tt{0}}s is \beq P(\bx) = p_1^r (1-p_1)^{N-r} . \eeq The number of strings that contain $r$ {\tt{1}}s is \beq n(r) = {N \choose r} . \eeq So the number of {\tt{1}}s, $r$, has a binomial distribution: \beq P(r) = {N \choose r} p_1^r (1-p_1)^{N-r} . \eeq These functions are shown in \figref{fig.num.per.tot}. The mean of $r$ is $N p_1$, and its standard deviation is $\sqrt{N p_1 (1-p_1)}$ (\pref{sec.first.binomial}). If $N$ is 100 then \beq r \sim N p_1 \pm \sqrt{N p_1 (1-p_1)} \simeq 10 \pm 3 . \eeq If $N=1000$ then \beq r \sim 100 \pm 10 . \eeq Notice that as $N$ gets bigger, the probability distribution of $r$ becomes more concentrated, in the sense that while the range of possible values of $r$ grows as $N$, the standard deviation of $r$ grows only as $\sqrt{N}$. That $r$ is most likely to fall in a small range of values implies that the outcome $\bx$ is also most likely to fall in a corresponding small subset of outcomes that we will call the {{\dbf\inds{typical set}}}. \subsection{Definition of the typical set} \label{sec.ts} % Let us generalize our discussion to an arbitrary ensemble $X$ % with alphabet $\A_X$ % and define typicality. Let us define \ind{typicality}\index{typical set!for compression} for an arbitrary ensemble $X$ with alphabet $\A_X$. Our definition of a typical string will involve the string's probability. A long string % message of $N$ symbols will usually contain % with high probability about $p_1N$ occurrences of the first symbol, $p_2N$ occurrences of the second, etc. Hence the probability of this string % long message is roughly \beq P(\bx)_{\rm typ} = P(x_1)P(x_2)P(x_3) \ldots P(x_N) \simeq p_1^{(p_1N)} p_2^{(p_2N)} \ldots p_I^{(p_IN)} \eeq % p_i^{p_iN} so that the information content of a typical string is \beq \log_2 \frac{1}{P(\bx)} \, \simeq \, N \sum_i p_i \log_2 \frac{1}{p_i} % \simeq \, = \, N H . \eeq So the random variable $\log_2 \!\dfrac{1}{P(\bx)}$, % So the random variable $\frac{1}{N} \log_2 \frac{1}{P(\bx)}$, % which is the average information content per symbol, is which is the information content of $\bx$, is very likely to be close in value to $N H$. We build our definition of typicality on this observation. We define the typical elements of $\A_X^N$ to be those elements that have probability close to $2^{-NH}$. (Note that the typical set, unlike the % best subset for compression smallest sufficient subset, does {\em not\/} include the most probable elements of $\A_X^N$, but we will show that these most probable elements contribute negligible probability.) We introduce a parameter $\beta$ that defines how close the probability has to be to $2^{-NH}$ for an element to be `typical'. % $\beta$- We call the set of typical elements the typical set, % $T$, or, to be more precise, $T_{N \beta}$: % , where the parameter $\beta$ %% controls the breadth of the typical set by defining % defines what we mean by a probability `close' to $2^{-NH}$: \beq T_{N\b} \equiv \left\{ \bx\in\A_X^N : \left| \frac{1}{N} \log_2 \frac{1}{P(\bx)} - H \right| < \b \right\} . \label{eq.TNb} \eeq % % check whether < has propagated to all necessary places % We will show that whatever value of $\beta$ we choose, the typical set contains almost all the probability as $N$ increases. This important result is sometimes called the {\dem `asymptotic equipartition' principle}.\index{asymptotic equipartition} % \newpage %\section{`Asymptotic Equipartition' and Source Coding} \label{sec.aep} % We will prove the following result: \begin{description} \item[`Asymptotic equipartition' principle\puncspace] % (AEP).] For an ensemble of $N$ independent identically distributed (\ind{i.i.d.}) random variables $X^N \equiv ( X_1, X_2, \ldots, X_N )$, with $N$ sufficiently large, the outcome $\bx = (x_1,x_2,\ldots, x_N)$ is almost certain to belong to a subset of $\A_X^N$ having only $2^{N H(X)}$ members, each having probability `close to' $2^{-N H(X)}$. \end{description} Notice that if $H(X) < H_0(X)$ then $2^{N H(X)}$ is a {\em tiny\/} fraction of the number of possible outcomes $|\A_X^N|=|\A_X|^N=2^{N H_0(X)}.$ \begin{aside} The term \ind{equipartition} is chosen to describe the idea that the members of the typical set have {\em roughly equal\/} probability. [This should not be taken too literally, hence my use of quotes around `asymptotic equipartition'; % in the phrase \aep; see page \pageref{sec.aep.caveat}.] A second meaning for equipartition, in thermal \ind{physics}, is the idea that each degree of freedom of a classical system has {equal\/} average energy, $\half kT$. This second meaning is not intended here. \end{aside} % The \aep\ is equivalent to: \begin{description} \item[Shannon's source coding theorem (verbal statement)\puncspace] $N$ i.i.d.\ random variables each with entropy $H(X)$ can be compressed into more than $NH(X)$ bits with negligible risk of information loss, as $N\rightarrow \infty$; conversely if they are compressed into fewer than $NH(X)$ bits it is virtually certain that information will be lost. \end{description} These two theorems are equivalent because we can define a compression algorithm that gives a distinct name of length $N H(X)$ bits to each $\bx$ in the typical set. % probable subset. % as follows: % enumerate the $\bx$ belonging to % the subset of $2^{N H(X)}$ equiprobable outcomes as 000\ldots000, % 000\ldots001, etc. \begin{figure} \figuredangle{% \begin{center} %%%%%%%% written by hand see also X.tex % % picture of Sdelta for X^100 % \newcommand{\axislevel}{27} \newcommand{\axislevelp}{32.5} \newcommand{\axislevelm}{24} \newcommand{\axislevelmm}{21} \newcommand{\forestgap}{-0.4} \newcommand{\forestgab}{-0.6} \newcommand{\forestgac}{-0.56} \newcommand{\forestgad}{-0.52} \newcommand{\forestgae}{-0.48} \newcommand{\forestgaf}{-0.44} % \newcommand{\forestgag}{0.48} %\newcommand{\forestgap}{0.35} was .35 when I went up to 14. \newcommand{\forest}[3]{\multiput(#1)(\forestgap,0){#2}{\line(0,1){#3}}} \newcommand{\foresb}[4]{\multiput(#1)(#4,0){#2}{\line(0,1){#3}}} % % picture % %\setlength{\unitlength}{2.45pt}% \setlength{\unitlength}{2.87pt}% \begin{picture}(170,81)(-170,-42) \forest{0,0}{1}{16.5}% \foresb{-5,0}{2}{16}{\forestgab} \foresb{-10,0}{3}{15.5}{\forestgab} \foresb{-15,0}{4}{15}{\forestgac} \foresb{-20,0}{5}{14.5}{\forestgad} \foresb{-25,0}{6}{14}{\forestgae} \foresb{-30,0}{7}{13.5}{\forestgaf} \foresb{-35,0}{8}{13}{\forestgap} \foresb{-40,0}{9}{12.5}{\forestgap} \forest{-45,0}{10}{12}% \forest{-50,0}{11}{11.5}% \forest{-55,0}{12}{11}% \forest{-60,0}{12}{10.5}% \forest{-65,0}{12}{10}% \forest{-70,0}{12}{9.5}% \forest{-75,0}{12}{9}% \forest{-80,0}{12}{8.5}% \forest{-85,0}{12}{8}% \forest{-90,0}{12}{7.5}% \forest{-95,0}{12}{7}% \forest{-100,0}{12}{6.5}% \forest{-105,0}{12}{6}% \forest{-110,0}{12}{5.5}% \forest{-115,0}{11}{5}% \forest{-120,0}{10}{4.5}% \foresb{-125,0}{9}{4.2}{\forestgap} \foresb{-130,0}{8}{3.9}{\forestgap} \foresb{-135,0}{7}{3.6}{\forestgaf} \foresb{-140,0}{6}{3.3}{\forestgae} \foresb{-145,0}{5}{3.0}{\forestgad} \foresb{-150,0}{4}{2.7}{\forestgac} \foresb{-155,0}{3}{2.4}{\forestgab} \foresb{-160,0}{2}{2.1}{\forestgab} \forest{-165,0}{1}{1.8}% % % axis: \put(-168,\axislevelm){\vector(1,0){171.0}} % % axis labels \put(0,\axislevelp){\makebox(0,0)[br]{\small$\log_2 P(x)$}} \put(-42.4,\axislevel){\makebox(0,0)[b]{\small$-NH(X)$}} % tic mark (was at -40 until Tue 8/1/02) \put(-42.4,\axislevelm){\line(0,1){2}} % the S0 box %\put(-3,-2.5){\framebox(172,\axislevelm){}} %\put(142,16){\makebox(0,0)[l]{$S_0$}} % % % typical set box \put(-49.5,-1){\framebox(15,\axislevelmm){}} \put(-51,16){\makebox(0,0)[r]{$T_{N\b}$}} % % object labels \put(0,-40){\vector(0,1){35}} \put(-15,-35){\vector(0,1){30}} %\put(26,-30){\vector(0,1){25}} \put(-36,-25){\vector(0,1){20}} \put(-46,-20){\vector(0,1){15}} %\put(56,-15){\vector(0,1){10}} \put(-155,-10){\vector(0,1){5}} \put( 0,-40){\makebox(0,0)[tr]{\footnotesize{{\tt 0000000000000}\ldots{\tt{00000000000}}}}} \put(-15,-35){\makebox(0,0)[tr]{\footnotesize{{\tt 0001000000000}\ldots{\tt{00000000000}}}}} %\put(26,-30){\makebox(0,0)[tl]{\footnotesize{{\tt 0000001000000}\ldots{\tt{00000010000}}}}} \put(-36,-25){\makebox(0,0)[tr]{\footnotesize{{\tt 0100000001000}\ldots{\tt{00010000000}}}}} \put(-46,-20){\makebox(0,0)[tr]{\footnotesize{{\tt 0000100000010}\ldots{\tt{00001000010}}}}} %\put(56,-15){\makebox(0,0)[tl]{\footnotesize{{\tt 0100001000100}\ldots{\tt{00010100100}}}}} \put(-155,-10){\makebox(0,0)[tl]{\footnotesize{{\tt 1111111111110}\ldots{\tt{11111110111}}}}} \end{picture} % % % % \end{center} }{% \caption[a]{Schematic diagram showing all strings in the ensemble $X^{N}$ % with $p_0 = 0.9, p_1=0.1$ % of large length $N$ ranked by their probability, and the typical set $T_{N\b}$.} \label{fig.typical.set.explain} }% \end{figure} \section{Proofs} \label{sec.chtwoproof} This section may be skipped if found tough going. \subsection{The law of large numbers} Our proof of the source coding theorem uses the \ind{law of large numbers}. \begin{description} % \item[A random variable $u$] is any real function of $x$, \item[Mean and variance] of a real random variable %\footnote are $\Exp[u] = \bar{u} = \sum_u P(u) u$ and $\var(u) = \sigma^2_u = \Exp[(u-\bar{u})^2] = \sum_u P(u) (u - \bar{u})^2.$ \begin{aside} Technical note: strictly I am assuming here that $u$ is a function $u(x)$ of a sample $x$ from a finite discrete ensemble $X$. Then the summations $\sum_u P(u) f(u)$ should be written $\sum_x P(x) f(u(x))$. This means that $P(u)$ is a finite sum of delta functions. This restriction guarantees that the mean and variance of $u$ do exist, which is not necessarily the case for general $P(u)$. \end{aside} \item[Chebyshev's inequality 1\puncspace] Let $t$ be a non-negative real random variable, and\index{Chebyshev inequality} let $\a$ be a positive real number. Then\index{inequality} \beq P(t \geq \a) \:\leq\: \frac{\bar{t}}{\a}. \label{eq.cheb.1} \eeq {\sf Proof:} $P(t \geq \a) = \sum_{t \geq \a} P(t)$. We multiply each term by $t/\a \geq 1$ and obtain: $P(t \geq \a) \leq \sum_{t \geq \a} P(t) t/\a.$ We add the (non-negative) missing terms and obtain: $P(t \geq \a) \leq \sum_{t} P(t) t/\a = \bar{t}/\a$. \hfill$\epfsymbol$\par \item[Chebyshev's inequality 2\puncspace] Let $x$ be a random variable, and let $\a$ be a positive real number. Then \beq P\left( (x-\bar{x})^2 \geq \a \right) \:\leq\: \sigma^2_x / \a. \eeq {\sf Proof:} Take $t = (x-\bar{x})^2$ and apply the previous proposition. \hfill$\epfsymbol$\par \item[Weak \ind{law of large numbers}\puncspace] Take $x$ to be the average of $N$ independent random variables $h_1, \ldots , h_N$, having common mean $\bar{h}$ and common variance $\sigma^2_h$: $x = \frac{1}{N} \sum_{n=1}^N h_n$. Then \beq P( (x-\bar{h})^2 \geq \a ) \leq \sigma^2_h/\a N. \eeq {\sf Proof:} obtained by showing that $\bar{x}=\bar{h}$ and that $\sigma^2_x = \sigma^2_h/ N$. \hfill$\epfsymbol$\par \end{description} We are interested in $x$ being very close to the mean ($\a$ very small). No matter how large $\sigma^2_h$ is, and no matter how small the required $\a$ is, and no matter how small the desired probability that $(x-\bar{h})^2 \geq \a$, we can always achieve it by taking $N$ large enough. \subsection{Proof of theorem \protect\ref{thm.sct} (\pref{thm.sct})} % the source coding theorem} % or could say theorem 1 We apply the law of large numbers to the random variable $\frac{1}{N} \log_2 \frac{1}{P(\bx)}$ defined for $\bx$ drawn from the ensemble $X^N$. This random variable can be written as the average of $N$ information contents $h_n = \log_2 ( 1 / P(x_n))$, each of which is a random variable with mean $H = H(X)$ and variance $\sigma^2 \equiv \var[ \log_2 ( 1 / P(x_n)) ]$. (Each term $h_n$ is the Shannon information content of the $n$th outcome.) We again define the typical set with parameters $N$ and $\beta$ thus: \beq T_{N\b} = \left\{ \bx\in\A_X^N : \left[ \frac{1}{N} \log_2 \frac{1}{P(\bx)} - H \right]^2 < \b^2 \right\} . \label{eq.TNb.2} \eeq For all $\bx \in T_{N\b}$, the probability of $\bx$ satisfies \beq 2^{-N(H+\b)} < P(\bx) < 2^{-N(H-\b)}. \eeq And by the law of large numbers, \beq P(\bx \in T_{N\b}) \geq 1 - \frac{\sigma^2}{\b^2 N} . \eeq We have thus proved the \aep. As $N$ increases, the probability that $\bx$ falls in $T_{N\b}$ approaches 1, for any $\beta$. How does this result relate to source coding? % We will prove the \aep\ first; then w We must relate $T_{N\b}$ to $H_{\delta}(X^N)$. We will show that for any given $\delta$ there is a sufficiently big $N$ such that $H_{\delta}(X^N) \simeq N H$. \subsubsection{Part 1: $\frac{1}{N} H_{\delta}(X^N) < H + \epsilon$.} % of the source coding theorem. % % More words here reminding what H_delta is % The set $T_{N\b}$ is not the best subset for compression. So the size of $T_{N\b}$ gives an upper bound on $H_{\delta}$. We show how {\em small} $H_{\delta}(X^N)$ must be by calculating % the largest cardinality that $T_{N\b}$ could have. how big $T_{N\b}$ could possibly be. We are free to set $\beta$ to any convenient value. The smallest possible probability that a member of $T_{N\b}$ can have is $2^{-N(H+\b)}$, and the total probability % that $T_{N\b}$ contains contained by $T_{N\b}$ can't be any bigger than 1. So \beq |T_{N\b}| \, 2^{-N(H+\b)} < 1 , \eeq that is, the size of the typical set is bounded by % so we can bound \beq |T_{N\b}| < 2^{N(H+\b)} . \eeq % BEWARE bad page break here If we set $\b = \epsilon$ and $N_0$ such that $\frac{\sigma^2}{\epsilon^2 N_0} \leq \delta$, then %%% %%% [I omitted this qualifier to preserve pagination] %%% %%% for all $N \geq N_0$, %%% $P(T_{N\b}) \geq 1 - \delta$, and the set $T_{N\b}$ becomes a witness to the fact that $H_{\delta}(X^N) \leq \log_2 | T_{N\b} | < N ( H + \epsilon)$. % \amarginfig{b}{ {\footnotesize \setlength{\unitlength}{1.2mm} \begin{picture}(40,40)(-5,0) \put(5,5){\makebox(0,0)[bl]{\psfig{figure=figs/gallager/Hdeltaconcept.eps,width=36mm}}} \put(5,35){\makebox(0,0){$\smallfrac{1}{N} H_{\delta}(X^N)$}} \put(5,27){\makebox(0,0)[r]{$H_0(X)$}} \put(5,4){\makebox(0,0)[t]{$0$}} \put(30,4){\makebox(0,0)[t]{$1$}} \put(35,4){\makebox(0,0)[t]{$\delta$}} \put(33,11){\makebox(0,0)[l]{$H-\epsilon$}} \put(33,15){\makebox(0,0)[l]{$H$}} \put(33,19){\makebox(0,0)[l]{$H+\epsilon$}} \end{picture} } \caption[a]{Schematic illustration of the two parts of the theorem. Given any $\delta$ and $\epsilon$, we show that for large enough $N$, $\frac{1}{N} H_{\delta}(X^N)$ lies (1) below the line $H+\epsilon$ and (2) above the line $H-\epsilon$.} \label{fig.Hd.schem} } \subsubsection{Part 2: $\frac{1}{N} H_{\delta}(X^N) > H - \epsilon$.} % of the source coding theorem.} % % needs work ,sanjoy says: % % (jan 99)_ % Imagine that someone claims this second part is not so -- that, for any $N$, the smallest $\delta$-sufficient subset $S_{\delta}$ is smaller than the above inequality would allow. % They claim that % $|S_{}| \leq 2^{N(H-\epsilon)}$ and $P(\bx \in S_{}) % \geq 1 - \delta$. We can make use of our typical set to show that they must be mistaken. Remember that we are free to set $\beta$ to any value we choose. We will set $\beta = \epsilon/2$, so that our task is to prove that a % that an alternative {\em smaller\/} subset $S'_{}$ having $|S'_{}| \leq 2^{N(H-2\beta)}$ and achieving $P(\bx \in S'_{}) \geq 1 - \delta$ cannot exist (for $N$ greater than an $N_0$ that we will specify). %(We attach the % prime to $S$ to denote the fact that this is a conjectured smallest subset.) So, let us consider the probability of falling in this rival smaller subset $S'_{}$. The probability of the subset $S'_{}$ is\marginpar[t]{% \begin{center} \raisebox{-0.5in}[0in][0in]{ %%%%%%%% written by hand Sun 22/12/02 % % Venn picture % % \setlength{\unitlength}{0.321pt}% {\begin{picture}(452,215)(-173,-132)% % axis labels \put(-100,39){\makebox(0,0)[r]{\small$T_{N\b}$}} \put(100,39){\makebox(0,0)[l]{\small$S'$}} \thinlines \put(-33,-1){\circle{126}} \thicklines \put(33,-1){\circle{126}} \thinlines \put(18,-85){\vector(-1,4){18}} \put(33,-90){\makebox(0,0)[t]{\small$ S'_{} \cap T_{N\b} $}} \put(105,-51){\vector(-1,1){40}} \put(112,-39){\makebox(0,0)[tl]{\small$ S'_{} \cap \overline{T_{N\b}} $}} \end{picture}} % % % % \end{center}} \beq P(\bx \in S'_{}) \,=\, P(\bx \in S' \! \cap \! T_{N\b}) + P(\bx \in S'_{} \!\cap\! \overline{T_{N\b}}), \eeq where $\overline{T_{N\b}}$ denotes the complement $\{ \bx \not \in T_{N\b}\}$. The maximum value of the first term is found if $S'_{} \cap T_{N\b} $ contains $2^{N(H-2\beta)}$ outcomes all with the maximum probability, $2^{-N(H-\beta)}$. The maximum value the second term can have is $P( \bx \not \in T_{N\b})$. So: \beq P(\bx \in S'_{}) \, \leq \, 2^{N(H-2\beta)} \, 2^{-N(H-\beta)} + \frac{\sigma^2}{\b^2 N} = 2^{-N \b} + \frac{\sigma^2}{\b^2 N} . \eeq We can now set $\b = \epsilon/2$ and $N_0$ such that $P(\bx \in S'_{}) < 1- \delta$, which shows that $S'$ cannot satisfy the definition of a sufficient subset $S_{\delta}$. Thus {\em any\/} subset $S'$ with size $|S'| \leq 2^{N(H-\epsilon)}$ has probability less than $1-\delta$, so by the definition of $H_\delta$, $H_{\delta}(X^N) > N ( H - \epsilon)$. % this sentence used to be below at % hereherehere Thus for large enough $N$, the function $\frac{1}{N} H_{\delta}(X^N)$ is essentially a constant function of $\delta$, for $0 < \delta < 1$, as illustrated in figures \ref{fig.hd.10.1010} and \ref{fig.Hd.schem}. \hfill $\Box$ %% NOTE % oleg suggested the part 2 should say % ``|T_Nb| >= S_d because S_delta is minimal % (not because S \in T)'' \section{Comments} The source coding theorem (\pref{thm.sct}) has two parts, $\frac{1}{N} H_{\delta}(X^N) < H + \epsilon$, and $\frac{1}{N} H_{\delta}(X^N) > H - \epsilon$. % $H -\frac{1}{N} H_{\delta}(X^N)< \epsilon$. Both results are interesting. The first part tells us that even if the probability of error $\delta$ is extremely small, the % average number of bits per symbol $\frac{1}{N} H_{\delta}(X^N)$ needed to specify a long $N$-symbol string $\bx$ with vanishingly small error probability does not have to exceed $H+ \epsilon$ bits. We need to have only a tiny tolerance for error, and the number of bits required drops significantly from $H_0(X)$ to $(H + \epsilon)$. What happens if we are yet more tolerant to compression errors? Part 2 tells us that even if $\delta$ is very close to 1, so that errors are made most of the time, the average number of bits per symbol needed to specify $\bx$ must still be at least $H - \epsilon$ bits. These two extremes tell us that regardless of our specific allowance for error, the number of bits per symbol needed to specify $\bx$ is % boils down to $H$ bits; no more and no less. \medskip % hereherehere %In section 2.4.2 `$\epsilon$ can decrease with increasing $N$'. I'd prefer %something like $N$ increases with decreasing $\epsilon$', since $N$ %depends on $\epsilon$ and not vice versa -- if I got it right. % caution warning \subsection{Caveat regarding `asymptotic equipartition'} \label{sec.aep.caveat} \index{caution!equipartition}I put the words `asymptotic equipartition' in quotes because it is important not to\index{asymptotic equipartition!why it is a misleading term} % be misled into think that the elements of the typical set $T_{N\beta}$ really do have roughly the same probability as each other. They are similar in probability only in the sense that their values of $\log_2 \frac{1}{P(\bx)}$ are within $2 N \beta$ of each other. Now, as $\beta$ is decreased, how does $N$ have to increase, if we are to keep our bound on the mass of the typical set, $P(\bx \in T_{N\beta}) \geq 1 - \frac{\sigma^2}{\beta^2 N}$, constant? % CHANGED 9802: % Since $\beta$ can decrease %scales % with increasing $N$ must grow as $1/ \beta^2$, so, if we write $\beta$ in terms of $N$ as $\alpha/\sqrt{N}$, for some constant $\alpha$, then the most probable string in the typical set will be of order $2^{\alpha \sqrt{N}}$ times greater than the least probable string in the typical set. As $\beta$ decreases, $N$ increases, and this ratio $2^{\alpha \sqrt{N}}$ grows exponentially. Thus we have `equipartition' only in a weak sense! % relative \subsection{Why did we introduce the typical set?} The best choice of subset for block compression is (by definition) $S_{\delta}$, not a typical set. So why did we bother introducing the typical set? The answer is, {\em we can count the typical set}. We know that all its elements have `almost identical' probability ($2^{-NH}$), and we know the whole set has probability almost 1, so the typical set must have roughly $2^{NH}$ elements. Without the help of the typical set (which is very similar to $S_{\delta}$) it would have been hard to count how many elements there are in $S_{\delta}$. %\section{Summary and overview} %\section{Where next} % We have established that the entropy $H(X)$ measures % the average information content of an ensemble. %% % In this chapter we discussed a lossy {block}-compression scheme that % used large blocks of fixed size. % In the next chapter we discuss variable length compression schemes that are % practical for small block sizes and that are not lossy. %% % \section{Exercises} % weighing problems in here % ITPRNN Problem 1a % \subsection*{Weighing problems} % \exercisaxB{1}{ex.weighexplain}{ While some people, when they first encounter the weighing problem with 12 balls and the three-outcome balance (\exerciseref{ex.weigh}), think that weighing six balls against six balls is a good first weighing, others say `no, weighing six against six conveys {\em no\/} information at all'. Explain to the second group why they are both right and wrong. Compute the information gained about {\em which is the odd ball\/}, and the information gained about {\em which is the odd ball and whether it is heavy or light}. } \exercisaxB{2}{ex.weighthirtynine}{ Solve the weighing problem for the case where there are 39 balls of which one is known to be odd. } \exercisaxB{2}{ex.binaryweigh}{ You are given 16 balls, all of which are equal in weight except for one that is either heavier or lighter. You are also given a bizarre two-pan balance that can report only two outcomes: `the two sides balance' or `the two sides do not balance'. Design a strategy to determine which is the odd ball {in as few uses of the balance as possible}. } \exercisaxB{2}{ex.flourforty}{ You have a two-pan balance; your job is to weigh out bags of flour with integer weights 1 to 40 pounds inclusive. How many weights do you need? [You are allowed to put weights on either pan. You're only allowed to put one flour bag on the balance at a time.] } \exercissxC{4}{ex.twelve.generalize.weigh}{ \ben \item% {ex.weigh} Is it possible to solve \exerciseref{ex.weigh} (the weighing problem with 12 balls and the three-outcome balance) using a sequence of three {\em fixed\/} weighings, such that the balls chosen for the second weighing do not depend on the outcome of the first, and the third weighing does not depend on the first or second? \item Find a solution to the general $N$-ball weighing problem in which exactly one of $N$ balls is odd. Show that in $W$ weighings, an odd ball can be identified from among $N = (3^W - 3 )/2$ balls. %How large can $N$ be if you are allowed $W$ weighings? % How are the weighings arranged in the case of the largest $N$? \een } \exercisaxC{3}{ex.twelve.two.weigh}{ You are given 12 balls and the three-outcome balance of \exerciseonlyref{ex.weigh}; this time, {\em two} of the balls are odd; each odd ball may be heavy or light, and we don't know which. We want to identify the odd balls and in which direction they are odd. \ben \item {\em Estimate\/} how many weighings are required by the optimal strategy. And what if there are three odd balls? %\item % How do your answers change if it is known in advance that % the odd balls will all have the same bias (all heavy, or all light)? \item How do your answers change if it is known that all the regular balls weigh 100\grams, that light balls weigh 99\grams, and heavy ones weigh 110\grams? \een } % end weighing \subsection*{Source coding with a lossy compressor, with loss $\delta$} \exercissxB{2}{ex.Hd46}{ % Let ${\cal P}_X = \{ 0.4,0.6 \}$. Sketch $\frac{1}{N} H_{\delta}(X^N)$ % as a function of $\delta$ for $N=1,2$ and 100. Let ${\cal P}_X = \{ 0.2,0.8 \}$. Sketch $\frac{1}{N} H_{\delta}(X^N)$ as a function of $\delta$ for $N=1,2$ and 1000. } \exercisaxB{2}{ex.Hd55}{ Let ${\cal P}_Y = \{ 0.5,0.5 \}$. Sketch $\frac{1}{N} H_{\delta}(Y^N)$ as a function of $\delta$ for $N=1,2,3$ and 100. } \exercissxB{2}{ex.HdSB}{ (For \ind{physics} students.) Discuss the relationship % similarities between the proof of the \aep\ and the equivalence\index{entropy!Gibbs}\index{entropy!Boltzmann} (for large systems) of the \ind{Boltzmann entropy} and the \ind{Gibbs entropy}.} \subsection*{Distributions that don't obey the law of large numbers} % % Cauchy distbn here? The \ind{law of large numbers}, which we used in this chapter, shows that the mean of a set of $N$ i.i.d.\ random variables has a probability distribution that becomes % more concentrated narrower, with width $\propto 1/\sqrt{N}$, as $N$ increases. However, we have proved this property only for discrete random variables, that is, for real numbers taking on a {\em finite\/} set of possible values. While many random variables with continuous probability distributions also satisfy the law of large numbers, there are important distributions that do not. Some continuous distributions do not have a mean or variance. \exercissxB{3}{ex.cauchy}{ Sketch the \ind{Cauchy distribution}\index{distribution!Cauchy} \beq P(x) = \frac{1}{Z} \frac{1}{x^2 + 1} , \:\:\:\: x \in (-\infty,\infty). \eeq What is its normalizing constant $Z$? Can you evaluate its mean or variance? Consider the sum $z=x_1 + x_2$, where $x_1$ and $x_2$ are independent random variables from a Cauchy distribution. What is $P(z)$? What is the probability distribution of the mean of $x_1$ and $x_2$, $\bar{x}=(x_1+x_2)/2$? What is the probability distribution of the mean of $N$ samples from this {Cauchy distribution}? } % \subsection{Other asymptotic properties} % Levy flights too? \exercisaxC{3}{ex.chernoff}{ {\sf\ind{Chernoff bound}.} We derived the weak law of large numbers from Chebyshev's inequality\index{Chebyshev inequality} (\ref{eq.cheb.1}) by letting the random variable $t$ in the inequality $%\beq P(t \geq \a) \:\leq\: \bar{t}/\a %\label{eq.cheb.1a} $ be a function, $t = (x-\bar{x})^2$, of the random variable $x$ we were interested in. Other useful inequalities can be obtained by using other functions. The \ind{Chernoff bound}, which is useful\index{bound} for bounding the \ind{tail}s of a distribution, is obtained by letting $t = \exp( s x)$. Show that \beq P( x \geq a ) \leq e^{-sa} g(s) , \:\:\:\mbox{ for any $s>0$ } \eeq and \beq P( x \leq a ) \leq e^{-sa} g(s) , \:\:\:\mbox{ for any $s<0$ } \eeq where $g(s)$ is the moment-generating function of $x$, \beq g(s) = \sum_x P(x) \, e^{sx} . \eeq % % Hence show that if $z$ is a sum of $N$ random variables $x$, %\beq % P( z \geq a ) \leq %\eeq } % end % \subsection*{Curious functions related to $p \log 1/p$} % SOLN - BORDERLINE \exercissxE{4}{ex.fxxxxx}{ This exercise has {no purpose at all}; it's included for the enjoyment of those who like mathematical curiosities. Sketch the function \beq f(x) = x^{x^{x^{x^{x^{\cdot^{\cdot^{\cdot}}}}}}} % f(x) = x^{x^{x^{x^{x^{\ddots}}}}} \eeq for $x \geq 0$. % To be explicit about the order in which the powers are evaluated, % here's another definition of $f$: %\beq % f(x) = x^{\left(x^{\left(x^{\cdot^{\cdot^{\cdot}}}\right)}\right)} %\eeq {\sf Hint:} Work out the inverse function to $f$ -- that is, the function $g(y)$ such that if $x=g(y)$ then $y=f(x)$ -- it's closely related to $p \log 1/p$. % {\sf Hints:} %\ben %\item Consider $f(\sqrt{2})$: % you might be able to persuade yourself % that $f(\sqrt{2})=2$. You might also be able % to persuade yourself that $f(\sqrt{2})=4$. What's going on? % [Yes, a two-valued function.] %\item % For a given $x$, if $f(x)=y$, then we have $y = x^{y}$, so % $y$ is found at the intersection of the curves $u_1(y)=x^y$ and $u_2(y)=y$. %\item % Work out the inverse function to $f$ -- that is, the function $g(y)$ % such that if $x=g(y)$ then $y=f(x)$ -- hint: it's closely related to % $p \log 1/p$. %\een } \dvips %\chapter{The Source Coding Theorem (old version of this Chapter)} %\label{ch.two.old} %\input{tex/_l2old.tex} %\dvips \section{Solutions}% to Chapter \protect\ref{ch.two}'s exercises} \fakesection{_s2} % chapter 2 % ex 39... % \soln{ex.Hadditive}{ Let $P(x,y)=P(x)P(y)$. Then \beqan H(X,Y) &=& \sum_{xy} P(x)P(y) \log \frac{1}{P(x)P(y)} \\ & = & \sum_{xy} P(x)P(y) \log \frac{1}{P(x)} + \sum_{xy} P(x)P(y) \log \frac{1}{ P(y)} \\ &=& \sum_{x} P(x) \log \frac{1}{P(x)} + \sum_{y} P(y) \log \frac{1}{ P(y)} \\ &=& H(X) + H(Y) . \eeqan } % \soln{ex.ascii}{ An ASCII file can be reduced in size by a factor of 7/8. This reduction could be achieved by a block code that maps 8-byte blocks into 7-byte blocks by copying the % . The mapping would copy 56 information-carrying bits into 7 bytes, and ignoring the last bit of every character. } \soln{ex.compress.possible}{ % Theorem: % No program can compress without loss *all* files of size >= N bits, for % any given integer N >= 0. % %Proof: % Assume that the program can compress without loss all files of size >= N % bits. Compress with this program all the 2^N files which have exactly N % bits. All compressed files have at most N-1 bits, so there are at most % (2^N)-1 different compressed files [2^(N-1) files of size N-1, 2^(N-2) of % size N-2, and so on, down to 1 file of size 0]. So at least two different % input files must compress to the same output file. Hence the compression % program cannot be lossless. % %The proof is called the "counting argument". It uses the so-called The \ind{pigeon-hole principle} states: you can't put 16 pigeons into 15 holes without using one of the holes twice. Similarly, you can't give $\A_X$ outcomes unique binary names of some length $l$ shorter than $\log_2 |\A_X|$ bits, because there are only $2^l$ such binary names, and $l < \log_2 |\A_X|$ implies $2^l < |\A_X|$, so at least two different inputs to the compressor would compress to the same output file. } \soln{ex.cusps}{ Between the cusps, all the changes in probability are equal, and the number of elements in $T$ changes by one at each step. So $H_{\delta}$ varies logarithmically with $(-\delta)$. % NEEDS WORK! } % % Another solution from Conway: % Label them % F AM NOT LICKED % then use these divisions % MA DO LIKE % ME TO FIND % FAKE COIN % %\soln{ex.twelve.generalize.weigh}{ % Thu, 28 Jan 1999 19:19:30 -0500 (EST) % From: %\begin{Sexercise}{ex.twelve.generalize.weigh} This solution was found by Dyson and Lyness in 1946 and presented in the following elegant form by {John Conway}\index{Conway, John H.} in 1999. % \footnote{Posting to {\tt{geometry-puzzles@forum.swarthmore.edu}} % Thu, 28 Jan 1999. %} % Be warned: the symbols A, B, and C are used to name the balls, to name the pans of the balance, to name the outcomes, and to name the possible states of the odd ball! \ben%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% enumerate 1 \item Label the 12 balls by the sequences % % verbatim not allowed in the argument of a command % {\small \begin{verbatim} AAB ABA ABB ABC BBC BCA BCB BCC CAA CAB CAC CCA \end{verbatim} } and in the {\small \begin{verbatim} 1st AAB ABA ABB ABC BBC BCA BCB BCC 2nd weighings put AAB CAA CAB CAC in pan A, ABA ABB ABC BBC in pan B. 3rd ABA BCA CAA CCA AAB ABB BCB CAB \end{verbatim} } Now in a given weighing, a pan will either end up in the \bit \item {\tt C}anonical position ({\tt C}) that it assumes when the pans are balanced, or \item {\tt A}bove that position ({\tt A}), or \item {\tt B}elow it ({\tt B}), \eit so the three weighings determine for each pan a sequence of three of these letters. If both sequences are {\tt CCC}, then there's no odd ball. Otherwise, for {\em just one\/} of the two pans, the sequence is among the 12 above, and names the odd ball, whose weight is {\tt A}bove or {\tt B}elow the proper one according as the pan is {\tt A} or {\tt B}. \item In $W$ weighings the odd ball can be identified from among \beq N = (3^W - 3 )/2 \eeq balls in the same way, by labelling them with all the non-constant sequences of $W$ letters from {\tt A}, {\tt B}, {\tt C} whose first change is A-to-B or B-to-C or C-to-A, and at the $w$th weighing putting those whose $w$th letter is {\tt A} in pan {\tt A} and those whose $w$th letter is {\tt B} in pan {\tt B}. \een %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %} \end{Sexercise} % {ex.twelve.two.weigh}{ % removed old solution to graveyard Tue 4/3/03 \soln{ex.Hd46}{% ex 42 % hd.p p=0.2 mmin=1 mmax=2 mstep=1 scale_by_n=1 plot_sub_graphs=1 | gnuplot % hd.p p=0.2 mmin=2 mmax=2 mstep=1 scale_by_n=1 plot_sub_graphs=1 | gnuplot % hd.p p=0.2 mmin=100 mmax=100 mstep=1 suppress_early_detail=1 scale_by_n=1 plot_sub_graphs=1 | gnuplot % hd.p p=0.2 mmin=1000 mmax=1000 mstep=1 suppress_early_detail=1 scale_by_n=1 plot_sub_graphs=1 hd=figs/hd0.2 | gnuplot %# gnuplot < gnu/Hd0.2.gnu %#45:coll:/home/mackay/itp/Hdelta> gv figs/hd0.2/all.1.100.ps The curves $\frac{1}{N} H_{\delta}(X^N)$ as a function of $\delta$ for $N=1,2$ and 1000 are shown in \figref{fig.hd.1.100}. % and table \ref{tab.Hdelta.0.4}. Note that $H_2(0.2) = 0.72$ bits. \begin{figure}[htbp] %\figuremargin{% \figuredanglenudge{% \begin{center} \begin{tabular}[t]{rl} \begin{tabular}[t]{l}\vspace{0in}\\% alignment hack \mbox{\psfig{figure=Hdelta/figs/hd0.2/all.1.100.ps,% width=60mm,angle=-90}} \end{tabular} % \hspace{0in} & %%%%%%%%%%%%%%%%%%%%%%%%% \begin{tabular}[t]{r@{--}lcc} \toprule \multicolumn{4}{c}{$N=1$} \\ \midrule % delta 1/N Hdelta 2^{Hdelta} \multicolumn{2}{c}{$\delta$} & $\frac{1}{N} H_{\delta}(\bX)$ & $2^{H_{\delta}(\bX)}$ % raise the roof! % {\rule[-3mm]{0pt}{8mm}} \\ \midrule 0 & 0.2 & 1 & 2 \\ 0.2 & 1 & 0 & 1 \\ \bottomrule \end{tabular} \hspace{0.1in} \begin{tabular}[t]{r@{--}lcc} \toprule% {r@{--}lcc} \multicolumn{4}{c}{$N=2$} \\ \midrule % delta 1/N Hdelta 2^{Hdelta} \multicolumn{2}{c}{$\delta$} & $\frac{1}{N} H_{\delta}(\bX)$ & $2^{H_{\delta}(\bX)}$ % raise the roof! % {\rule[-3mm]{0pt}{8mm}} \\ \midrule 0 & 0.04 & 1 & 4 \\ 0.04 & 0.2 & 0.79 & 3 \\ % was 0.792\,48 0.2 & 0.36 & 0.5 & 2 \\ 0.36 & 1 & 0 & 1 \\ \bottomrule \end{tabular}\\ \end{tabular} %%%%%%%%%%%%%%%%%%%%%%%%%%%% \end{center} }{% \caption[a]{$\frac{1}{N} H_{\delta}(\bX)$ (vertical axis) against $\delta$ (horizontal), for $N=1, 2, 100$ binary variables with $p_1=0.4$.} \label{fig.hd.1.100} \label{tab.Hdelta.0.4} }{0.25in} \end{figure} %\begin{table}[htbp] %\figuremargin{% %\begin{center} %\end{center} %}{% %\caption[a]{Values of $\frac{1}{N} H_{\delta}(\bX)$ against $\delta$.} %% add 0.4 to this caption %\label{tab.Hdelta.0.4} %} %\end{table} % } \soln{ex.HdSB}{ The Gibbs entropy is $\kB \sum_i p_i \ln \frac{1}{p_i}$, where $i$ runs over all states of the system. This entropy is equivalent (apart from the factor of $\kB$) to the Shannon entropy of the ensemble. Whereas the Gibbs entropy can be defined for any ensemble, the Boltzmann entropy is only defined for {\dem microcanonical\/} ensembles, which have a probability distribution that is uniform over a set of accessible states. The Boltzmann entropy is defined to be $S_{\rm B} = \kB \ln \Omega$ where $\Omega$ is the number of accessible states of the \ind{microcanonical} ensemble. This is equivalent (apart from the factor of $\kB$) to the perfect information content $H_0$ of that constrained ensemble. The Gibbs entropy of a microcanonical ensemble is trivially equal to the Boltzmann entropy. We now consider a \ind{thermal distribution} (the {\dem\ind{canonical}\/} ensemble), where the probability of a state $\bx$ is \beq % P(\bx) =\frac{1}{Z} \exp( - \beta E(\bx) )? P(\bx) =\frac{1}{Z} \exp\left( - \frac{ E(\bx) }{\kB T} \right) . \eeq With this canonical ensemble we can associate a corresponding microcanonical ensemble, % typically % usually an ensemble with total energy fixed to the mean energy of the canonical ensemble (fixed to within some precision $\epsilon$). % Recalling that under the % thermal distribution (the canonical ensemble) we see that Now, fixing the total energy to a precision $\epsilon$ is equivalent to fixing the value of $\ln \dfrac{1}{P(\bx)}$ to within % $\epsilon/\beta$. $\epsilon \kB T$. Our definition of the typical set $T_{N \beta}$ was precisely that it consisted of all elements that have a value of $\log P(\bx)$ very close to the mean value of $\log P(\bx)$ under the canonical ensemble, $- N H(X)$. Thus the microcanonical ensemble is equivalent to a uniform distribution over % constraining the state $\bx$ to be in the typical set of the canonical ensemble. Our proof of the \aep\ thus proves -- for the case of a system whose energy is separable into a sum of independent terms -- that the Boltzmann entropy of the microcanonical ensemble is very close (for large $N$) to the Gibbs entropy of the canonical ensemble, if the energy of the microcanonical ensemble is constrained to equal the mean energy of the canonical ensemble. } \soln{ex.cauchy}{ The normalizing constant of the \ind{Cauchy distribution} \[ P(x) = \frac{1}{Z} \frac{1}{x^2 + 1} \] is \beq Z = \int^{\infty}_{-\infty} \d x \: \frac{1}{x^2 + 1} = \left[ {\tan}^{-1} x \right]^{\infty}_{-\infty} = \frac{\pi}{2} - \frac{-\pi}{2} = \pi . \eeq The mean and variance of this distribution are both undefined. (The distribution is symmetrical about zero, but this does not imply that its mean is zero. The mean is the value of a divergent integral.) % ; depending what limiting procedure we % define to evaluate this integral we The sum $z=x_1 + x_2$, where $x_1$ and $x_2$ both have Cauchy distributions, has probability density given by the convolution \beq P(z) = \frac{1}{\pi^2} \int^{\infty}_{-\infty} \d x_1 \: \frac{1}{x_1^2 + 1} \frac{1}{(z-x_1)^2 + 1} % P(x1,x2) delta [z=x1+x2] .. -> x2 = z-x1 , \eeq % Introducing $\Delta \equiv x_1-x_2$ this can be written more symmetrically % as % \beq % P(z) = \frac{1}{\pi^2} \int^{\infty}_{-\infty} \d \Delta \: % \eeq which after a considerable labour using standard methods %\footnote{Can anyone % give me an elegant solution?} gives \beq P(z) = \frac{1}{\pi^2} 2 \frac{\pi}{z^2+4} = \frac{2}{\pi} \frac{1}{z^2+2^2} , \label{eq.cauchysum} \eeq which we recognize as a Cauchy distribution with width parameter 2 (where the original distribution has width parameter 1). This implies that the mean of the two points, $\bar{x} = (x_1+x_2)/2 = z/2$, has a Cauchy distribution with width parameter 1. Generalizing, the mean of $N$ samples from a Cauchy distribution is Cauchy-distributed with the {\em same parameters\/} as the individual samples. The probability distribution of the mean does {\em not\/} become narrower as $1/\sqrt{N}$. {\em The \ind{central-limit theorem} does not apply to the {Cauchy distribution}, because it does not have a finite \ind{variance}.} An alternative neat method for getting to \eqref{eq.cauchysum} makes use of the \ind{Fourier transform}\index{generating function} of the Cauchy distribution, which is a \index{biexponential}{biexponential} $e^{-|\omega|}$. Convolution in real space corresponds to multiplication in Fourier space, so the \ind{Fourier transform} of $z$ is simply $e^{-|2 \omega|}$. Reversing the transform, we obtain \eqref{eq.cauchysum}. } %\begincuttable \soln{ex.fxxxxx}{ \amarginfig{t}{ \begin{center} \begin{tabular}{c} \psfig{figure=gnu/fxxxxx50.ps,width=1.7in,angle=-90}\\ \psfig{figure=gnu/fxxxxx5.ps,width=1.7in,angle=-90}\\ \psfig{figure=gnu/fxxxxx.5.ps,width=1.7in,angle=-90}\\ \end{tabular} \end{center} %}{% gnu: load 'fxxxxx.gnu' \caption[a]{ % The function $\displaystyle f(x) = x_{\:,}^{x^{x^{x^{x^{\cdot^{\cdot^{\cdot}}}}}}} $ shown at three different scales.} \label{fig.xxxxx} }% The function $f(x)$ %\beq % f(x) = x^{x^{x^{x^{x^{\ddots}}}}} %\eeq has inverse function % to $f$ is \beq g(y) = y^{1/y}. \eeq Note \beq \log g(y) = 1/y \log y . \eeq I obtained a tentative graph of $f(x)$ by plotting $g(y)$ with $y$ along the vertical axis and $g(y)$ along the horizontal axis. The resulting graph suggests that $f(x)$ is single valued for $x \in (0,1)$, and looks surprisingly well-behaved and ordinary; for $x \in (1, e^{1/e})$, $f(x)$ is two-valued. $f(\sqrt{2})$ is equal both to 2 and 4. For $x > e^{1/e}$ (which is about 1.44), $f(x)$ is infinite. % undefined. However, it might be argued that this approach to sketching $f(x)$ is only partly valid, if we define $f$ as the limit of the sequence of functions $x$, $x^x$, $x^{x^x}, \ldots$; this sequence does not have a limit for % , below % pr (1.0/exp(1.0))**exp(1.0) % 0.0659880358453126 $0 \leq x \leq (1/e)^e \simeq 0.07$ on account of a pitchfork \ind{bifurcation} at $x=(1/e)^e$; and for $x \in (1,e^{1/e})$, the sequence's limit is single-valued -- the lower of the two values sketched in the figure. % load 'fxxxxx.gnu2' % } %\endcuttable \dvipsb{solutions source coding} \prechapter{About Chapter} \fakesection{intro for chapter 3} In the last chapter, we saw a proof of the fundamental status of the entropy as a measure of average information content. We defined a data compression scheme using {\em fixed length block codes}, and proved that as $N$ increases, it is possible to encode $N$ i.i.d.\ variables $\bx = (x_1,\ldots,x_N)$ into a block of $N(H(X)+\epsilon)$ bits with vanishing probability of error, whereas if we attempt to encode $X^N$ into $N(H(X)-\epsilon)$ bits, the probability of error is virtually 1. We thus verified the {\em possibility\/} of data compression, but the block coding defined in the proof did not give a practical algorithm. In this chapter and the next, we study practical data compression algorithms. Whereas the last chapter's compression scheme used large blocks of {\em fixed\/} size and was {\em lossy}, in the next chapter we discuss {\em variable-length\/} compression schemes that are practical for small block sizes and that are {\em not lossy}. Imagine a rubber glove filled with water. If we compress two fingers of the glove, some other part of the glove has to expand, because the total volume of water is constant. (Water is essentially incompressible.) Similarly, when we shorten the codewords for some outcomes, there must be other codewords that get longer, if the scheme is not lossy. In this chapter we will discover the information-theoretic equivalent of water \ind{volume}. % the constant volume of water in the glove. %% \medskip \fakesection{prerequisites for chapter 3} Before reading \chref{ch.three}, you should have worked on \extwenty. \medskip We will use the\index{notation!intervals} following notation for intervals:\medskip % the statement \begin{center} \begin{tabular}{ll} $x \in [1 ,2)$ & means that $x \geq 1$ and $x < 2$; \\ % the statement $x \in (1 ,2]$ & means that $x > 1$ and $x \leq 2$.\\ \end{tabular} \end{center} % {All these definitions of source % codes, Huffman codes, etc., can be generalized to codes over % other $q$-ary alphabets, but little is lost by concentrating on % the binary case.} %\chapter{Data Compression II: Symbol Codes} \mysetcounter{page}{102} \ENDprechapter \chapter{Symbol Codes} \label{ch.three} % %.tex % \documentstyle[twoside,11pt,chapternotes,lsalike]{itchapter} % \begin{document} % \bibliographystyle{lsalike} % \input{psfig.tex} % \include{/home/mackay/tex/newcommands1} % \include{/home/mackay/tex/newcommands2} % \input{itprnnchapter.tex} % \setcounter{chapter}{2}% set to previous value % \setcounter{page}{34} % set to current value % \setcounter{exercise_number}{45} % set to imminent value % % % \renewcommand{\bs}{{\bf s}} % \newcommand{\eq}{\mbox{$=$}} % \chapter{Data Compression II: Symbol Codes} % % \section*{Source Coding: Lossless data compression with symbol codes} % % Practical source coding \label{ch3} %\section{Symbol codes} In this chapter, we discuss {\dem variable-length symbol codes\/}\indexs{symbol code},\index{source code!symbol code} % , variable-length}, which encode one source symbol at a time, instead of encoding huge strings of $N$ source symbols. These codes are {\dem lossless:} unlike the last chapter's block codes, they are guaranteed to compress and decompress without any errors; but there is a chance that the codes may sometimes produce encoded strings longer than the original source string. The idea is that we can achieve compression, on average, by assigning {\em shorter\/} encodings to the more probable outcomes and {\em longer\/} encodings to the less probable. The key issues are: \begin{description} \item[What are the implications if a symbol code is {\em lossless\/}?] If some codewords are shortened, by how much do other codewords have to be lengthened? \item[Making compression practical\puncspace] How can we ensure that a symbol code is easy to decode? \item[Optimal symbol codes\puncspace] How should we assign codelengths to achieve the best compression, and what is the best achievable compression? \end{description} We again verify the fundamental status of the Shannon \ind{information content} and the entropy, proving:\index{source coding theorem} % % \begin{description} \item[Source coding theorem (symbol codes)\puncspace] There exists a variable-length encoding $C$ of an ensemble $X$ such that the average length of an encoded symbol, $L(C,X)$, satisfies $L(C,X) \in \left[ H(X) , H(X) + 1 \right)$. The average length is equal to the entropy $H(X)$ only if the codelength for each outcome is equal to its {Shannon information content}. \end{description} % We will also define a constructive procedure, the \index{Huffman code}Huffman\nocite{Huffman1952} coding algorithm, that produces optimal symbol codes.\index{symbol code!optimal} \begin{description} \item[Notation for alphabets\puncspace] $\A^N$ denotes the set of ordered $N$-tuples of elements from the set $\A$, \ie, all strings of length $N$. The symbol $\A^+$ will denote the set of all strings of finite length composed of elements from the set $\A$. \end{description} \exampla{ $\{{\tt{0}},{\tt{1}}\}^3 = \{{\tt{0}}{\tt{0}}{\tt{0}},{\tt{0}}{\tt{0}}{\tt{1}},{\tt{0}}{\tt{1}}{\tt{0}},{\tt{0}}{\tt{1}}{\tt{1}},{\tt{1}}{\tt{0}}{\tt{0}},{\tt{1}}{\tt{0}}{\tt{1}},{\tt{1}}{\tt{1}}{\tt{0}},{\tt{1}}{\tt{1}}{\tt{1}}\}$. } \exampla{ $\{{\tt{0}},{\tt{1}}\}^+ = \{ {\tt{0}} , {\tt{1}} , {\tt{0}}{\tt{0}} , {\tt{0}}{\tt{1}} , {\tt{1}}{\tt{0}} , {\tt{1}}{\tt{1}} , {\tt{0}}{\tt{0}}{\tt{0}} , {\tt{0}}{\tt{0}}{\tt{1}} , \ldots \}$. } % This notation is borrowed from the standard notation for expressions % in computer science \section{Symbol codes} \label{sec.symbol.code.intro} \begin{description} \item[A (binary) symbol code] $C$ for an ensemble $X$ is a mapping from the range of $x$, $\A_X \eq \{a_1,\ldots, $ $a_I\}$, to $\{{\tt{0}},{\tt{1}}\}^+$. % a set of finite length strings of symbols % from an alphabet (NAME?). $c(x)$ will denote the {\dem{codeword}\/}\indexs{symbol code!codeword} corresponding to $x$, and $l(x)$ will denote its length, with $l_i = l(a_i)$. The {\dem \inds{extended code}\/} $C^+$ is a mapping from $\A_X^+$ to $\{{\tt{0}},{\tt{1}}\}^+$ obtained by concatenation, without punctuation, of the corresponding codewords:\index{concatenation!in compression} \beq c^+(x_1 x_2 \ldots x_N) = c(x_1)c(x_2)\ldots c(x_N) . \eeq [The term `\ind{mapping}' here is a synonym for `function'.] \end{description} \exampla{ A symbol code for the ensemble $X$ defined by \beq \begin{array}{*{4}{c}*{5}{@{\,}c}} & \A_X & = & \{ & {\tt a}, & {\tt b}, & {\tt c}, & {\tt d} & \} , \\ & \P_X & = & \{ & \dhalf, & \dquarter, & \deighth, & \deighth & \}, \end{array} \eeq % : \A_X = \{{\tt{a}},{\tt{b}},{\tt{c}},{\tt{d}}\},$ $\P_X = \{ \dhalf,\dquarter,\deighth,\deighth \}$ is $C_0$, shown in the margin. % = \{ {\tt{1}}{\tt{0}}{\tt{0}}{\tt{0}}, {\tt{0}}{\tt{1}}{\tt{0}}{\tt{0}}, {\tt{0}}{\tt{0}}{\tt{1}}{\tt{0}}, {\tt{0}}{\tt{0}}{\tt{0}}{\tt{1}}\}$. \marginpar{ \begin{center} $C_0$: \begin{tabular}{clc} \toprule $a_i$ & $c(a_i)$ & $l_i$ % {\rule[-3mm]{0pt}{8mm}}%strut \\ \midrule {\tt a} & {\tt 1000} & 4 \\ {\tt b} & {\tt 0100} & 4 \\ {\tt c} & {\tt 0010} & 4 \\ {\tt d} & {\tt 0001} & 4 \\ \bottomrule \end{tabular} \end{center} } Using the extended code, we may encode ${\tt{acdbac}}$ as \beq c^{+}({\tt{acdbac}}) = {\tt{1000}} {\tt{0010}} {\tt{0001}} {\tt{0100}} {\tt{1000}} {\tt{0010}} . \eeq } There are basic requirements for a useful symbol code. First, any encoded string must have a unique decoding. Second, the symbol code must be easy to decode. And third, the code should achieve as much compression as possible. \subsection{Any encoded string must have a unique decoding} \begin{description} \item[A code $C(X)$ is uniquely decodeable] if, under the extended code $C^+$, no two distinct strings have the same encoding, % every element of $\A_X^+$ maps into a different string, \ie, \beq \forall \, \bx,\by \in \A_X^+, \:\: \bx \not = \by \:\: \Rightarrow \:\: c^+(\bx) \not = c^+(\by). \label{eq.UD} \eeq %cnp22@maths.cam.ac.uk: % I'm missing the word `injectivity'. This would explain, why % (3.2) is necessary for an inverse function. % % {\em I believe mathematicians would put it this way: % a code is uniquely decodeable if the extended code is an injective % mapping.} \end{description} The code $C_0$ defined above is an example of a uniquely decodeable code. \subsection{The symbol code must be easy to decode} A symbol code is easiest to decode if it is possible to identify the end of a codeword as soon as it arrives, which means that no codeword can be a {\dem{prefix}\/} of another codeword. % % {\em (Need a defn of a prefix here.)} %\marginpar{\footnotesize % [A word $c$ %% \in \A^{+}$ % is a {\dem prefix\/} of another word $d$ %% \in \A^{+}$ % if there exists a tail string $t$ %% \in \A^{*} % such that the concatenation $ct$ is % identical to $d$. For example, {\tt 1} is a prefix of {\tt 101}, % and so is {\tt 10}.] %} [A word $c$ % \in \A^{+}$ is a {\dem prefix\/} of another word $d$ % \in \A^{+}$ if there exists a tail string $t$ % \in \A^{*} such that the concatenation $ct$ is identical to $d$. For example, {\tt 1} is a prefix of {\tt 101}, and so is {\tt 10}.] % We will show later that we don't lose any performance if we constrain our symbol code to be a prefix code. \begin{description} \item[A symbol code is called a \inds{prefix code}] if no codeword is a prefix of any other codeword. A prefix code is also known as an {\dem\ind{instantaneous}\/} or {\dem\ind{self-punctuating}\/} code, because an encoded string can be decoded from left to right without looking ahead to subsequent codewords. The end of a codeword is immediately recognizable. A prefix code is uniquely decodeable. \end{description} \begin{aside} {Prefix codes are also % is more accurately called known as `prefix-free codes' or `prefix condition codes'.} \end{aside} \noindent Prefix codes correspond to trees, as illustrated in the margin of the next page. \exampla{ \amarginfignocaption{t}{\mbox{\small$C_1$ \psfig{figure=figs/C1.ps,angle=-90,width=1in}}} The code $C_1 = \{ {\tt{0}} , {\tt{1}}{\tt{0}}{\tt{1}} \}$ is a prefix code because ${\tt{0}}$ is not a prefix of {\tt{1}}{\tt{0}}{\tt{1}}, nor is {\tt{1}}{\tt{0}}{\tt{1}} a prefix of {\tt{0}}. } \exampla{ Let $C_2 = \{ {\tt{1}} , {\tt{1}}{\tt{0}}{\tt{1}} \}$. This code is not a prefix code because ${\tt{1}}$ is a prefix of {\tt{1}}{\tt{0}}{\tt{1}}. } \exampla{ % \marginpar[t]{\mbox{\small\raisebox{0.4in}[0in][0in]{$C_3$} \psfig{figure=figs/C3.ps,angle=-90,width=1in}}} The code $C_3 = \{ {\tt 0} , {\tt 10} , {\tt 110} , {\tt 111} \}$ is a prefix code. % } %%%%%%%%%%%%%%% \exampla{ \amarginfignocaption{t}{\mbox{\small\raisebox{0.4in}[0in][0in]{$C_3$} \psfig{figure=figs/C3.ps,angle=-90,width=1in}}\\[0.21in] \mbox{\small% \raisebox{0.2in}[0in][0in]{$C_4$} \psfig{figure=figs/C4.ps,angle=-90,width=0.681in}% }\\[0.125in] \small\raggedright\reducedlead Prefix codes can be represented on binary trees. {\dem Complete\/} prefix codes correspond to binary trees with no unused branches. $C_1$ is an incomplete code.} The code $C_4 = \{ {\tt 00} , {\tt 01} , {\tt 10} , {\tt 11} \}$ is a prefix code. % } %%%%%%%%%%%%%%% \exercissxA{1}{ex.C1101}{ Is $C_2$ uniquely decodeable? } % % example % % morse code with spaces stripped out. Is it a prefix code? Is it UD? % (no,no) % \exampla{ % ref corrected 9802 Consider \exerciseref{ex.weigh} and \figref{fig.weighing} (\pref{fig.weighing}). Any weighing strategy that identifies the odd ball and whether it is heavy or light can be viewed as assigning a {\em ternary\/} code to each of the 24 possible states. This code is a prefix code. } \subsection{The code should achieve as much compression as possible} \begin{description} \item[The expected length $L(C,X)$] of a symbol code $C$ for ensemble $X$ is \beq L(C,X) = \sum_{x \in \A_X} P(x) \, l(x). \eeq We may also write this quantity as \beq L(C,X) = \sum_{i=1}^{I} p_i l_i \eeq where $I = |\A_X|$. \end{description} % \exampla{ % {\sf Example 1:} \marginpar[b]{ \begin{center} $C_3$:\\[0.1in] \begin{tabular}{cllcc} \toprule $a_i$ & $c(a_i)$ & $p_i$ & % \multicolumn{1}{c}{$\log_2 \frac{1}{p_i}$} $h(p_i)$ & $l_i$ % {\rule[-3mm]{0pt}{8mm}}%strut \\ \midrule {\tt a} & {\tt 0} & \dhalf & 1.0 & 1 \\ {\tt b} & {\tt 10} & \dquarter & 2.0 & 2 \\ {\tt c} & {\tt 110} & \deighth & 3.0 & 3 \\ {\tt d} & {\tt 111} & \deighth & 3.0 & 3 \\ \bottomrule \end{tabular} \end{center} } Let \beq \begin{array}{*{4}{c}*{5}{@{\,}c}} & \A_X & = & \{ & {\tt a}, & {\tt b}, & {\tt c}, & {\tt d} & \} , \\ \mbox{and} \:\:& \P_X & = & \{ & \dhalf, & \dquarter, & \deighth, & \deighth & \}, \end{array} \eeq and consider the code $C_3$. % $c(a)\eq {\tt{0}}$, $ c(b)\eq {\tt{1}}{\tt{0}}$, % $c(c)\eq {\tt{1}}{\tt{1}}{\tt{0}}$, $ c(d)\eq {\tt{1}}{\tt{1}}{\tt{1}}$. % The entropy of $X$ is 1.75 bits, and the expected length $L(C_3,X)$ of this code is also 1.75 bits. The sequence of symbols $\bx\eq ({\tt acdbac})$ is % 134213 encoded as $c^+(\bx)={\tt{0110111100110}}$. % You can confirm that no other sequence of % symbols $\bx$ has the same encoding. % In fact, $C_3$ is a {prefix code\/} and is therefore \inds{uniquely decodeable}. Notice that the codeword lengths satisfy $l_i \eq \log_2 (1/p_i)$, or equivalently, $p_i \eq 2^{-l_i}$. } %\medskip % %\noindent {\sf Example 2:} \exampla{ Consider the fixed length code for the same ensemble $X$, $C_4$. % $ c(1)\eq {\tt{00}}$, $ c(2)\eq {\tt{01}}$, $ c(3)\eq {\tt{10}}$, $ c(4)\eq {\tt{11}}$. % % C4 by itself in a table, moved to graveyard \marginpar[b]{ \begin{center} \begin{tabular}{cll} \toprule % $a_i$ & $C_4$& $C_5$ %&$C_6$ % \\ % $c(a_i)$ & $p_i$ & % \multicolumn{1}{c}{$\log_2 \frac{1}{p_i}$} % $h(p_i)$ & $l_i$ % {\rule[-3mm]{0pt}{8mm}}%strut \\ \midrule {\tt a} & {\tt 00} & {\tt 0} \\ {\tt b} & {\tt 01} & {\tt 1} \\ {\tt c} & {\tt 10} & {\tt 00} \\ {\tt d} & {\tt 11} & {\tt 11} \\ \bottomrule \end{tabular} \end{center} } The expected length $L(C_4,X)$ is 2 bits. } % edskip % % \noindent {\sf Example 3:} \exampla{ Consider $C_5$. %$ c(1)\eq {\tt{0}}$, $ c(2)\eq {\tt{1}}$, $ c(3)\eq {\tt{00}}$, $c(4)\eq {\tt{11}}$. The expected length $L(C_5,X)$ is 1.25 bits, which is less than $H(X)$. But the code is not uniquely decodeable. The sequence $\bx\eq ({\tt acdbac})$ % 134213)$ encodes as {\tt{000111000}}, which can also be decoded as $({\tt cabdca})$. } % \medskip % % \noindent {\sf Example 4:} \exampla{ Consider the code $C_6$. \amargintabnocaption{c}{ \begin{center} $C_6$:\\[0.1in] \begin{tabular}{cllcc} \toprule $a_i$ & $c(a_i)$ & $p_i$ & % {$\log_2 \frac{1}{p_i}$} $h(p_i)$ & $l_i$ % {\rule[-3mm]{0pt}{8mm}}%strut \\ \midrule {\tt a} & {\tt 0} & \dhalf & 1.0 & 1 \\ {\tt b} & {\tt 01} & \dquarter & 2.0 & 2 \\ {\tt c} & {\tt 011} & \deighth & 3.0 & 3 \\ {\tt d} & {\tt 111} & \deighth & 3.0 & 3 \\ \bottomrule \end{tabular} \end{center} } %$ c(1)\eq {\tt{0}}$, $ c(2)\eq {\tt{01}}$, $ c(3)\eq {\tt{011}}$, $c(4)\eq {\tt{111}}$. The expected length $L(C_6,X)$ of this code is 1.75 bits. The sequence of symbols $\bx\eq ({\tt acdbac})$ is encoded as $c^+(\bx)={\tt{0011111010011}}$. Is $C_6$ a {prefix code}? It is not, because $c({\tt a}) = {\tt 0}$ is a prefix of both $c({\tt b})$ and $c({\tt c})$. Is $C_6$ {uniquely decodeable}? This is not so obvious. If you think that it might {\em not\/} be {uniquely decodeable}, try to prove it so by finding a pair of strings $\bx$ and $\by$ that have the same encoding. [The definition of unique decodeability is given in \eqref{eq.UD}.] $C_6$ certainly isn't {\em easy\/} to decode. When we receive `{\tt{00}}', it is possible that $\bx$ could start `{\tt{aa}}', `{\tt{ab}}' or `{\tt{ac}}'. Once we have received `{\tt{001111}}', the second symbol is still ambiguous, as $\bx$ could be `{\tt{abd}}\ldots' or `{\tt{acd}}\ldots'. But eventually a unique decoding crystallizes, once the next {\tt{0}} appears in the encoded stream. $C_6$ {\em is\/} in fact {uniquely decodeable}. Comparing with the prefix code $C_3$, we see that the codewords of $C_6$ are the reverse of $C_3$'s. That $C_3$ is uniquely decodeable proves that $C_6$ is too, since any string from $C_6$ is identical to a string from $C_3$ read backwards. } % \medskip % something I recall reading in cover was a contrary statement that said that % with a nonprefix code it will take an arb long time to figure things out. % maybe that was just a w.c. result. % What is it that distinguishes a uniquely \section{What limit is imposed by unique decodeability?} We now ask, given a list of positive integers $\{ l_i \}$, does there exist a uniquely decodeable\index{uniquely decodeable}\index{source code!uniquely decodeable} code with those integers as its codeword lengths? At this stage, we ignore the probabilities of the different symbols; once we understand unique decodeability better, we'll reintroduce the probabilities and discuss how to make an {\dem optimal\/} uniquely decodeable symbol code. In the examples above, we have observed that if we take a code such as $\{{\tt{00}},{\tt{01}},{\tt{10}},{\tt{11}}\}$, and shorten one of its codewords, for example ${\tt{00}} \rightarrow {\tt{0}}$, then we can retain unique decodeability only if we lengthen other codewords. Thus there seems to be a constrained budget\index{symbol code!budget} that we can spend on codewords, with shorter codewords being more expensive. Let us explore the nature of this \ind{budget}. If we build a code purely from codewords of length $l$ equal to three, how many codewords can we have and retain unique decodeability? The answer is $2^l = 8$. Once we have chosen all eight of these codewords, is there any way we could add to the code another codeword of some {\em other\/} length and retain unique decodeability? It would seem not. What if we make a code that includes a length-one codeword, `{\tt{0}}', with the other codewords being of length three? How many length-three codewords can we have? If we restrict attention to prefix codes, then % it is clear that we can have only four codewords of length three, namely $\{ {\tt{100}},{\tt{101}},{\tt{110}},{\tt{111}} \}$. What about other codes? Is there any other way of choosing codewords of length 3 that can give more codewords? Intuitively, we think this unlikely. A codeword of length $3$ appears to have a cost that is $2^{2}$ times smaller than a codeword of length 1. % "... cost ... times smaller ..."; I suspect some % readers may have difficulty with this sentence. Let's define a total budget of size 1, which we can spend on codewords. If we set the cost of a codeword whose length is $l$ to $2^{-l}$, then we have a pricing system that fits the examples discussed above. Codewords of length 3 cost $\deighth$ each; codewords of length 1 cost $1/2$ each. We can spend our budget on any codewords. If we go over our budget then the code will certainly not be uniquely decodeable. If, on the other hand, \beq \sum_i 2^{-l_i} \leq 1, \label{eq.kraft} \eeq then the code may be uniquely decodeable. This inequality is the \inds{Kraft inequality}.\label{sec.kraft} \begin{description} \item[\Kraft\ inequality\puncspace] For any uniquely decodeable code $C(X)$ over the binary alphabet $\{0,1\}$, the codeword lengths must satisfy: \beq \sum_{i=1}^I 2^{-l_i} \leq 1 , \eeq where $I = |\A_X|$. \end{description} \begin{description} \item[Completeness\puncspace] If a uniquely decodeable code satisfies the \Kraft\ inequality with equality then it is called a {\dbf complete} code. \end{description} % It is less obvious that t We want codes that are uniquely decodeable; prefix codes are uniquely decodeable, and are easy to decode. % ; and it is easy to assess whether a code is a prefix code. % codes that are not prefix codes are less straightforward to decode than % prefix codes. So life would be simpler for us if we could restrict attention to prefix codes.\index{prefix code} Fortunately, % we can prove that for any source there {\em is\/} an optimal symbol code that is also a prefix code. % We wi, and we will discuss an % algorithm we can restrict attention to prefix % codes. % The following % result is also true: \begin{description} \item[\Kraft\ inequality and prefix codes\puncspace] Given a set of codeword lengths that satisfy the Kraft inequality, % this inequality, there exists a uniquely decodeable prefix code\index{source code!prefix code}\index{prefix code} with these codeword lengths. \end{description} \begin{aside} %\subsection*{The small print} The Kraft inequality % , which appears on page \pageref{sec.kraft}, might be more accurately referred to as the Kraft--McMillan inequality:\index{Kraft, L.G.}\index{McMillan, B.}\nocite{mcmillan1956} Kraft % (1949) proved that if the inequality is satisfied, then a prefix code exists with the given lengths. % McMillan % (1956) \citeasnoun{mcmillan1956} proved the converse, that unique decodeability implies that the inequality holds. \end{aside} \begin{prooflike}{Proof of the \Kraft\ inequality} % Define $S = \sum_i 2^{-l_i}$. Consider the quantity \beq S^N = \left[ \sum_i 2^{-l_i} \right]^N = \sum_{i_1=1}^{I} \sum_{i_2=1}^{I} \cdots \sum_{i_N=1}^{I} 2^{-\displaystyle \left(l_{i_1} + l_{i_2} + \cdots l_{i_N} \right) } . \eeq The quantity in the exponent, $\left(l_{i_1} + l_{i_2} + \cdots + l_{i_N} \right)$, is the length of the encoding of the string $\bx = a_{i_1} a_{i_2} \ldots a_{i_N}$. For every string $\bx$ of length $N$, there is one term in the above sum. Introduce an array $A_l$ that counts how many strings $\bx$ have encoded length $l$. Then, defining $l_{\min} = \min_i l_i$ and $l_{\max} = \max_i l_i$: \beq S^N = \sum_{l = N l_{\min} }^{N l_{\max}} 2^{-l} A_l . \eeq Now assume $C$ is uniquely decodeable, so that for all $\bx \not = \by$, $c^+(\bx) \not = c^+(\by)$. Concentrate on the $\bx$ that have encoded length $l$. There are a total of $2^l$ distinct bit strings of length $l$, so it must be the case that $A_l \leq 2^l$. % So \beq S^N = \sum_{l = N l_{\min} }^{N l_{\max}} 2^{-l} A_l \leq \sum_{l = N l_{\min} }^{N l_{\max}} 1 \:\: \leq \:\: N l_{\max}. \label{eq.kraft.climax} \eeq Thus $S^N \leq l_{\max} N$ for all $N$. Now if $S$ were greater than 1, then as $N$ increases, $S^N$ would be an exponentially growing function, and for large enough $N$, an exponential always exceeds a polynomial such as $l_{\max} N$. But our result $(S^N \leq l_{\max} N)$ % \ref{eq.kraft.climax} is true for {\em any\/} $N$. Therefore $S \leq 1$. \hfill % Q.E.D. % % to have % enabled me to understand it the first time round, it would have been % sufficient to have said 'for the inequality to be true for all N, % regardless of how large, S has to be <= 1.' % \end{prooflike} \exercissxB{3}{ex.KIconverse}{ % (optional) Prove the result stated above, that for any set of codeword lengths $\{ l_i \}$ satisfying the \Kraft\ inequality, there is a prefix code having those lengths. } % % Symbol Coding Budget % \begin{figure} \figuremargin{% \begin{center} \mbox{\psfig{figure=figs/budget1.eps,height=3in}\ \psfig{figure=figs/budgetmax.eps,height=3in}} \end{center} }{% \caption[a]{The symbol coding \ind{budget}.\index{source code!supermarket}\indexs{symbol code!budget} The `cost' $2^{-l}$ of each codeword (with length $l$) is indicated by the size of the box it is written in. The total budget available when making a uniquely decodeable code is 1. You can think of this diagram as showing a {\dem{codeword supermarket}\/}\index{supermarket (codewords)}, with the codewords arranged in aisles by their length, and the cost of each codeword indicated by the size of its box on the shelf. If the cost of the codewords that you take exceeds the budget then your code will not be uniquely decodeable. } \label{fig.budget1} }% \end{figure} \begin{figure} \figuredangle{% \begin{center} \mbox{ %\begin{tabular}{cc} % $C_0$ & $C_3$ \\ %\psfig{figure=figs/budget0.eps,height=1.48in}& %\psfig{figure=figs/budget3.eps,height=1.48in} \\[0.2in] % $C_4$ & $C_6$ \\ %\psfig{figure=figs/budget4.eps,height=1.48in}& %\psfig{figure=figs/budget6.eps,height=1.48in}\\ %\end{tabular}} \begin{tabular}{cccc} $C_0$ & $C_3$ & $C_4$ & $C_6$ \\ \psfig{figure=figs/budget0.eps,height=1.66in}& \psfig{figure=figs/budget3.eps,height=1.66in}& \psfig{figure=figs/budget4.eps,height=1.66in}& \psfig{figure=figs/budget6.eps,height=1.66in}\\ \end{tabular}} \end{center} }{% \caption[a]{Selections of codewords % from the codeword supermarket made by codes $C_0,C_3,C_4$ and $C_6$ from section \protect\ref{sec.symbol.code.intro}.} \label{fig.budget0} \label{fig.budget6} }% \end{figure} A pictorial view of the \Kraft\ inequality may help you solve this exercise. Imagine that we are choosing the codewords to make a symbol code. We can draw the set of all candidate codewords % that we might include in a code in a supermarket that displays the `cost' of the codeword by the area of a box (\figref{fig.budget1}). The total budget available -- the `1' on the right-hand side of the \Kraft\ inequality -- is shown at one side. Some of the codes discussed in section \ref{sec.symbol.code.intro} are illustrated in figure \ref{fig.budget0}. Notice that the codes that are prefix codes, $C_0$, $C_3$, and $C_4$, have the property that to the right of any selected codeword, there are no other selected codewords -- because prefix codes correspond to trees. % The {\em complete\/} prefix codes $C_0$, $C_3$, % and $C_4$ have the property that % the codewords abut % Notice also that the % `incomplete' code % -\ref{fig.budget6}. Notice that a {\em complete\/} prefix code corresponds to a {\em complete\/} tree having no unused branches. \medskip We are now ready to put back the symbols' probabilities $\{ p_i \}$. Given a set of symbol probabilities (the English language probabilities of \figref{fig.monogram}, for example), how do we make the best symbol code -- one with the smallest possible expected length $L(C,X)$? And what is that smallest possible expected length? It's not obvious how to assign the codeword lengths. If we give short codewords to the more probable symbols then the expected length might be reduced; on the other hand, shortening some codewords necessarily causes others to lengthen, by the Kraft inequality. \section{What's the most compression that we can hope for?} % there must be a compromise. % of s % Of the four codes displayed in figure \ref{fig.budget0}, % $C_3$ and $C_6$ We wish to minimize the expected length of a code, \beqan L(C,X) &=& \sum_i p_i l_i . \eeqan As you might have guessed, the entropy appears as the % It is easy to show that there is a lower bound on the expected length of a code. \begin{description} \item[Lower bound on expected length\puncspace] The expected length $L(C,X)$ of a uniquely decodeable code is bounded below by $H(X)$. \item[{\sf Proof.}] % Introduce the optimum codelengths $l^*_i \equiv \log (1/p_i)$, We define the {\dem\inds{implicit probabilities}\/} $q_i \equiv 2^{-l_i}/z$, where $z\eq \sum_{i'} 2^{-l_{i'}}$, so that $l_i \eq \log 1/q_i - \log z$. We then use Gibbs' inequality, $\sum_i p_i \log 1/q_i \geq \sum_i p_i \log 1/p_i$, with equality if $q_i \eq p_i$, and the \Kraft\ inequality $z\leq 1$: \beqan L(C,X) &=& \sum_i p_i l_i = \sum_i p_i \log 1/q_i - \log z \label{eq.expected.length} \\ & \geq & \sum_i p_i \log 1/p_i - \log z \\ & \geq & H(X) . \eeqan The equality $L(C,X) \eq H(X)$ is achieved only if the \Kraft\ equality $z % \sum_i 2^{-l_i} \eq 1$ is satisfied, and if the codelengths satisfy $l_i \eq \log (1/p_i)$. \hfill $\Box$ \end{description} This is an important result so let's say it again: \begin{description} \item[Optimal source codelengths\puncspace] The\index{source code!optimal lengths} expected length is minimized and is equal to $H(X)$ only if the codelengths are equal to the {\dem Shannon information contents}:\index{information content} \beq l_i = \log_2 (1/p_i) . \eeq \item[Implicit probabilities defined by codelengths\puncspace] Conversely, any choice of codelengths $\{l_i\}$ {\em implicitly\/} defines a probability distribution $\{q_i\}$, \beq q_i \equiv 2^{-l_i}/z , \eeq for which those codelengths would be the optimal codelengths. If the code is complete then $z=1$ and the implicit probabilities are given by $q_i = 2^{-l_i}$. \end{description} % This is one of the central themes of this course. % % % \section{How much can we compress?} So, we can't compress below the entropy. % using a symbol code. How close can we expect to get to the entropy? % if we are using a symbol code? % \section{Existence of good symbol codes} \begin{ctheorem} {\sf Source coding theorem for symbol codes.} For an ensemble $X$ there exists a prefix code $C$ with expected length satisfying\indexs{extra bit} \beq H(X) \leq L(C,X) < H(X) + 1. \label{eq.source.coding.symbol} \eeq \label{th.source.coding.symbol} \end{ctheorem} \begin{prooflike}{Proof} We set the codelengths to integers slightly larger than the optimum lengths: \beq l_i = \lceil \log_2 (1/p_i) \rceil \eeq where $\lceil l^* \rceil$ denotes the smallest integer greater than or equal to $l^*$. [We are not asserting that the {\em optimal\/} code necessarily uses these lengths, we are simply choosing these lengths because we can use them to prove the theorem.] We check that there {\em is\/} a prefix code with these lengths by confirming that the \Kraft\ inequality is satisfied. \beq \sum_i 2^{-l_i} = \sum_i 2^{-\lceil \log_2 (1/p_i) \rceil} \leq \sum_i 2^{ -\log_2 (1/p_i) } = \sum_i p_i = 1 . \eeq Then we confirm \beq L(C,X) = \sum_i p_i \lceil \log (1/p_i) \rceil < \sum_i p_i ( \log (1/p_i) + 1 ) = H(X) + 1. \eeq % corrected < to = , 9802 % \end{prooflike} \subsection{The cost of using the wrong codelengths} If we use a code whose lengths are not equal to the optimal codelengths, the average message length will be larger than the entropy. %when we use the `wrong' code. If the true probabilities are $\{ p_i \}$ and we use a complete code with lengths $l_i$, % that satisfy the % \Kraft\ equality (that is, % the \Kraft\ inequality with equality), we can view those lengths as defining \ind{implicit probabilities} $q_i = 2^{-{l_i}}$. % l_i \eq \log 1/q_i$ such % that $\sum_i q_i \eq 1$, then Continuing from \eqref{eq.expected.length}, the average length is \beq L(C,X) = H(X)+\sum_i p_i \log p_i/q_i, \eeq \ie, it exceeds the entropy by the \ind{relative entropy} $D_{\rm KL}(\bp||\bq)$ (as defined on \pref{eq.KL}). \section{Optimal source coding with symbol codes: Huffman coding} Given a set of probabilities $\P$, how can we design an optimal prefix code? For example, what is the best symbol code for the English language ensemble shown in \figref{fig.elfig}? \marginfig{\begin{center}\input{tex/_paz.tex}\end{center} \caption[a]{An ensemble in need of a symbol code.}\label{fig.elfig}} When we say `optimal', let's assume our aim is to minimize the expected length $L(C,X)$. \subsection{How not to do it} One might try to roughly split the set $\A_X$ in two, and continue bisecting the subsets so as to define a binary tree from the root. This construction has the right spirit, as in the weighing problem, % is how the {\em Shannon-Fano code\/} is constructed,\index{Shannon, Claude}\index{Fano} but it is not necessarily optimal; it achieves $L(C,X) \leq H(X) + 2$. % % find a reference for proof of this? % %{\em [Is Shannon-Fano % the correct name? According to Goldie and Pinch this has a different % meaning. Check.]} \subsection{The Huffman coding algorithm} We now present a beautifully simple algorithm for finding an optimal prefix code. \indexs{Huffman code}The trick is to construct the code {\em backwards\/} starting from the tails of the codewords; {\em we build the binary tree from its leaves}. \begin{algorithm}[h] \begin{framedalgorithmwithcaption}{\caption[a]{Huffman coding algorithm.}} \ben \item%[{\sf 1.}] Take the two least probable symbols in the alphabet. These two symbols will be given the longest codewords, which will have equal length, and differ only in the last digit. \item%[{\sf 2.}] Combine these two symbols into a single symbol, and repeat. \een \end{framedalgorithmwithcaption} \end{algorithm} Since each step reduces the size of the alphabet by one, this algorithm will have assigned strings to all the symbols after $|\A_X|-1$ steps. \exampla{ % {\sf Example:} \begin{tabular}[t]{*{11}{@{\,}l}} Let \hspace{0.1in} & $\A_X$ &=&$\{$& {\tt a},&{\tt b},&{\tt c},&{\tt d},&{\tt e} &$\}$ \\ and \hspace{0.1in} & $\P_X$ &=&$\{$& 0.25, &0.25, & 0.2, & 0.15, & 0.15 & $\}$. \end{tabular} \begin{center} % \framebox{\psfig{figure=figs/huffman.ps,% %angle=-90}} \setlength{\unitlength}{0.015in}%was0125 \begin{picture}(200,95)(40,40) \put( 60,105){\makebox(0,0)[lb]{\raisebox{0pt}[0pt][0pt]{0.25}}} \put( 60,090){\makebox(0,0)[lb]{\raisebox{0pt}[0pt][0pt]{0.25}}} \put( 60,075){\makebox(0,0)[lb]{\raisebox{0pt}[0pt][0pt]{0.2}}} \put( 60,060){\makebox(0,0)[lb]{\raisebox{0pt}[0pt][0pt]{0.15}}} \put( 60,045){\makebox(0,0)[lb]{\raisebox{0pt}[0pt][0pt]{0.15}}} \put(100,105){\makebox(0,0)[lb]{\raisebox{0pt}[0pt][0pt]{0.25}}} \put(100,090){\makebox(0,0)[lb]{\raisebox{0pt}[0pt][0pt]{0.25}}} \put(100,075){\makebox(0,0)[lb]{\raisebox{0pt}[0pt][0pt]{0.2}}} \put(100,060){\makebox(0,0)[lb]{\raisebox{0pt}[0pt][0pt]{0.3}}} \put(140,105){\makebox(0,0)[lb]{\raisebox{0pt}[0pt][0pt]{0.25}}} \put(140,090){\makebox(0,0)[lb]{\raisebox{0pt}[0pt][0pt]{0.45}}} \put(140,060){\makebox(0,0)[lb]{\raisebox{0pt}[0pt][0pt]{0.3}}} \put(180,105){\makebox(0,0)[lb]{\raisebox{0pt}[0pt][0pt]{0.55}}} \put(180,090){\makebox(0,0)[lb]{\raisebox{0pt}[0pt][0pt]{0.45}}} \put(220,105){\makebox(0,0)[lb]{\raisebox{0pt}[0pt][0pt]{1.0}}} \put( 40,105){\makebox(0,0)[lb]{\raisebox{0pt}[0pt][0pt]{\tt a}}} \put( 40,090){\makebox(0,0)[lb]{\raisebox{0pt}[0pt][0pt]{\tt b}}} \put( 40,075){\makebox(0,0)[lb]{\raisebox{0pt}[0pt][0pt]{\tt c}}} \put( 40,060){\makebox(0,0)[lb]{\raisebox{0pt}[0pt][0pt]{\tt d}}} \put( 40,045){\makebox(0,0)[lb]{\raisebox{0pt}[0pt][0pt]{\tt e}}} \put( 85,067){\makebox(0,0)[lb]{\raisebox{0pt}[0pt][0pt]{\tt 0}}} \put( 85,045){\makebox(0,0)[lb]{\raisebox{0pt}[0pt][0pt]{\tt 1}}} \put(125,097){\makebox(0,0)[lb]{\raisebox{0pt}[0pt][0pt]{\tt 0}}} \put(125,075){\makebox(0,0)[lb]{\raisebox{0pt}[0pt][0pt]{\tt 1}}} \put(165,112){\makebox(0,0)[lb]{\raisebox{0pt}[0pt][0pt]{\tt 0}}} \put(165,065){\makebox(0,0)[lb]{\raisebox{0pt}[0pt][0pt]{\tt 1}}} \put(205,112){\makebox(0,0)[lb]{\raisebox{0pt}[0pt][0pt]{\tt 0}}} \put(205,090){\makebox(0,0)[lb]{\raisebox{0pt}[0pt][0pt]{\tt 1}}} \thinlines \put( 80,110){\line( 1, 0){ 15}} \put( 80,095){\line( 1, 0){ 15}} \put( 80,080){\line( 1, 0){ 15}} \put( 80,065){\line( 1, 0){ 15}} \put( 95,065){\line(-1,-1){ 15}} \put(120,110){\line( 1, 0){ 15}} \put(120,065){\line( 1, 0){ 15}} \put(120,095){\line( 1, 0){ 15}} \put(135,095){\line(-1,-1){ 15}} \put(160,095){\line( 1, 0){ 15}} \put(160,110){\line( 1, 0){ 15}} \put(175,110){\line(-1,-3){ 15}} \put(200,110){\line( 1, 0){ 15}} \put(215,110){\line(-1,-1){ 15}} \put( 40,125){\makebox(0,0)[bl]{\raisebox{0pt}[0pt][0pt]{$x$}}} \put( 85,125){\makebox(0,0)[b]{\raisebox{0pt}[0pt][0pt]{step 1}}} \put(125,125){\makebox(0,0)[b]{\raisebox{0pt}[0pt][0pt]{step 2}}} \put(165,125){\makebox(0,0)[b]{\raisebox{0pt}[0pt][0pt]{step 3}}} \put(205,125){\makebox(0,0)[b]{\raisebox{0pt}[0pt][0pt]{step 4}}} \end{picture} \end{center} The codewords are then obtained by concatenating the binary digits in reverse order: % Codewords $C = \{ {\tt{00}}, {\tt{10}} , {\tt{11}}, {\tt{010}}, {\tt{011}} \}$. \margintab{ \begin{center} \begin{tabular}{clrrl} \toprule $a_i$ & $p_i$ & \multicolumn{1}{c}{$h(p_i)$%$\log_2 \frac{1}{p_i}$} } & $l_i$ & $c(a_i)$ %{\rule[-3mm]{0pt}{8mm}}%strut \\ \midrule {\tt a} & 0.25 & 2.0 & 2 & {\tt 00} \\ {\tt b} & 0.25 & 2.0 & 2 & {\tt 10} \\ {\tt c} & 0.2 & 2.3 & 2 & {\tt 11} \\ {\tt d} & 0.15 & 2.7 & 3 & {\tt 010} \\ {\tt e} & 0.15 & 2.7 & 3 & {\tt 011} \\ \bottomrule \end{tabular} \end{center} \caption[a]{Code created by the Huffman algorithm.} \label{tab.huffman} } The codelengths selected by the Huffman algorithm (column 4 of \tabref{tab.huffman}) are in some cases longer and in some cases shorter than the ideal codelengths, the Shannon information contents $\log_2 \dfrac{1}{p_i}$ (column 3). The expected length of the code is $L=2.30$ bits, whereas the entropy is $H=2.2855$ bits.\ENDsolution } If at any point there is more than one way of selecting the two least probable symbols then the choice may be made in any manner -- the expected length of the code will not depend on the choice. \exercissxC{3}{ex.Huffmanconverse}{ % (Optional) Prove\index{Huffman code!`optimality'} that there is no better symbol code for a source than the Huffman code. } % \exampla{ We can make a Huffman code for the probability distribution over the alphabet introduced in \figref{fig.monogram}. The result is shown in \figref{fig.monogram.huffman}. This code has an expected length of 4.15 bits; the entropy of the ensemble is 4.11 bits. % It is interesting to notice how % some symbols, for example {\tt q}, receive codelengths that % differ by more than 1 bit from Observe the disparities between the assigned codelengths and the ideal codelengths $\log_2 \dfrac{1}{p_i}$. } %%%%%%%%%%%%%%%%%%%%%%%%% alphabet of english! \begin{figure} \figuremargin{% \begin{center} \mbox{\small \begin{tabular}{clrrl} \toprule $a_i$ & $p_i$ & \multicolumn{1}{c}{$\log_2 \frac{1}{p_i}$} & $l_i$ & $c(a_i)$ %{\rule[-3mm]{0pt}{8mm}}%strut \\[0in] \midrule {\tt a}& 0.0575 & 4.1 & 4 & {\tt 0000 } \\ {\tt b}& 0.0128 & 6.3 & 6 & {\tt 001000 } \\ {\tt c}& 0.0263 & 5.2 & 5 & {\tt 00101 } \\ {\tt d}& 0.0285 & 5.1 & 5 & {\tt 10000 } \\ {\tt e}& 0.0913 & 3.5 & 4 & {\tt 1100 } \\ {\tt f}& 0.0173 & 5.9 & 6 & {\tt 111000 } \\ {\tt g}& 0.0133 & 6.2 & 6 & {\tt 001001 } \\ {\tt h}& 0.0313 & 5.0 & 5 & {\tt 10001 } \\ {\tt i}& 0.0599 & 4.1 & 4 & {\tt 1001 } \\ {\tt j}& 0.0006 & 10.7 & 10 & {\tt 1101000000 } \\ {\tt k}& 0.0084 & 6.9 & 7 & {\tt 1010000 } \\ {\tt l}& 0.0335 & 4.9 & 5 & {\tt 11101 } \\ {\tt m}& 0.0235 & 5.4 & 6 & {\tt 110101 } \\ {\tt n}& 0.0596 & 4.1 & 4 & {\tt 0001 } \\ {\tt o}& 0.0689 & 3.9 & 4 & {\tt 1011 } \\ {\tt p}& 0.0192 & 5.7 & 6 & {\tt 111001 } \\ {\tt q}& 0.0008 & 10.3 & 9 & {\tt 110100001 } \\ {\tt r}& 0.0508 & 4.3 & 5 & {\tt 11011 } \\ {\tt s}& 0.0567 & 4.1 & 4 & {\tt 0011 } \\ {\tt t}& 0.0706 & 3.8 & 4 & {\tt 1111 } \\ {\tt u}& 0.0334 & 4.9 & 5 & {\tt 10101 } \\ {\tt v}& 0.0069 & 7.2 & 8 & {\tt 11010001 } \\ {\tt w}& 0.0119 & 6.4 & 7 & {\tt 1101001 } \\ {\tt x}& 0.0073 & 7.1 & 7 & {\tt 1010001 } \\ {\tt y}& 0.0164 & 5.9 & 6 & {\tt 101001 } \\ {\tt z}& 0.0007 & 10.4 & 10 & {\tt 1101000001 } \\ {--}& 0.1928 & 2.4 & 2 & {\tt 01 } \\ \bottomrule %{\verb+-+}& 0.1928 & 2.4 & 2 & {\tt 01 } \\ \bottomrule \end{tabular} \hspace*{0.5in}\raisebox{-2in}{\psfig{figure=tex/sortedtree.eps,width=1.972in}} } \end{center} }{% \caption[a]{Huffman code for the English language ensemble (monogram statistics).} % introduced in \protect\figref{fig.monogram}.} \label{fig.monogram.huffman} }% \end{figure} % see \cite[p. 97]{Cover&Thomas} % \medskip \subsection{Constructing a binary tree top-down is suboptimal} In previous chapters we studied weighing problems in which we built ternary or binary trees. We noticed that balanced trees -- ones in which, at every step, the two possible outcomes were as close as possible to equiprobable -- appeared to describe the most efficient experiments. This gave an intuitive motivation for entropy as a measure of information content. It is not the case, however, that optimal codes can {\em always\/} be constructed by a greedy top-down method in which the alphabet is successively divided into subsets that are as near as possible to equiprobable. % /home/mackay/itp/huffman> huffman.p latex=1 < fiftywrong3 \exampla{ Find the optimal binary symbol code for the ensemble: \beq \begin{array}{*{3}{@{\,}c@{\,}}*{6}{c@{\,}}*{2}{@{\,}c}} \A_X & = & \{ & {\tt a}, & {\tt b}, & {\tt c}, & {\tt d}, & {\tt e}, & {\tt f}, & {\tt g} & \} \\ \P_X & = & \{ & 0.01, & 0.24, & 0.05, & 0.20, & 0.47, & 0.01, & 0.02 & \} \\ \end{array} . \eeq Notice that a greedy top-down method can split this set into two % equiprobable subsets $\{ {\tt a},{\tt b},{\tt c},{\tt d} \}$ and $\{{\tt e},{\tt f},{\tt g}\}$ which both have probability $1/2$, and that $\{ {\tt a},{\tt b},{\tt c},{\tt d} \}$ can be divided into % equiprobable subsets $\{ {\tt a},{\tt b} \}$ and $\{{\tt c},{\tt d}\}$, which have probability $1/4$; so a greedy top-down method gives the code shown in the third column of \tabref{tab.greed},\margintab{ \begin{center}\small \begin{tabular}{clll} \toprule $a_i$ & $p_i$ & Greedy & Huffman \\[0in] \midrule {\tt a} & .01 & {\tt 000} & {\tt 000000} \\ {\tt b} & .24 & {\tt 001} & {\tt 01} \\ {\tt c} & .05 & {\tt 010} & {\tt 0001} \\ {\tt d} & .20 & {\tt 011} & {\tt 001} \\ {\tt e} & .47 & {\tt 10} & {\tt 1} \\ {\tt f} & .01 & {\tt 110} & {\tt 000001} \\ {\tt g} & .02 & {\tt 111} & {\tt 00001} \\ \bottomrule \end{tabular} \end{center} \caption[a]{A greedily-constructed code compared with the Huffman code.} \label{tab.greed} } which has expected length 2.53. The Huffman coding algorithm yields the code shown in the fourth column, %\begin{center} %\begin{tabular}{clrrl} \toprule %$a_i$ & $p_i$ & \multicolumn{1}{c}{$\log_2 \frac{1}{p_i}$} & $l_i$ & $c(a_i)$ %%{\rule[-3mm]{0pt}{8mm}}%strut %\\[0in] \midrule %{\tt a} & 0.01 & 6.6 & 6 & {\tt 000000} \\ %{\tt b} & 0.24 & 2.1 & 2 & {\tt 01} \\ %{\tt c} & 0.05 & 4.3 & 4 & {\tt 0001} \\ %{\tt d} & 0.20 & 2.3 & 3 & {\tt 001} \\ %{\tt e} & 0.47 & 1.1 & 1 & {\tt 1} \\ %{\tt f} & 0.01 & 6.6 & 6 & {\tt 000001} \\ %{\tt g} & 0.02 & 5.6 & 5 & {\tt 00001} \\ % \bottomrule %\end{tabular} %\end{center} which has expected length 1.97.\ENDsolution % entropy 1.9323 % } %\subsection{Twenty questions} % The Huffman algorithm defines the optimal way to % play `twenty questions'. % % {\em [MORE HERE]} \section{Disadvantages of the Huffman code} \label{sec.huffman.probs} The Huffman\index{Huffman code!disadvantages}\index{symbol code!disadvantages} algorithm produces an optimal symbol code for an ensemble, but this is not the end of the story. Both the word `ensemble' and the phrase `symbol code' need careful attention. %\begin{description} %\item[Changing ensemble.] \subsection{Changing ensemble} If we wish to communicate a sequence of outcomes from one unchanging ensemble, then a Huffman code may be convenient. But often the appropriate ensemble changes. If for example we are compressing text, then the symbol frequencies will vary with context: in English the letter {\tt{u}} is much more probable after a {\tt{q}} than after an {\tt{e}} (\figref{fig.conbigrams}). And furthermore, our knowledge of these context-dependent symbol frequencies will also change as we learn % accumulate statistics on the statistical properties of the text source.\index{adaptive models} % So our probabilities should change Huffman codes do not handle changing ensemble probabilities with any elegance. One brute-force approach would be to recompute the Huffman code every time the probability over symbols changes. Another attitude is to deny the option of adaptation, and instead run through the entire file in advance and compute a good probability distribution, which will then remain fixed throughout transmission. The code itself must also be communicated in this scenario. Such a technique is not only cumbersome and restrictive, it is also suboptimal, since the initial message specifying the code and the document itself are partially redundant. % -- knowing the algorithm that % defines the code for a given document, one can deduce what the % initial header has to be from the . This technique therefore wastes bits. % flag this: % could discuss bits back here % \subsection{The extra bit} %item[The extra bit.] An equally serious problem with Huffman codes is the innocuous-looking `\ind{extra bit}' relative to the ideal average length of $H(X)$ -- a Huffman code achieves a length that satisfies $H(X) \leq L(C,X) < H(X) + 1,$ as proved in theorem \ref{th.source.coding.symbol}. %\eqref{eq.source.coding.symbol}). A Huffman code thus incurs an overhead of between 0 and 1 bits per symbol. If $H(X)$ were large, then this overhead would be an unimportant fractional increase. But for many applications, the entropy may be as low as one bit per symbol, or even smaller, so the overhead %`$+1$' $L(C,X)- H(X)$ may dominate the encoded file length. Consider English text: in some contexts, long strings of characters may be highly predictable. % , as we saw in the guessing game of chapter \chtwo. % given a simple model of the language. For example, in the context `{\verb+strings_of_ch+}', one might predict the next nine symbols to be `{\verb+aracters_+}' with a probability of 0.99 each. A traditional Huffman code would be obliged to use at least one bit per character, making a total cost of nine bits where virtually no information is being conveyed (0.13 bits in total, to be precise). The entropy of English, given a good model, is about one bit per character \cite{Shannon48}, so a Huffman code is likely to be highly % nearly 100\% inefficient. A traditional patch-up of Huffman codes uses them to compress {\dem blocks\/} of symbols, for example the `extended sources' $X^N$ we discussed in \chref{ch.two}. % \ref{ch2} % rather than defining a code for single symbols. The overhead per block is at most 1 bit so the overhead per symbol % goes down as is at most $1/N$ bits. For sufficiently large blocks, the problem of the extra bit may be removed -- but only at the expenses of (a) losing the elegant instantaneous decodeability of simple Huffman coding; and (b) having to compute the probabilities of all relevant strings and build the associated Huffman tree. One will end up explicitly computing the probabilities and codes for a huge number of strings, most of which will never actually occur. (See \exerciseref{ex.Huff99}.) % A further problem is that it may not be appropriate to model % successive symbols as coming independently from a single ensemble % $X$. As we already asserted, any decent model for text will % assign a probability over symbols that depends on the context. % A changing probability distribution over symbols is % not incompatible with the construction of Huffman codes for % blocks of symbols. One could consider each possible sequence, % computing the relevant probability distributions along the way % to evaluate the probability of the entire sequence, then build % a Huffman tree for the sequences. One could account for % dependences between blocks as well, if one were willing to % use a different Huffman code each time. But this modified % encoder would be % computationally expensive, since for large block sizes an % exponentially large number of possible sequences would have % to be considered along with their adaptive probabilities. %% is context-dependent. % \end{description} % \medskip \subsection{Beyond symbol codes} % Huffman codes, therefore, although widely trumpeted as `optimal', have many defects for practical purposes.\index{Huffman code!`optimality'} They {\em are\/} optimal {\em symbol\/} codes, but for practical purposes {\em we don't want a symbol code}. The defects of Huffman codes are rectified by {\dem arithmetic coding},\index{arithmetic coding} which dispenses with the restriction that each symbol must translate into an integer number of bits. Arithmetic coding is the main topic of the next chapter. % is not a symbol coding. This % we will discuss next. % In an arithmetic code, the probabilistic modelling is clearly % separated from the encoding operation. \section{Summary} \begin{description} \item[Kraft inequality\puncspace] If a code is {\dbf uniquely decodeable} its lengths must satisfy \beq \sum_i 2^{-l_i } \leq 1 . \eeq For any lengths satisfying the Kraft inequality, there exists a prefix code with those lengths. \item[Optimal source codelengths for an ensemble] are equal to the Shannon information contents\index{source code!optimal lengths}\index{source code!implicit probabilities} \beq l_i = \log_2 \frac{1}{p_i} , \eeq and conversely, any choice of codelengths defines {\dbf\ind{implicit probabilities}} \beq q_i = \frac{2^{-l_i}}{z} . \eeq \item[The \ind{relative entropy}] $D_{\rm KL}(\bp||\bq)$ measures how many bits per symbol are wasted by using a % mismatched code whose implicit probabilities are $\bq$, when the ensemble's true probability distribution is $\bp$. \item[Source coding theorem for symbol codes\puncspace] For an ensemble $X$, there exists a prefix code whose expected length satisfies \beq H(X) \leq L(C,X) < H(X) + 1 . \eeq % The expected length is only equal to the entropy if the \item[The Huffman coding algorithm] generates an optimal symbol code iteratively. At each iteration, the two least probable symbols are combined. \end{description} \section{Exercises} \exercisaxB{2}{ex.Cnud}{ Is the code $\{ {\tt 00}, {\tt 11}, {\tt 0101}, {\tt 111}, {\tt 1010}, {\tt 100100}, {\tt 0110} \}$ % $\{ 00,11,0101,111,1010,100100,0110 \}$ uniquely decodeable? } \exercisaxB{2}{ex.Ctern}{ Is the ternary code $\{ {\tt 00},{\tt 012},{\tt 0110},{\tt 0112},{\tt 100},{\tt 201},{\tt 212},{\tt 22} \}$ uniquely decodeable? } \exercissxA{3}{ex.HuffX2X3}{ Make Huffman codes for $X^2$, $X^3$ and $X^4$ where ${\cal A}_X = \{ 0,1 \}$ and ${\cal P}_X = \{ 0.9,0.1 \}$. Compute their expected lengths and compare them with the entropies $H(X^2)$, $H(X^3)$ and $H(X^4)$. Repeat this exercise for $X^2$ and $X^4$ where ${\cal P}_X = \{ 0.6,0.4 \}$. } \exercissxA{2}{ex.Huffambig}{ Find a probability distribution $\{ p_1,p_2,p_3,p_4 \}$ such that there are {\em two\/} optimal codes that assign different lengths $\{ l_i \}$ to the four symbols. } \exercisaxC{3}{ex.Huffambigb}{ (Continuation of \exerciseonlyref{ex.Huffambig}.) Assume that the four probabilities $\{ p_1,p_2,p_3,p_4 \}$ are ordered such that $p_1 \geq p_2 \geq p_3 \geq p_4 \geq 0$. Let $\cal Q$ be the set of all probability vectors $\bp$ such that there are {\em two\/} optimal codes with different lengths. Give a complete description of $\cal Q$. Find three probability vectors $\bq^{(1)}$, $\bq^{(2)}$, $\bq^{(3)}$, which are the \ind{convex hull} of $\cal Q$, \ie, such that any $\bp \in \cal Q$ can be written as \beq \bp = \mu_1 \bq^{(1)} + \mu_2 \bq^{(2)} +\mu_3 \bq^{(3)} , \eeq where $\{\mu_i\}$ are positive. } \exercisaxB{1}{ex.twenty.questions}{ Write a short essay discussing how to play the game of {\sf{\ind{twenty questions}}} optimally. [In twenty questions, one player thinks of an object, and the other player has to guess the object using as few binary questions as possible, preferably fewer than twenty.] } \exercisaxB{2}{ex.powertwogood}{ Show that, if each probability $p_i$ is equal to an integer power of 2 then there exists a source code whose expected length equals the entropy. } \exercissxB{2}{ex.make.huffman.suck}{ Make ensembles for which the difference between the entropy and the expected length of the Huffman code is as big as possible. }% 14. Gallager, R. G., "Variations on a Theme by Huffman", % IEEE Trans. on Information Theory, Vol. IT-24, No. 6, Nov. 1978, pp. 668-674. % %\exercisxB{2}{ex.huffman.biggerhalf}{ % If one of the probabilities $p_m$ is greater than $1/2$, how % big must the difference between the expected length and the entropy be? % Sketch a graph the %} % from {tex/huffmanI.tex} \exercissxB{2}{ex.huffman.uniform} { % from 02q.tex on rum A source $X$ has an alphabet of eleven characters $$\{ {\tt{a}} , {\tt{b}} , {\tt{c}} , {\tt{d}} , {\tt{e}} , {\tt{f}} , {\tt{g}} , {\tt{h}} , {\tt{i}} , {\tt{j}} , {\tt{k}} \},$$ all of which have equal probability, $1/11$. % State the meaning of the ideal codelengths Find an {optimal uniquely decodeable symbol code} for this source. How much greater is the expected length of this optimal code than the entropy of $X$? } \exercisaxB{2}{ex.huffman.uniform2}{ Consider the optimal symbol code for an ensemble $X$ with alphabet size $I$ from which all symbols have identical probability $p = 1/I$. $I$ is not a power of 2. Show that the fraction $f^+$ of the $I$ symbols that are assigned codelengths equal to \beq l^+ \equiv \lceil \log_2 I \rceil \eeq satisfies \beq f^+ = 2 - \frac{2^{l^+}}{I} \label{eq.HIf} \eeq and that the expected length of the optimal symbol code is \beq L = l^+ -1 + f^+ . \label{eq.HIL} \eeq By differentiating the excess length %\beq $ \Delta L \equiv L - H(X)$ %\eeq with respect to $I$, show that the excess length is bounded by \beq \Delta L \leq 1 - \frac{ \ln ( \ln 2 )}{ \ln 2} -\frac{ 1 }{ \ln 2} = 0.086 . \eeq } \exercisaxA{2}{ex.Huff99}{ Consider a sparse binary source with ${\cal P}_X = \{ 0.99 , 0.01 \}$. Discuss how Huffman codes could be used to compress this source {\em efficiently}.\index{Huffman code} Estimate how many codewords your proposed solutions require. % The entropy - hint: could think about run length encoding? % } \exercisaxB{2}{ex.poisonglass}{ % p.111 martin gardner mathematical carnival{Gardner:Carnival} {\em Scientific American\/} carried the following puzzle\index{poisoned glass}\index{puzzle!poisoned glass} in 1975. % roughly! \begin{description} \item[The poisoned glass\puncspace]% This should be \exercisetitlestyle ? `Mathematicians are curious birds', the police commissioner said to his wife. `You see, we had all those partly filled glasses lined up in rows on a table in the hotel kitchen. Only one contained poison, and we wanted to know which one before searching that glass for fingerprints. Our lab could test the liquid in each glass, but the tests take time and money, so we wanted to make as few of them as possible by simultaneously testing mixtures of small samples from groups of glasses. The university sent over a mathematics professor to help us. He counted the glasses, smiled and said: `$\,$``Pick any glass you want, Commissioner. We'll test it first.'' `$\,$``But won't that waste a test?'' I asked. `$\,$``No,'' he said, ``it's part of the best procedure. We can test one glass first. It doesn't matter which one.''$\,$' `How many glasses were there to start with?' the commissioner's wife asked. `I don't remember. Somewhere between 100 and 200.' What was the exact number of glasses? \end{description}% \cite{Gardner:Carnival} Solve this puzzle and then explain why the professor was in fact wrong and the commissioner was right. What is in fact the optimal procedure for identifying the one poisoned glass? What is the expected waste relative to this optimum if one followed the professor's strategy? Explain the relationship to symbol coding. } % could get worked up over the all zero codeword, which corresponds to % a possible non-detection; if this would require an extra test % then presumably the story is a bit different, with some deliberate % skewing of the tree to make it more likely that we get a positive %result along the way. \exercissxA{2}{ex.optimalcodep1}{% problem fixed Tue 12/12/00 Assume that a sequence of symbols from the ensemble $X$ introduced at the beginning of this chapter is compressed using the code $C_3$. \amarginfignocaption{t}{ \begin{center} $C_3$:\\[0.1in] \begin{tabular}{cllcc} \toprule $a_i$ & $c(a_i)$ & $p_i$ & \multicolumn{1}{c}{$h({p_i})$} & $l_i$ % {\rule[-3mm]{0pt}{8mm}}%strut \\ \midrule {\tt a} & {\tt 0} & \dhalf & 1.0 & 1 \\ {\tt b} & {\tt 10} & \dquarter & 2.0 & 2 \\ {\tt c} & {\tt 110} & \deighth & 3.0 & 3 \\ {\tt d} & {\tt 111} & \deighth & 3.0 & 3 \\ \bottomrule \end{tabular} \end{center} } Imagine picking one bit at random from the binary encoded sequence $\bc = c(x_1)c(x_2)c(x_3)\ldots$ . What is the probability that this bit is a 1? } \exercissxB{2}{ex.Huffmanqary}{ % (Optional) How should the\index{Huffman code!general alphabet} binary Huffman encoding scheme be modified to make optimal symbol codes in an encoding alphabet with $q$ symbols? (Also known as `\ind{radix} $q$'.) } % answer, Hamming p.73: % add enough states with probability zero to make the total % number of states equal to $k(q-1)+1$, for some integer $k$. % then repeatedly combine $q$ into 1 % \end{document} % % \item[A code $C(X)$ is {\em non-singular\/}] if every element of $\A_X$ % maps into a different string, \ie, % \beq % a_i \not = a_j \Rightarrow c(a_i) \not = c(a_j). % \eeq % % \item[The extension $C^+$ of a code $C$] is a mapping from finite length % strings of $\A_X$ to $\{0,1\}^+$ % % finite length strings of NAME? % defined by the concatentation: % \beq % c(x_1 x_2 \ldots x_N) = c(x_1)c(x_2)\ldots c(x_N) % \eeq % % \item[A code is uniquely decodeable] if its extension is non-singular. % \subsection*{Mixture codes} It is a tempting idea to construct a `\ind{metacode}' from several symbol codes that assign different-length codewords to the alternative symbols, then switch from one code to another, choosing whichever assigns the shortest codeword to the current symbol. Clearly we cannot do this for free.\index{bits back} If one wishes to choose between two codes, then it is necessary to lengthen the message in a way that indicates which of the two codes is being used. If we indicate this choice by a single leading bit, it will be found that the resulting code is suboptimal because it is incomplete (that is, it fails the Kraft equality). \exercissxA{3}{ex.mixsubopt}{ Prove that this metacode is incomplete, and explain why this combined code is suboptimal. } % % need more on prefix property to make clear how strings are decodeable, % self-punctuating. \dvips \section{Solutions}% to Chapter \protect\ref{ch3}'s exercises} \fakesection{solns 3} \soln{ex.C1101}{ Yes, $C_2 = \{ {\tt{1}} , {\tt{1}}{\tt{0}}{\tt{1}} \}$ % $C_2 = \{ 1 , 101 \}$ is uniquely decodeable, even though it is not a prefix code, because no two different strings can map onto the same string; only the codeword $c(a_2)={\tt 101}$ contains the symbol {\tt0}. } \soln{ex.KIconverse}{ We wish to prove that for any set of codeword lengths $\{ l_i \}$ satisfying the \Kraft\ inequality, there is a prefix code having those lengths. % % Symbol Coding Budget -- cut this figure later, it is already in _l3 % \begin{figure}[htbp] \figuremargin{% \begin{center} \mbox{\psfig{figure=figs/budget1.eps,height=3in}\ \psfig{figure=figs/budgetmax.eps,height=3in}} \end{center} }{% \caption[a]{The codeword supermarket and the symbol coding budget. The `cost' $2^{-l}$ of each codeword (with length $l$) is indicated by the size of the box it is written in. The total budget available when making a uniquely decodeable code is 1.} \label{fig.budget1a} }% \end{figure} This is readily proved by thinking of the codewords illustrated in \figref{fig.budget1a} as being in a `codeword supermarket', with size indicating cost. We imagine purchasing\index{source code!supermarket}\index{supermarket (codewords)} codewords one at a time, starting from the shortest codewords (\ie, the biggest purchases), using the budget shown at the right of \figref{fig.budget1a}. We start at one side of the codeword supermarket, say the top, and purchase the first codeword of the required length. We advance down the supermarket a distance $2^{-l}$, and purchase the next codeword of the next required length, and so forth. Because the codeword lengths are getting longer, and the corresponding intervals are getting shorter, we can always buy an adjacent codeword to the latest purchase, so there is no wasting of the budget. Thus at the $I$th codeword we have advanced a distance $\sum_{i=1}^{I} 2^{-l_i}$ down the supermarket; if $\sum 2^{-l_i} \leq 1$, we will have purchased all the codewords without running out of budget. } \soln{ex.Huffmanconverse}{ The proof that Huffman coding is optimal depends on proving that the key step in the algorithm -- the decision to give % combination of the two symbols with smallest probability equal encoded lengths -- cannot lead to a larger expected length than any other code. We can prove this by contradiction. Assume that the two symbols with smallest probability, called $a$ and $b$, to which the Huffman algorithm would assign equal length codewords, do {\em not\/} have equal lengths in {\em any\/} optimal symbol code. The optimal symbol code is some other rival code in which these two codewords have unequal lengths $l_a$ and $l_b$ with $l_a < l_b$. Without loss of generality we can assume that this other code is a complete prefix code, because any codelengths of a uniquely decodeable code can be realized by a prefix code. % We now consider transforming the other code into a new code % in which we interchange \ldots In this rival code, there must be some other symbol $c$ whose probability $p_c$ is greater than $p_a$ and whose length in the rival code is greater than or equal to $l_b$, because the code for $b$ must have an adjacent codeword of equal or greater length -- a complete prefix code never has a solo codeword of the maximum length. \begin{figure}%[htbp] \figuremargin{% \begin{tabular}{llllll} \toprule % \hline symbol & \multicolumn{2}{c}{probability} & Huffman & Rival code's & Modified rival \\ & & & codewords & codewords & code \\ \midrule % [0.1in]\hline $a$ & $p_a$ & \framebox[0.15in]{} & \framebox[1.50cm]{$c_{\rm H}(a)$} & \framebox[1.0cm]{$c_{\rm R}(a)$} & \framebox[1.6cm]{$c_{\rm R}(c)$} \\[0.1in] $b$ & $p_b$ & \framebox[0.1in]{} & \framebox[1.50cm]{$c_{\rm H}(b)$} & \framebox[1.5cm]{$c_{\rm R}(b)$} & \framebox[1.5cm]{$c_{\rm R}(b)$} \\[0.1in] $c$ & $p_c$ & \framebox[0.25in]{} & \framebox[0.95cm]{$c_{\rm H}(c)$} & \framebox[1.6cm]{$c_{\rm R}(c)$} & \framebox[1.0cm]{$c_{\rm R}(a)$} \\ \bottomrule % [0.1in] \hline \end{tabular} }{% \caption[a]{Proof that Huffman coding makes an optimal symbol code. % The proof works by contradiction. We assume that the rival code, which is said to be optimal, assigns {\em unequal\/} length codewords to the two symbols with smallest probability, $a$ and $b$. By interchanging codewords $a$ and $c$ of the rival code, where $c$ is a symbol with rival codelength as long as $b$'s, we can make a code better than the rival code. This shows that the rival code was not optimal. } \label{fig.huffman.optimal} }% \end{figure} Consider exchanging the codewords of $a$ and $c$ (\figref{fig.huffman.optimal}), so that $a$ is encoded with the longer codeword that was $c$'s, and $c$, which is more probable than $a$, gets the shorter codeword. Clearly this reduces the expected length of the code. The change in expected length is $(p_a-p_c)(l_c-l_a)$. Thus we have contradicted the assumption that the rival code is optimal. Therefore it is valid to give the two symbols with smallest probability equal encoded lengths. Huffman coding produces optimal symbol codes.\ENDsolution } %\soln{ex.Cnud}{ %\soln{ex.Ctern}{ \soln{ex.HuffX2X3}{ A Huffman code for $X^2$ where ${\cal A}_X = \{ {\tt 0},{\tt 1} \}$ and ${\cal P}_X = \{ 0.9,0.1 \}$ is $\{{\tt 00},{\tt 01},{\tt 10},{\tt 11}\} \rightarrow \{{\tt 1},{\tt 01},{\tt 000},{\tt 001}\}$. This code has $L(C,X^2) = 1.29$, whereas the entropy $H(X^2)$ is 0.938. A Huffman code for $X^3$ is \[ \begin{array}{c} \{{\tt 000},{\tt 100},{\tt 010},{\tt 001},{\tt 101},{\tt 011},{\tt 110},{\tt 111}\} \rightarrow\\ \hspace*{1in} \{{\tt 1},{\tt 011},{\tt 010},{\tt 001}, {\tt 00000},{\tt 00001},{\tt 00010},{\tt 00011}\}. \end{array} \] % corrected from 1.472 to 1.598 % 9802 This has expected length $L(C,X^3) = 1.598$ whereas the entropy $H(X^3)$ is 1.4069. A Huffman code for $X^4$ maps the sixteen source strings to the following codelengths: \[ \begin{array}{c} \{ {\tt 0000},{\tt 1000},{\tt 0100},{\tt 0010},{\tt 0001},{\tt 1100},{\tt 0110},{\tt 0011},{\tt 0101}, {\tt 1010},{\tt 1001},{\tt 1110},{\tt 1101}, \\ {\tt 1011},{\tt 0111},{\tt 1111} \} \rightarrow \:\: \{ 1,3,3,3,4,6,7,7,7,7,7,9,9,9,10,10 \}. % 10,10,9,9,9,7,7,7,7,7,6,4,3, 3,3,1\}. \end{array} \] This has expected length $L(C,X^4) = 1.9702$ whereas the entropy $H(X^4)$ is 1.876. % % 0.6,0.4 When ${\cal P}_X = \{ 0.6,0.4 \}$, the Huffman code for $X^2$ has lengths $\{ 2,2,2,2 \}$; the expected length is 2 bits, and the entropy is 1.94 bits. A Huffman code for $X^4$ is shown in \tabref{fig.X4huff2}. % , has lengths % $\{0000,1000,0100,0010,0001,1100,0110,0011,0101,1010,1001,1110,1101,1011,0111,1111\} \rightarrow$ % $\{3,3,4,4,4,4,4,4,4,4,4,4,5,5,5,5\}$. The expected length is 3.92 bits, and the entropy is 3.88 bits. % see tmp3 for soln using huffman.p % $\{0000,1000,0100,0010,0001,1100,0110,0011,0101,1010,1001,1110,1101,1011,0111,1111\} \rightarrow \{5,5,5,5,4,4,4,4,4,4,4,4,4,4,3,3\}$. } % see tmp3 for use of huffman.p %\begin{figure} %\figuremargin{% \margintab{\footnotesize \begin{center} \begin{tabular}{clrl} \toprule % \hline $a_i$ & $p_i$ & % \multicolumn{1}{c}{$h({p_i})$} & $l_i$ & $c(a_i)$ % {\rule[-3mm]{0pt}{8mm}}%strut % \\[0.1in] \hline \\ \midrule {\tt 0000} & 0.1296 & 3 & {\tt 000 }\\ {\tt 0001} & 0.0864 & 4 & {\tt 0100 }\\ {\tt 0010} & 0.0864 & 4 & {\tt 0110 }\\ {\tt 0100} & 0.0864 & 4 & {\tt 0111 }\\ {\tt 1000} & 0.0864 & 3 & {\tt 100 }\\ {\tt 1100} & 0.0576 & 4 & {\tt 1010 }\\ {\tt 1010} & 0.0576 & 4 & {\tt 1100 }\\ {\tt 1001} & 0.0576 & 4 & {\tt 1101 }\\ {\tt 0110} & 0.0576 & 4 & {\tt 1110 }\\ {\tt 0101} & 0.0576 & 4 & {\tt 1111 }\\ {\tt 0011} & 0.0576 & 4 & {\tt 0010 }\\ {\tt 1110} & 0.0384 & 5 & {\tt 00110 }\\ {\tt 1101} & 0.0384 & 5 & {\tt 01010 }\\ {\tt 1011} & 0.0384 & 5 & {\tt 01011 }\\ {\tt 0111} & 0.0384 & 4 & {\tt 1011 }\\ {\tt 1111} & 0.0256 & 5 & {\tt 00111 }\\ \bottomrule %\hline %expected length 3.9248 %entropy 3.8838 \end{tabular} \end{center} %}{% \caption[a]{Huffman code for $X^4$ when $p_0=0.6$. Column 3 shows the assigned codelengths and column 4 the codewords. Some strings whose probabilities are identical, \eg, the fourth and fifth, receive different codelengths.} \label{fig.X4huff2} }% %\end{figure} \soln{ex.Huffambig}{ The set of probabilities $\{ p_1,p_2,p_3,p_4 \} = \{ \dsixth,\dsixth,\dthird,\dthird\}$ gives rise to two different optimal sets of codelengths, because at the second step of the Huffman coding algorithm we can choose any of the three possible pairings. We may either put them in a constant length code $\{ {\tt00},{\tt01},{\tt10},{\tt11} \}$ or the code $\{ {\tt000},{\tt001},{\tt01},{\tt1} \}$. Both codes have expected length 2. Another solution is $\{ p_1,p_2,p_3,p_4 \}$ $=$ $\{ \dfifth,\dfifth,\dfifth,\dtwofifth\}$. % =$ $\{ 0.2 , 0.2 , 0.2 , 0.4 \} $. And a third is $\{ p_1,p_2,p_3,p_4 \} = \{ \dthird,\dthird,\dthird,0\}$. } \soln{ex.make.huffman.suck}{ Let $p_{\max}$ be the largest probability in $p_1,p_2,\ldots,p_I$. The difference between the expected length $L$ and the entropy $H$ can be no bigger than $\max ( p_{\max} , 0.086 )$ \cite{Gallager78}. % See exercises \ref{ex.huffman.uniform}--\ref{ex.huffman.uniform2} to understand where the curious 0.086 comes from. } \soln{ex.huffman.uniform}{ % removed to cutsolutions.tex Length $-$ entropy = 0.086. %length / entropy 1.0249 } % \soln{ex.Huff99}{ % BORDERLINE \soln{ex.optimalcodep1}{% problem fixed Tue 12/12/00 There are two ways to answer this problem correctly, and one popular way to answer it incorrectly. Let's give the incorrect answer first: \begin{description} \item[Erroneous answer\puncspace] ``We can pick a random bit by first picking a random source symbol $x_i$ with probability $p_i$, then picking a random bit from $c(x_i)$. If we define $f_i$ to be the fraction of the bits of $c(x_i)$ that are {\tt 1}s, we find \marginpar[b]{\small \begin{center} $C_3$: \begin{tabular}{cllc} \toprule $a_i$ & $c(a_i)$ & $p_i$ & $l_i$ \\ \midrule {\tt a} & {\tt 0} & \dhalf & 1 \\ {\tt b} & {\tt 10} & \dquarter & 2 \\ {\tt c} & {\tt 110} & \deighth & 3 \\ {\tt d} & {\tt 111} & \deighth & 3 \\ \bottomrule \end{tabular} \end{center} } \beqan \!\!\!\!\!\!\!\!\!\! P(\mbox{bit is {\tt 1}}) &=& \sum_i p_i f_i \label{eq.wrongp1} \\ &=& \dfrac{1}{2} \times 0 + \dfrac{1}{4} \times \dfrac{1}{2} + \dfrac{1}{8} \times \dfrac{2}{3} + \dfrac{1}{8} \times 1 = \dthird \mbox{.''} \eeqan \end{description} This answer is wrong because it falls for the \index{bus-stop paradox}{bus-stop fallacy},\index{paradox} which was introduced in \exerciseref{ex.waitbus}: if buses arrive at random, and we are interested in `the average time from one bus until the next', we must distinguish two possible averages: (a) the average time from a randomly chosen bus until the next; (b) the average time between the bus you just missed and the next bus. The second `average' is twice as big as the first because, by waiting for a bus at a random time, you bias your selection of a bus in favour of buses that follow a large gap. You're unlikely to catch a bus that comes 10 seconds after a preceding bus! Similarly, the symbols {\tt c} and {\tt d} get encoded into longer-length binary strings than {\tt a}, so when we pick a bit from the compressed string at random, we are more likely to land in a bit belonging to a {\tt c} or a {\tt d} than would be given by the probabilities $p_i$ in the expectation (\ref{eq.wrongp1}). All the probabilities need to be scaled up by $l_i$, and renormalized. \begin{description} \item[Correct answer in the same style\puncspace] Every time symbol $x_i$ is encoded, $l_i$ bits are added to the binary string, of which $f_i l_i$ are {\tt 1}s. The expected number of {\tt 1}s added per symbol is \beq \sum_i p_i f_i l_i ; \eeq and the expected total number of bits added per symbol is \beq \sum_i p_i l_i . \eeq So the fraction of {\tt 1}s in the transmitted string is \beqan P(\mbox{bit is {\tt 1}}) &=& \frac{ \sum_i p_i f_i l_i }{ \sum_i p_i l_i } \label{eq.rightp1} \\ &=& \frac{ \dfrac{1}{2} \times 0 + \dfrac{1}{4} \times 1 + \dfrac{1}{8} \times 2 + \dfrac{1}{8} \times 3 }{ \dfrac{7}{4} } = \frac{\dfrac{7}{8}}{\dfrac{7}{4}} = 1/2 . \nonumber \eeqan \end{description} For a general symbol code and a general ensemble, the expectation (\ref{eq.rightp1}) is the correct answer. But in this case, we can use a more powerful argument. \begin{description} \item[Information-theoretic answer\puncspace] The encoded string $\bc$ is the output of an optimal compressor that compresses samples from $X$ down to an expected length of $H(X)$ bits. We can't expect to compress this data any further. But if the probability $P(\mbox{bit is {\tt 1}})$ were not equal to $\dhalf$ then it {\em would\/} be possible to compress the binary string further (using a block compression code, say). Therefore $P(\mbox{bit is {\tt 1}})$ must be equal to $\dhalf$; indeed the probability of any sequence of $l$ bits in the compressed stream taking on any particular value must be $2^{-l}$. The output of a perfect compressor is always perfectly random bits. \begincuttable To put it another way, if the probability $P(\mbox{bit is {\tt 1}})$ were not equal to $\dhalf$, then the information content per bit of the compressed string would be at most $H_2( P(\mbox{{\tt 1}}) )$, which would be less than 1; but this contradicts the fact that we can recover the original data from $\bc$, so the information content per bit of the compressed string must be $H(X)/L(C,X)=1$. \ENDcuttable \end{description} } % % this one is a new addition % \soln{ex.Huffmanqary}{ The \index{Huffman code!general alphabet}{general Huffman coding algorithm} for an encoding alphabet with $q$ symbols has one difference from the binary case. The process of combining $q$ symbols into 1 symbol reduces the number of symbols by $q\!-\!1$. So if we start with $A$ symbols, we'll only end up with a complete $q$-ary tree if $A \mod (q\!-\!1)$ is equal to 1. Otherwise, we know that whatever prefix code we make, it must be an incomplete tree with a number of missing leaves equal, modulo $(q\!-\!1)$, to $A \mod (q\!-\!1) - 1$. For example, if a ternary tree is built for eight symbols, then there will unavoidably be one missing leaf in the tree. The optimal $q$-ary code is made by putting these extra leaves in the longest branch of the tree. This can be achieved by adding the appropriate number of symbols to the original source symbol set, all of these extra symbols having probability zero. The total number of leaves is then equal to $r(q\!-\!1)+1$, for some integer $r$. The symbols are then repeatedly combined by taking the $q$ symbols with smallest probability and replacing them by a single symbol, as in the binary Huffman coding algorithm.} \soln{ex.mixsubopt}{ %This is important but I haven't written it yet. We wish to show that a greedy \ind{metacode}, which picks the code which gives the shortest encoding, is actually suboptimal, because it violates the Kraft inequality. % For generality, let's call the % that the objects to be encoded, % $x$, `symbols'. We'll assume that each symbol $x$ is assigned lengths $l_k(x)$ by each of the candidate codes $C_k$. Let us assume there are $K$ alternative codes and that we can encode which code is being used with a header of length $\log K$ bits. Then the metacode assigns lengths $l'(x)$ that are given by \beq l'(x) = \log_2 K + \min_k l_k(x) . \eeq We compute the Kraft sum: \beq S = \sum_x 2^{- l'(x)} = \frac{1}{K} \sum_x 2^{- \min_k l_k(x)} . \eeq Let's divide the set $\A_X$ into non-overlapping subsets $\{\A_k\}_{k=1}^{K}$ such that subset $\A_k$ contains all the symbols $x$ that the metacode sends via code $k$. Then \beq S = \frac{1}{K} \sum_k \sum_{x \in \A_{k}} 2^{- l_k(x)} . \eeq Now if one sub-code $k$ satisfies the Kraft equality $\sum_{x\in \A_X} 2^{- l_k(x)} \eq 1$, then it must be the case that \beq \sum_{x \in \A_{k}} 2^{- l_k(x)} \leq 1 , \label{eq.from.kraft} \eeq with equality only if all the symbols $x$ are in $\A_k$, which would mean that we are only using one of the $K$ codes. So \beq S \leq \frac{1}{K} \sum_{k=1}^K 1 = 1 , \eeq with equality only if \eqref{eq.from.kraft} is an equality for all codes $k$. But it's impossible for all the symbols to be in {\em all\/} the non-overlapping subsets $\{\A_k\}_{k=1}^{K}$, so we can't have equality (\ref{eq.from.kraft}) holding for {\em all\/} $k$. So %\beq % S < 1 . %\eeq $S < 1$. Another way of seeing that a mixture code is suboptimal is to consider the binary tree that it defines. Think of the special case of two codes. The first bit we send identifies which code we are using. Now, in a complete code, any subsequent binary string is a valid string. But once we know that we are using, say, code A, we know that what follows can only be a codeword corresponding to a symbol $x$ whose encoding is shorter under code A than code B. So some strings are invalid continuations, and the mixture code is incomplete and suboptimal. %%% MAYBE!!!!!!!!!!!!!! For further discussion of this issue and its relationship to probabilistic modelling read about `\ind{bits back} coding' in \secref{sec.bitsback} and in \citeasnoun{frey-98}. } % \dvipsb{solutions 3} \prechapter{About Chapter} \fakesection{prerequisites for chapter known as 4} Before reading \chref{ch.four}, you should have read the previous chapter and worked on most of the exercises in it. We'll also make use of some Bayesian modelling ideas that arrived in the vicinity of \exerciseref{ex.postpa}. % Arithmetic coding has been invented several times, % by Elias, by Rissanen, and % but is only slowly becoming well known % % {The description of Lempel--Ziv coding is based on that of Cover and Thomas (1991).} %\chapter{Data Compression III: Stream Codes} \mysetcounter{page}{126} \ENDprechapter \chapter{Stream Codes} \label{ch.four} \label{ch.ac} % _l4.tex \fakesection{Data Compression III: Stream Codes} % % still need to change notation for R(|) % \label{ch4} In this chapter we discuss two data compression schemes.\index{source code!stream codes|(}\index{stream codes|(} %% that constitute the state of the art. {\dem\indexs{arithmetic coding}Arithmetic coding} is a beautiful method that goes hand in hand with the philosophy that compression of data from a source entails probabilistic modelling of that source. As of 1999, the best compression methods for text files use arithmetic coding, and several state-of-the-art image compression systems use it too. {\dem\ind{Lempel--Ziv coding}} is a `\ind{universal}' method, % in my opinion an ugly hack, but designed under the philosophy that we would like a single compression algorithm that will do a reasonable job for {\em any\/} source. In fact, for many real life sources, this algorithm's universal properties hold only in the limit of unfeasibly large amounts of data, but, all the same, Lempel--Ziv compression is widely used and often effective. \section{The guessing game} \label{sec.startofch4} % \looseness=-1 this did not achieve what was advertised! As a motivation for these\index{game!guessing} two compression methods, % let us consider the redundancy in a typical % imagine compressing a \ind{English} text file. Such files have redundancy at several levels: for example, they contain the ASCII characters with non-equal frequency; certain consecutive pairs of letters are more probable than others; and entire words can be predicted given the context and a semantic understanding of the text. To illustrate the redundancy of English, and a curious way in which it could be compressed, we can imagine a \ind{guessing game} in which an English speaker repeatedly attempts to predict the next character in a text file. % \subsection{The guessing game} \label{sec.guessing} % Could discuss the compression of English text by guessing For simplicity, let us assume that the allowed alphabet consists of the 26 upper case letters {\tt A,B,C,\ldots, Z} and a space `{\tt -}'. The game involves asking the subject to guess the next character repeatedly, the only feedback being whether the guess is correct or not, until the character is correctly guessed. After a correct guess, we note the number of guesses that were made when the character was identified, and ask the subject to guess the next character in the same way. One sentence % given by Shannon gave the following result when a human was asked to guess a sentence. % in a guessing game. The numbers of guesses are listed below each character.\index{reverse}\index{motorcycle} % and the idea of having an identical twin. This introduces the idea % of mapping to a different alphabet with nonuniform probability. % The guessing game. From Shannon. \smallskip \begin{center}\hspace*{0.3in} %\begin{tabular}{*{36}{c@{\,\,}}} \begin{tabular}{*{36}{p{0.15in}@{}}} \small\tt T&\small\tt H&\small\tt E&\small\tt R&\small\tt E&\small\tt -&\small\tt I&\small\tt S&\small\tt -&\small\tt N&\small\tt O&\small\tt -&\small\tt R&\small\tt E&\small\tt V&\small\tt E&\small\tt R&\small\tt S&\small\tt E&\small\tt -&\small\tt O&\small\tt N&\small\tt -&\small\tt A&\small\tt -&\small\tt M&\small\tt O&\small\tt T&\small\tt O&\small\tt R&\small\tt C&\small\tt Y&\small\tt C&\small\tt L&\small\tt E&\small\tt -\\ \footnotesize 1&\footnotesize 1&\footnotesize 1&\footnotesize 5&\footnotesize 1&\footnotesize 1&\footnotesize 2&\footnotesize 1&\footnotesize 1&\footnotesize 2&\footnotesize 1&\footnotesize 1&\footnotesize \hspace{-0.05in}1\hspace{-0.25mm}5&\footnotesize 1&\footnotesize \hspace{-0.05in}1\hspace{-0.25mm}7&\footnotesize 1&\footnotesize 1&\footnotesize 1&\footnotesize 2&\footnotesize 1&\footnotesize 3&\footnotesize 2&\footnotesize 1&\footnotesize 2&\footnotesize 2&\footnotesize 7&\footnotesize 1&\footnotesize 1&\footnotesize 1&\footnotesize 1&\footnotesize 4&\footnotesize 1&\footnotesize 1&\footnotesize 1&\footnotesize 1&\footnotesize 1\\ \end{tabular} \smallskip \end{center} % attempt to tighten this para: \looseness=-1 Notice that in many cases, the next letter is guessed immediately, in one guess. In other cases, particularly at the start of syllables, more guesses are needed. What do this game and these results offer us? First, they demonstrate the redundancy of English from the point of view of an English speaker. Second, this game might be used in a data compression scheme, as follows. % encoding The string of numbers `1, 1, 1, 5, 1, \ldots', listed above, was obtained by presenting the text to the subject. The maximum number of guesses that the subject will make for a given letter is twenty-seven, so what the subject is doing for us is performing a time-varying mapping of the twenty-seven letters $\{ {\tt A,B,C,\ldots, Z,-}\}$ onto the twenty-seven numbers $\{1,2,3,\ldots, 27\}$, which we can view as symbols in a new alphabet. The total number of symbols has not been reduced, but since he uses some of these symbols much more frequently than others -- for example, 1 and 2 -- it should be easy to compress this new string of symbols. % ; we will discuss data compression %% the details of how to do this % properly shortly. % decoding How would the {\em uncompression\/} of the sequence of numbers `1, 1, 1, 5, 1, \ldots' work? At uncompression time, we do not have the original string `{\small\tt{THERE}}\ldots', we have only the encoded sequence. Imagine that our subject has an absolutely \ind{identical twin}\index{twin} %({\em absolutely\/} identical) who also plays the guessing game\index{guessing game} with us, as if we %, the experimenters, knew the source text. If we stop him whenever he has made a number of guesses equal to the given number, then he will have just guessed the correct letter, and we can then say `yes, that's right', and move to the next character. Alternatively, if the identical twin is not available, we could design a compression system with the help of just one human as follows. We choose a window length $L$, that is, a number of characters of context to show the human. For every one of the $27^L$ possible strings of length $L$, we ask them, `What would you predict is the next character?', and `If that prediction were wrong, what would your next guesses be?'. After tabulating their answers to these $26 \times 27^L$ questions, we could use two copies of these enormous tables at the encoder and the decoder in place of the two human twins. Such a language model is called an $L$th order \ind{Markov model}. These systems are clearly unrealistic for practical compression, but they illustrate several principles that we will make use of now. \section{Arithmetic codes} \label{sec.ac} % In lecture 2 we discussed fixed length block codes. When we discussed variable-length symbol codes, and the optimal Huffman algorithm for constructing them, we concluded by pointing out two practical and theoretical problems with Huffman codes (section \ref{sec.huffman.probs}). % % index decision: {arithmetic coding} not {arithmetic codes} % These defects are rectified by {\dem\index{arithmetic coding}{arithmetic codes}}, which were invented by Elias\nocite{EliasACmentionedpages61to62},\index{Elias, Peter} by \index{Rissanen, Jorma}{Rissanen} and by {Pasco}, and subsequently made practical by % Witten, Neal, and Cleary. \citeasnoun{arith_coding}.\index{Neal, Radford M.} % \index{Pasco, Richard} In an arithmetic code, the probabilistic modelling is clearly separated from the encoding operation. The system is rather similar to the guessing game.\index{guessing game} % that we considered in Chapter \chtwo. The human predictor is replaced by a {\dem\ind{probabilistic model}} of the source.\index{model} As each symbol is produced by the source, the probabilistic model supplies a {\dem\ind{predictive distribution}} over all possible values of the next symbol, that is, a list of positive numbers $\{ p_i \}$ that sum to one. If we choose to model the source as producing i.i.d.\ symbols with some known distribution, then the predictive distribution is the same every time; but arithmetic coding can with equal ease handle complex adaptive models that produce context-dependent % time-varying predictive distributions. The predictive model is usually implemented in a computer program. % a model which hypothesizes arbitrary % context-dependences and non-stationarities, and which learns as it % goes, so that predictive distributions in any given context gradually % sharpen up. % I will give an example later on, % of an adaptive model producing appropriate probabilities % but first let us discuss the arithmetic coding algorithm itself. The encoder makes use of the model's predictions to create a binary string. The decoder makes use of an identical twin of the model (just as in the guessing \index{guessing game}game) to interpret the binary string. Let the source alphabet be $\A_X = \{a_1 ,\ldots, a_I\}$, and let the $I$th symbol $a_I$ have the special meaning `end of transmission'. The source spits out a sequence $x_1,x_2,\ldots,x_n,\ldots.$ The source does {\em not\/} necessarily produce i.i.d.\ symbols. We will assume that a computer program is provided to the encoder that assigns a predictive probability distribution over $a_i$ given the sequence that has occurred thus far, $P(x_n \eq a_i \given x_1,\ldots,x_{n-1})$. % Nor will we assume that the source % is correctly modeled by $P$. But if it is, then arithmetic coding achieves % the Shannon rate. % % The encoder will send a binary transmission to the receiver. % The receiver has an identical program that produces the same predictive probability distribution $P(x_n \eq a_i \given x_1,\ldots,x_{n-1})$. % and uses it to interpret the received message. \medskip \begin{figure}[htbp] \figuremargin{% \begin{center} \setlength{\unitlength}{1mm} \begin{picture}(50,40)(0,0) \put(18,40){\makebox(0,0)[r]{0.00}} \put(18,30){\makebox(0,0)[r]{0.25}} \put(18,20){\makebox(0,0)[r]{0.50}} \put(18,10){\makebox(0,0)[r]{0.75}} \put(18, 0){\makebox(0,0)[r]{1.00}} % % major horizontals % \put(20,40){\line(1,0){37}} \put(20,30){\line(1,0){13}} \put(20,20){\line(1,0){28}} \put(20,10){\line(1,0){13}} \put(20, 0){\line(1,0){37}} % % biggest intervals % \put(45,30){\vector(0,1){9}} \put(45,30){\vector(0,-1){9}} \put(47,30){\makebox(0,0)[l]{{\tt{0}}}} \put(45,10){\vector(0,1){9}} \put(45,10){\vector(0,-1){9}} \put(47,10){\makebox(0,0)[l]{{\tt{1}}}} % \put(35,25){\vector(0,1){4}} \put(35,25){\vector(0,-1){4}} \put(37,25){\makebox(0,0)[l]{{\tt{01}}}} % some subdivs \put(20,35){\line(1,0){7}} \put(20,25){\line(1,0){7}} \put(20,15){\line(1,0){7}} \put(20, 5){\line(1,0){7}} % % 01101 = 13/32 = 16.25 % 01110 = 14/32 = 17.5 \put(20,23.75){\line(1,0){4}} \put(20,22.50){\line(1,0){4}} \put(62,23.125){\makebox(0,0)[l]{{\tt{01101}}}} % % interrupted pointer: \put(60,23.125){\line(-1,0){14}} \put(44,23.125){\line(-1,0){8}} \put(34,23.125){\vector(-1,0){9.5}} % \end{picture} \end{center} }{% \caption[a]{Binary strings define real intervals within the real line [0,1). We first encountered a picture like this when we discussed the \index{supermarket (codewords)}\index{symbol code!supermarket}\index{source code!supermarket}{symbol-code supermarket} in \chref{ch3}. } \label{fig.arith.Rbinary} }% \end{figure} \subsection{Concepts for understanding arithmetic coding} \begin{aside} %\item[Notation for intervals.] {\sf Notation for intervals.} The interval $[0.01, 0.10)$ is all numbers between $0.01$ and $0.10$, including $0.01\dot{0}\equiv0.01000\ldots$ but not $0.10\dot{0}\equiv0.10000\ldots.$ \end{aside} A binary transmission defines an interval within the real line from 0 to 1. For example, the string {\tt{01}} is interpreted as a binary real number 0.01\ldots, which corresponds to the interval $[0.01, 0.10)$ in binary, \ie, the interval $[0.25,0.50)$ in base ten. % % why strange line breaks? % The longer string {\tt{01101}} corresponds to a smaller interval $[0.01101,$ $0.01110)$. Because {\tt{01101}} has the first string, {\tt{01}}, as a prefix, the new interval is a sub-interval of the interval $[0.01, 0.10)$. A one-megabyte binary file ($2^{23}$ bits) is thus viewed as specifying a number between 0 and 1 to a precision of about two million % $10^7$ decimal places -- {two million decimal digits, because each byte translates into a little more than two decimal digits.} % byte = 8 bits ~= 2 digits. % % one meg-byte = 2^3 * 2^20 = 2^23 binary places -> 2.5*10^7 or (2**23=8388608) . % shall I tell you a bedtime number between 0 and 1 to 10^7 d.p. darling? % \medskip Now, we can also % Similarly, we can divide the real line [0,1) into $I$ intervals of lengths equal to the probabilities $P(x_1 \eq a_i)$, as shown in \figref{fig.arith.R}. % upsidedown % p1 = 6 -- 34 mid: 37 w = 3-1 % p2 = 16 cum 22 -- 18 mid: 26 w = 8-1 % last = 6 cum -- 6 mid: 3 w = 3-1 \newcommand{\aonelevel}{34} \newcommand{\atwolevel}{18} \newcommand{\apenlevel}{6}% penultimate \newcommand{\apenmid}{12}% put dots here \newcommand{\aonemid}{37} \newcommand{\aonew}{2} \newcommand{\atwow}{7} \newcommand{\atwomid}{26} \newcommand{\aIw}{2} \newcommand{\aImid}{3} \begin{figure}[htbp] \figuremargin{% \begin{center} \setlength{\unitlength}{1mm} \begin{picture}(50,40)(0,0) \put(18,40){\makebox(0,0)[r]{0.00}} \put(18,\aonelevel){\makebox(0,0)[r]{$P(x_1\eq a_1)$}} \put(18,\atwolevel){\makebox(0,0)[r]{$P(x_1\eq a_1)+P(x_1\eq a_2)$}} \put(18,\apenlevel){\makebox(0,0)[r]{$P(x_1\eq a_1)+\ldots+P(x_1\eq a_{I\!-\!1})$}} \put(18, 0){\makebox(0,0)[r]{1.0}} % % major horizontals % \put(20,40){\line(1,0){37}} \put(20,\aonelevel){\line(1,0){20}} \put(20,\atwolevel){\line(1,0){20}} \put(20,\apenlevel){\line(1,0){20}} \put(20, 0){\line(1,0){37}} \put(30,\apenmid){\makebox(0,0)[l]{$\vdots$}} % % biggest intervals % \put(35,\aonemid){\vector(0,1){\aonew}} \put(35,\aonemid){\vector(0,-1){\aonew}} \put(37,\aonemid){\makebox(0,0)[l]{$a_1$}}% or $P(x_1\eq a_1)$}} \put(35,\atwomid){\vector(0,1){\atwow}} \put(35,\atwomid){\vector(0,-1){\atwow}} \put(37,\atwomid){\makebox(0,0)[l]{$a_2$}}% or $P(x_1\eq a_2)$}} \put(35,\aImid){\vector(0,1){\aIw}} \put(35,\aImid){\vector(0,-1){\aIw}} \put(37,\aImid){\makebox(0,0)[l]{$a_I$}}% or $P(x_1\eq a_I)$}} \put(37,\apenmid){\makebox(0,0)[l]{$\vdots$}} % \put(20,23){\line(1,0){4}}% beg of a5 \put(20,20){\line(1,0){4}}% end a5 % \put(62,21.5){\makebox(0,0)[l]{$a_2 a_5$}} % interrupted pointer: \put(60,21.5){\line(-1,0){24}} \put(34,21.5){\vector(-1,0){9.5}} % % a2a1: 34 is the top % \put(20,30){\line(1,0){4}}% end of a1 \put(20,28){\line(1,0){4}}% end of a2 \put(20,25){\line(1,0){4}}% end of a3 % \put(62,32){\makebox(0,0)[l]{$a_2 a_1$}} % interrupted pointer: \put(60,32){\line(-1,0){24}} \put(34,32){\vector(-1,0){9.5}} % \end{picture} \end{center} }{% \caption[a]{A probabilistic model defines real intervals within the real line [0,1).} \label{fig.arith.R} }% \end{figure} We may then take each interval $a_i$ and subdivide it into intervals denoted $a_ia_1,a_ia_2,\ldots, a_ia_I$, such that the length of $a_ia_j$ is proportional to $P(x_2 \eq a_j \given x_1 \eq a_i)$. Indeed the length of the interval $a_ia_j$ will be precisely the joint probability \beq P(x_1 \eq a_i,x_2\eq a_j)=P(x_1\eq a_i)P(x_2 \eq a_j \given x_1 \eq a_i). \eeq Iterating this procedure, the interval $[0,1)$ can be divided into a sequence of intervals corresponding to all possible finite length strings $x_1x_2\ldots x_N$, such that the length of an interval is equal to the probability of the string given our model. % This iterative procedure \subsection{Formulae describing arithmetic coding} \begin{aside} The process depicted in \figref{fig.arith.R} can be written explicitly as follows. The intervals are defined in terms of the lower and upper cumulative probabilities \beqan Q_{n}(a_i \given x_1,\ldots,x_{n-1}) & \equiv & \sum_{i'\eq 1 }^{i-1} P(x_n \eq a_{i'} \given x_1,\ldots,x_{n-1}) , \label{eq.arith.Q} \\ R_{n}(a_i \given x_1,\ldots,x_{n-1}) & \equiv & \sum_{i'\eq 1 }^{i} P(x_n \eq a_{i'} \given x_1,\ldots,x_{n-1}) . \label{eq.arith.R} \eeqan % As the $n$th symbol arrives, we subdivide the $n-1$th interval at the points defined by $Q_n$ and $R_n$. For example, starting with the first symbol, the intervals `$a_1$', `$a_2$', % `$a_3$', and `$a_I$' are % first interval, % which we will call \beq a_1 \leftrightarrow [Q_{1}(a_1),R_{1}(a_1))= [0,P(x_1 \eq a_1)) , \eeq \beq a_2 \leftrightarrow [Q_{1}(a_2),R_{1}(a_2))= \left[ P(x\eq a_1),P(x\eq a_1)+P(x\eq a_2) \right) , \eeq %\beq % a_3 \leftrightarrow [Q_{1}(a_3),R_{1}(a_3))= % \left[ % P(x\eq a_1)+P(x\eq a_2) , P(x\eq a_1)+P(x\eq a_2) +P(x\eq a_3)\right), %\eeq and \beq a_I \leftrightarrow \left[ Q_{1}(a_{I}) , R_{1}(a_{I}) \right) = \left[ P(x_1\eq a_1)+\ldots+P(x_1\eq a_{I\!-\!1}) ,1.0 \right) . \eeq Algorithm \ref{alg.ac} describes the general procedure. \end{aside} \begin{algorithm} \begin{framedalgorithmwithcaption}{ \caption[a]{Arithmetic coding. Iterative procedure to find the interval $[u,v)$ for the string $x_1x_2\ldots x_N$. } \label{alg.ac} } %\algorithmmargin{% \begin{center} \begin{tabular}{l} %\begin{description}% should be ALGORITHM %\item[Iterative procedure to find the interval $[u,v)$ % corresponding to % for the string $x_1x_2\ldots x_N$] % {\tt $u$ := 0.0} \\ {\tt $v$ := 1.0} \\ {\tt $p$ := $v-u$} \\ {\tt for $n$ = 1 to $N$ \verb+{+ } \\ \hspace*{0.5in} Compute the cumulative probabilities $Q_n$ and $R_n$ \protect(\ref{eq.arith.Q},\,\ref{eq.arith.R}) % $\{ R_{n}(a_i \given x_1,\ldots,x_{n-1}) \}_{i=1}^{I}$ %% $\{ R_{n,i \given x_1,\ldots,x_{n-1}} \}_{i=0}^{I}$ % using \eqref{eq.arith.R} \\ \\ \hspace*{0.5in} {\tt $v$ := $u + p R_{n}(x_n \given x_1,\ldots,x_{n-1}) $ } \\ \hspace*{0.5in} {\tt $u$ := $u + p Q_{n}(x_n \given x_1,\ldots,x_{n-1}) $ } \\ \hspace*{0.5in} {\tt $p$ := $v-u$} \\ % {\tt ) } \\ {\tt \verb+}+ } \\ \end{tabular} \end{center} %\end{description} %} \end{framedalgorithmwithcaption} \end{algorithm} To encode a string $x_1x_2\ldots x_N$, we locate the interval corresponding to $x_1x_2\ldots x_N$, and send a binary string whose interval lies within that interval. This encoding can be performed on the fly, as we now illustrate. % \eof defined in itprnnchapter \subsection{Example: compressing the tosses of a bent coin} Imagine that we watch as a \ind{bent coin} is tossed some number of times (\cf\ \exampleref{exa.bentcoin} and \secref{sec.bentcoin} (\pref{sec.bentcoin})). The two outcomes when the coin is tossed are denoted $\tt a$ and $\tt b$. A third possibility is that the experiment is halted, an event denoted by the `end of file' symbol, `$\eof$'. Because the coin is bent, we expect that the probabilities of the outcomes $\tt a$ and $\tt b$ are not equal, though beforehand we don't know which is the more probable outcome. % Let $\A_X=\{a,b,\eof\}$, where % $a$ and $\tb$ make up a binary alphabet with % $\eof$ is an `end of file' symbol. \subsubsection{Encoding\subsubpunc} Let the source string be `$\tt bbba\eof$'. We pass along the string one symbol at a time and use our model to compute the probability distribution of the next symbol given the string thus far. Let these probabilities be: \[\begin{array}{l*{3}{r@{\eq}l}} \toprule \mbox{Context } \\ \mbox{(sequence thus far) } & \multicolumn{6}{c}{\mbox{Probability of next symbol}} \\[0.05in] \midrule & P( \ta ) & 0.425 & P( \tb ) & 0.425 & P( \eof ) & 0.15 \\[0.05in] \tb& P( \ta \given \tb ) & 0.28 & P( \tb \given \tb ) & 0.57 & P( \eof \given \tb ) & 0.15 \\[0.05in] \tb\tb&P( \ta \given \tb\tb ) & 0.21 & P( \tb \given \tb\tb ) & 0.64 & P( \eof \given \tb\tb ) & 0.15 \\[0.05in] \tb\tb\tb&P( \ta \given \tb\tb\tb ) & 0.17 & P( \tb \given \tb\tb\tb ) & 0.68 & P( \eof \given \tb\tb\tb ) & 0.15 \\[0.05in] \tb\tb\tb\ta& P( \ta \given \tb\tb\tb\ta ) & 0.28 & P( \tb \given \tb\tb\tb\ta ) & 0.57 & P( \eof \given \tb\tb\tb\ta ) & 0.15 \\ \bottomrule \end{array} \] \Figref{fig.ac} shows the corresponding intervals. The interval $\tb$ is the middle 0.425 of $[0,1)$. The interval $\tb\tb$ is the middle 0.567 of $\tb$, and so forth. % in the following figure. \begin{figure}[htbp] \figuremargin{% \begin{center} % created by ac.p only_show_data=1 > ac/ac_data.tex %%%%%%% and edited by hand \mbox{ \hspace{-0.1in}\small \setlength{\unitlength}{4.8in} %\setlength{\unitlength}{5.75in} \begin{picture}(0.59130434782608698452,1)(-0.29565217391304349226,0) \thinlines % line 0.0000 from -0.5000 to 0.0000 \put( -0.2957, 1.0000){\line(1,0){ 0.2957}} % a at -0.4500, 0.2125 \put( -0.2811, 0.7875){\makebox(0,0)[r]{\tt{a}}} % line 0.4250 from -0.5000 to 0.0000 \put( -0.2957, 0.5750){\line(1,0){ 0.2957}} % b at -0.4500, 0.6375 \put( -0.2811, 0.3625){\makebox(0,0)[r]{\tt{b}}} % line 0.8500 from -0.5000 to 0.0000 \put( -0.2957, 0.1500){\line(1,0){ 0.2957}} % \teof at -0.4500, 0.9250 \put( -0.2811, 0.0750){\makebox(0,0)[r]{\teof}} % line 1.0000 from -0.5000 to 0.0000 \put( -0.2957, 0.0000){\line(1,0){ 0.2957}} % ba at -0.3500, 0.4852 \put( -0.2220, 0.5148){\makebox(0,0)[r]{\tt{ba}}} % line 0.5454 from -0.4500 to 0.0000 \put( -0.2661, 0.4546){\line(1,0){ 0.2661}} % bb at -0.3500, 0.6658 \put( -0.2220, 0.3342){\makebox(0,0)[r]{\tt{bb}}} % line 0.7862 from -0.4500 to 0.0000 \put( -0.2661, 0.2138){\line(1,0){ 0.2661}} % b\teof at -0.3500, 0.8181 \put( -0.2220, 0.1819){\makebox(0,0)[r]{\tt{b\teof}}} % bba at -0.2300, 0.5710 \put( -0.1510, 0.4290){\makebox(0,0)[r]{\tt{bba}}} % line 0.5966 from -0.3500 to 0.0000 \put( -0.2070, 0.4034){\line(1,0){ 0.2070}} % bbb at -0.2300, 0.6734 \put( -0.1510, 0.3266){\makebox(0,0)[r]{\tt{bbb}}} % line 0.7501 from -0.3500 to 0.0000 \put( -0.2070, 0.2499){\line(1,0){ 0.2070}} % bb\teof at -0.2300, 0.7682 \put( -0.1510, 0.2318){\makebox(0,0)[r]{\tt{bb\teof}}} % bbba at -0.1000, 0.6096 \put( -0.0741, 0.3904){\makebox(0,0)[r]{\tt{bbba}}} % line 0.6227 from -0.2300 to 0.0000 \put( -0.1360, 0.3773){\line(1,0){ 0.1360}} % bbbb at -0.1000, 0.6749 \put( -0.0741, 0.3251){\makebox(0,0)[r]{\tt{bbbb}}} % line 0.7271 from -0.2300 to 0.0000 \put( -0.1360, 0.2729){\line(1,0){ 0.1360}} % bbb\teof at -0.1000, 0.7386 \put( -0.0741, 0.2614){\makebox(0,0)[r]{\tt{bbb\teof}}} % line 0.6040 from -0.1000 to 0.0000 \put( -0.0591, 0.3960){\line(1,0){ 0.0591}} % line 0.6188 from -0.1000 to 0.0000 \put( -0.0591, 0.3812){\line(1,0){ 0.0591}} % line 0.0000 from 0.0100 to 0.5000 \put( 0.0059, 1.0000){\line(1,0){ 0.2897}} % 0 at 0.0100, 0.2500 \put( 0.2811, 0.7500){\makebox(0,0)[l]{\tt0}} % line 0.5000 from 0.0100 to 0.5000 \put( 0.0059, 0.5000){\line(1,0){ 0.2897}} % 1 at 0.0100, 0.7500 \put( 0.2811, 0.2500){\makebox(0,0)[l]{\tt1}} % line 1.0000 from 0.0100 to 0.5000 \put( 0.0059, 0.0000){\line(1,0){ 0.2897}} % 00 at 0.0100, 0.1250 \put( 0.2397, 0.8750){\makebox(0,0)[l]{\tt00}} % line 0.2500 from 0.0100 to 0.4500 \put( 0.0059, 0.7500){\line(1,0){ 0.2602}} % 01 at 0.0100, 0.3750 \put( 0.2397, 0.6250){\makebox(0,0)[l]{\tt01}} % 000 at 0.0100, 0.0625 \put( 0.1806, 0.9375){\makebox(0,0)[l]{\tt000}} % line 0.1250 from 0.0100 to 0.3800 \put( 0.0059, 0.8750){\line(1,0){ 0.2188}} % 001 at 0.0100, 0.1875 \put( 0.1806, 0.8125){\makebox(0,0)[l]{\tt001}} % 0000 at 0.0100, 0.0312 % was at 0.1037, move 0.02 right -> 1207 \put( 0.1207, 0.9688){\makebox(0,0)[l]{\tt0000}} % line 0.0625 from 0.0100 to 0.2800 \put( 0.0059, 0.9375){\line(1,0){ 0.1597}} % 0001 at 0.0100, 0.0938 \put( 0.1207, 0.9062){\makebox(0,0)[l]{\tt0001}} % 00000 at 0.0100, 0.0156 \put( 0.0387, 0.9844){\makebox(0,0)[l]{\tt00000}} % line 0.0312 from 0.0100 to 0.1500 \put( 0.0059, 0.9688){\line(1,0){ 0.0828}} % 00001 at 0.0100, 0.0469 \put( 0.0387, 0.9531){\makebox(0,0)[l]{\tt00001}} % line 0.0156 from 0.0100 to 0.0400 \put( 0.0059, 0.9844){\line(1,0){ 0.0177}} % line 0.0078 from 0.0100 to 0.0200 \put( 0.0059, 0.9922){\line(1,0){ 0.0059}} % line 0.0234 from 0.0100 to 0.0200 \put( 0.0059, 0.9766){\line(1,0){ 0.0059}} % line 0.0469 from 0.0100 to 0.0400 \put( 0.0059, 0.9531){\line(1,0){ 0.0177}} % line 0.0391 from 0.0100 to 0.0200 \put( 0.0059, 0.9609){\line(1,0){ 0.0059}} % line 0.0547 from 0.0100 to 0.0200 \put( 0.0059, 0.9453){\line(1,0){ 0.0059}} % 00010 at 0.0100, 0.0781 \put( 0.0387, 0.9219){\makebox(0,0)[l]{\tt00010}} % line 0.0938 from 0.0100 to 0.1500 \put( 0.0059, 0.9062){\line(1,0){ 0.0828}} % 00011 at 0.0100, 0.1094 \put( 0.0387, 0.8906){\makebox(0,0)[l]{\tt00011}} % line 0.0781 from 0.0100 to 0.0400 \put( 0.0059, 0.9219){\line(1,0){ 0.0177}} % line 0.0703 from 0.0100 to 0.0200 \put( 0.0059, 0.9297){\line(1,0){ 0.0059}} % line 0.0859 from 0.0100 to 0.0200 \put( 0.0059, 0.9141){\line(1,0){ 0.0059}} % line 0.1094 from 0.0100 to 0.0400 \put( 0.0059, 0.8906){\line(1,0){ 0.0177}} % line 0.1016 from 0.0100 to 0.0200 \put( 0.0059, 0.8984){\line(1,0){ 0.0059}} % line 0.1172 from 0.0100 to 0.0200 \put( 0.0059, 0.8828){\line(1,0){ 0.0059}} % 0010 at 0.0100, 0.1562 \put( 0.1207, 0.8438){\makebox(0,0)[l]{\tt0010}} % line 0.1875 from 0.0100 to 0.2800 \put( 0.0059, 0.8125){\line(1,0){ 0.1597}} % 0011 at 0.0100, 0.2188 \put( 0.1207, 0.7812){\makebox(0,0)[l]{\tt0011}} % 00100 at 0.0100, 0.1406 \put( 0.0387, 0.8594){\makebox(0,0)[l]{\tt00100}} % line 0.1562 from 0.0100 to 0.1500 \put( 0.0059, 0.8438){\line(1,0){ 0.0828}} % 00101 at 0.0100, 0.1719 \put( 0.0387, 0.8281){\makebox(0,0)[l]{\tt00101}} % line 0.1406 from 0.0100 to 0.0400 \put( 0.0059, 0.8594){\line(1,0){ 0.0177}} % line 0.1328 from 0.0100 to 0.0200 \put( 0.0059, 0.8672){\line(1,0){ 0.0059}} % line 0.1484 from 0.0100 to 0.0200 \put( 0.0059, 0.8516){\line(1,0){ 0.0059}} % line 0.1719 from 0.0100 to 0.0400 \put( 0.0059, 0.8281){\line(1,0){ 0.0177}} % line 0.1641 from 0.0100 to 0.0200 \put( 0.0059, 0.8359){\line(1,0){ 0.0059}} % line 0.1797 from 0.0100 to 0.0200 \put( 0.0059, 0.8203){\line(1,0){ 0.0059}} % 00110 at 0.0100, 0.2031 \put( 0.0387, 0.7969){\makebox(0,0)[l]{\tt00110}} % line 0.2188 from 0.0100 to 0.1500 \put( 0.0059, 0.7812){\line(1,0){ 0.0828}} % 00111 at 0.0100, 0.2344 \put( 0.0387, 0.7656){\makebox(0,0)[l]{\tt00111}} % line 0.2031 from 0.0100 to 0.0400 \put( 0.0059, 0.7969){\line(1,0){ 0.0177}} % line 0.1953 from 0.0100 to 0.0200 \put( 0.0059, 0.8047){\line(1,0){ 0.0059}} % line 0.2109 from 0.0100 to 0.0200 \put( 0.0059, 0.7891){\line(1,0){ 0.0059}} % line 0.2344 from 0.0100 to 0.0400 \put( 0.0059, 0.7656){\line(1,0){ 0.0177}} % line 0.2266 from 0.0100 to 0.0200 \put( 0.0059, 0.7734){\line(1,0){ 0.0059}} % line 0.2422 from 0.0100 to 0.0200 \put( 0.0059, 0.7578){\line(1,0){ 0.0059}} % 010 at 0.0100, 0.3125 \put( 0.1806, 0.6875){\makebox(0,0)[l]{\tt010}} % line 0.3750 from 0.0100 to 0.3800 \put( 0.0059, 0.6250){\line(1,0){ 0.2188}} % 011 at 0.0100, 0.4375 \put( 0.1806, 0.5625){\makebox(0,0)[l]{\tt011}} % 0100 at 0.0100, 0.2812 \put( 0.1207, 0.7188){\makebox(0,0)[l]{\tt0100}} % line 0.3125 from 0.0100 to 0.2800 \put( 0.0059, 0.6875){\line(1,0){ 0.1597}} % 0101 at 0.0100, 0.3438 \put( 0.1207, 0.6562){\makebox(0,0)[l]{\tt0101}} % 01000 at 0.0100, 0.2656 \put( 0.0387, 0.7344){\makebox(0,0)[l]{\tt01000}} % line 0.2812 from 0.0100 to 0.1500 \put( 0.0059, 0.7188){\line(1,0){ 0.0828}} % 01001 at 0.0100, 0.2969 \put( 0.0387, 0.7031){\makebox(0,0)[l]{\tt01001}} % line 0.2656 from 0.0100 to 0.0400 \put( 0.0059, 0.7344){\line(1,0){ 0.0177}} % line 0.2578 from 0.0100 to 0.0200 \put( 0.0059, 0.7422){\line(1,0){ 0.0059}} % line 0.2734 from 0.0100 to 0.0200 \put( 0.0059, 0.7266){\line(1,0){ 0.0059}} % line 0.2969 from 0.0100 to 0.0400 \put( 0.0059, 0.7031){\line(1,0){ 0.0177}} % line 0.2891 from 0.0100 to 0.0200 \put( 0.0059, 0.7109){\line(1,0){ 0.0059}} % line 0.3047 from 0.0100 to 0.0200 \put( 0.0059, 0.6953){\line(1,0){ 0.0059}} % 01010 at 0.0100, 0.3281 \put( 0.0387, 0.6719){\makebox(0,0)[l]{\tt01010}} % line 0.3438 from 0.0100 to 0.1500 \put( 0.0059, 0.6562){\line(1,0){ 0.0828}} % 01011 at 0.0100, 0.3594 \put( 0.0387, 0.6406){\makebox(0,0)[l]{\tt01011}} % line 0.3281 from 0.0100 to 0.0400 \put( 0.0059, 0.6719){\line(1,0){ 0.0177}} % line 0.3203 from 0.0100 to 0.0200 \put( 0.0059, 0.6797){\line(1,0){ 0.0059}} % line 0.3359 from 0.0100 to 0.0200 \put( 0.0059, 0.6641){\line(1,0){ 0.0059}} % line 0.3594 from 0.0100 to 0.0400 \put( 0.0059, 0.6406){\line(1,0){ 0.0177}} % line 0.3516 from 0.0100 to 0.0200 \put( 0.0059, 0.6484){\line(1,0){ 0.0059}} % line 0.3672 from 0.0100 to 0.0200 \put( 0.0059, 0.6328){\line(1,0){ 0.0059}} % 0110 at 0.0100, 0.4062 \put( 0.1207, 0.5938){\makebox(0,0)[l]{\tt0110}} % line 0.4375 from 0.0100 to 0.2800 \put( 0.0059, 0.5625){\line(1,0){ 0.1597}} % 0111 at 0.0100, 0.4688 \put( 0.1207, 0.5312){\makebox(0,0)[l]{\tt0111}} % 01100 at 0.0100, 0.3906 \put( 0.0387, 0.6094){\makebox(0,0)[l]{\tt01100}} % line 0.4062 from 0.0100 to 0.1500 \put( 0.0059, 0.5938){\line(1,0){ 0.0828}} % 01101 at 0.0100, 0.4219 \put( 0.0387, 0.5781){\makebox(0,0)[l]{\tt01101}} % line 0.3906 from 0.0100 to 0.0400 \put( 0.0059, 0.6094){\line(1,0){ 0.0177}} % line 0.3828 from 0.0100 to 0.0200 \put( 0.0059, 0.6172){\line(1,0){ 0.0059}} % line 0.3984 from 0.0100 to 0.0200 \put( 0.0059, 0.6016){\line(1,0){ 0.0059}} % line 0.4219 from 0.0100 to 0.0400 \put( 0.0059, 0.5781){\line(1,0){ 0.0177}} % line 0.4141 from 0.0100 to 0.0200 \put( 0.0059, 0.5859){\line(1,0){ 0.0059}} % line 0.4297 from 0.0100 to 0.0200 \put( 0.0059, 0.5703){\line(1,0){ 0.0059}} % 01110 at 0.0100, 0.4531 \put( 0.0387, 0.5469){\makebox(0,0)[l]{\tt01110}} % line 0.4688 from 0.0100 to 0.1500 \put( 0.0059, 0.5312){\line(1,0){ 0.0828}} % 01111 at 0.0100, 0.4844 \put( 0.0387, 0.5156){\makebox(0,0)[l]{\tt01111}} % line 0.4531 from 0.0100 to 0.0400 \put( 0.0059, 0.5469){\line(1,0){ 0.0177}} % line 0.4453 from 0.0100 to 0.0200 \put( 0.0059, 0.5547){\line(1,0){ 0.0059}} % line 0.4609 from 0.0100 to 0.0200 \put( 0.0059, 0.5391){\line(1,0){ 0.0059}} % line 0.4844 from 0.0100 to 0.0400 \put( 0.0059, 0.5156){\line(1,0){ 0.0177}} % line 0.4766 from 0.0100 to 0.0200 \put( 0.0059, 0.5234){\line(1,0){ 0.0059}} % line 0.4922 from 0.0100 to 0.0200 \put( 0.0059, 0.5078){\line(1,0){ 0.0059}} % 10 at 0.0100, 0.6250 \put( 0.2397, 0.3750){\makebox(0,0)[l]{\tt10}} % line 0.7500 from 0.0100 to 0.4500 \put( 0.0059, 0.2500){\line(1,0){ 0.2602}} % 11 at 0.0100, 0.8750 \put( 0.2397, 0.1250){\makebox(0,0)[l]{\tt11}} % 100 at 0.0100, 0.5625 \put( 0.1806, 0.4375){\makebox(0,0)[l]{\tt100}} % line 0.6250 from 0.0100 to 0.3800 \put( 0.0059, 0.3750){\line(1,0){ 0.2188}} % 101 at 0.0100, 0.6875 \put( 0.1806, 0.3125){\makebox(0,0)[l]{\tt101}} % 1000 at 0.0100, 0.5312 \put( 0.1207, 0.4688){\makebox(0,0)[l]{\tt1000}} % line 0.5625 from 0.0100 to 0.2800 \put( 0.0059, 0.4375){\line(1,0){ 0.1597}} % 1001 at 0.0100, 0.5938 \put( 0.1207, 0.4062){\makebox(0,0)[l]{\tt1001}} % 10000 at 0.0100, 0.5156 \put( 0.0387, 0.4844){\makebox(0,0)[l]{\tt10000}} % line 0.5312 from 0.0100 to 0.1500 \put( 0.0059, 0.4688){\line(1,0){ 0.0828}} % 10001 at 0.0100, 0.5469 \put( 0.0387, 0.4531){\makebox(0,0)[l]{\tt10001}} % line 0.5156 from 0.0100 to 0.0400 \put( 0.0059, 0.4844){\line(1,0){ 0.0177}} % line 0.5078 from 0.0100 to 0.0200 \put( 0.0059, 0.4922){\line(1,0){ 0.0059}} % line 0.5234 from 0.0100 to 0.0200 \put( 0.0059, 0.4766){\line(1,0){ 0.0059}} % line 0.5469 from 0.0100 to 0.0400 \put( 0.0059, 0.4531){\line(1,0){ 0.0177}} % line 0.5391 from 0.0100 to 0.0200 \put( 0.0059, 0.4609){\line(1,0){ 0.0059}} % line 0.5547 from 0.0100 to 0.0200 \put( 0.0059, 0.4453){\line(1,0){ 0.0059}} % 10010 at 0.0100, 0.5781 \put( 0.0387, 0.4219){\makebox(0,0)[l]{\tt10010}} % line 0.5938 from 0.0100 to 0.1500 \put( 0.0059, 0.4062){\line(1,0){ 0.0828}} % 10011 at 0.0100, 0.6094 \put( 0.0387, 0.3906){\makebox(0,0)[l]{\tt10011}} % line 0.5781 from 0.0100 to 0.0400 \put( 0.0059, 0.4219){\line(1,0){ 0.0177}} % line 0.5703 from 0.0100 to 0.0200 \put( 0.0059, 0.4297){\line(1,0){ 0.0059}} % line 0.5859 from 0.0100 to 0.0200 \put( 0.0059, 0.4141){\line(1,0){ 0.0059}} % line 0.6094 from 0.0100 to 0.0400 \put( 0.0059, 0.3906){\line(1,0){ 0.0177}} % line 0.6016 from 0.0100 to 0.0200 \put( 0.0059, 0.3984){\line(1,0){ 0.0059}} % line 0.6172 from 0.0100 to 0.0200 \put( 0.0059, 0.3828){\line(1,0){ 0.0059}} % 1010 at 0.0100, 0.6562 \put( 0.1207, 0.3438){\makebox(0,0)[l]{\tt1010}} % line 0.6875 from 0.0100 to 0.2800 \put( 0.0059, 0.3125){\line(1,0){ 0.1597}} % 1011 at 0.0100, 0.7188 \put( 0.1207, 0.2812){\makebox(0,0)[l]{\tt1011}} % 10100 at 0.0100, 0.6406 \put( 0.0387, 0.3594){\makebox(0,0)[l]{\tt10100}} % line 0.6562 from 0.0100 to 0.1500 \put( 0.0059, 0.3438){\line(1,0){ 0.0828}} % 10101 at 0.0100, 0.6719 \put( 0.0387, 0.3281){\makebox(0,0)[l]{\tt10101}} % line 0.6406 from 0.0100 to 0.0400 \put( 0.0059, 0.3594){\line(1,0){ 0.0177}} % line 0.6328 from 0.0100 to 0.0200 \put( 0.0059, 0.3672){\line(1,0){ 0.0059}} % line 0.6484 from 0.0100 to 0.0200 \put( 0.0059, 0.3516){\line(1,0){ 0.0059}} % line 0.6719 from 0.0100 to 0.0400 \put( 0.0059, 0.3281){\line(1,0){ 0.0177}} % line 0.6641 from 0.0100 to 0.0200 \put( 0.0059, 0.3359){\line(1,0){ 0.0059}} % line 0.6797 from 0.0100 to 0.0200 \put( 0.0059, 0.3203){\line(1,0){ 0.0059}} % 10110 at 0.0100, 0.7031 \put( 0.0387, 0.2969){\makebox(0,0)[l]{\tt10110}} % line 0.7188 from 0.0100 to 0.1500 \put( 0.0059, 0.2812){\line(1,0){ 0.0828}} % 10111 at 0.0100, 0.7344 \put( 0.0387, 0.2656){\makebox(0,0)[l]{\tt10111}} % line 0.7031 from 0.0100 to 0.0400 \put( 0.0059, 0.2969){\line(1,0){ 0.0177}} % line 0.6953 from 0.0100 to 0.0200 \put( 0.0059, 0.3047){\line(1,0){ 0.0059}} % line 0.7109 from 0.0100 to 0.0200 \put( 0.0059, 0.2891){\line(1,0){ 0.0059}} % line 0.7344 from 0.0100 to 0.0400 \put( 0.0059, 0.2656){\line(1,0){ 0.0177}} % line 0.7266 from 0.0100 to 0.0200 \put( 0.0059, 0.2734){\line(1,0){ 0.0059}} % line 0.7422 from 0.0100 to 0.0200 \put( 0.0059, 0.2578){\line(1,0){ 0.0059}} % 110 at 0.0100, 0.8125 \put( 0.1806, 0.1875){\makebox(0,0)[l]{\tt110}} % line 0.8750 from 0.0100 to 0.3800 \put( 0.0059, 0.1250){\line(1,0){ 0.2188}} % 111 at 0.0100, 0.9375 \put( 0.1806, 0.0625){\makebox(0,0)[l]{\tt111}} % 1100 at 0.0100, 0.7812 \put( 0.1207, 0.2188){\makebox(0,0)[l]{\tt1100}} % line 0.8125 from 0.0100 to 0.2800 \put( 0.0059, 0.1875){\line(1,0){ 0.1597}} % 1101 at 0.0100, 0.8438 \put( 0.1207, 0.1562){\makebox(0,0)[l]{\tt1101}} % 11000 at 0.0100, 0.7656 \put( 0.0387, 0.2344){\makebox(0,0)[l]{\tt11000}} % line 0.7812 from 0.0100 to 0.1500 \put( 0.0059, 0.2188){\line(1,0){ 0.0828}} % 11001 at 0.0100, 0.7969 \put( 0.0387, 0.2031){\makebox(0,0)[l]{\tt11001}} % line 0.7656 from 0.0100 to 0.0400 \put( 0.0059, 0.2344){\line(1,0){ 0.0177}} % line 0.7578 from 0.0100 to 0.0200 \put( 0.0059, 0.2422){\line(1,0){ 0.0059}} % line 0.7734 from 0.0100 to 0.0200 \put( 0.0059, 0.2266){\line(1,0){ 0.0059}} % line 0.7969 from 0.0100 to 0.0400 \put( 0.0059, 0.2031){\line(1,0){ 0.0177}} % line 0.7891 from 0.0100 to 0.0200 \put( 0.0059, 0.2109){\line(1,0){ 0.0059}} % line 0.8047 from 0.0100 to 0.0200 \put( 0.0059, 0.1953){\line(1,0){ 0.0059}} % 11010 at 0.0100, 0.8281 \put( 0.0387, 0.1719){\makebox(0,0)[l]{\tt11010}} % line 0.8438 from 0.0100 to 0.1500 \put( 0.0059, 0.1562){\line(1,0){ 0.0828}} % 11011 at 0.0100, 0.8594 \put( 0.0387, 0.1406){\makebox(0,0)[l]{\tt11011}} % line 0.8281 from 0.0100 to 0.0400 \put( 0.0059, 0.1719){\line(1,0){ 0.0177}} % line 0.8203 from 0.0100 to 0.0200 \put( 0.0059, 0.1797){\line(1,0){ 0.0059}} % line 0.8359 from 0.0100 to 0.0200 \put( 0.0059, 0.1641){\line(1,0){ 0.0059}} % line 0.8594 from 0.0100 to 0.0400 \put( 0.0059, 0.1406){\line(1,0){ 0.0177}} % line 0.8516 from 0.0100 to 0.0200 \put( 0.0059, 0.1484){\line(1,0){ 0.0059}} % line 0.8672 from 0.0100 to 0.0200 \put( 0.0059, 0.1328){\line(1,0){ 0.0059}} % 1110 at 0.0100, 0.9062 \put( 0.1207, 0.0938){\makebox(0,0)[l]{\tt1110}} % line 0.9375 from 0.0100 to 0.2800 \put( 0.0059, 0.0625){\line(1,0){ 0.1597}} % 1111 at 0.0100, 0.9688 \put( 0.1207, 0.0312){\makebox(0,0)[l]{\tt1111}} % 11100 at 0.0100, 0.8906 \put( 0.0387, 0.1094){\makebox(0,0)[l]{\tt11100}} % line 0.9062 from 0.0100 to 0.1500 \put( 0.0059, 0.0938){\line(1,0){ 0.0828}} % 11101 at 0.0100, 0.9219 \put( 0.0387, 0.0781){\makebox(0,0)[l]{\tt11101}} % line 0.8906 from 0.0100 to 0.0400 \put( 0.0059, 0.1094){\line(1,0){ 0.0177}} % line 0.8828 from 0.0100 to 0.0200 \put( 0.0059, 0.1172){\line(1,0){ 0.0059}} % line 0.8984 from 0.0100 to 0.0200 \put( 0.0059, 0.1016){\line(1,0){ 0.0059}} % line 0.9219 from 0.0100 to 0.0400 \put( 0.0059, 0.0781){\line(1,0){ 0.0177}} % line 0.9141 from 0.0100 to 0.0200 \put( 0.0059, 0.0859){\line(1,0){ 0.0059}} % line 0.9297 from 0.0100 to 0.0200 \put( 0.0059, 0.0703){\line(1,0){ 0.0059}} % 11110 at 0.0100, 0.9531 \put( 0.0387, 0.0469){\makebox(0,0)[l]{\tt11110}} % line 0.9688 from 0.0100 to 0.1500 \put( 0.0059, 0.0312){\line(1,0){ 0.0828}} % 11111 at 0.0100, 0.9844 \put( 0.0387, 0.0156){\makebox(0,0)[l]{\tt11111}} % line 0.9531 from 0.0100 to 0.0400 \put( 0.0059, 0.0469){\line(1,0){ 0.0177}} % line 0.9453 from 0.0100 to 0.0200 \put( 0.0059, 0.0547){\line(1,0){ 0.0059}} % line 0.9609 from 0.0100 to 0.0200 \put( 0.0059, 0.0391){\line(1,0){ 0.0059}} % line 0.9844 from 0.0100 to 0.0400 \put( 0.0059, 0.0156){\line(1,0){ 0.0177}} % line 0.9766 from 0.0100 to 0.0200 \put( 0.0059, 0.0234){\line(1,0){ 0.0059}} % line 0.9922 from 0.0100 to 0.0200 \put( 0.0059, 0.0078){\line(1,0){ 0.0059}} \end{picture} \hspace{-0.04in}% was -.25 \raisebox{1.1895in}{% was 1.425 \setlength{\unitlength}{33.39in} %\setlength{\unitlength}{40in} \begin{picture}(0.085,0.04)(-0.0425,0.37) \thinlines % % wings added by hand \put( -0.0408 , 0.4082){\line(-1,-3){ 0.005}} \put( -0.0408 , 0.3730){\line(-1,3){ 0.005}} % % arrow identifying the final interval added by hand % the center of the interval is 0010 below this point % 10011110 (0.3809) % 0.0017 is the length of the stubby lines % % want vector's tip to end at height 0.37995 and x=0.0010 % 4*34 = 136 -> 36635 % this was perfectly positioned %\put( 0.0040, 0.36635){\makebox(0,0)[tl]{\tt100111101}} %\put( 0.0044, 0.36635){\vector(-1,4){0.0034}} % but I shifted it to this for arty reasons \put( 0.0048, 0.36635){\makebox(0,0)[tl]{\tt100111101}} \put( 0.0052, 0.36635){\vector(-1,4){0.0034}} % % line 0.5966 from -0.4800 to 0.0000 \put( -0.0408, 0.4034){\line(1,0){ 0.0408}} % bbba at -0.2800, 0.6096 \put( -0.0252, 0.3904){\makebox(0,0)[r]{\tt{bbba}}} % line 0.6227 from -0.4200 to 0.0000 \put( -0.0357, 0.3773){\line(1,0){ 0.0357}} % bbbaa at -0.1000, 0.6003 \put( -0.0099, 0.3997){\makebox(0,0)[r]{\tt{bbbaa}}} % line 0.6040 from -0.2800 to 0.0000 \put( -0.0238, 0.3960){\line(1,0){ 0.0238}} % bbbab at -0.1000, 0.6114 \put( -0.0099, 0.3886){\makebox(0,0)[r]{\tt{bbbab}}} % line 0.6188 from -0.2800 to 0.0000 \put( -0.0238, 0.3812){\line(1,0){ 0.0238}} % bbba\eof at -0.1000, 0.6207 \put( -0.0099, 0.3793){\makebox(0,0)[r]{\tt{bbba\teof}}} % line 0.6250 from 0.0100 to 0.4900 \put( 0.0008, 0.3750){\line(1,0){ 0.0408}} % line 0.5938 from 0.0100 to 0.4200 \put( 0.0008, 0.4062){\line(1,0){ 0.0348}} % 10011 at 0.0100, 0.6094 \put( 0.0299, 0.3906){\makebox(0,0)[l]{\tt10011}} % moved left a bit, was.0329 % 10010111 at 0.0100, 0.5918 \put( 0.0040, 0.4082){\makebox(0,0)[l]{\tt10010111}} % line 0.5918 from 0.0100 to 0.0300 \put( 0.0008, 0.4082){\line(1,0){ 0.0017}} % line 0.6094 from 0.0100 to 0.3700 % shortened, was .0306 \put( 0.0008, 0.3906){\line(1,0){ 0.0276}} % line 0.6016 from 0.0100 to 0.3000 \put( 0.0008, 0.3984){\line(1,0){ 0.0246}} % 10011000 at 0.0100, 0.5957 \put( 0.0040, 0.4043){\makebox(0,0)[l]{\tt10011000}} % line 0.5977 from 0.0100 to 0.2100 \put( 0.0008, 0.4023){\line(1,0){ 0.0170}} % 10011001 at 0.0100, 0.5996 \put( 0.0040, 0.4004){\makebox(0,0)[l]{\tt10011001}} % line 0.5957 from 0.0100 to 0.0300 \put( 0.0008, 0.4043){\line(1,0){ 0.0017}} % line 0.5996 from 0.0100 to 0.0300 \put( 0.0008, 0.4004){\line(1,0){ 0.0017}} % 10011010 at 0.0100, 0.6035 \put( 0.0040, 0.3965){\makebox(0,0)[l]{\tt10011010}} % line 0.6055 from 0.0100 to 0.2100 \put( 0.0008, 0.3945){\line(1,0){ 0.0170}} % 10011011 at 0.0100, 0.6074 \put( 0.0040, 0.3926){\makebox(0,0)[l]{\tt10011011}} % line 0.6035 from 0.0100 to 0.0300 \put( 0.0008, 0.3965){\line(1,0){ 0.0017}} % line 0.6074 from 0.0100 to 0.0300 \put( 0.0008, 0.3926){\line(1,0){ 0.0017}} % line 0.6172 from 0.0100 to 0.3000 \put( 0.0008, 0.3828){\line(1,0){ 0.0246}} % 10011100 at 0.0100, 0.6113 \put( 0.0040, 0.3887){\makebox(0,0)[l]{\tt10011100}} % line 0.6133 from 0.0100 to 0.2100 \put( 0.0008, 0.3867){\line(1,0){ 0.0170}} % 10011101 at 0.0100, 0.6152 \put( 0.0040, 0.3848){\makebox(0,0)[l]{\tt10011101}} % line 0.6113 from 0.0100 to 0.0300 \put( 0.0008, 0.3887){\line(1,0){ 0.0017}} % line 0.6152 from 0.0100 to 0.0300 \put( 0.0008, 0.3848){\line(1,0){ 0.0017}} % 10011110 at 0.0100, 0.6191 \put( 0.0040, 0.3809){\makebox(0,0)[l]{\tt10011110}} % line 0.6211 from 0.0100 to 0.2100 \put( 0.0008, 0.3789){\line(1,0){ 0.0170}} % 10011111 at 0.0100, 0.6230 \put( 0.0040, 0.3770){\makebox(0,0)[l]{\tt10011111}} % line 0.6191 from 0.0100 to 0.0300 \put( 0.0008, 0.3809){\line(1,0){ 0.0017}} % line 0.6230 from 0.0100 to 0.0300 \put( 0.0008, 0.3770){\line(1,0){ 0.0017}} % 10100000 at 0.0100, 0.6270 \put( 0.0040, 0.3730){\makebox(0,0)[l]{\tt10100000}} % line 0.6289 from 0.0100 to 0.2100 \put( 0.0008, 0.3711){\line(1,0){ 0.0170}} % line 0.6270 from 0.0100 to 0.0300 \put( 0.0008, 0.3730){\line(1,0){ 0.0017}} \end{picture} } } \end{center} }{% \caption[a]{Illustration of the arithmetic coding process as the sequence {$\tt bbba\eof$} is transmitted.} \label{fig.ac} }% \end{figure} When the first symbol `$\tb$' is observed, the encoder knows that the encoded string will start `{\tt{01}}', `{\tt{10}}', or `{\tt{11}}', but does not know which. The encoder writes nothing for the time being, and examines the next symbol, which is `$\tb$'. The interval `$\tt bb$' lies wholly within interval `{\tt{1}}', so the encoder can write the first bit: `{\tt{1}}'. The third symbol `$\tt b$' narrows down the interval a little, but not quite enough for it to lie wholly within interval `{\tt{10}}'. Only when the next `$\tt a$' is read from the source can we transmit some more bits. Interval `$\tt bbba$' lies wholly within the interval `{\tt{1001}}', so the encoder adds `{\tt{001}}' to the `{\tt{1}}' it has written. Finally when the `$\eof$' arrives, we need a procedure for terminating the encoding. Magnifying the interval `$\tt bbba\eof$' (\figref{fig.ac}, right) we note that the marked interval `{\tt{100111101}}' is wholly contained by $\tt bbba\eof$, so the encoding can be completed by appending `{\tt{11101}}'. \exercissxA{2}{ex.ac.terminate}{ Show that the overhead required to terminate a message is never more than 2 bits, relative to the ideal message length given the probabilistic model $\H$, $h(\bx \given \H) = \log [ 1/ P(\bx \given \H)]$. } % \begin{center} % % created by ac.p sub=1 unit=40 only_show_data=1 > ac/ac_sub_data.tex % \input{figs/ac/ac_sub_data.tex} % \end{center} This is an important result. Arithmetic coding is very nearly optimal. The message length is always within two bits of the {Shannon information content}\index{information content} of the entire source string, so the expected message length is within two bits of the entropy of the entire message. \subsubsection{Decoding\subsubpunc} The decoder receives the string `{\tt{100111101}}' and passes along it one symbol at a time. First, the probabilities $P(\ta), P(\tb), P(\eof)$ are computed using the identical program that the encoder used and the intervals `$\ta$', `$\tb$' and `$\eof$' are deduced. Once the first two bits `{\tt{10}}' have been examined, it is certain that the original string must have been started with a `$\tb$', since the interval `{\tt{10}}' lies wholly within interval `$\tb$'. The decoder can then use the model to compute $P(\ta \given \tb), P(\tb \given \tb), P(\eof \given \tb)$ and deduce the boundaries of the intervals `$\tb\ta$', `$\tb\tb$' and `$\tb\eof$'. Continuing, we decode the second $\tb$ once we reach `{\tt{1001}}', the third $\tb$ once we reach `{\tt{100111}}', and so forth, with the unambiguous identification of `$\tb\tb\tb\ta\eof$' once the whole binary string has been read. With the convention that `$\eof$' denotes the end of the message, the decoder knows to stop decoding. \subsubsection{Transmission of multiple files\subsubpunc} How might one use arithmetic coding to communicate several distinct files over the binary channel? Once the $\eof$ character has been transmitted, we imagine that the decoder is reset into its initial state. There is no transfer of the learnt statistics of the first file to the second file. % We start a fresh arithmetic code. If, however, we did believe that there is a relationship among the files that we are going to compress, we could define our alphabet differently, introducing a second end-of-file character that marks the end of the file but instructs the encoder and decoder to continue using the same probabilistic model. % If we went this route, % we would only be able to uncompress the second file % after first uncompressing the first file. \subsection{The big picture} Notice that to communicate a string of $N$ letters % coming from an alphabet of size $|\A| = I$ both the encoder and the decoder needed to compute only $N|\A|$ conditional probabilities -- the probabilities of each possible letter in each context actually encountered -- just as in the guessing game.\index{guessing game} This cost can be contrasted with the alternative of using a Huffman code\index{Huffman code!disadvantages} with a large block size (in order to reduce the possible one-bit-per-symbol overhead discussed in % the previous chapter section \ref{sec.huffman.probs}), where {\em all\/} block sequences that could occur % be encoded % in a block must be considered and their probabilities evaluated. Notice how flexible arithmetic coding is: it can be used with any source alphabet and any encoded alphabet. The size of the source alphabet and the encoded alphabet can change with time. Arithmetic coding can be used with any probability distribution, which can change utterly from context to context. Furthermore, if we would like the symbols of the encoding alphabet (say, {\tt 0} and {\tt 1}) to be used with {\em unequal\/} frequency, that can easily be arranged by subdividing the right-hand interval in proportion to the required frequencies. \subsection{How the probabilistic model might make its predictions} The technique of arithmetic coding does not force one to produce the predictive probability in any particular way, but the predictive distributions might naturally be produced by a Bayesian model. \Figref{fig.ac} was generated using a simple model that always assigns a probability of 0.15 to $\eof$, and assigns the remaining 0.85 to $\ta$ and $\tb$, divided in proportion to probabilities given by Laplace's rule, \beq P_{\rm L}(\ta \given x_1,\ldots,x_{n-1})=\frac{F_{\ta}+1}{F_{\ta}+F_{\tb}+2} , \label{eq.laplaceagain} \eeq where $F_{\ta}(x_1,\ldots,x_{n-1})$ is the number of times that $\ta$ has occurred so far, and $F_{\tb}$ is the count of $\tb$s. These predictions correspond to a simple Bayesian model that expects and adapts to % is able to learn a non-equal frequency of use of the source symbols $\ta$ and $\tb$ within a file. % The end result will be an encoder that can adapt to a nonuniform source. \Figref{fig.ac2} displays the intervals corresponding to a number of strings of length up to five. Note that if the string so far has contained a large number of $\tb$s then the probability of $\tb$ relative to $\ta$ is increased, and conversely if many $\ta$s occur then $\ta$s are made more probable. Larger intervals, remember, require fewer bits to encode. % \begin{figure}[tbp] \figuremargin{% \begin{center} % created by ac.p only_show_data=1 > ac/ac_data.tex \mbox{ \setlength{\unitlength}{5.75in} \begin{picture}(0.59130434782608698452,1)(-0.29565217391304349226,0) \thinlines % line 0.0000 from -0.5000 to 0.0000 \put( -0.2957, 1.0000){\line(1,0){ 0.2957}} % a at -0.4500, 0.2125 \put( -0.2811, 0.7875){\makebox(0,0)[r]{\tt{a}}} % line 0.4250 from -0.5000 to 0.0000 \put( -0.2957, 0.5750){\line(1,0){ 0.2957}} % b at -0.4500, 0.6375 \put( -0.2811, 0.3625){\makebox(0,0)[r]{\tt{b}}} % line 0.8500 from -0.5000 to 0.0000 \put( -0.2957, 0.1500){\line(1,0){ 0.2957}} % \eof at -0.4500, 0.9250 \put( -0.2811, 0.0750){\makebox(0,0)[r]{\tt{\teof}}} % line 1.0000 from -0.5000 to 0.0000 \put( -0.2957, 0.0000){\line(1,0){ 0.2957}} % aa at -0.3500, 0.1204 \put( -0.2220, 0.8796){\makebox(0,0)[r]{\tt{aa}}} % line 0.2408 from -0.4500 to 0.0000 \put( -0.2661, 0.7592){\line(1,0){ 0.2661}} % ab at -0.3500, 0.3010 \put( -0.2220, 0.6990){\makebox(0,0)[r]{\tt{ab}}} % line 0.3612 from -0.4500 to 0.0000 \put( -0.2661, 0.6388){\line(1,0){ 0.2661}} % a\eof at -0.3500, 0.3931 \put( -0.2220, 0.6069){\makebox(0,0)[r]{\tt{a\teof}}} % aaa at -0.2300, 0.0768 \put( -0.1510, 0.9232){\makebox(0,0)[r]{\tt{aaa}}} % line 0.1535 from -0.3500 to 0.0000 \put( -0.2070, 0.8465){\line(1,0){ 0.2070}} % aab at -0.2300, 0.1791 \put( -0.1510, 0.8209){\makebox(0,0)[r]{\tt{aab}}} % line 0.2047 from -0.3500 to 0.0000 \put( -0.2070, 0.7953){\line(1,0){ 0.2070}} % aa\eof at -0.2300, 0.2228 \put( -0.1510, 0.7772){\makebox(0,0)[r]{\tt{aa\teof}}} % aaaa at -0.1000, 0.0522 \put( -0.0741, 0.9478){\makebox(0,0)[r]{\tt{aaaa}}} % line 0.1044 from -0.2300 to 0.0000 \put( -0.1360, 0.8956){\line(1,0){ 0.1360}} % aaab at -0.1000, 0.1175 \put( -0.0741, 0.8825){\makebox(0,0)[r]{\tt{aaab}}} % line 0.1305 from -0.2300 to 0.0000 \put( -0.1360, 0.8695){\line(1,0){ 0.1360}} % line 0.0740 from -0.1000 to 0.0000 \put( -0.0591, 0.9260){\line(1,0){ 0.0591}} % line 0.0887 from -0.1000 to 0.0000 \put( -0.0591, 0.9113){\line(1,0){ 0.0591}} % line 0.1192 from -0.1000 to 0.0000 \put( -0.0591, 0.8808){\line(1,0){ 0.0591}} % line 0.1266 from -0.1000 to 0.0000 \put( -0.0591, 0.8734){\line(1,0){ 0.0591}} % aaba at -0.1000, 0.1666 \put( -0.0741, 0.8334){\makebox(0,0)[r]{\tt{aaba}}} % line 0.1796 from -0.2300 to 0.0000 \put( -0.1360, 0.8204){\line(1,0){ 0.1360}} % aabb at -0.1000, 0.1883 \put( -0.0741, 0.8117){\makebox(0,0)[r]{\tt{aabb}}} % line 0.1970 from -0.2300 to 0.0000 \put( -0.1360, 0.8030){\line(1,0){ 0.1360}} % line 0.1683 from -0.1000 to 0.0000 \put( -0.0591, 0.8317){\line(1,0){ 0.0591}} % line 0.1757 from -0.1000 to 0.0000 \put( -0.0591, 0.8243){\line(1,0){ 0.0591}} % line 0.1870 from -0.1000 to 0.0000 \put( -0.0591, 0.8130){\line(1,0){ 0.0591}} % line 0.1944 from -0.1000 to 0.0000 \put( -0.0591, 0.8056){\line(1,0){ 0.0591}} % aba at -0.2300, 0.2664 \put( -0.1510, 0.7336){\makebox(0,0)[r]{\tt{aba}}} % line 0.2920 from -0.3500 to 0.0000 \put( -0.2070, 0.7080){\line(1,0){ 0.2070}} % abb at -0.2300, 0.3176 \put( -0.1510, 0.6824){\makebox(0,0)[r]{\tt{abb}}} % line 0.3432 from -0.3500 to 0.0000 \put( -0.2070, 0.6568){\line(1,0){ 0.2070}} % ab\eof at -0.2300, 0.3522 \put( -0.1510, 0.6478){\makebox(0,0)[r]{\tt{ab\teof}}} % abaa at -0.1000, 0.2539 \put( -0.0741, 0.7461){\makebox(0,0)[r]{\tt{abaa}}} % line 0.2669 from -0.2300 to 0.0000 \put( -0.1360, 0.7331){\line(1,0){ 0.1360}} % abab at -0.1000, 0.2756 \put( -0.0741, 0.7244){\makebox(0,0)[r]{\tt{abab}}} % line 0.2843 from -0.2300 to 0.0000 \put( -0.1360, 0.7157){\line(1,0){ 0.1360}} % line 0.2556 from -0.1000 to 0.0000 \put( -0.0591, 0.7444){\line(1,0){ 0.0591}} % line 0.2630 from -0.1000 to 0.0000 \put( -0.0591, 0.7370){\line(1,0){ 0.0591}} % line 0.2743 from -0.1000 to 0.0000 \put( -0.0591, 0.7257){\line(1,0){ 0.0591}} % line 0.2817 from -0.1000 to 0.0000 \put( -0.0591, 0.7183){\line(1,0){ 0.0591}} % abba at -0.1000, 0.3007 \put( -0.0741, 0.6993){\makebox(0,0)[r]{\tt{abba}}} % line 0.3094 from -0.2300 to 0.0000 \put( -0.1360, 0.6906){\line(1,0){ 0.1360}} % abbb at -0.1000, 0.3225 \put( -0.0741, 0.6775){\makebox(0,0)[r]{\tt{abbb}}} % line 0.3355 from -0.2300 to 0.0000 \put( -0.1360, 0.6645){\line(1,0){ 0.1360}} % line 0.2994 from -0.1000 to 0.0000 \put( -0.0591, 0.7006){\line(1,0){ 0.0591}} % line 0.3068 from -0.1000 to 0.0000 \put( -0.0591, 0.6932){\line(1,0){ 0.0591}} % line 0.3168 from -0.1000 to 0.0000 \put( -0.0591, 0.6832){\line(1,0){ 0.0591}} % line 0.3316 from -0.1000 to 0.0000 \put( -0.0591, 0.6684){\line(1,0){ 0.0591}} % ba at -0.3500, 0.4852 \put( -0.2220, 0.5148){\makebox(0,0)[r]{\tt{ba}}} % line 0.5454 from -0.4500 to 0.0000 \put( -0.2661, 0.4546){\line(1,0){ 0.2661}} % bb at -0.3500, 0.6658 \put( -0.2220, 0.3342){\makebox(0,0)[r]{\tt{bb}}} % line 0.7862 from -0.4500 to 0.0000 \put( -0.2661, 0.2138){\line(1,0){ 0.2661}} % b\eof at -0.3500, 0.8181 \put( -0.2220, 0.1819){\makebox(0,0)[r]{\tt{b\teof}}} % baa at -0.2300, 0.4506 \put( -0.1510, 0.5494){\makebox(0,0)[r]{\tt{baa}}} % line 0.4762 from -0.3500 to 0.0000 \put( -0.2070, 0.5238){\line(1,0){ 0.2070}} % bab at -0.2300, 0.5018 \put( -0.1510, 0.4982){\makebox(0,0)[r]{\tt{bab}}} % line 0.5274 from -0.3500 to 0.0000 \put( -0.2070, 0.4726){\line(1,0){ 0.2070}} % ba\eof at -0.2300, 0.5364 \put( -0.1510, 0.4636){\makebox(0,0)[r]{\tt{ba\teof}}} % baaa at -0.1000, 0.4381 \put( -0.0741, 0.5619){\makebox(0,0)[r]{\tt{baaa}}} % line 0.4511 from -0.2300 to 0.0000 \put( -0.1360, 0.5489){\line(1,0){ 0.1360}} % baab at -0.1000, 0.4598 \put( -0.0741, 0.5402){\makebox(0,0)[r]{\tt{baab}}} % line 0.4685 from -0.2300 to 0.0000 \put( -0.1360, 0.5315){\line(1,0){ 0.1360}} % line 0.4398 from -0.1000 to 0.0000 \put( -0.0591, 0.5602){\line(1,0){ 0.0591}} % line 0.4472 from -0.1000 to 0.0000 \put( -0.0591, 0.5528){\line(1,0){ 0.0591}} % line 0.4585 from -0.1000 to 0.0000 \put( -0.0591, 0.5415){\line(1,0){ 0.0591}} % line 0.4659 from -0.1000 to 0.0000 \put( -0.0591, 0.5341){\line(1,0){ 0.0591}} % baba at -0.1000, 0.4849 \put( -0.0741, 0.5151){\makebox(0,0)[r]{\tt{baba}}} % line 0.4936 from -0.2300 to 0.0000 \put( -0.1360, 0.5064){\line(1,0){ 0.1360}} % babb at -0.1000, 0.5066 \put( -0.0741, 0.4934){\makebox(0,0)[r]{\tt{babb}}} % line 0.5197 from -0.2300 to 0.0000 \put( -0.1360, 0.4803){\line(1,0){ 0.1360}} % line 0.4836 from -0.1000 to 0.0000 \put( -0.0591, 0.5164){\line(1,0){ 0.0591}} % line 0.4910 from -0.1000 to 0.0000 \put( -0.0591, 0.5090){\line(1,0){ 0.0591}} % line 0.5010 from -0.1000 to 0.0000 \put( -0.0591, 0.4990){\line(1,0){ 0.0591}} % line 0.5158 from -0.1000 to 0.0000 \put( -0.0591, 0.4842){\line(1,0){ 0.0591}} % bba at -0.2300, 0.5710 \put( -0.1510, 0.4290){\makebox(0,0)[r]{\tt{bba}}} % line 0.5966 from -0.3500 to 0.0000 \put( -0.2070, 0.4034){\line(1,0){ 0.2070}} % bbb at -0.2300, 0.6734 \put( -0.1510, 0.3266){\makebox(0,0)[r]{\tt{bbb}}} % line 0.7501 from -0.3500 to 0.0000 \put( -0.2070, 0.2499){\line(1,0){ 0.2070}} % bb\eof at -0.2300, 0.7682 \put( -0.1510, 0.2318){\makebox(0,0)[r]{\tt{bb\teof}}} % bbaa at -0.1000, 0.5541 \put( -0.0741, 0.4459){\makebox(0,0)[r]{\tt{bbaa}}} % line 0.5628 from -0.2300 to 0.0000 \put( -0.1360, 0.4372){\line(1,0){ 0.1360}} % bbab at -0.1000, 0.5759 \put( -0.0741, 0.4241){\makebox(0,0)[r]{\tt{bbab}}} % line 0.5889 from -0.2300 to 0.0000 \put( -0.1360, 0.4111){\line(1,0){ 0.1360}} % line 0.5528 from -0.1000 to 0.0000 \put( -0.0591, 0.4472){\line(1,0){ 0.0591}} % line 0.5602 from -0.1000 to 0.0000 \put( -0.0591, 0.4398){\line(1,0){ 0.0591}} % line 0.5702 from -0.1000 to 0.0000 \put( -0.0591, 0.4298){\line(1,0){ 0.0591}} % line 0.5850 from -0.1000 to 0.0000 \put( -0.0591, 0.4150){\line(1,0){ 0.0591}} % bbba at -0.1000, 0.6096 \put( -0.0741, 0.3904){\makebox(0,0)[r]{\tt{bbba}}} % line 0.6227 from -0.2300 to 0.0000 \put( -0.1360, 0.3773){\line(1,0){ 0.1360}} % bbbb at -0.1000, 0.6749 \put( -0.0741, 0.3251){\makebox(0,0)[r]{\tt{bbbb}}} % line 0.7271 from -0.2300 to 0.0000 \put( -0.1360, 0.2729){\line(1,0){ 0.1360}} % line 0.6040 from -0.1000 to 0.0000 \put( -0.0591, 0.3960){\line(1,0){ 0.0591}} % line 0.6188 from -0.1000 to 0.0000 \put( -0.0591, 0.3812){\line(1,0){ 0.0591}} % line 0.6375 from -0.1000 to 0.0000 \put( -0.0591, 0.3625){\line(1,0){ 0.0591}} % line 0.7114 from -0.1000 to 0.0000 \put( -0.0591, 0.2886){\line(1,0){ 0.0591}} % line 0.0000 from 0.0100 to 0.5000 \put( 0.0059, 1.0000){\line(1,0){ 0.2897}} % 0 at 0.0100, 0.2500 \put( 0.2811, 0.7500){\makebox(0,0)[l]{\tt0}} % line 0.5000 from 0.0100 to 0.5000 \put( 0.0059, 0.5000){\line(1,0){ 0.2897}} % 1 at 0.0100, 0.7500 \put( 0.2811, 0.2500){\makebox(0,0)[l]{\tt1}} % line 1.0000 from 0.0100 to 0.5000 \put( 0.0059, 0.0000){\line(1,0){ 0.2897}} % 00 at 0.0100, 0.1250 \put( 0.2397, 0.8750){\makebox(0,0)[l]{\tt00}} % line 0.2500 from 0.0100 to 0.4500 \put( 0.0059, 0.7500){\line(1,0){ 0.2602}} % 01 at 0.0100, 0.3750 \put( 0.2397, 0.6250){\makebox(0,0)[l]{\tt01}} % 000 at 0.0100, 0.0625 \put( 0.1806, 0.9375){\makebox(0,0)[l]{\tt000}} % line 0.1250 from 0.0100 to 0.3800 \put( 0.0059, 0.8750){\line(1,0){ 0.2188}} % 001 at 0.0100, 0.1875 \put( 0.1806, 0.8125){\makebox(0,0)[l]{\tt001}} % 0000 at 0.0100, 0.0312 \put( 0.1207, 0.9688){\makebox(0,0)[l]{\tt0000}} % line 0.0625 from 0.0100 to 0.2800 \put( 0.0059, 0.9375){\line(1,0){ 0.1597}} % 0001 at 0.0100, 0.0938 \put( 0.1207, 0.9062){\makebox(0,0)[l]{\tt0001}} % 00000 at 0.0100, 0.0156 \put( 0.0387, 0.9844){\makebox(0,0)[l]{\tt00000}} % line 0.0312 from 0.0100 to 0.1500 \put( 0.0059, 0.9688){\line(1,0){ 0.0828}} % 00001 at 0.0100, 0.0469 \put( 0.0387, 0.9531){\makebox(0,0)[l]{\tt00001}} % line 0.0156 from 0.0100 to 0.0400 \put( 0.0059, 0.9844){\line(1,0){ 0.0177}} % line 0.0078 from 0.0100 to 0.0200 \put( 0.0059, 0.9922){\line(1,0){ 0.0059}} % line 0.0234 from 0.0100 to 0.0200 \put( 0.0059, 0.9766){\line(1,0){ 0.0059}} % line 0.0469 from 0.0100 to 0.0400 \put( 0.0059, 0.9531){\line(1,0){ 0.0177}} % line 0.0391 from 0.0100 to 0.0200 \put( 0.0059, 0.9609){\line(1,0){ 0.0059}} % line 0.0547 from 0.0100 to 0.0200 \put( 0.0059, 0.9453){\line(1,0){ 0.0059}} % 00010 at 0.0100, 0.0781 \put( 0.0387, 0.9219){\makebox(0,0)[l]{\tt00010}} % line 0.0938 from 0.0100 to 0.1500 \put( 0.0059, 0.9062){\line(1,0){ 0.0828}} % 00011 at 0.0100, 0.1094 \put( 0.0387, 0.8906){\makebox(0,0)[l]{\tt00011}} % line 0.0781 from 0.0100 to 0.0400 \put( 0.0059, 0.9219){\line(1,0){ 0.0177}} % line 0.0703 from 0.0100 to 0.0200 \put( 0.0059, 0.9297){\line(1,0){ 0.0059}} % line 0.0859 from 0.0100 to 0.0200 \put( 0.0059, 0.9141){\line(1,0){ 0.0059}} % line 0.1094 from 0.0100 to 0.0400 \put( 0.0059, 0.8906){\line(1,0){ 0.0177}} % line 0.1016 from 0.0100 to 0.0200 \put( 0.0059, 0.8984){\line(1,0){ 0.0059}} % line 0.1172 from 0.0100 to 0.0200 \put( 0.0059, 0.8828){\line(1,0){ 0.0059}} % 0010 at 0.0100, 0.1562 \put( 0.1207, 0.8438){\makebox(0,0)[l]{\tt0010}} % line 0.1875 from 0.0100 to 0.2800 \put( 0.0059, 0.8125){\line(1,0){ 0.1597}} % 0011 at 0.0100, 0.2188 \put( 0.1207, 0.7812){\makebox(0,0)[l]{\tt0011}} % 00100 at 0.0100, 0.1406 \put( 0.0387, 0.8594){\makebox(0,0)[l]{\tt00100}} % line 0.1562 from 0.0100 to 0.1500 \put( 0.0059, 0.8438){\line(1,0){ 0.0828}} % 00101 at 0.0100, 0.1719 \put( 0.0387, 0.8281){\makebox(0,0)[l]{\tt00101}} % line 0.1406 from 0.0100 to 0.0400 \put( 0.0059, 0.8594){\line(1,0){ 0.0177}} % line 0.1328 from 0.0100 to 0.0200 \put( 0.0059, 0.8672){\line(1,0){ 0.0059}} % line 0.1484 from 0.0100 to 0.0200 \put( 0.0059, 0.8516){\line(1,0){ 0.0059}} % line 0.1719 from 0.0100 to 0.0400 \put( 0.0059, 0.8281){\line(1,0){ 0.0177}} % line 0.1641 from 0.0100 to 0.0200 \put( 0.0059, 0.8359){\line(1,0){ 0.0059}} % line 0.1797 from 0.0100 to 0.0200 \put( 0.0059, 0.8203){\line(1,0){ 0.0059}} % 00110 at 0.0100, 0.2031 \put( 0.0387, 0.7969){\makebox(0,0)[l]{\tt00110}} % line 0.2188 from 0.0100 to 0.1500 \put( 0.0059, 0.7812){\line(1,0){ 0.0828}} % 00111 at 0.0100, 0.2344 \put( 0.0387, 0.7656){\makebox(0,0)[l]{\tt00111}} % line 0.2031 from 0.0100 to 0.0400 \put( 0.0059, 0.7969){\line(1,0){ 0.0177}} % line 0.1953 from 0.0100 to 0.0200 \put( 0.0059, 0.8047){\line(1,0){ 0.0059}} % line 0.2109 from 0.0100 to 0.0200 \put( 0.0059, 0.7891){\line(1,0){ 0.0059}} % line 0.2344 from 0.0100 to 0.0400 \put( 0.0059, 0.7656){\line(1,0){ 0.0177}} % line 0.2266 from 0.0100 to 0.0200 \put( 0.0059, 0.7734){\line(1,0){ 0.0059}} % line 0.2422 from 0.0100 to 0.0200 \put( 0.0059, 0.7578){\line(1,0){ 0.0059}} % 010 at 0.0100, 0.3125 \put( 0.1806, 0.6875){\makebox(0,0)[l]{\tt010}} % line 0.3750 from 0.0100 to 0.3800 \put( 0.0059, 0.6250){\line(1,0){ 0.2188}} % 011 at 0.0100, 0.4375 \put( 0.1806, 0.5625){\makebox(0,0)[l]{\tt011}} % 0100 at 0.0100, 0.2812 \put( 0.1207, 0.7188){\makebox(0,0)[l]{\tt0100}} % line 0.3125 from 0.0100 to 0.2800 \put( 0.0059, 0.6875){\line(1,0){ 0.1597}} % 0101 at 0.0100, 0.3438 \put( 0.1207, 0.6562){\makebox(0,0)[l]{\tt0101}} % 01000 at 0.0100, 0.2656 \put( 0.0387, 0.7344){\makebox(0,0)[l]{\tt01000}} % line 0.2812 from 0.0100 to 0.1500 \put( 0.0059, 0.7188){\line(1,0){ 0.0828}} % 01001 at 0.0100, 0.2969 \put( 0.0387, 0.7031){\makebox(0,0)[l]{\tt01001}} % line 0.2656 from 0.0100 to 0.0400 \put( 0.0059, 0.7344){\line(1,0){ 0.0177}} % line 0.2578 from 0.0100 to 0.0200 \put( 0.0059, 0.7422){\line(1,0){ 0.0059}} % line 0.2734 from 0.0100 to 0.0200 \put( 0.0059, 0.7266){\line(1,0){ 0.0059}} % line 0.2969 from 0.0100 to 0.0400 \put( 0.0059, 0.7031){\line(1,0){ 0.0177}} % line 0.2891 from 0.0100 to 0.0200 \put( 0.0059, 0.7109){\line(1,0){ 0.0059}} % line 0.3047 from 0.0100 to 0.0200 \put( 0.0059, 0.6953){\line(1,0){ 0.0059}} % 01010 at 0.0100, 0.3281 \put( 0.0387, 0.6719){\makebox(0,0)[l]{\tt01010}} % line 0.3438 from 0.0100 to 0.1500 \put( 0.0059, 0.6562){\line(1,0){ 0.0828}} % 01011 at 0.0100, 0.3594 \put( 0.0387, 0.6406){\makebox(0,0)[l]{\tt01011}} % line 0.3281 from 0.0100 to 0.0400 \put( 0.0059, 0.6719){\line(1,0){ 0.0177}} % line 0.3203 from 0.0100 to 0.0200 \put( 0.0059, 0.6797){\line(1,0){ 0.0059}} % line 0.3359 from 0.0100 to 0.0200 \put( 0.0059, 0.6641){\line(1,0){ 0.0059}} % line 0.3594 from 0.0100 to 0.0400 \put( 0.0059, 0.6406){\line(1,0){ 0.0177}} % line 0.3516 from 0.0100 to 0.0200 \put( 0.0059, 0.6484){\line(1,0){ 0.0059}} % line 0.3672 from 0.0100 to 0.0200 \put( 0.0059, 0.6328){\line(1,0){ 0.0059}} % 0110 at 0.0100, 0.4062 \put( 0.1207, 0.5938){\makebox(0,0)[l]{\tt0110}} % line 0.4375 from 0.0100 to 0.2800 \put( 0.0059, 0.5625){\line(1,0){ 0.1597}} % 0111 at 0.0100, 0.4688 \put( 0.1207, 0.5312){\makebox(0,0)[l]{\tt0111}} % 01100 at 0.0100, 0.3906 \put( 0.0387, 0.6094){\makebox(0,0)[l]{\tt01100}} % line 0.4062 from 0.0100 to 0.1500 \put( 0.0059, 0.5938){\line(1,0){ 0.0828}} % 01101 at 0.0100, 0.4219 \put( 0.0387, 0.5781){\makebox(0,0)[l]{\tt01101}} % line 0.3906 from 0.0100 to 0.0400 \put( 0.0059, 0.6094){\line(1,0){ 0.0177}} % line 0.3828 from 0.0100 to 0.0200 \put( 0.0059, 0.6172){\line(1,0){ 0.0059}} % line 0.3984 from 0.0100 to 0.0200 \put( 0.0059, 0.6016){\line(1,0){ 0.0059}} % line 0.4219 from 0.0100 to 0.0400 \put( 0.0059, 0.5781){\line(1,0){ 0.0177}} % line 0.4141 from 0.0100 to 0.0200 \put( 0.0059, 0.5859){\line(1,0){ 0.0059}} % line 0.4297 from 0.0100 to 0.0200 \put( 0.0059, 0.5703){\line(1,0){ 0.0059}} % 01110 at 0.0100, 0.4531 \put( 0.0387, 0.5469){\makebox(0,0)[l]{\tt01110}} % line 0.4688 from 0.0100 to 0.1500 \put( 0.0059, 0.5312){\line(1,0){ 0.0828}} % 01111 at 0.0100, 0.4844 \put( 0.0387, 0.5156){\makebox(0,0)[l]{\tt01111}} % line 0.4531 from 0.0100 to 0.0400 \put( 0.0059, 0.5469){\line(1,0){ 0.0177}} % line 0.4453 from 0.0100 to 0.0200 \put( 0.0059, 0.5547){\line(1,0){ 0.0059}} % line 0.4609 from 0.0100 to 0.0200 \put( 0.0059, 0.5391){\line(1,0){ 0.0059}} % line 0.4844 from 0.0100 to 0.0400 \put( 0.0059, 0.5156){\line(1,0){ 0.0177}} % line 0.4766 from 0.0100 to 0.0200 \put( 0.0059, 0.5234){\line(1,0){ 0.0059}} % line 0.4922 from 0.0100 to 0.0200 \put( 0.0059, 0.5078){\line(1,0){ 0.0059}} % 10 at 0.0100, 0.6250 \put( 0.2397, 0.3750){\makebox(0,0)[l]{\tt10}} % line 0.7500 from 0.0100 to 0.4500 \put( 0.0059, 0.2500){\line(1,0){ 0.2602}} % 11 at 0.0100, 0.8750 \put( 0.2397, 0.1250){\makebox(0,0)[l]{\tt11}} % 100 at 0.0100, 0.5625 \put( 0.1806, 0.4375){\makebox(0,0)[l]{\tt100}} % line 0.6250 from 0.0100 to 0.3800 \put( 0.0059, 0.3750){\line(1,0){ 0.2188}} % 101 at 0.0100, 0.6875 \put( 0.1806, 0.3125){\makebox(0,0)[l]{\tt101}} % 1000 at 0.0100, 0.5312 \put( 0.1207, 0.4688){\makebox(0,0)[l]{\tt1000}} % line 0.5625 from 0.0100 to 0.2800 \put( 0.0059, 0.4375){\line(1,0){ 0.1597}} % 1001 at 0.0100, 0.5938 \put( 0.1207, 0.4062){\makebox(0,0)[l]{\tt1001}} % 10000 at 0.0100, 0.5156 \put( 0.0387, 0.4844){\makebox(0,0)[l]{\tt10000}} % line 0.5312 from 0.0100 to 0.1500 \put( 0.0059, 0.4688){\line(1,0){ 0.0828}} % 10001 at 0.0100, 0.5469 \put( 0.0387, 0.4531){\makebox(0,0)[l]{\tt10001}} % line 0.5156 from 0.0100 to 0.0400 \put( 0.0059, 0.4844){\line(1,0){ 0.0177}} % line 0.5078 from 0.0100 to 0.0200 \put( 0.0059, 0.4922){\line(1,0){ 0.0059}} % line 0.5234 from 0.0100 to 0.0200 \put( 0.0059, 0.4766){\line(1,0){ 0.0059}} % line 0.5469 from 0.0100 to 0.0400 \put( 0.0059, 0.4531){\line(1,0){ 0.0177}} % line 0.5391 from 0.0100 to 0.0200 \put( 0.0059, 0.4609){\line(1,0){ 0.0059}} % line 0.5547 from 0.0100 to 0.0200 \put( 0.0059, 0.4453){\line(1,0){ 0.0059}} % 10010 at 0.0100, 0.5781 \put( 0.0387, 0.4219){\makebox(0,0)[l]{\tt10010}} % line 0.5938 from 0.0100 to 0.1500 \put( 0.0059, 0.4062){\line(1,0){ 0.0828}} % 10011 at 0.0100, 0.6094 \put( 0.0387, 0.3906){\makebox(0,0)[l]{\tt10011}} % line 0.5781 from 0.0100 to 0.0400 \put( 0.0059, 0.4219){\line(1,0){ 0.0177}} % line 0.5703 from 0.0100 to 0.0200 \put( 0.0059, 0.4297){\line(1,0){ 0.0059}} % line 0.5859 from 0.0100 to 0.0200 \put( 0.0059, 0.4141){\line(1,0){ 0.0059}} % line 0.6094 from 0.0100 to 0.0400 \put( 0.0059, 0.3906){\line(1,0){ 0.0177}} % line 0.6016 from 0.0100 to 0.0200 \put( 0.0059, 0.3984){\line(1,0){ 0.0059}} % line 0.6172 from 0.0100 to 0.0200 \put( 0.0059, 0.3828){\line(1,0){ 0.0059}} % 1010 at 0.0100, 0.6562 \put( 0.1207, 0.3438){\makebox(0,0)[l]{\tt1010}} % line 0.6875 from 0.0100 to 0.2800 \put( 0.0059, 0.3125){\line(1,0){ 0.1597}} % 1011 at 0.0100, 0.7188 \put( 0.1207, 0.2812){\makebox(0,0)[l]{\tt1011}} % 10100 at 0.0100, 0.6406 \put( 0.0387, 0.3594){\makebox(0,0)[l]{\tt10100}} % line 0.6562 from 0.0100 to 0.1500 \put( 0.0059, 0.3438){\line(1,0){ 0.0828}} % 10101 at 0.0100, 0.6719 \put( 0.0387, 0.3281){\makebox(0,0)[l]{\tt10101}} % line 0.6406 from 0.0100 to 0.0400 \put( 0.0059, 0.3594){\line(1,0){ 0.0177}} % line 0.6328 from 0.0100 to 0.0200 \put( 0.0059, 0.3672){\line(1,0){ 0.0059}} % line 0.6484 from 0.0100 to 0.0200 \put( 0.0059, 0.3516){\line(1,0){ 0.0059}} % line 0.6719 from 0.0100 to 0.0400 \put( 0.0059, 0.3281){\line(1,0){ 0.0177}} % line 0.6641 from 0.0100 to 0.0200 \put( 0.0059, 0.3359){\line(1,0){ 0.0059}} % line 0.6797 from 0.0100 to 0.0200 \put( 0.0059, 0.3203){\line(1,0){ 0.0059}} % 10110 at 0.0100, 0.7031 \put( 0.0387, 0.2969){\makebox(0,0)[l]{\tt10110}} % line 0.7188 from 0.0100 to 0.1500 \put( 0.0059, 0.2812){\line(1,0){ 0.0828}} % 10111 at 0.0100, 0.7344 \put( 0.0387, 0.2656){\makebox(0,0)[l]{\tt10111}} % line 0.7031 from 0.0100 to 0.0400 \put( 0.0059, 0.2969){\line(1,0){ 0.0177}} % line 0.6953 from 0.0100 to 0.0200 \put( 0.0059, 0.3047){\line(1,0){ 0.0059}} % line 0.7109 from 0.0100 to 0.0200 \put( 0.0059, 0.2891){\line(1,0){ 0.0059}} % line 0.7344 from 0.0100 to 0.0400 \put( 0.0059, 0.2656){\line(1,0){ 0.0177}} % line 0.7266 from 0.0100 to 0.0200 \put( 0.0059, 0.2734){\line(1,0){ 0.0059}} % line 0.7422 from 0.0100 to 0.0200 \put( 0.0059, 0.2578){\line(1,0){ 0.0059}} % 110 at 0.0100, 0.8125 \put( 0.1806, 0.1875){\makebox(0,0)[l]{\tt110}} % line 0.8750 from 0.0100 to 0.3800 \put( 0.0059, 0.1250){\line(1,0){ 0.2188}} % 111 at 0.0100, 0.9375 \put( 0.1806, 0.0625){\makebox(0,0)[l]{\tt111}} % 1100 at 0.0100, 0.7812 \put( 0.1207, 0.2188){\makebox(0,0)[l]{\tt1100}} % line 0.8125 from 0.0100 to 0.2800 \put( 0.0059, 0.1875){\line(1,0){ 0.1597}} % 1101 at 0.0100, 0.8438 \put( 0.1207, 0.1562){\makebox(0,0)[l]{\tt1101}} % 11000 at 0.0100, 0.7656 \put( 0.0387, 0.2344){\makebox(0,0)[l]{\tt11000}} % line 0.7812 from 0.0100 to 0.1500 \put( 0.0059, 0.2188){\line(1,0){ 0.0828}} % 11001 at 0.0100, 0.7969 \put( 0.0387, 0.2031){\makebox(0,0)[l]{\tt11001}} % line 0.7656 from 0.0100 to 0.0400 \put( 0.0059, 0.2344){\line(1,0){ 0.0177}} % line 0.7578 from 0.0100 to 0.0200 \put( 0.0059, 0.2422){\line(1,0){ 0.0059}} % line 0.7734 from 0.0100 to 0.0200 \put( 0.0059, 0.2266){\line(1,0){ 0.0059}} % line 0.7969 from 0.0100 to 0.0400 \put( 0.0059, 0.2031){\line(1,0){ 0.0177}} % line 0.7891 from 0.0100 to 0.0200 \put( 0.0059, 0.2109){\line(1,0){ 0.0059}} % line 0.8047 from 0.0100 to 0.0200 \put( 0.0059, 0.1953){\line(1,0){ 0.0059}} % 11010 at 0.0100, 0.8281 \put( 0.0387, 0.1719){\makebox(0,0)[l]{\tt11010}} % line 0.8438 from 0.0100 to 0.1500 \put( 0.0059, 0.1562){\line(1,0){ 0.0828}} % 11011 at 0.0100, 0.8594 \put( 0.0387, 0.1406){\makebox(0,0)[l]{\tt11011}} % line 0.8281 from 0.0100 to 0.0400 \put( 0.0059, 0.1719){\line(1,0){ 0.0177}} % line 0.8203 from 0.0100 to 0.0200 \put( 0.0059, 0.1797){\line(1,0){ 0.0059}} % line 0.8359 from 0.0100 to 0.0200 \put( 0.0059, 0.1641){\line(1,0){ 0.0059}} % line 0.8594 from 0.0100 to 0.0400 \put( 0.0059, 0.1406){\line(1,0){ 0.0177}} % line 0.8516 from 0.0100 to 0.0200 \put( 0.0059, 0.1484){\line(1,0){ 0.0059}} % line 0.8672 from 0.0100 to 0.0200 \put( 0.0059, 0.1328){\line(1,0){ 0.0059}} % 1110 at 0.0100, 0.9062 \put( 0.1207, 0.0938){\makebox(0,0)[l]{\tt1110}} % line 0.9375 from 0.0100 to 0.2800 \put( 0.0059, 0.0625){\line(1,0){ 0.1597}} % 1111 at 0.0100, 0.9688 \put( 0.1207, 0.0312){\makebox(0,0)[l]{\tt1111}} % 11100 at 0.0100, 0.8906 \put( 0.0387, 0.1094){\makebox(0,0)[l]{\tt11100}} % line 0.9062 from 0.0100 to 0.1500 \put( 0.0059, 0.0938){\line(1,0){ 0.0828}} % 11101 at 0.0100, 0.9219 \put( 0.0387, 0.0781){\makebox(0,0)[l]{\tt11101}} % line 0.8906 from 0.0100 to 0.0400 \put( 0.0059, 0.1094){\line(1,0){ 0.0177}} % line 0.8828 from 0.0100 to 0.0200 \put( 0.0059, 0.1172){\line(1,0){ 0.0059}} % line 0.8984 from 0.0100 to 0.0200 \put( 0.0059, 0.1016){\line(1,0){ 0.0059}} % line 0.9219 from 0.0100 to 0.0400 \put( 0.0059, 0.0781){\line(1,0){ 0.0177}} % line 0.9141 from 0.0100 to 0.0200 \put( 0.0059, 0.0859){\line(1,0){ 0.0059}} % line 0.9297 from 0.0100 to 0.0200 \put( 0.0059, 0.0703){\line(1,0){ 0.0059}} % 11110 at 0.0100, 0.9531 \put( 0.0387, 0.0469){\makebox(0,0)[l]{\tt11110}} % line 0.9688 from 0.0100 to 0.1500 \put( 0.0059, 0.0312){\line(1,0){ 0.0828}} % 11111 at 0.0100, 0.9844 \put( 0.0387, 0.0156){\makebox(0,0)[l]{\tt11111}} % line 0.9531 from 0.0100 to 0.0400 \put( 0.0059, 0.0469){\line(1,0){ 0.0177}} % line 0.9453 from 0.0100 to 0.0200 \put( 0.0059, 0.0547){\line(1,0){ 0.0059}} % line 0.9609 from 0.0100 to 0.0200 \put( 0.0059, 0.0391){\line(1,0){ 0.0059}} % line 0.9844 from 0.0100 to 0.0400 \put( 0.0059, 0.0156){\line(1,0){ 0.0177}} % line 0.9766 from 0.0100 to 0.0200 \put( 0.0059, 0.0234){\line(1,0){ 0.0059}} % line 0.9922 from 0.0100 to 0.0200 \put( 0.0059, 0.0078){\line(1,0){ 0.0059}} \end{picture} } \end{center} }{% \caption[a]{Illustration of the intervals defined by a simple Bayesian probabilistic model. The size of an intervals is proportional to the probability of the string. This model anticipates that the source is likely to be biased towards one of {\tt{a}} and {\tt{b}}, so sequences having lots of {\tt{a}}s or lots of {\tt{b}}s have larger intervals than sequences of the same length that are 50:50 {\tt{a}}s and {\tt{b}}s.} \label{fig.ac2} }% \end{figure} \begin{aside} \subsection{Details of the Bayesian model} Having emphasized that any model could be used -- arithmetic coding is not wedded to any particular set of probabilities -- let me explain the simple adaptive probabilistic model used in the preceding example; we first encountered this model in % chapter \ref{ch1} % (page \pageref{ex.postpa}) \exerciseref{ex.postpa}. % % % {\em (This material may be a repetition of material in \chref{ch1}.)} % \subsubsection{Assumptions} The model will be described using parameters $p_{\eof}$, $p_{\ta}$ and $p_{\tb}$, defined below, which should not be confused with the predictive probabilities {\em in a particular context\/}, for example, $P(\ta \given \bs\eq {\tb\ta\ta} )$. % An analogy for this model, as I indicated % at the start, % is the tossing of a bent coin (\secref{sec.bentcoin}). A bent coin labelled $\ta$ and $\tb$ is tossed some number of times $l$, which we don't know beforehand. The coin's probability of coming up $\ta$ when tossed is $p_{\ta}$, and $p_{\tb} = 1-p_{\ta}$; the parameters $p_{\ta},p_{\tb}$ are not known beforehand. The source string $\bs = \tt baaba\eof$ indicates that $l$ was 5 and the sequence of outcomes was $\tt baaba$. \ben \item It is assumed that the length of the string $l$ has an exponential probability distribution \beq P(l) = (1 - p_{\eof})^l p_{\eof} . \eeq This distribution corresponds to assuming a constant probability $p_{\eof}$ for the termination symbol `$\eof$' at each character. \item It is assumed that the non-terminal characters in the string are selected independently at random from an ensemble with probabilities % distribution $\P = \{p_{\ta},p_{\tb}\}$; the probability $p_{\ta}$ is fixed throughout the string to some unknown value that could be anywhere between $0$ and $1$. The probability of an $\ta$ occurring as the next symbol, given $p_{\ta}$ (if only we knew it), is $(1-p_{\eof})p_{\ta}$. % given that it is not The probability, given $p_{\ta}$, that an unterminated string of length $F$ is a given string $\bs$ that contains $\{F_{\ta},F_{\tb}\}$ counts of the two outcomes % $\{ a , b \}$ is the \ind{Bernoulli distribution} \beq P( \bs \given p_{\ta} , F ) = p_{\ta}^{F_{\ta}} (1-p_{\ta})^{F_{\tb}} . \label{eq.pa.like} \eeq \item We assume a uniform prior distribution for $p_{\ta}$, \beq P(p_{\ta}) = 1 , \: \: \: \: \: \: p_{\ta} \in [0,1] , \label{eq.pa.prior} \eeq and define $p_{\tb} \equiv 1-p_{\ta}$. It would be easy to assume other priors on $p_{\ta}$, with beta distributions being the most convenient to handle. \een This model was studied in \secref{sec.bentcoin}. The key result we require is the predictive distribution for the next symbol, given the string so far, $\bs$. This probability that the next character is $\ta$ or $\tb$ (assuming that it is not `$\eof$') was derived in \eqref{eq.laplacederived} and is precisely Laplace's rule (\ref{eq.laplaceagain}). \end{aside} \exercisaxB{3}{ex.ac.vs.huffman}{ Compare the expected message length when an ASCII file is compressed by the following three methods. \begin{description} \item[Huffman-with-header\puncspace] Read the whole file, find the empirical frequency of each symbol, construct a Huffman code for those frequencies, transmit the code by transmitting the lengths of the Huffman codewords, then transmit the file using the Huffman code. (The actual codewords don't need to be transmitted, since we can use a deterministic method for building the tree given the codelengths.) \item[Arithmetic code using the \ind{Laplace model}\puncspace] \beq P_{\rm L}(\ta \given x_1,\ldots,x_{n-1})=\frac{F_{\ta}+1} {\sum_{{\ta'}}(F_{{\ta'}}+1)}. \eeq \item[Arithmetic code using a \ind{Dirichlet model}\puncspace] This model's predictions are: \beq P_{\rm D}(\ta \given x_1,\ldots,x_{n-1})=\frac{F_{\ta}+\alpha} {\sum_{{\ta'}}(F_{{\ta'}}+\alpha)}, \eeq where $\alpha$ is fixed to a number such as 0.01. A small value of $\alpha$ corresponds to a more responsive version of the Laplace model; the probability over characters is expected to be more nonuniform; $\alpha=1$ reproduces the Laplace model. \end{description} Take care that the header of your Huffman message is self-delimiting. Special cases worth considering are (a) short files with just a few hundred characters; (b) large files in which some characters are never used. } \section{Further applications of arithmetic coding} \subsection{Efficient generation of random samples} \label{sec.ac.efficient} Arithmetic coding not only offers a way to compress strings believed to come from a given model; it also offers a way to generate random strings from a model. Imagine sticking a pin into the unit interval at random, that line having been divided into subintervals in proportion to probabilities $p_i$; the probability that your pin will lie in interval $i$ is $p_i$. So to generate a sample from a model, all we need to do is feed ordinary random bits into an arithmetic {\em decoder\/}\index{arithmetic coding!decoder} for that model.\index{arithmetic coding!uses beyond compression} An infinite random bit sequence corresponds to the selection of a point at random from the line $[0,1)$, so the decoder will then select a string at random from the assumed distribution. This arithmetic method is guaranteed to use very nearly the smallest number of random bits possible to make the selection -- an important point in communities where random numbers are expensive! [{This is not a joke. Large amounts of money are spent on generating random bits in software and hardware. Random numbers are valuable.}] A simple example of the use of this technique is in the generation of random bits with a nonuniform distribution $\{ p_0,p_1 \}$. % This is a useful technique \exercissxA{2}{ex.usebits}{ Compare the following two techniques for generating random symbols from a nonuniform distribution $\{ p_0,p_1 \} = \{ 0.99,0.01\}$: \ben \item The standard method: use a standard random number generator to generate an integer between 1 and $2^{32}$. Rescale the integer to $(0,1)$. Test whether this uniformly distributed random variable is less than $0.99$, and emit a {\tt{0}} or {\tt{1}} accordingly. \item Arithmetic coding using the correct model, fed with standard random bits. \een Roughly how many random bits will each method use to generate a thousand samples from this sparse distribution? } \subsection{Efficient data-entry devices} When we enter text into a computer, we make gestures of some sort -- maybe we tap a keyboard, or scribble with a pointer, or click with a mouse; an {\em efficient\/} \index{user interfaces}\index{data entry}\ind{text entry} system is one where the number of gestures required to enter a given text string is {\em small\/}. Writing\index{writing}%\index{text entry}% \marginfignocaption{\small \begin{center} \begin{tabular}{rcl} \multicolumn{3}{l}{Compression:}\\ text& $\rightarrow$ &bits\\[0.2in] \multicolumn{3}{l}{Writing:} \\ text &$\leftarrow$& gestures\\[0.2in] \end{tabular} \end{center} } can be viewed as an inverse process\index{arithmetic coding!uses beyond compression} to data compression. In data compression, the aim is to map a given text string into a {\em small\/} number of bits. In text entry, we want a small sequence of gestures to produce our intended text. By inverting an arithmetic coder, we can obtain \index{inverse-arithmetic-coder}an information-efficient text entry device that is driven by continuous pointing gestures \cite{ward2000}. In this system, called \ind{Dasher},\index{human--machine interfaces}\index{software!Dasher} the user zooms in on the unit interval to locate the % \index{text entry} interval corresponding to their intended string, in the same style as \figref{fig.ac}. A \ind{language model} (exactly as used in text compression) controls the sizes of the intervals such that probable strings are quick and easy to identify. After an hour's practice, a novice user can write with one \ind{finger} driving {Dasher} at about 25 words per minute -- that's about half their normal ten-finger \index{QWERTY}typing speed on a regular \ind{keyboard}. It's even possible to write at 25 words per minute, {\em hands-free}, using gaze direction to drive Dasher \cite{wardmackay2002}. Dasher is available as free software for various platforms.\footnote{ {\tt http://www.inference.phy.cam.ac.uk/dasher/}} \label{sec.stopbeforeLZ} \section{Lempel--Ziv coding\nonexaminable} The \index{Lempel--Ziv coding|(}Lempel--Ziv algorithms, which are widely used for data compression (\eg, the {\tt\ind{compress}} and {\tt\ind{gzip}} commands), are different in philosophy to arithmetic coding. There is no separation between modelling and coding,\index{philosophy} and no opportunity for explicit modelling.\index{source code!algorithms} \subsection{Basic Lempel--Ziv algorithm} The method of compression is to replace a \ind{substring} with a \ind{pointer} to an earlier occurrence of the same substring. For example if the string is {\tt{1011010100010}}\ldots, we \ind{parse} it into an ordered {\dem\ind{dictionary}\/} of substrings that have not appeared before as follows: $\l$, {\tt{1}}, {\tt{0}}, {\tt{11}}, {\tt{01}}, {\tt{010}}, {\tt{00}}, {\tt{10}}, \dots. We include the \index{empty string}empty substring \ind{$\lambda$} as the first substring in the dictionary and order the substrings in the dictionary by the order in which they emerged from the source. After every comma, we look along the next part of the input sequence until we have read a substring that has not been marked off before. A moment's reflection will confirm that this substring is longer by one bit than a substring that has occurred earlier in the dictionary. This means that we can encode each substring by giving a {\dem pointer\/} to the earlier occurrence of that prefix and then sending the extra bit by which the new substring in the dictionary differs from the earlier substring. If, at the $n$th bit, we have enumerated $s(n)$ substrings, then we can give the value of the pointer in $\lceil \log_2 s(n) \rceil$ bits. The code for the above sequence is then as shown in the fourth line of the following table (with punctuation included for clarity), the upper lines indicating the source string and the value of $s(n)$: % % \beginfullpagewidth%% defined in chapternotes.sty, uses {narrow} \[ \begin{array}{l|*{8}{l}} \mbox{source substrings}&\lambda & {\tt{1}} & {\tt{0}} & {\tt{11}} & {\tt{01}} & {\tt{010}} & {\tt{00}} & {\tt{10}} \\ s(n) & 0 & 1 & 2 & 3 & 4 & 5 & 6 & 7 \\ s(n)_{\rm binary} & {\tt{000}} & {\tt{001}} & {\tt{010}} & {\tt{011}} & {\tt{100}} & {\tt{101}} & {\tt{110}} & {\tt{111}} \\ (\mbox{pointer},\mbox{bit})& & (,{\tt{1}}) & ({\tt{0}},{\tt{0}}) & ({\tt{01}},{\tt{1}}) & ({\tt{10}},{\tt{1}}) & ({\tt{100}},{\tt{0}})& ({\tt{010}},{\tt{0}}) & ({\tt{001}},{\tt{0}}) \end{array} \] \end{narrow} % The pointer Notice that the first pointer we send is empty, because, given that there is only one substring in the dictionary -- the string $\lambda$ -- no bits are needed to convey the `choice' of that substring as the prefix. The encoded string is {\tt 100011101100001000010}. The encoding, in this simple case, is actually a longer string than the source string, because there was no obvious redundancy in the source string. \exercisaxB{2}{ex.Clengthen}{ Prove that {\em any\/} uniquely decodeable code from $\{{\tt{0}},{\tt{1}}\}^+$ to $\{{\tt{0}},{\tt{1}}\}^+$ necessarily makes some strings longer if it makes some strings shorter. } One reason why the algorithm described above lengthens a lot of strings is because it is inefficient -- it transmits unnecessary bits; to put it another way, its code is not complete.\label{sec.LZprune} % is not necessarily the explanation for the above lengthening, % however, because % see also {ex.LZprune}{ % the algorithm described is certainly inefficient: o Once a substring in the {dictionary} has been joined there by both of its children, then we can be sure that it will not be needed (except possibly as part of our protocol for terminating a message); so at that point we could drop it from our dictionary of substrings and shuffle them all along one, thereby reducing the length of subsequent pointer messages. Equivalently, we could write the second prefix into the dictionary at the point previously occupied by the parent. A second unnecessary overhead is the transmission of the new bit in these cases -- the second time a prefix is used, we can be sure of the identity of the next bit. % This is easy to do in a computer but not so easy for a human % to cope with. \subsubsection{Decoding} The decoder again involves an identical twin at the decoding end who constructs the dictionary of substrings as the data are decoded. \exercissxB{2}{ex.LZencode}{ Encode the string {\tt{000000000000100000000000}} using the basic Lempel--Ziv algorithm described above. } % lambda 0 00 000 0000 001 00000 000000 % 000 001 010 011 100 101 110 111 % ,0 1,0 10,0 11,0 010,1 100,0 110,0 % answer % 010100110010110001100 \exercissxB{2}{ex.LZdecode}{ Decode the string \begin{center} {\tt{00101011101100100100011010101000011}} \end{center} that was encoded using the basic Lempel--Ziv algorithm. } % answer % 0100001000100010101000001 001000001000000 % lamda, 0, 1, 00, 001, 000, 10, 0010, 101, 0000, 01, 00100, 0001, 00000 % 0 , 1, 10,11, 100, 101,110, 111, 1000, 1001, 1010, 1011,1100, 1101 % ,0 0,1 01,0 11,1 011,0 010,0 100,0 110,1 0101,0 0001,1 bored! % 10101011101100100100011010101000011 % % see tcl/lempelziv.tcl \subsubsection{Practicalities} In this description I have not discussed the method for terminating a string. There are many variations on the Lempel--Ziv algorithm, all exploiting the same idea but using different procedures for dictionary management, etc. % Two of the best known % variations are called the Ziv-Lempel algorithm % and the LZW algorithm. % The resulting programs are fast, but their performance on compression of English text, although useful, does not match the standards set in the arithmetic coding literature. \subsection{Theoretical properties} In contrast to the block code, Huffman code, and arithmetic coding methods we discussed in the last three chapters, the Lempel--Ziv algorithm is defined without making any mention of a \ind{probabilistic model} for the source. Yet,\index{model} % in fact, given any \ind{ergodic} %\footnote{Need to clarify this. It means % the source is memoryless on sufficiently long timescales.} source (\ie, one that is memoryless on sufficiently long timescales), the Lempel--Ziv algorithm can be proven {\em asymptotically\/} to compress down to the entropy of the source. This is why it is called a `\ind{universal}' compression algorithm. For a proof of this property, see \citeasnoun{Cover&Thomas}. % Cover and Thomas (1991). It achieves its compression, however, only by {\em memorizing\/} substrings that have happened so that it has a short name for them the next time they occur. The asymptotic timescale on which this universal performance is achieved %%is likely to be the time that it takes for %% if the source has not been observed long enough for % {\em all\/} typical sequences of length $n^*$ % to occur, where $n^*$ is the longest lengthscale associated with the % statistical fluctuations in the source. %the longest lengthscale on % which there are correlations in . % red then % For many sources the time for all typical sequences to % occur is may, for many sources, be unfeasibly long, because the number of typical substrings that need memorizing may be enormous. % The useful performance of the algorithm in practice is a reflection of the fact that many files contain multiple repetitions of particular short sequences of characters, a form of redundancy to which the algorithm is well suited. \subsection{Common ground} I have emphasized the difference in philosophy behind arithmetic coding and Lempel--Ziv coding. There is common ground between them, though: in principle, one can design adaptive probabilistic models, and thence arithmetic codes, that are `\ind{universal}', that is, models that will asymptotically compress {\em any source in some class\/} to within some factor (preferably 1) of its entropy.\index{compression!universal} However, {for practical purposes\/}, I think such universal models can only be constructed if the class of sources is severely restricted. A general purpose compressor that can discover the probability distribution of {\em any\/} source would be a general purpose \ind{artificial intelligence}! A general purpose artificial intelligence does not yet exist. % \subsection{Comments} % The Lempel--Ziv algorithm can be generalized to any finite alphabet % as long as the input and output alphabets are the same. I believe % it is not convenient to use unequal alphabets. \section{Demonstration} An interactive aid for exploring arithmetic coding, {\tt dasher.tcl}, is available.\footnote{{\tt http://www.inference.phy.cam.ac.uk/mackay/itprnn/softwareI.html}} % http://www.inference.phy.cam.ac.uk/mackay/itprnn/code/tcl/dasher.tcl A demonstration arithmetic-coding\index{source code!algorithms} \index{arithmetic coding!software}\index{software!arithmetic coding}software\index{source code!software} package written by \index{Neal, Radford M.}{Radford Neal}\footnote{% % is available from \\ \noindent {\tt ftp://ftp.cs.toronto.edu/pub/radford/www/ac.software.html}} % This package consists of encoding and decoding modules to which the user adds a module defining the probabilistic model. It should be emphasized that there is no single general-purpose arithmetic-coding compressor; a new model has to be written for each type of source. % application. % Radford Neal's\index{Neal, Radford M.} package includes a simple adaptive model similar to the Bayesian model demonstrated in section \ref{sec.ac}. The results using this Laplace model should be viewed as a basic benchmark since it is the simplest possible probabilistic model -- it % These results are anecdotal and should not be taken too % seriously, but it is interesting that the highly developed gzip % software only does a little better than the benchmark % of the simple Laplace model, simply assumes the characters in the file come independently from a fixed ensemble. The counts $\{ F_i \}$ of the symbols $\{ a_i \}$ are rescaled and rounded as the file is read such that all the counts lie between 1 and 256. \index{DjVu}\index{deja vu}\index{Le Cun, Yann}\index{Bottou, Leon} % Yann Le Cun, Leon Bottou and colleagues at AT{\&}T Labs % have written a A state-of-the-art compressor for documents containing text and images, {\tt{DjVu}}, uses arithmetic coding.\footnote{% % {\tt{DjVu}} is described at \tt http://www.djvuzone.org/} % (better Reference for deja vu?) It uses a carefully designed approximate arithmetic coder for binary alphabets called the Z-coder \cite{bottou98coder}, which is much faster than the arithmetic coding software described above. One of the neat tricks the Z-coder uses is this: the adaptive model adapts only occasionally (to save on computer time), with the decision about when to adapt being pseudo-randomly controlled by whether the arithmetic encoder emitted a bit. The JBIG image compression standard for binary images uses arithmetic coding with a context-dependent model, which adapts using a rule similar to Laplace's rule. PPM \cite{Teahan95a} is a leading method for text compression, and it uses arithmetic coding. There are many Lempel--Ziv-based programs. {\tt gzip} is based on a version of Lempel--Ziv called `{\tt LZ77}' \cite{Ziv_Lempel77}\nocite{Ziv_Lempel78}. {\tt compress} is based on `{\tt LZW}' \cite{Welch84}. In my experience the best is {\tt gzip}, with {\tt compress} being inferior on most files. % To % give further credit to {\tt gzip}, it stores additional information in % the compressed file such as the name of the file and its % last modification date. {\tt bzip} is a {\dem{\ind{block-sorting} file compressor\/}}, which makes use of a neat hack called the {\dem\ind{Burrows--Wheeler transform}}\index{source code!Burrows--Wheeler transform}\index{source code!block-sorting compression} \cite{bwt}. This method is not based on an explicit probabilistic model, and it only works well for files larger than several thousand characters; but in practice it is a very effective compressor for files in which the context of a character is a good predictor for that character.% % Maybe I'll describe it in a future edition of this % book. \footnote{There is a lot of information about the Burrows--Wheeler transform on the net. {\tt{http://dogma.net/DataCompression/BWT.shtml}} } %bzip2 compresses files using the Burrows--Wheeler block-sorting text compression algorithm, and Huffman %coding. Compression is generally considerably better than that achieved by more conventional %LZ77/LZ78-based compressors, and approaches the performance of the PPM family of statistical %compressors. \subsubsection{Compression of a text file} Table \ref{tab.zipcompare1} gives the computer time in seconds taken and the compression achieved when these programs are applied to the \LaTeX\ file containing the text of this chapter, of size 20,942 bytes. \begin{table}[htbp] \figuremargin{ \begin{center} \begin{tabular}{lccc} \toprule Method & Compression & Compressed size & Uncompression \\ & time$ \,/\, $sec & (\%age of 20,942) & time$ \,/\, $sec \\ \midrule %Adaptive encoder, Laplace model & 0.28 & $12\,974$ (61\%) & 0.32 \\ %{\tt gzip / gunzip} & {\tt gzip} & 0.10 & \hspace{0.06in}$ 8\,177$ (39\%) & {\bf 0.01} \\ {\tt compress} %/ uncompress} & 0.05 & $10\,816$ (51\%) & 0.05 \\ \midrule {\tt bzip} % / bunzip} & & \hspace{0.06in}$ 7\,495$ (36\%) & \\ {\tt bzip2} %/ bunzip2} & & \hspace{0.06in}$ 7\,640$ (36\%) & \\ {\tt ppmz } & & \hspace{0.06in}{\bf 6$\,$800 (32\%)} & \\ \bottomrule \end{tabular} \end{center} }{ \caption[a]{Comparison of compression algorithms applied to a text file. } \label{tab.zipcompare1} } \end{table} % I will report the value of ``u'' % django: % 0.410u 0.060s 0:00.60 78.3% 0+0k 0+0io 109pf+0w % 6800 Nov 25 18:05 ../l4.tex.ppm % time ppmz ../l4.tex.ppm ../l4.tex.up % 0.480u 0.040s 0:00.60 86.6% 0+0k 0+0io 109pf+0w % % 108:wol:/home/mackay/_tools/ac0> time adaptive_encode < ~/_courses/itprnn/l4.tex > l4.tex.aez % 0.280u 0.040s 0:00.55 58.1% 0+105k 2+3io 0pf+0w % 109:wol:/home/mackay/_tools/ac0> time gzip ~/_courses/itprnn/l4.tex % 0.100u 0.060s 0:00.28 57.1% 0+161k 2+12io 0pf+0w % 110:wol:/home/mackay/_tools/ac0> ls -lisa ~/_courses/itprnn/l4.tex.gz % 110131 8 8177 Jan 10 15:40 /home/mackay/_courses/itprnn/l4.tex.gz % 111:wol:/home/mackay/_tools/ac0> gunzip ~/_courses/itprnn/l4.tex.gz % 112:wol:/home/mackay/_tools/ac0> ls -lisa ~/_courses/itprnn/l4.tex l4.tex.aez % 109904 21 20942 Jan 10 15:40 /home/mackay/_courses/itprnn/l4.tex % 444691 13 12974 Jan 10 15:40 l4.tex.aez % 113:wol:/home/mackay/_tools/ac0> time gzip ~/_courses/itprnn/l4.tex % 0.100u 0.050s 0:00.24 62.5% 0+150k 0+13io 0pf+0w % 114:wol:/home/mackay/_tools/ac0> time gunzip ~/_courses/itprnn/l4.tex.gz % 0.010u 0.060s 0:00.17 41.1% 0+80k 0+8io 0pf+0w % 115:wol:/home/mackay/_tools/ac0> time adaptive_decode < l4.tex.aez > l4.tex % 0.320u 0.030s 0:00.39 89.7% 0+101k 6+4io 5pf+0w % % django: bzip and gunzip: % 149:django.ucsf.edu:/home/mackay/_tools/ac0> time bzip l4.tex % BZIP, a block-sorting file compressor. Version 0.21, 25-August-96. % 0.060u 0.020s 0:00.22 36.3% 0+0k 0+0io 107pf+0w % 7495 Jan 10 1997 l4.tex.bz % 153:django.ucsf.edu:/home/mackay/_tools/ac0> time bunzip l4.tex.bz % 0.020u 0.010s 0:00.14 21.4% 0+0k 0+0io 93pf+0w % 20942 Jan 10 1997 l4.tex % 155:django.ucsf.edu:/home/mackay/_tools/ac0> time bzip2 l4.tex % 0.050u 0.000s 0:00.37 13.5% 0+0k 0+0io 90pf+0w % 7640 Jan 10 1997 l4.tex.bz2 % 157:django.ucsf.edu:/home/mackay/_tools/ac0> time bunzip2 l4.tex.bz2 % 0.020u 0.000s 0:00.15 13.3% 0+0k 0+0io 85pf+0w % time gzip l4.tex % 0.010u 0.010s 0:00.28 7.1% 0+0k 0+0io 84pf+0w % 8177 Jan 10 1997 l4.tex.gz % time gunzip l4.tex % 0.000u 0.010s 0:00.12 8.3% 0+0k 0+0io 87pf+0w % \subsubsection{Compression of a sparse file} Interestingly, {\tt gzip} does not always do so well. Table \ref{tab.zipcompare2} gives the % computer time in seconds taken and the compression achieved when these programs are applied to a text file containing $10^6$ characters, each of which is either {\tt0} and {\tt1} with probabilities 0.99 and 0.01. The Laplace model is quite well matched to this source, and the benchmark arithmetic coder gives good performance, followed closely by {\tt compress}; {\tt gzip} % , interestingly, is worst. % see /home/mackay/_tools/ac0 % % , and {\tt gzip --best} does no better. % has identical performance to {\tt gzip} on this example.}] An ideal model for this source would compress the file into about $10^6 H_2(0.01)/8 \simeq 10\,100$ bytes. The Laplace-model compressor falls short of this performance because it is implemented using only eight-bit precision. The {\tt{ppmz}} compressor compresses the best of all, but takes much more computer time.\index{Lempel--Ziv coding|)} \begin{table}[htbp] \figuremargin{ \begin{center} \begin{tabular}{lccc} \toprule Method & Compression & Compressed size & Uncompression \\ & time$ \,/\, $sec & $ \,/\, $bytes & time$ \,/\, $sec \\ \midrule % Adaptive encoder, % Laplace model & % 6.4 & 14089 (1.4\%)\hspace{0.06in} & 9.2 \\ %{\tt gzip / gunzip} & % 2.1 & 20548 (2.1\%)\hspace{0.06in} & 0.43 \\ %{\tt compress / uncompress} & % 0.73 & 14692 (1.47\%) & 0.76 \\ \bottomrule %{\tt bzip / bunzip} & % & & (\%) & \\ %{\tt bzip2 / bunzip2} & % & & (\%) & \\ \hline Laplace model & 0.45 & $14\,143$ (1.4\%)\hspace{0.06in} & 0.57 \\ {\tt gzip } & 0.22 & $20\,646$ (2.1\%)\hspace{0.06in} & 0.04 \\ {\tt gzip {\tt-}{\tt-}best+} & %{\tt gzip \verb+--best+} & 1.63 & $15\,553$ (1.6\%)\hspace{0.06in} & 0.05 \\ {\tt compress} & 0.13 & $14\,785$ (1.5\%)\hspace{0.06in} & 0.03 \\ \midrule {\tt bzip } & 0.30 & $10\,903$ (1.09\%) & 0.17 \\ {\tt bzip2} & 0.19 & $11\,260$ (1.12\%) & 0.05 \\ {\tt ppmz} & 533 & {\bf 10$\,$447 (1.04\%)} & 535 \\ \bottomrule \end{tabular} \end{center} % ideal length = 0.0807931 * 10^6 = 80793 bits = 10099 bytes % /home/mackay/_tools/ac0/README1 }{ \caption[a]{Comparison of compression algorithms applied to a random file of $10^6$ characters, 99\% {\tt0}s and 1\% {\tt1}s. } \label{tab.zipcompare2} } \end{table} \section{Summary} In the last three chapters we have studied three classes of data compression codes. \begin{description} \item[Fixed-length block codes] (Chapter \chtwo). These are mappings from a fixed number of source symbols to a fixed-length binary message. % Most source strings are given no encoding; Only a tiny fraction of the source strings are given an encoding. These codes were fun for identifying the entropy as the measure of compressibility but they are of little practical use. \item[Symbol codes] (Chapter \chthree). Symbol codes employ a variable-length code for each symbol in the source alphabet, the codelengths being integer lengths determined by the probabilities of the symbols. Huffman's algorithm constructs an optimal symbol code for a given set of symbol probabilities. Every source string has a uniquely decodeable encoding, and if the source symbols come from the assumed distribution then the symbol code will compress to an expected length per character $L$ lying in the interval $[H,H\!+\!1)$. Statistical fluctuations in the source may make the actual length longer or shorter than this mean length. If the source is not well matched to the assumed distribution then the mean length is increased by the relative entropy $D_{\rm KL}$ between the source distribution and the code's implicit distribution. For sources with small entropy, the symbol has to emit at least one bit per source symbol; compression below one bit per source symbol can be achieved only by the cumbersome procedure of putting the source data into blocks. \item[Stream codes\puncspace] The distinctive property of stream codes, compared with symbol codes, is that they are not constrained to emit at least one bit for every symbol read from the source stream. So large numbers of source symbols may be coded into a smaller number of bits. % , but unlike block codes, this is achieved This property could be obtained using a symbol code only if the source stream were somehow chopped into blocks. \bit \item {Arithmetic codes} combine a probabilistic model with an encoding algorithm that identifies each string with a sub-interval of $[0,1)$ of size equal to the probability of that string under the model. This code is almost optimal in the sense that the compressed length of a string $\bx$ closely matches the Shannon information content of $\bx$ given the probabilistic model. Arithmetic codes fit with the philosophy that good compression requires %intelligence {\dem data modelling}, in the form of an adaptive Bayesian model. \item % [Stream codes: Lempel--Ziv codes\puncspace] Lempel--Ziv codes are adaptive in the sense that they memorize strings that have already occurred. They are built on the philosophy that we don't know anything at all about what the probability distribution of the source will be, and we want a compression algorithm that will perform reasonably well whatever that distribution is. \eit \end{description} %\section{Optimal compression must involve artificial intelligence} %\subsection{A rant about `universal' compression} % moved this to rant.tex for the time being Both arithmetic codes and Lempel--Ziv codes will fail to decode correctly if any of the bits of the compressed file are altered. So if compressed files are to be stored or transmitted over noisy media, error-correcting codes will be essential. Reliable communication over unreliable channels is the topic of \partnoun\ \noisypart. % the next few chapters. %Exercises \section{Exercises on stream codes}%{Problems} \exercisaxA{2}{ex.AC52}{ Describe an arithmetic coding algorithm to encode random bit strings of length $N$ and weight $K$ (\ie, $K$ ones and $N-K$ zeroes) where $N$ and $K$ are given. For the case $N\eq 5$, $K \eq 2$, show in detail the intervals corresponding to all source substrings of lengths 1--5. } \exercissxB{2}{ex.AC52b}{ How many bits are needed to specify a selection of % an unordered collection of $K$ objects from $N$ objects? ($N$ and $K$ are assumed to be known and the selection of $K$ objects is unordered.) How might such a selection be made at random without being wasteful of random bits? } \exercisaxB{2}{ex.HuffvAC}{ % from 2001 exam A binary source $X$ emits independent identically distributed symbols with probability distribution $\{ f_{0},f_1 \}$, where $f_1 = 0.01$. Find an optimal uniquely-decodeable symbol code for a string $\bx=x_1x_2x_3$ of {\bf{three}} successive samples from this source. Estimate (to one decimal place) the factor by which the expected length of this optimal code is greater than the entropy of the three-bit string $\bx$. [$H_2(0.01) \simeq 0.08$, where $H_2(x) = x \log_2 (1/x) + (1-x) \log_2 (1/(1-x))$.] %\medskip An {{arithmetic code}\/} is used to compress a string of $1000$ samples from the source $X$. Estimate the mean and standard deviation of the length of the compressed file. % This is example 6.3, identical, except we are talking about compressing % rather than generating. } \exercisaxB{2}{ex.ACNf}{ Describe an arithmetic coding algorithm to generate random bit strings of length $N$ with density $f$ (\ie, each bit has probability $f$ of being a one) where $N$ is given. } \exercisaxC{2}{ex.LZprune}{ Use a modified Lempel--Ziv algorithm in which, as discussed on \pref{sec.LZprune}, the dictionary of prefixes is % effectively pruned by writing new prefixes into the space occupied by prefixes that will not be needed again. Such prefixes can be identified when both their children have been added to the dictionary of prefixes. (You may neglect the issue of termination of encoding.) Use this algorithm to encode the string {\tt{0100001000100010101000001}}. Highlight the bits that follow a prefix on the second occasion that that prefix is used. (As discussed earlier, these bits could be omitted.) % from the encoding if we adopted the convention (discussed % earlier) % of not transmitting the bit that follows a prefix on the % second occasion that that prefix is used. % nb this is same as an earlier example. % i get % ,0 0,1 1,0 10,1 10,0 00,0 011,0 100,1 010,0 001,1 } \exercissxC{2}{ex.LZcomplete}{ Show that this modified Lempel--Ziv code is still not `complete', that is, there are binary strings that are not encodings of any string. } % answer: this is because there are illegal prefix names, e.g. at the % 5th step, 111 is not legal. % \exercissxB{3}{ex.LZfail}{ Give examples of simple sources that have low entropy but would not be compressed well by the Lempel--Ziv algorithm. } % % Ideas: add a figure showing the flow diagram -- source, model. % % % \begin{thebibliography}{} % \bibitem[\protect\citeauthoryear{Witten {\em et~al.\/}}{1987}]{arith_coding} % {\sc Witten, I.~H.}, {\sc Neal, R.~M.}, \lsaand {\sc Cleary, J.~G.} % \newblock (1987) % \newblock Arithmetic coding for data compression. % \newblock {\em Communications of the ACM\/} {\bf 30} (6):~520-540. % % \end{thebibliography} % \part{Noisy Channel Coding} % \end{document} \dvips % \section{Further exercises on data compression} %\chapter{Further Exercises on Data Compression} \label{ch_f4} % % _f4.tex: exercises to follow chapter 4 in a 'review, revision, further topics' % exercise zone. % \fakesection{Post-compression general extra exercises} The following exercises may be skipped by the reader who is eager to learn about noisy channels. % % DOES THIS BELONG HERE? Maybe move to p92. % \fakesection{RNGaussian} \exercissxA{3}{ex.RNGaussian}{ \index{life in high dimensions}\index{high dimensions, life in} % Consider a Gaussian distribution\index{Gaussian distribution!$N$--dimensional} in $N$ dimensions, \beq P(\bx) = \frac{1}{(2 \pi \sigma^2)^{N/2}} \exp \left( - \frac{\sum_n x_n^2}{2 \sigma^2} \right) . \label{first.gaussian} \eeq % Show that Define the radius of a point $\bx$ to be $r = \left( {\sum_n x_n^2} \right)^{1/2}$. Estimate the mean and variance of the square of the radius, $r^2 = \left( {\sum_n x_n^2} \right)$. \begin{aside}%{\small You may find helpful the integral \beq \int \! \d x\: \frac{1}{(2 \pi \sigma^2)^{1/2}} \: x^4 \exp \left( - \frac{x^2}{2 \sigma^2} \right) = 3 \sigma^4 , \label{eq.gaussian4thmoment} \eeq though you should be able to estimate the required quantities without it. \end{aside} % If you like gamma integrals % derive the probability density of the radius $r = \left( {\sum_n % x_n^2} \right)^{1/2}$, and find the most probable % radius. %\amarginfig{b}{% in first printing, before asides changed \amarginfig{t}{% \setlength{\unitlength}{0.7mm} % there is a strip without ink at the left, hence I use -19 % instead of -21 as the left coordinate \begin{picture}(42,42)(-19,-21)% original is 6in by 6in, so 7unitlength=1in % use 42 unitlength for width \put(-21,-21){\makebox(42,42){\psfig{figure=figs/typicalG.ps,angle=-90,width=29.4mm}}} %\put(14,14){\makebox(0,0)[l]{\small probability density is maximized here}} \put(10,18){\makebox(0,0)[bl]{\small probability density}} \put(13,13){\makebox(0,0)[bl]{\small is maximized here}} %\put(14,-14){\makebox(0,0)[l]{\small almost all probability mass is here}} \put(9,-16){\makebox(0,0)[l]{\small almost all}} \put(2,-21){\makebox(0,0)[l]{\small probability mass is here}} %\put(15,-26){\makebox(0,0)[l]{\small is here}} \put(-2,-2){\makebox(0,0)[tr]{\small $\sqrt{N} \sigma$}} \end{picture} \caption[a]{Schematic representation of the typical set of an $N$-dimensional Gaussian distribution.} } Assuming that $N$ is large, show that nearly all the probability of a Gaussian is contained in a \ind{thin shell} of radius $\sqrt{N} \sigma$. Find the thickness of the shell. Evaluate the probability density % in $\bx$ space (\ref{first.gaussian}) at a point in that thin shell and at the origin $\bx=0$ and compare. Use the case $N=1000$ as an example. Notice that nearly all the probability mass % the bulk of the probability density is located in a different part of the space from the region of highest probability density. % } % % extra exercises that are appropriate once source compression has been % discussed. % % contents: % % simple huffman question % Phone chat using rings (originally in mockexam.tex, now in M.tex) % Bridge bidding as communication (where?) % \fakesection{Compression exercises: bidding in bridge, etc} % \exercisaxA{2}{ex.source_code}{ % Explain what is meant by an {\em optimal binary symbol code\/}. Find an optimal binary symbol code for the ensemble: \[ \A = \{ {\tt{a}},{\tt{b}},{\tt{c}},{\tt{d}},{\tt{e}},{\tt{f}},{\tt{g}},{\tt{h}},{\tt{i}},{\tt{j}} \} , \] \[ \P = \left\{ \frac{1}{100} , \frac{2}{100} , \frac{4}{100} , \frac{5}{100} , \frac{6}{100} , \frac{8}{100} , \frac{9}{100} , \frac{10}{100} , \frac{25}{100} , \frac{30}{100} \right\} , \] and compute the expected length of the code. } \exercisaxA{2}{ex.doublet.huffman}{ A string $\by=x_1 x_2$ consists of {\em two\/} independent samples from an ensemble \[ X : {\cal A}_X = \{ {\tt{a}} , {\tt{b}} , {\tt{c}} \} ; {\cal P}_X = \left\{ \frac{1}{10} , \frac{3}{10} , \frac{6}{10} \right\} . \] What is the entropy of $\by$? Construct an optimal binary symbol code for the string $\by$, and find its expected length. } \exercisaxA{2}{ex.ac_expected}{ % (Cambridge University Part III Maths examination, 1998.) % Strings of $N$ independent samples from an ensemble with $\P = \{ 0.1 , 0.9 \}$ are compressed using an {arithmetic code} that is matched to that ensemble. Estimate the mean and standard deviation of the compressed strings' lengths for the case $N=1000$. % [$H_2(0.1) \simeq 0.47$] % ; $\log_2(9) \simeq 3$.] % .47, 3.17 % my answer: 470 pm 30 } % from M.tex, in which model solns are found too \exercisaxA{3}{ex.phone_chat}{%(Cambridge University Part III Maths examination, 1998.) {\sf Source coding with variable-length symbols.} % -- Source coding / optimal use of channel} \begin{quote} In the chapters on source coding, we assumed that we were encoding into a binary alphabet $\{ {\tt0} , {\tt1} \}$ in which both symbols\index{source code!variable symbol durations} % had the same associated cost. Clearly a good compression algorithm % uses both these symbols with equal frequency, and the capacity of % this alphabet is one bit per character. should be used with equal frequency. In this question we explore how the encoding alphabet should be used % what happens if the symbols take different times to transmit. % have different costs. % the \end{quote} % A poverty-stricken \ind{student} communicates for free with a friend using a \index{phone}{telephone} by selecting an integer $n \in \{ 1,2,3\ldots \}$, making the friend's phone ring $n$ times, then hanging up in the middle of the $n$th ring. This process is repeated so that a string of symbols $n_1 n_2 n_3 \ldots$ is received. What is the optimal way to communicate? If large integers $n$ are selected then the message takes longer to communicate. If only small integers $n$ are used then the information content per symbol is small. We aim to maximize the rate of information transfer, per unit time. Assume that the time taken to transmit a number of rings $n$ and to redial %, including the space that separates them from the next sequence of rings is $l_n$ seconds. Consider a probability distribution over $n$, $\{ p_n \}$. Defining the average duration {\em per symbol\/} to be \beq L(\bp) = \sum_n p_n l_n \eeq and the entropy {\em per symbol\/} to be \beq H(\bp) = \sum_n p_n \log_2 \frac{1}{p_n } , \eeq show that for the average information rate {\em per second\/} to be maximized, the symbols must be used with probabilities of the form \beq p_n = \frac{1}{Z} 2^{-\beta l_n} \label{eq.phone.1} \eeq where % $\beta$ is a Lagrange multiplier %and $Z = \sum_n 2^{-\beta l_n}$ and $\beta$ satisfies the implicit equation % \marginpar{[6]} \beq \beta = \frac{H(\bp)}{L(\bp)} , \label{eq.phone.2} \eeq that is, $\beta$ is the rate of communication. %is set so as to maximize %\beq % R(\beta) = - \beta - \frac{\log Z(\beta)}{L(\beta)} %\eeq % where $L(\beta)=\sum p_n l_n$. % By differentiating $R(\beta)$, show that % $\beta^*$ satisfies Show that these two equations (\ref{eq.phone.1}, \ref{eq.phone.2}) imply that $\beta$ must be set such that \beq \log Z =0. \label{eq.phone.3} \eeq % Assuming that the channel has the property % redialling takes the same time as one ring, so that \beq l_n = n \: \mbox{seconds}, \label{eq.phone.4} \eeq find the optimal distribution $\bp$ and show that the maximal information rate is 1 bit per second. % $\log xxxx$ % and that the mean number of rings % in a group is xxxx and that the information per % ring is xxxx. How does this compare with the information rate per second achieved if $\bp$ is set to $(1/2,1/2,0,0,0,0,\ldots)$ --- that is, only the symbols $n=1$ and $n=2$ are selected, and they have equal probability? Discuss the relationship between the results (\ref{eq.phone.1}, \ref{eq.phone.3}) derived above, and the Kraft inequality from source coding theory. How might a random binary source be efficiently encoded into a sequence of symbols $n_1 n_2 n_3 \ldots$ for transmission over the channel defined in \eqref{eq.phone.4}? } \exercisaxB{1}{ex.shuffle}{How many bits does it take to shuffle a pack of cards? % [In case this is not clear, here's the long-winded % version: imagine using a random number generator % to generate perfect shuffles of a deck of cards. % What is the smallest number of random bits % needed per shuffle?] } \exercisaxB{2}{ex.bridge}{In the card game\index{game!Bridge} Bridge,\index{Bridge} the four players receive 13 cards each from the deck of 52 and start each game by looking at their own hand and bidding. The legal bids are, in ascending order $1 \clubsuit, 1 \diamondsuit, 1 \heartsuit, 1\spadesuit,$ $1NT,$ $2 \clubsuit,$ $2 \diamondsuit,$ % 2 \heartsuit, 2\spadesuit, 2NT, $\ldots$ % 7 \clubsuit, 7 \diamondsuit, $7 \heartsuit, 7\spadesuit, 7NT$, and successive bids must follow this order; a bid of, say, $2 \heartsuit$ may only be followed by higher bids such as $2\spadesuit$ or $3 \clubsuit$ or $7 NT$. (Let us neglect the `double' bid.) % The outcome of the bidding process determines the subsequent % game. The players have several aims when bidding. One of the aims is for two partners to communicate to each other as much as possible about what cards are in their hands. % There are many bidding systems whose aim is, among other things, % to communicate this information. Let us concentrate on this task. \begin{enumerate} \item After the cards have been dealt, how many bits are needed for North to convey to South what her hand is? \item Assuming that E and W do not bid at all, what is the maximum total information that N and S can convey to each other while bidding? Assume that N starts the bidding, and that once either N or S stops bidding, the bidding stops. \end{enumerate} } \exercisaxB{2}{ex.microwave}{ My old `\ind{arabic}' \ind{microwave oven}\index{human--machine interfaces} had 11 buttons for entering cooking times, and my new `\ind{roman}' microwave has just five. The buttons of the roman microwave are labelled `10 minutes', `1 minute', `10 seconds', `1 second', and `Start'; I'll abbreviate these five strings to the symbols {\tt M}, {\tt C}, {\tt X}, {\tt I}, $\Box$. % The two keypads then look as follows. % included by _e4.tex \amarginfig{b}{% \begin{center} \begin{tabular}[t]{c}%%%%%%%%%% table containing microwave buttons %\toprule Arabic \\ \midrule % The keypad \begin{tabular}[t]{*{3}{p{.1in}}} \framebox{1} & \framebox{2} & \framebox{3} \\ \framebox{4} & \framebox{5} & \framebox{6} \\ \framebox{7} & \framebox{8} & \framebox{9} \\ & \framebox{0} & \framebox{$\!\Box\!$} \\ \end{tabular} \\ %\bottomrule %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% end all micro table \end{tabular} \begin{tabular}[t]{c}%%%%%%%%%% table containing microwave buttons %\toprule Roman \\ \midrule % The keypad \begin{tabular}[t]{*{3}{p{.1in}}} \framebox{{\tt{M}}} & \framebox{{\tt{X}}} & \\ \framebox{{\tt{C}}} & \framebox{{\tt{I}}} & \framebox{$\!\Box\!$} \\ \end{tabular} \\ %\bottomrule %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% end all micro table \end{tabular}\\ \mbox{$\:$} \end{center} \caption[a]{Alternative keypads for microwave ovens.} } To enter one minute and twenty-three seconds (1:23), the arabic sequence is \beq {\tt{123}}\Box, \eeq and the roman sequence is \beq {\tt{CXXIII}}\Box . \eeq Each of these keypads defines a code mapping the 3599 cooking times from 0:01 to 59:59 into a string of symbols. \ben \item Which times can be produced with two or three symbols? (For example, 0:20 can be produced by three symbols in either code: ${\tt{XX}}\Box$ and ${\tt{20}}\Box$.) \item Are the two codes complete? Give a detailed answer. % Discuss all the ways in which these two codes are not complete. \item For each code, name a cooking time % couple of times that it can produce in four symbols that the other code cannot. \item Discuss the implicit probability distributions over times to which each of these codes is best matched. \item Concoct a plausible probability distribution over times that a real user might use, and evaluate roughly the expected number of symbols, and maximum number of symbols, that each code requires. Discuss the ways in which each code is inefficient or efficient. \item Invent a more efficient cooking-time-encoding system for a microwave oven. \een %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% } % \fakesection{Cinteger} %\input{tex/_Cinteger} \exercissxC{2}{ex.Cinteger}{ Is the standard binary representation for positive integers (\eg\ $c_{\rm b}(5) = {\tt 101}$) a uniquely decodeable code? Design a binary code for the positive integers, \ie, a mapping from $n \in \{ 1,2,3,\ldots \}$ to $c(n) \in \{{\tt 0},{\tt 1}\}^+$, that is uniquely decodeable. Try to design codes that are prefix codes and that satisfy the \Kraft\ equality $\sum_n 2^{-l_n} \eq 1$. % % Not a typo. % \begin{aside} Motivations: any data file terminated by a special end of file character can be mapped onto an integer, so a prefix code for integers can be used as a self-delimiting encoding of files too. Large files correspond to large integers. Also, one of the building blocks of a `universal' coding scheme -- that is, a coding scheme that will work OK for a large variety of sources -- is the ability to encode integers. Finally, in microwave ovens, cooking times are positive integers! \end{aside} Discuss criteria by which one might compare alternative codes for integers (or, equivalently, alternative self-delimiting codes for files). } % % % \section{Solutions}% to Chapter \protect\ref{ch4}'s exercises} % % solns to exercises in l4.tex % \fakesection{solns to exercises in l4.tex} \soln{ex.ac.terminate}{ The worst-case situation is when the interval to be represented lies just inside a binary interval. In this case, we may choose either of two binary intervals as shown in \figref{fig.ac.worst.case}. These binary intervals are no smaller than $P(\bx|\H)/4$, so the binary encoding has a length no greater than $\log_2 1/ P(\bx|\H) + \log_2 4$, which is two bits more than the ideal message length. } % % HELP HELP HELP RESTORE ME! % \input{tex/acvshuffman.tex} % % \soln{ex.usebits}{ The standard method uses 32 random bits per generated symbol and so requires $32\,000$ bits to generate one thousand samples. % this is displaced down a bit. \begin{figure}%[htbp] \figuremargin{% \begin{center} % created by ac.p only_show_data=1 > ac/ac_data.tex \mbox{ \small \setlength{\unitlength}{1.62in} \begin{picture}(2,1.2)(0,0) \thicklines % desired interval on left \put( 0.0, 1.01){\makebox(0,0)[bl]{Source string's interval}} \put( 0.5, 0.5){\makebox(0,0){$P(\bx|\H)$}} \put( 0.0, 0.05){\line(1,0){ 1.0}} \put( 0.0, 0.95){\line(1,0){ 1.0}} % % binary intervals \put( 1.0, 1.03){\makebox(0,0)[bl]{Binary intervals}} \put( 1.0, 0.0){\line(1,0){ 1.0}} \put( 1.0, 1.0){\line(1,0){ 1.0}} % \thinlines % \put( 0.5, 0.4){\vector(0,-1){0.35}} \put( 0.5, 0.6){\vector(0,1){0.35}} % \put( 1.0, 0.5){\line(1,0){ 0.5}} \put( 1.0, 0.25){\line(1,0){ 0.25}} \put( 1.0, 0.75){\line(1,0){ 0.25}} % \put( 1.125, 0.625){\vector(0,1){0.125}} \put( 1.125, 0.625){\vector(0,-1){0.125}} \put( 1.125, 0.375){\vector(0,1){0.125}} \put( 1.125, 0.375){\vector(0,-1){0.125}} \end{picture} } \end{center} }{% \caption[a]{Termination of arithmetic coding in the worst case, where there is a two bit overhead. Either of the two binary intervals marked on the right-hand side may be chosen. These binary intervals are no smaller than $P(\bx|\H)/4$.} \label{fig.ac.worst.case} }% \end{figure} Arithmetic coding uses on average about $H_2 (0.01)=0.081$ bits per generated symbol, and so requires about 83 bits to generate one thousand samples (assuming an overhead of roughly two bits associated with termination). Fluctuations in the number of {\tt{1}}s would produce variations around this mean with standard deviation 21. } % 57 %\soln{ex.Clengthen}{ % moved to cutsolutions.tex \soln{ex.LZencode}{ The encoding is {\tt010100110010110001100}, which comes from the parsing \beq \tt 0, 00, 000, 0000, 001, 00000, 000000 \eeq which is encoded thus: \beq {\tt (,0),(1,0),(10,0),(11,0),(010,1),(100,0),(110,0) } . \eeq } \soln{ex.LZdecode}{ The decoding is \begin{center} {\tt 0100001000100010101000001}. \end{center} } %\soln{ex.AC52}{ \soln{ex.AC52b}{ This problem is equivalent to \exerciseref{ex.AC52}. The selection of $K$ objects from $N$ objects requires $\lceil \log_2 {N \choose K}\rceil$ bits $\simeq N H_2(K/N)$ bits. % This selection could be made using arithmetic coding. The selection corresponds to a binary string of length $N$ in which the {\tt{1}} bits represent which objects are selected. Initially the probability of a {\tt{1}} is $K/N$ and the probability of a {\tt{0}} is $(N\!-\!K)/N$. Thereafter, given that the emitted string thus far, of length $n$, contains $k$ {\tt{1}}s, the probability of a {\tt{1}} is $(K\!-\!k)/(N\!-\!n)$ and the probability of a {\tt{0}} is $1 - (K\!-\!k)/(N\!-\!n)$. } \soln{ex.LZcomplete}{ This modified Lempel--Ziv code is still not `complete', because, for example, after five prefixes have been collected, the pointer could be any of the strings $\tt000$, $\tt001$, $\tt010$, $\tt011$, $\tt100$, but it cannot be $\tt101$, $\tt110$ or $\tt111$. Thus there are some binary strings that cannot be produced as encodings. } \soln{ex.LZfail}{ Sources with low entropy that are not well compressed by Lempel--Ziv include:\index{Lempel--Ziv coding!criticisms} \ben \item Sources with some symbols that have long range correlations and intervening random junk. An ideal model should capture what's correlated and compress it. Lempel--Ziv can compress the correlated features only by memorizing all cases of the intervening junk. As a simple example, consider a \index{phone number}telephone book in which every line contains an (old number, new number) pair: \begin{center} {\tt{285-3820:572-5892}}\teof\\ {\tt{258-8302:593-2010}}\teof\\ \end{center} The number of characters per line is 18, drawn from the 13-character alphabet $\{ {\tt{0}},{\tt{1}},\ldots,{\tt{9}},{\tt{-}},{\tt{:}},\eof\}$. The characters `{\tt{-}}', `{\tt{:}}' and `\teof' occur in a predictable sequence, so the true information content per line, assuming all the phone numbers are seven digits long, and assuming that they are random sequences, is about 14 \dits. (A \dit\ is the information content of a random integer between 0 and 9.) A finite state language model could easily capture the regularities in these data. A Lempel--Ziv algorithm will take a long time before it compresses such a file down to 14 bans per line, % by a factor of $14/18$, however, because in order for it to `learn' that the string {\tt{:}}$ddd$ is always followed by {\tt{-}}, for any three digits $ddd$, it will have to {\em see\/} all those strings. So near-optimal compression will only be achieved after thousands of lines of the file have been read.\medskip % figs/wallpaper.ps made by pepper.p \begin{figure}[htbp] \fullwidthfigureright{% %\figuremargin{% \small \begin{center} \mbox{%(a) \psfig{figure=figs/wallpaper.ps}}\\ %\mbox{(b) \psfig{figure=figs/wallpaperc.ps}}\\ %\mbox{(c) \psfig{figure=figs/wallpaperb.ps}} \end{center} }{% \caption[a]{ A source with low entropy that is not well compressed by Lempel--Ziv. The bit sequence is read from left to right. Each line differs from the line above in $f=5$\% of its bits. The image width is 400 pixels. % % Three % sources with low entropy that are not well compressed by Lempel--Ziv. % The bit sequence is read from left to right. The image width is 400 pixels % in each case. % % (a) Each line differs from the line above in $p=$5\% of its bits. % % (b) % Each column $c$ has its own transition probability $p_c$ such that % successive vertical bits are identical with probability $p_c$. The % probabilities $p_c$ are drawn from a uniform distribution over $[0,0.5]$. % % (c) As in b, but the probabilities $p_c$ are drawn from a uniform % distribution over $[0,1]$. } % ; in columns with $p_c \simeq 1$, successive % vertical bits are likely to be opposite to each other. } % \label{fig.pepper} }% \end{figure} % % this is beautiful but gratuitous % %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%55 %\begin{figure}[htbp] %\figuremargin{% %\begin{center} %\mbox{\psfig{figure=figs/automaton346.big1.ps,height=7in}}\\ %\end{center} %}{% %\caption[a]{A longer cellular automaton history. %} %\label{fig.automatonII} %\end{figure} \vspace*{-10pt}% included to undo the cumulation of item space and figure space. \item Sources with long range correlations, for example two-dimensional images that are represented by a sequence of pixels, row by row, so that vertically adjacent pixels are a distance $w$ apart in the source stream, where $w$ is the image width. Consider, for example, a fax transmission in which each line is very similar to the previous line (\figref{fig.pepper}). The true entropy is only $H_2(f)$ per pixel, where $f$ is the probability that a pixel differs from its parent. % except for a light peppering % of noise. % Each line is somewhat similar to the previous line but not identical, % so there is no previous occurrence of a long string % to point to; some algorithms in the Lempel--Ziv class % will achieve a certain degree of compression % by memorizing recent short strings, but the compression achieved % will not equal the true entropy. % and after a few lines, % the pattern has moved on by a random walk, so memorizing ancient patterns % is of no use. Lempel--Ziv algorithms will only compress down to the entropy once {\em all\/} strings of length $2^w = 2^{400}$ have occurred and their successors have been memorized. There are only about $2^{300}$ particles in the universe, so we can confidently say that Lempel--Ziv codes will {\em never\/} capture the redundancy of such an image. % figs/wallpaper.ps made by pepper.p \begin{figure}[htbp] %\figuremargin{% \fullwidthfigureright{% \begin{center} %\mbox{(a) \psfig{figure=figs/wallpaperx.ps}}\\ \mbox{%(b) \psfig{figure=figs/wallpaperx2.ps}}\\ %\mbox{(c) \psfig{figure=figs/automaton346.2.ps}}\\ % see also figs/automaton346.big1.pbm \end{center} }{% \caption[a]{%A second source with low entropy that is not optimally compressed by Lempel--Ziv. A texture consisting of horizontal and vertical pins dropped at random on the plane. % (c) The 100-step time-history of a cellular automaton with 400 cells. } \label{fig.wallpaper} }% \end{figure} Another highly redundant texture is shown in \figref{fig.wallpaper}. The image was made by dropping horizontal and vertical pins randomly on the plane. It contains both long-range vertical correlations and long-range horizontal correlations. There is no practical way that Lempel--Ziv, fed with a pixel-by-pixel scan of this image, could capture both these correlations. % gzip on the pbm gives: 2374 wallpaperx.pbm.gz % That is better than 50%. % Saved as a gif, wallpaperx.pbm is 2926 characters. Original 40000 pixels would be 5000 characters. % That is worse than 50% compression. % cf. perl program, stripwallpaper.p % is % 0 8 274 /home/mackay/bin/stripwallpaper.p.gz % 0 16 631 wallpaperx.asc.gz % 0 24 905 total <-------- % 18 65 368 /home/mackay/bin/stripwallpaper.p % 162 484 1390 wallpaperx.asc % 180 549 1758 total % lossless jpg is terrible!: % 38828 wallpaperx.jpg % would be nice to try JBIG on this. % It is worth emphasizing that b Biological computational systems can readily identify the redundancy in these images and in images that are much more complex; thus we might anticipate that the best data compression algorithms will result from the development of \ind{artificial intelligence} methods.\index{compression!future methods} \item Sources with intricate redundancy, such as files generated by computers. For example, a \LaTeX\ file followed by its encoding into a PostScript file. The information content of this pair of files is roughly equal to the information content of the \LaTeX\ file alone. \item A picture of the Mandelbrot set. The picture has an information content equal to the number of bits required to specify the range of the complex plane studied, the pixel sizes, and the colouring rule used. % mapping of set membership to pixel colour. % \item % Encoded transmissions arising from an error-correcting code of rate $K/N$. % These are very easily compressed by a factor % $K/N$ if the generator operation is known. % see README2 in /home/mackay/_courses/comput/newising_mc \item A picture of a ground state of a frustrated antiferromagnetic \ind{Ising model} (\figref{fig.ising.ground}), which we will discuss in \chref{ch.ising}. Like \figref{fig.wallpaper}, this binary image has interesting correlations in two directions. \begin{figure}[htbp] \figuremargin{% \begin{center} \mbox{\bighisingsample{hexagon2}} \end{center} }{% \caption[a]{Frustrated triangular Ising model in one of its ground states.} \label{fig.ising.ground} }% \end{figure} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \item Cellular automata -- \figref{fig.wallpaperc} shows the state history of 100 steps of a \ind{cellular automaton} with 400 cells. The update rule, in which each cell's new state depends on the state of five preceding cells, was selected at random. The information content is equal to the information in the boundary (400 bits), and the propagation rule, which here can be described in 32 bits. An optimal compressor will thus give a compressed file length which is essentially constant, independent of the vertical height of the image. Lempel--Ziv would only give this zero-cost compression once the cellular automaton has entered a periodic limit cycle, which could easily take about $2^{100}$ iterations. In contrast, the JBIG compression method, which models the probability of a pixel given its local context and uses arithmetic coding, would do a good job on these images. %\item % And finally, an example relating to error-correcting codes: % the\index{error-correcting code!and compression}\index{difficulty of compression}\index{compression!difficulty of} % received transmissions arising when encoded transmissions are % sent over a noisy channel. Such received strings have an entropy % equal to the source entropy plus the channel noise's % entropy. If a \index{Lempel--Ziv coding|)}Lempel--Ziv % algorithm could compress these strings, % this would be tantamount to solving the decoding problem for % the error-correcting code! % % We have not got to this topic yet, but we will see later that % the decoding of a general error-correcting code is % a challenging intractable problem. % automaton.p \begin{figure}%[htbp] %\figuremargin{% \fullwidthfigureright{% \begin{center} \mbox{%(c) \psfig{figure=figs/automaton346.2.ps}}\\ % see also figs/automaton346.big1.pbm \end{center} }{% \caption[a]{% Another source with low entropy that is not optimally compressed by Lempel--Ziv. The 100-step time-history of a cellular automaton with 400 cells. } \label{fig.wallpaperc} }% \end{figure} \een } \index{source code!stream codes|)}\index{stream codes|)} \dvipsb{solutions stream codes} % % %\section{Solutions}% to Chapter \protect\ref{ch_f4}'s exercises} % \section{Solutions to section \protect\ref{ch_f4}'s exercises} \fakesection{RNGaussian} \soln{ex.RNGaussian}{ For a one-dimensional Gaussian, the variance of $x$, $\Exp[x^2]$, is $\sigma^2$. So the mean value of $r^2$ in $N$ dimensions, since the components of $\bx$ are independent random variables, is \beq \Exp[ r^2] = N \sigma^2 . \eeq The variance of $r^2$, similarly, is $N$ times the variance of $x^2$, where $x$ is a one-dimensional Gaussian variable. \beq \var (x^2 ) = \int \! \d x \: \frac{1}{(2 \pi \sigma^2)^{1/2}} x^4 \exp \left( - \frac{x^2}{2 \sigma^2} \right) - \sigma^4 . \eeq The integral is found to be $3 \sigma^4$ (\eqref{eq.gaussian4thmoment}), so $\var(x^2) = 2 \sigma^4$. Thus the variance of $r^2$ is $2 N \sigma^4$. For large $N$, the \ind{central-limit theorem} % law of large numbers indicates that $r^2$ has a Gaussian distribution with mean $N \sigma^2$ and standard deviation $\sqrt{2 N} \sigma^2$, so the probability density of $r$ must similarly be concentrated about $r \simeq \sqrt{N} \sigma$. The thickness of this shell is given by turning the standard deviation of $r^2$ into a standard deviation on $r$: for small $\delta r/r$, $\delta \log r = \delta r/r = (\dhalf) \delta \log r^2 = (\dhalf) \delta (r^2)/r^2$, so setting $\delta (r^2) = \sqrt{2 N} \sigma^2$, $r$ has standard deviation $\delta r = (\dhalf) r \delta (r^2)/r^2$ % $=$ $(\dhalf) \sqrt{2 N} \sigma^2 / \sqrt{( N \sigma^2)}$ $=\sigma/\sqrt{2}$. The probability density of the Gaussian at a point $\bx_{\rm shell}$ where $r = \sqrt{N} \sigma$ is \beq P(\bx_{\rm shell}) = \frac{1}{(2 \pi \sigma^2)^{N/2}} \exp \left( - \frac{N \sigma^2}{2 \sigma^2} \right) = \frac{1}{(2 \pi \sigma^2)^{N/2}} \exp \left( - \frac{N}{2} \right) . \eeq Whereas the probability density at the origin is \beq P(\bx\eq 0) = \frac{1}{(2 \pi \sigma^2)^{N/2}} . \eeq Thus $P(\bx_{\rm shell})/P(\bx\eq 0) = \exp \left( - \linefrac{N}{2} \right) .$ The probability density at the typical radius is $e^{-N/2}$ times smaller than the density at the origin. If $N=1000$, then the probability density at the origin is $e^{500}$ times greater. % } % % % for _e4.tex % \fakesection{Source coding problems solutions} %\soln{ex.forward-backward-language}{ %% (Draft.) %% % If we write down a language model for strings in forward-English, % the same model defines a probability distribution over strings % of backward English. The probability distributions have % identical entropy, so the average information contents % of the reversed % language and the forward language are equal. %} %\soln{ex.microwave}{ % moved to cutsolutions.tex % removed to cutsolutions.tex % \soln{ex.bridge}{(Draft.) \dvipsb{solutions further data compression f4} %\subchapter{Codes for integers \nonexaminable} \chapter{Codes for Integers \nonexaminable} \label{ch.codesforintegers} This chapter is an aside, which may safely be skipped. \section*{Solution to \protect\exerciseref{ex.Cinteger}}% was fiftythree \label{sec.codes.for.integers}\label{ex.Cinteger.sol}% special by hand %\soln{ex.Cinteger}{ %} \fakesection{Cinteger Solutions to problems} % % original integer stuff is in old/s_integer.tex % % chapter 2 , coding of integers To discuss the coding of integers\index{source code!for integers} we need some definitions.\index{binary representations} \begin{description} \item[The standard binary representation of a positive integer] $n$ will be denoted by $c_{\rm b}(n)$, \eg, $c_{\rm b}(5) = {\tt 101}$, $c_{\rm b}(45) = {\tt 101101}$. \item[The standard binary length of a positive integer] $n$, $l_{\rm b}(n)$, is the length of the string $c_{\rm b}(n)$. For example, $l_{\rm b}(5) = 3$, $l_{\rm b}(45) = 6$. \end{description} The standard binary representation $c_{\rm b}(n)$ is {\em not\/} a uniquely decodeable code for integers since there is no way of knowing when an integer has ended. For example, $c_{\rm b}(5)c_{\rm b}(5)$ is identical to $c_{\rm b}(45)$. It would be uniquely decodeable if we knew the standard binary length of each integer before it was received. Noticing that all positive integers have a standard binary representation that starts with a {\tt{1}}, we might define another representation: \begin{description} \item[The headless binary representation of a positive integer] $n$ will be denoted by $c_{\rm B}(n)$, \eg, $c_{\rm B}(5) = {\tt 01}$, $c_{\rm B}(45) = {\tt 01101}$ and $c_{\rm B}(1) = \lambda$ (where $\l$ denotes the null string). \end{description} This representation would be uniquely decodeable if we knew the length $l_{\rm b}(n)$ of the integer. So, how can we make a uniquely decodeable code for integers? Two strategies can be distinguished. \ben \item {\bf Self-delimiting codes}. We first communicate somehow % An alternative strategy is to make the code self-delimiting \index{symbol code!self-delimiting}\index{self-delimiting}the length of the integer, $l_{\rm b}(n)$, which is also a positive integer; then communicate the original integer $n$ itself using $c_{\rm B}(n)$. \item {\bf Codes with `end of file' characters}. We code the integer into blocks of length $b$ bits, and reserve one of the $2^b$ symbols to have the special meaning `end of file'. The coding of integers into blocks is arranged so that this reserved symbol is not needed for any other purpose. \een The simplest uniquely decodeable code for integers is the unary code, which can be viewed as a code with an end of file character. \begin{description} \item[Unary code\puncspace] An integer $n$ is encoded by sending a string of $n\!-\!1$ {\tt 0}s % zeroes followed by a {\tt 1}. \[ \begin{array}{cl} \toprule n & c_{\rm U}(n) \\ \midrule 1 & {\tt 1} \\ 2 & {\tt 01} \\ 3 & {\tt 001} \\ 4 & {\tt 0001} \\ 5 & {\tt 00001} \\ \vdots & \\ 45 & {\tt 000000000000000000000000000000000000000000001} \\ \bottomrule \end{array} \] The unary code has length $l_{\rm U}(n) = n$. The unary code is the optimal code for integers if the probability distribution over $n$ is $p_{\rm U}(n) = 2^{-{n}}$. \end{description} \subsubsection*{Self-delimiting codes} We can use the unary code to encode the {\em length\/} of the binary encoding of $n$ and make a self-delimiting code: \begin{description} \item[Code $C_\alpha$\puncspace] % The length of the standard binary representation is a positive integer We send the unary code for $l_{\rm b}(n)$, followed by the headless binary representation of $n$. \beq c_{\alpha}(n) = c_{\rm U}[ l_{\rm b}(n) ] c_{\rm B}(n) . \eeq Table \ref{tab.calpha} shows the codes for some integers. The overlining indicates the division of each string into the parts $c_{\rm U}[ l_{\rm b}(n) ]$ and $c_{\rm B}(n)$. \margintab{\footnotesize \[ \begin{array}{clll} \toprule n & c_{\rm b}(n) & \makebox[0in][c]{$l_{\rm b}(n)$} & c_{\alpha}(n) % = c_{\rm U}[ l_{\rm b}(n) ] c_{\rm B}(n) \\ \midrule 1 & {\tt 1 } & 1 & {\tt {\overline{1}}} \\ 2 & {\tt 10 } & 2 & {\tt {\overline{01}}0} \\ 3 & {\tt 11 } & 2 & {\tt {\overline{01}}1} \\ 4 & {\tt 100} & 3 & {\tt {\overline{001}}00} \\ 5 & {\tt 101} & 3 & {\tt {\overline{001}}01} \\ 6 & {\tt 110} & 3 & {\tt {\overline{001}}10} \\ \vdots & \\ 45 & {\tt 101101} & 6 & {\tt {\overline{000001}}01101} \\ \bottomrule \end{array} \] \caption[a]{$C_\alpha$.} \label{tab.calpha} } We might equivalently view $c_{\alpha}(n)$ as consisting of a string of $(l_{\rm b}(n)-1)$ zeroes followed by the standard binary representation of $n$, $c_{\rm b}(n)$. The codeword $c_{\alpha}(n)$ has length $l_{\alpha}(n) = 2 l_{\rm b}(n) - 1$. The implicit probability distribution over $n$ for the code $C_{\alpha}$ is separable into the product of a probability distribution over the length $l$, \beq P(l) = 2^{-l} , \eeq and a uniform distribution over integers having that length, \beq P(n\given l) = \left\{ \begin{array}{cl} 2^{-l+1} & l_{\rm b}(n) = l \\ 0 & \mbox{otherwise}. \end{array} \right. \eeq \end{description} Now, for the above code, the header that communicates the length always occupies the same number of bits as the standard binary representation of the integer (give or take one). If we are expecting to encounter large integers (large files) then this representation seems suboptimal, since it leads to all files occupying a size that is double their original uncoded size. Instead of using the unary code to encode the length $l_{\rm b}(n)$, we could use $C_{\alpha}$.% % see graveyard for original \margintab{{\footnotesize \[ \begin{array}{cll} \toprule n & c_{\beta}(n) & c_{\gamma}(n) \\ \midrule 1 & {\tt{\overline{1}}} & {\tt{\overline{1}}} \\ 2 & {\tt{\overline{010}}0} & {\tt{\overline{0100}}0} \\ 3 & {\tt{\overline{010}}1} & {\tt{\overline{0100}}1} \\ 4 & {\tt{\overline{011}}00}& {\tt{\overline{0101}}00} \\ 5 & {\tt{\overline{011}}01}& {\tt{\overline{0101}}01} \\ 6 & {\tt{\overline{011}}10}& {\tt{\overline{0101}}10} \\ \vdots & \\ 45 & {\tt{\overline{00110}}01101} & {\tt{\overline{01110}}01101} \\ \bottomrule \end{array} \] } \caption[a]{$C_\beta$ and $C_{\gamma}$.} \label{tab.cbeta} } \begin{description} \item[Code $C_\beta$\puncspace] % The length of the standard binary representation is a positive integer We send the length $l_{\rm b}(n)$ using $C_{\alpha}$, followed by the headless binary representation of $n$. \beq c_{\beta}(n) = c_{\alpha}[ l_{\rm b}(n) ] c_{\rm B}(n) . \eeq \end{description} Iterating this procedure, we can define a sequence of codes. \begin{description} \item[Code $C_{\gamma}$\puncspace] \beq c_{\gamma}(n) = c_{\beta}[ l_{\rm b}(n) ] c_{\rm B}(n) . \eeq % see graveyard for gamma table \item[Code $C_\delta$\puncspace] \beq c_{\delta}(n) = c_{\gamma}[ l_{\rm b}(n) ] c_{\rm B}(n) . \eeq \end{description} \subsection{Codes with end-of-file symbols} We can also make byte-based representations. (Let's use the term \ind{byte} flexibly here, to denote any fixed-length string of bits, not just a string of length 8 bits.) If we encode the number in some base, for example decimal, then we can represent each digit in a byte. In order to represent a digit from 0 to 9 in a byte we need four bits. Because $2^4 = 16$, this leaves 6 extra four-bit symbols, $\{${\tt 1010}, {\tt 1011}, {\tt 1100}, {\tt 1101}, {\tt 1110}, {\tt 1111}$\}$, that correspond to no decimal digit. We can use these as end-of-file symbols to indicate the end of our positive integer. % Such a code can also code the integer zero, for which % we have not been providing a code up till now. Clearly it is redundant to have more than one end-of-file symbol, so a more efficient code would encode the integer into base 15, and use just the sixteenth symbol, {\tt 1111}, as the punctuation character. Generalizing this idea, we can make similar byte-based codes for integers in bases 3 and 7, and in any base of the form $2^n-1$. \margintab{\small \[ \begin{array}{cll} \toprule n & c_3(n) & c_{7}(n) % = c_{\rm U}[ l_{\rm b}(n) ] c_{\rm B}(n) \\ \midrule 1 & {\tt 01\, 11 } & {\tt 001\, 111} \\ 2 & {\tt 10\, 11 } & {\tt 010\, 111} \\ 3 & {\tt 01\, 00\, 11 } & {\tt 011\, 111} \\ \vdots & \\ 45 & {\tt 01\, 10\, 00\, 00\, 11} & {\tt 110\, 011\, 111} \\ \bottomrule \end{array} \] \caption[a]{Two codes with end-of-file symbols, $C_3$ and $C_7$. Spaces have been included to show the byte boundaries. } } These codes are almost complete. (Recall that a code is `complete' if it satisfies the Kraft inequality with equality.) The codes' remaining inefficiency is that they provide the ability to encode the integer zero and the empty string, neither of which was required. \exercissxB{2}{ex.intEOF}{ Consider the implicit probability distribution over integers corresponding to the code with an end-of-file character. \ben \item If the code has eight-bit blocks (\ie, the integer is coded in base 255), what is the mean length in bits of the integer, under the implicit distribution? \item If one wishes to encode binary files of expected size about one hundred \kilobytes\ using a code with an end-of-file character, what is the optimal block size? \een } \subsection*{Encoding a tiny file} % see claude.p in itp/tex To illustrate the codes we have discussed, we now use each code to encode a small file consisting of just 14 characters, \[ \framebox{\tt{Claude Shannon}}. \] \bit \item If we map the ASCII characters onto seven-bit symbols (\eg, in decimal, ${\tt C}=67$, ${\tt l}=108$, etc.), this 14 character file corresponds to the integer \[ n = 167\,987\,786\,364\,950\,891\,085\,602\,469\,870 \:\:\mbox{(decimal)}. \] \item The unary code for $n$ consists of this many (less one) zeroes, followed by a one. If all the oceans were turned into ink, and if we wrote a hundred bits with every cubic millimeter, % or microlitre there % would be roughly might be enough ink to write $c_{\rm U}(n)$. \item The standard binary representation of $n$ is this length-98 sequence of bits: \beqa c_{\rm b}(n) &=& \begin{array}[t]{l} \tt 1000011110110011000011110101110010011001010100000 \\ \tt 1010011110100011000011101110110111011011111101110. \end{array} \eeqa % To store this self-delimiting file % on a disc, we would need \eit \exercisaxB{2}{ex.claudeshannonn}{ Write down or describe the following self-delimiting representations of the above number $n$: $c_{\alpha}(n)$, $c_{\beta}(n)$, $c_{\gamma}(n)$, $c_{\delta}(n)$, $c_{3}(n)$, $c_{7}(n)$, and $c_{15}(n)$. Which of these encodings is the shortest? [{\sf{Answer:}} $c_{15}$.] } % % solution moved to cutsolutions.tex % \subsection{Comparing the codes} One could answer the question `which of two codes is superior?' by a sentence of the form `For $n>k$, code 1 is superior, for $n Secondly, the depiction in terms of Venn diagrams encourages one to believe that all the areas correspond to positive quantities. In the special case of two random variables it is indeed true that $H(X \given Y)$, $\I(X;Y)$ and $H(Y \given X)$ are positive quantities. But as soon as we progress to three-variable ensembles, we obtain a diagram with positive-looking areas that may actually correspond to negative quantities. \Figref{fig.venn3} correctly shows relationships such as \beq H(X) + H(Z \given X) + H(Y \given X,Z) = H(X,Y,Z) . \eeq But it gives the misleading impression that the conditional mutual information $\I(X;Y \given Z)$ is {\em less than\/} the mutual information $\I(X;Y)$. \begin{figure} \figuremargin{%3/4 \begin{center} \mbox{\psfig{figure=figs/venn3.ps,angle=-90,width=5.25in}} \end{center} }{% \caption[a]{A misleading representation of entropies, continued.} \label{fig.venn3} }% \end{figure} In fact the area labelled $A$ can correspond to a {\em negative\/} quantity. Consider the joint ensemble $(X,Y,Z)$ in which $x \in \{0,1\}$ and $y \in \{0,1\}$ are independent binary variables and $z \in \{0,1\}$ is defined to be $z=x+y \mod 2$. Then clearly $H(X) = H(Y) = 1$ bit. Also $H(Z) = 1$ bit. And $H(Y \given X) = H(Y) = 1$ since the two variables are independent. So the mutual information between $X$ and $Y$ is zero. $\I(X;Y) = 0$. However, if $z$ is observed, $X$ and $Y$ become dependent --- % correlated --- knowing $x$, given $z$, tells you what $y$ is: $y = z - x \mod 2$. So $\I(X;Y \given Z) = 1$ bit. Thus the area labelled $A$ must correspond to $-1$ bits for the figure to give the correct answers. The above example is not at all a capricious or exceptional illustration. The binary symmetric channel with input $X$, noise $Y$, and output $Z$ % The classic\index{earthquake and burglar alarm}\index{burglar alarm and earthquake} % earthquake-burglar-alarm ensemble \exercisebref{ex.burglar}\ %% (section ???), % with % earthquake $= X$, % burglar $ = Y$ and alarm $= Z$, % is a perfect example of a is a situation in which $\I(X;Y)=0$ (input and noise are independent) % uncorrelated but $\I(X;Y \given Z) > 0$ (once you see the output, the unknown input and the unknown noise are intimately related!). The Venn diagram representation is therefore valid only if one is aware that positive areas may represent negative quantities. With this proviso % As long as this possibility is kept in mind, the interpretation of entropies in terms of sets can be helpful \cite{Yeung1991}. % The quantity corresponding to $A$ is denoted $I(X;Y;Z)$ % by \citeasnoun{Yeung1991}. } \soln{ex.dataprocineq}{% BORDERLINE %{\bf New answer:} For any joint ensemble $XYZ$, the following chain rule for mutual information holds. \beq \I(X;Y,Z) = \I(X;Y) + \I(X;Z \given Y) . \eeq Now, in the case $w \rightarrow d \rightarrow r$, $w$ and $r$ are independent given $d$, so $\I(W;R \given D) = 0$. Using the chain rule twice, we have: \beq \I(W;D,R) = \I(W;D) \eeq and \beq \I(W;D,R) = \I(W;R) + \I(W;D \given R) , \eeq so \beq \I(W;R) - \I(W;D) \leq 0 . \eeq % for more solutions to this problem see % Igraveyard.tex } \prechapter{About Chapter} \fakesection{prerequisites for chapter 5} Before reading \chref{ch.five}, you should have read \chapterref{ch.one} and worked on \exerciseref{ex.rel.ent}, and \exerciserefrange{ex.Hcondnal}{ex.zxymod2}. % \exfifteen--\exeighteen, % \extwenty--\extwentyone, and \extwentythree. % uvw to HXY>0 % {ex.Hmutualineq}{ex.joint}, % \exerciserefrangeshort{ex.rel.ent} % load of H() and I() stuff shoved in here now. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \ENDprechapter \chapter{Communication over a Noisy Channel} \label{ch.five} % % l5.tex % % useful program: bin/capacity.p for checking channel % capacities % % % \part{Noisy Channel Coding} % \chapter{Communication over a noisy channel} % % The noisy-channel coding theorem, part a} % % \chapter{The noisy channel coding theorem, part a} \label{ch5} \section{The big picture} % \setlength{\unitlength}{1mm} \begin{realcenter} %\begin{floatingfigure}[l]{3.2in} \begin{picture}(85,50)(-40,5) \thinlines \put(0,5){\framebox(25,10){\begin{tabular}{c}Noisy\\ channel\end{tabular}}} \put(-20,20){\framebox(25,10){\begin{tabular}{c}Encoder\end{tabular}}} \put(20,20){\framebox(25,10){\begin{tabular}{c}Decoder\end{tabular}}} \put(-20,40){\framebox(25,10){\begin{tabular}{c}Compressor\end{tabular}}} \put(20,40){\framebox(25,10){\begin{tabular}{c}Decompressor\end{tabular}}} \put(-40,40){\makebox(15,10){\begin{tabular}{c}{\sc Source}\\{\sc coding}\end{tabular}}} \put(-40,20){\makebox(15,10){\begin{tabular}{c}{\sc Channel}\\{\sc coding}\end{tabular}}} \put(-20,55){\makebox(25,10){Source}} % \put(-7.5,18){\line(0,-1){8}} \put(-7.5,10){\vector(1,0){6}} \put(32.5,10){\vector(0,1){8}} \put(32.5,10){\line(-1,0){6}} % \put(32.5,31){\vector(0,1){8}} \put(32.5,51){\vector(0,1){6}} \put(-7.5,39){\vector(0,-1){8}} \put(-7.5,57){\vector(0,-1){6}} \end{picture} \end{realcenter} % In\index{channel!noisy} Chapters \ref{ch2}--\ref{ch4}, we discussed source coding with block codes, symbol codes and stream codes. We implicitly assumed that the channel from the compressor to the decompressor was noise-free. Real channels are noisy. We will now spend two chapters on the subject of noisy-channel coding -- the fundamental possibilities and limitations of error-free \ind{communication} through a noisy channel. The aim of channel coding is to make the noisy channel behave like a noiseless channel. We will assume that the data to be transmitted has been through a good compressor, so the bit stream has no obvious redundancy. The channel code, which makes the transmission, will put\index{redundancy!in channel code} back % into the transmission redundancy of a special sort, designed to make the noisy received signal decodeable.\index{decoder} Suppose we transmit 1000 bits per second\index{channel!binary symmetric} with $p_0 = p_1 = \dhalf$ over a noisy channel that flips bits with probability $f = 0.1$. What is the rate of transmission of information? % shannon p.35 We might guess that the rate is 900 bits per second by subtracting the expected number of errors per second. But this is not correct, because the recipient does not know where the errors occurred. Consider the case where the noise is so great that the received symbols are independent of the transmitted symbols. This corresponds to a noise level of $f=0.5$, since half of the received symbols are correct due to chance alone. But when $f=0.5$, no information is transmitted at all. % ? cut this clearly? \label{sec.ch5.intro} % refer to exercise {ex.zxymod2}. Given what we have learnt about entropy, it seems reasonable that a measure of the information transmitted is given by the \ind{mutual information} between the source and the received signal, that is, the entropy of the source minus the \ind{conditional entropy} of the source given the received signal. % % shannon calls the conditional entropy the equivocation % and points out that the equivocation is the amount of extra % information needed for a correcting device to figure out % what is going on We will now review the definition of conditional entropy and mutual information. Then we will examine % progress to the question of whether it is possible to use such a noisy channel to communicate {\em reliably}. We will % Our aim here is to show that for any channel $Q$ there is a non-zero rate, the \inds{capacity}\index{channel!capacity} $C(Q)$, up to which information can be sent with arbitrarily small probability of error. \section{Review of probability and information} % conditional, joint and mutual information} % We now build on As an example, we take the joint distribution $XY$ from \extwentyone.\label{ex.joint.sol} % % A useful picture breaks down the total information content $H(X,Y)$ % of a joint ensemble thus: % \begin{center} % \setlength{\unitlength}{1in} % \begin{picture}(3,1.13)(0,-0.2) % \put(0,0.7){\framebox(3,0.20){$H(X,Y)$}} % \put(0,0.4){\framebox(2.2,0.20){$H(X)$}} % \put(1.5,0.1){\framebox(1.5,0.20){$H(Y)$}} % \put(1.5125,-0.2){\framebox(0.675,0.20){$\I(X;Y)$}} % \put(0,-0.2){\framebox(1.475,0.20){$H(X \given Y)$}} % \put(2.225,-0.2){\framebox(0.775,0.20){$H(Y \specialgiven X)$}} % \end{picture} % \end{center} % % \subsection{Example of a joint ensemble} % A joint ensemble $XY$ has the following joint distribution. The marginal distributions $P(x)$ and $P(y)$ are shown in the margins.\index{marginal probability} % $P(x,y)$: \[ \begin{array}{cc|cccc|c} \multicolumn{2}{c}{P(x,y)} & \multicolumn{4}{|c|}{x} & P(y) \\[0.051in] & & 1 & 2 & 3 & 4 & \\[0.011in] \hline \strutf &1 & \dfrac{1}{8} & \dfrac{1}{16} & \dfrac{1}{32} & \dfrac{1}{32} & \dfrac{1}{4} \\[0.01in] \raisebox{0mm}{\mbox{$y$}} &2 & \dfrac{1}{16} & \dfrac{1}{8} & \dfrac{1}{32} & \dfrac{1}{32} & \dfrac{1}{4} \\[0.01in] &3 & \dfrac{1}{16} & \dfrac{1}{16} & \dfrac{1}{16} & \dfrac{1}{16} & \dfrac{1}{4} \\[0.01in] &4 & \dfrac{1}{4} & 0 & 0 & 0 & \dfrac{1}{4} \\[0.01in] \hline \multicolumn{2}{c|}{P(x)} & \strutf\dfrac{1}{2} & \dfrac{1}{4} & \dfrac{1}{8} & \dfrac{1}{8} & \\[0.051in] \end{array} \] The joint entropy is $H(X,Y)=27/8$ bits. The marginal entropies are $H(X) = 7/4$ bits and $H(Y) = 2$ bits. We can compute the conditional distribution of $x$ for each value of $y$, and the entropy of each of those conditional distributions: \[ \begin{array}{cc|cccc|c} \multicolumn{2}{c|}{P(x \given y)} & \multicolumn{4}{c|}{x} & H(X \given y) / \mbox{bits} \\[0.051in] & & 1 & 2 & 3 & 4 & \\[0.011in] \hline \strutf &1 & \dfrac{1}{2} & \dfrac{1}{4} & \dfrac{1}{8} & \dfrac{1}{8} & \dfrac{7}{4} \\[0.01in] \raisebox{0mm}{\mbox{$y$}} &2 & \dfrac{1}{4} & \dfrac{1}{2} & \dfrac{1}{8} & \dfrac{1}{8} & \dfrac{7}{4} \\[0.01in] &3 & \dfrac{1}{4} & \dfrac{1}{4} & \dfrac{1}{4} & \dfrac{1}{4} & 2 \\[0.01in] &4 & 1 & 0 & 0 & 0 & 0 \\[0.01in] \hline \multicolumn{3}{c}{\strutf } & \multicolumn{4}{r}{H(X \given Y) = \dfrac{11}{8}} \\[0.1in] \end{array} \] Note that whereas $H(X \given y\eq 4) = 0$ is less than $H(X)$, $H(X \given y\eq 3)$ is greater than $H(X)$. % _s5A.tex has a solution link already \label{ex.Hcondnal.sol} % \label{ex.joint.sol} So in some cases, learning $y$ can % make us more uncertain {\em increase\/} our uncertainty about $x$. Note also that although $P(x \given y\eq 2)$ is a different distribution from $P(x)$, the conditional entropy $H(X \given y\eq 2)$ is equal to $H(X)$. So learning that $y$ is 2 changes our knowledge about $x$ but does not reduce the uncertainty of $x$, as measured by the entropy. On average though, learning $y$ does convey information about $x$, since $H(X \given Y) < H(X)$. One may also evaluate $H(Y \specialgiven X) = 13/8$ bits. The mutual information is $\I(X;Y) = H(X) - H(X \given Y) = 3/8$ bits. % INCLUDE ME LATER %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % INCLUDE ME LATER %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % INCLUDE ME LATER %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % % \subsection{Solutions to a few other exercises} % \input{tex/entropy_soln.tex} % % INCLUDE ME LATER %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % INCLUDE ME LATER %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % INCLUDE ME LATER %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % %\mynewpage MNBV \section{Noisy channels} \begin{description} \item[A discrete memoryless channel $Q$] is\index{channel!discrete memoryless} characterized by an input alphabet $\A_X$, an output alphabet $\A_Y$, and a set of conditional probability distributions $P(y \given x)$, one for each $x \in \A_X$. These {\dbf{transition probabilities}} may be written in a matrix\index{transition probability} \beq Q_{j|i} = P(y\eq b_j \given x\eq a_i) . \eeq % \index{conventions!matrices}\index{conventions!vectors \begin{aside} I\index{notation!conventions of this book}\index{notation!matrices}\index{notation!vectors}\index{notation!transition probability} usually orient this matrix with the output variable $j$ indexing the rows and the input variable $i$ indexing the columns, so that each column of $\bQ$ is a probability vector. With this convention, we can obtain the probability of the output, $\bp_Y$, from a probability distribution over the input, $\bp_X$, by right-multiplication: \beq \bp_Y = \bQ \bp_X . \eeq % \end{aside} % \end{description} \noindent Some useful model channels are: \begin{description} % bsc \item[Binary symmetric channel\puncspace] \indexs{channel!binary symmetric}\indexs{binary symmetric channel} $\A_X \eq \{{\tt 0},{\tt 1}\}$. $\A_Y \eq \{{\tt 0},{\tt 1}\}$. \[ \begin{array}{c} \setlength{\unitlength}{0.46mm} \begin{picture}(30,20)(-5,0) \put(-4,9){{\makebox(0,0)[r]{$x$}}} \put(5,2){\vector(1,0){10}} \put(5,16){\vector(1,0){10}} \put(5,4){\vector(1,1){10}} \put(5,14){\vector(1,-1){10}} \put(4,2){\makebox(0,0)[r]{1}} \put(4,16){\makebox(0,0)[r]{0}} \put(16,2){\makebox(0,0)[l]{1}} \put(16,16){\makebox(0,0)[l]{0}} \put(24,9){{\makebox(0,0)[l]{$y$}}} \end{picture} \end{array} \hspace{1in} \begin{array}{ccl} P(y\eq {\tt 0} \given x\eq {\tt 0}) &=& 1 - \q ; \\ P(y\eq {\tt 1} \given x\eq {\tt 0}) &=& \q ; \end{array} \begin{array}{ccl} P(y\eq {\tt 0} \given x\eq {\tt 1}) &=& \q ; \\ P(y\eq {\tt 1} \given x\eq {\tt 1}) &=& 1 - \q . \end{array} \hspace{1in} \begin{array}{c} \ecfig{bsc15.1} \end{array} \] % % \BEC bec BEC % \item[Binary erasure channel\puncspace] \indexs{channel!binary erasure}\indexs{binary erasure channel} $\A_X \eq \{{\tt 0},{\tt 1}\}$. $\A_Y \eq \{{\tt 0},\mbox{\tt ?},{\tt 1}\}$. \[ \begin{array}{c} \setlength{\unitlength}{0.46mm} \begin{picture}(30,30)(-5,0) \put(-4,15){{\makebox(0,0)[r]{$x$}}} \put(5,5){\vector(1,0){10}} \put(5,25){\vector(1,0){10}} \put(5,5){\vector(1,1){10}} \put(5,25){\vector(1,-1){10}} \put(4,5){\makebox(0,0)[r]{\tt 1}} \put(4,25){\makebox(0,0)[r]{\tt 0}} \put(16,5){\makebox(0,0)[l]{\tt 1}} \put(16,25){\makebox(0,0)[l]{\tt 0}} \put(16,15){\makebox(0,0)[l]{\tt ?}} \put(24,15){{\makebox(0,0)[l]{$y$}}} \end{picture} \end{array} \hspace{1in} \begin{array}{ccl} P(y\eq {\tt 0} \given x\eq {\tt 0}) &=& 1 - \q ; \\ P(y\eq \mbox{\tt ?} \given x\eq {\tt 0}) &=& \q ; \\ P(y\eq {\tt 1} \given x\eq {\tt 0}) &=& 0 ; \end{array} \begin{array}{ccl} P(y\eq {\tt 0} \given x\eq {\tt 1}) &=& 0 ; \\ P(y\eq \mbox{\tt ?} \given x\eq {\tt 1}) &=& \q ; \\ P(y\eq {\tt 1} \given x\eq {\tt 1}) &=& 1 - \q . \end{array} \hspace{1in} \begin{array}{c} \ecfig{bec.1} \end{array} \] \item[Noisy typewriter\puncspace] \indexs{channel!noisy typewriter}\indexs{noisy typewriter} $\A_X = \A_Y = \mbox{the 27 letters $\{${\tt A}, {\tt B}, \ldots, {\tt Z}, {\tt -}$\}$}$. The letters are arranged in a circle, and when the typist attempts to type {\tt B}, what comes out is either {\tt A}, {\tt B} or {\tt C}, with probability \dfrac{1}{3} each; when the input is {\tt C}, the output is {\tt B}, {\tt C} or {\tt D}; and so forth, with the final letter `{\tt -}' % being adjacent to the first letter {\tt A}. \[ \begin{array}{c} \setlength{\unitlength}{1pt} \begin{picture}(48,130)(0,2) \thinlines \put(5,5){\vector(3,0){30}} \put(5,25){\vector(3,0){30}} \put(5,15){\vector(3,0){30}} \put(5,5){\vector(3,1){30}} \put(5,25){\vector(3,-1){30}} \put(4,5){\makebox(0,0)[r]{{\tt -}}} \put(4,15){\makebox(0,0)[r]{{\tt Z}}} \put(4,25){\makebox(0,0)[r]{{\tt Y}}} \put(36,5){\makebox(0,0)[l]{{\tt -}}} \put(36,15){\makebox(0,0)[l]{{\tt Z}}} \put(36,25){\makebox(0,0)[l]{{\tt Y}}} % \put(5,15){\vector(3,1){30}} \put(5,15){\vector(3,-1){30}} \put(5,25){\vector(3,0){30}} \put(5,25){\vector(3,1){30}} \put(20,43){\makebox(0,0){$\vdots$}} % %\put(5,35){\vector(3,0){30}} %\put(5,35){\vector(3,1){30}} \put(5,35){\vector(3,-1){30}} %\put(5,45){\vector(3,0){30}} \put(5,45){\vector(3,1){30}} %\put(5,45){\vector(3,-1){30}} \put(5,55){\vector(3,0){30}} \put(5,55){\vector(3,1){30}} \put(5,55){\vector(3,-1){30}} \thicklines \put(5,65){\vector(3,0){30}} \put(5,65){\vector(3,1){30}} \put(5,65){\vector(3,-1){30}} \thinlines \put(5,75){\vector(3,0){30}} \put(5,75){\vector(3,1){30}} \put(5,75){\vector(3,-1){30}} \put(5,85){\vector(3,0){30}} \put(5,85){\vector(3,1){30}} \put(5,85){\vector(3,-1){30}} \put(5,95){\vector(3,0){30}} \put(5,95){\vector(3,1){30}} \put(5,95){\vector(3,-1){30}} \put(5,105){\vector(3,0){30}} \put(5,105){\vector(3,1){30}} \put(5,105){\vector(3,-1){30}} \put(5,115){\vector(3,0){30}} \put(5,115){\vector(3,1){30}} \put(5,115){\vector(3,-1){30}} \put(5,125){\vector(3,0){30}} \put(5,125){\vector(3,-1){30}} \put(5,5){\vector(1,4){30}} \put(5,125){\vector(1,-4){30}} %\put(4,35){\makebox(0,0)[r]{{\tt J}}} %\put(36,35){\makebox(0,0)[l]{{\tt J}}} %\put(4,45){\makebox(0,0)[r]{{\tt I}}} %\put(36,45){\makebox(0,0)[l]{{\tt I}}} \put(4,55){\makebox(0,0)[r]{{\tt H}}} \put(36,55){\makebox(0,0)[l]{{\tt H}}} \put(4,65){\makebox(0,0)[r]{{\tt G}}} \put(36,65){\makebox(0,0)[l]{{\tt G}}} \put(4,75){\makebox(0,0)[r]{{\tt F}}} \put(36,75){\makebox(0,0)[l]{{\tt F}}} \put(4,85){\makebox(0,0)[r]{{\tt E}}} \put(36,85){\makebox(0,0)[l]{{\tt E}}} \put(4,95){\makebox(0,0)[r]{{\tt D}}} \put(36,95){\makebox(0,0)[l]{{\tt D}}} \put(4,105){\makebox(0,0)[r]{{\tt C}}} \put(36,105){\makebox(0,0)[l]{{\tt C}}} \put(4,115){\makebox(0,0)[r]{{\tt B}}} \put(36,115){\makebox(0,0)[l]{{\tt B}}} \put(4,125){\makebox(0,0)[r]{{\tt A}}} \put(36,125){\makebox(0,0)[l]{{\tt A}}} \end{picture} \end{array} \hspace{1in} \begin{array}{ccl} & \vdots & \\ P(y\eq {\tt F} \given x\eq {\tt G}) &=& 1/3 ; \\ P(y\eq {\tt G} \given x\eq {\tt G}) &=& 1/3 ; \\ P(y\eq {\tt H} \given x\eq {\tt G}) &=& 1/3 ; \\ & \vdots & \end{array} \hspace{1.2in} \begin{array}{c} \ecfig{type} \end{array} \] \item[Z channel\puncspace] \indexs{channel!Z channel}\indexs{Z channel} $\A_X \eq \{{\tt 0},{\tt 1}\}$. $\A_Y \eq \{{\tt 0},{\tt 1}\}$. \[ % \begin{array}{c} % \setlength{\unitlength}{0.46mm} % \begin{picture}(20,20)(0,0) % \put(5,5){\vector(1,0){10}} % \put(5,15){\vector(1,0){10}} % \put(5,5){\vector(1,1){10}} % \put(4,5){\makebox(0,0)[r]{1}} % \put(4,15){\makebox(0,0)[r]{0}} % \put(16,5){\makebox(0,0)[l]{1}} % \put(16,15){\makebox(0,0)[l]{0}} % \end{picture} % \end{array} \begin{array}{c} \setlength{\unitlength}{0.46mm} \begin{picture}(30,20)(-5,0) \put(-4,9){{\makebox(0,0)[r]{$x$}}} \put(5,2){\vector(1,0){10}} \put(5,16){\vector(1,0){10}} \put(5,4){\vector(1,1){10}} % \put(5,14){\vector(1,-1){10}} \put(4,2){\makebox(0,0)[r]{1}} \put(4,16){\makebox(0,0)[r]{0}} \put(16,2){\makebox(0,0)[l]{1}} \put(16,16){\makebox(0,0)[l]{0}} \put(24,9){{\makebox(0,0)[l]{$y$}}} \end{picture} \end{array} \hspace{1in} \begin{array}{ccl} P(y\eq {\tt 0} \given x\eq {\tt 0}) &=& 1 ; \\ P(y\eq {\tt 1} \given x\eq {\tt 0}) &=& 0 ; \\ \end{array} \begin{array}{ccl} P(y\eq {\tt 0} \given x\eq {\tt 1}) &=& \q ; \\ P(y\eq {\tt 1} \given x\eq {\tt 1}) &=& 1- \q .\\ \end{array} \hspace{1in} %\:\:\:\:\:\: \begin{array}{c} \ecfig{z15.1} \end{array} \] % {\em Check if this orientation of the channel disagrees % with any demonstrations.} \end{description} \section{Inferring the input given the output} % was a subsection % a single transmission} If we assume that the input $x$ to a channel comes from an ensemble $X$, then we obtain a joint ensemble $XY$ in which the random variables $x$ and $y$ have the joint distribution: \beq P(x,y) = P(y \given x) P(x) . \eeq Now if we receive a particular symbol $y$, what was the input symbol $x$? We typically won't know for certain. We can write down the posterior distribution of the input using \Bayes\ theorem:\index{Bayes' theorem} \beq P(x \given y) = \frac{ P(y \given x) P(x) }{P(y) } = \frac{ P(y \given x) P(x) }{\sum_{x'} P(y \given x') P(x') } . \eeq \exampla{ %{\sf Example 1:} Consider a % \index{channel!binary symmetric}\ind{binary symmetric channel} {binary symmetric channel} with probability of error $\q\eq 0.15$. Let the input ensemble be $\P_X: \{p_0 \eq 0.9, p_1 \eq 0.1\}$. Assume we observe $y\eq 1$. \beqan P(x\eq 1 \given y\eq 1) &=&\frac{ P(y\eq 1 \given x\eq 1) P(x\eq 1) }{\sum_{x'} P(y \given x') P(x') } \nonumber \\ &\eq & \frac{ 0.85 \times 0.1 }{ 0.85 \times 0.1 + 0.15 \times 0.9 } \nonumber \\ &=& \frac{ 0.085 }{ 0.22 } \:\:=\:\: 0.39 . \eeqan Thus `$x\eq 1$' is still less probable than `$x\eq 0$', although it is not as improbable as it was before. } % Could turn this into an exercise. % Alternatively, assume we observe $y\eq 0$. % \beqa % P(x\eq 1 \given y\eq 0) &=& \frac{ P(y\eq 0 \given x\eq 1) P(x\eq 1)}{\sum_{x'} P(y \given x') P(x')} \\ % &=& \frac{ 0.15 \times 0.1 }{ 0.15 \times 0.1 + 0.85 \times 0.9 } \\ % &=& \frac{ 0.015 }{0.78} = 0.019 . % \eeqa \exercissxA{1}{ex.bscy0}{ Now assume we observe $y\eq 0$. Compute the probability of $x\eq 1$ given $y\eq 0$. } \exampla{ %{\sf Example 2:} Consider a \ind{Z channel}\index{channel!Z channel} with probability of error $\q\eq 0.15$. Let the input ensemble be $\P_X: \{p_0 \eq 0.9, p_1 \eq 0.1\}$. Assume we observe $y\eq 1$. \beqan P(x\eq 1 \given y\eq 1) &=& \frac{ 0.85 \times 0.1 }{ 0.85 \times 0.1 + 0 \times 0.9 } \nonumber \\ &=& \frac{ 0.085}{0.085} \:\:=\:\: 1.0 . \eeqan So given the output $y\eq 1$ we become certain of the input. } % Alternatively, assume we observe $y\eq 0$. % \beqa % P(x\eq 1 \given y\eq 0) % % &=& \frac{ P(y\eq 0 \given x\eq 1) P(x\eq 1)}{\sum_{x'} P(y \given x') P(x')} \\ % &=& \frac{ 0.15 \times 0.1 }{ 0.15 \times 0.1 + 1.0 \times 0.9 } \\ % &=& \frac{ 0.015}{ 0.915} = 0.016 . % \eeqa \exercissxA{1}{ex.zcy0}{ Alternatively, assume we observe $y\eq 0$. Compute $P(x\eq 1 \given y\eq 0)$. } \section{Information conveyed by a channel} We now consider how much information can be communicated through a channel. In {operational\/} terms, we are interested in finding ways of using the channel such that all the bits that are communicated are recovered with negligible probability of error. In {mathematical\/} terms, assuming a particular input ensemble $X$, we can measure how much information the output conveys about the input by the mutual information: \beq \I(X;Y) \equiv H(X) - H(X \given Y) = H(Y) - H(Y \specialgiven X) . \eeq Our aim is to establish the connection between these two ideas. Let us evaluate $\I(X;Y)$ for some of the channels above. \subsection{Hint for computing mutual information} \index{hint for computing mutual information}\index{mutual information!how to compute}We will tend to think of $\I(X;Y)$ as $H(X) - H(X \given Y)$, \ie, how much the uncertainty of the input $X$ is reduced when we look at the output $Y$. But for computational purposes it is often handy to evaluate $H(Y) - H(Y \specialgiven X)$ instead. %\medskip % this reproduced from _p5A.tex, figure 9.1 {fig.entropy.breakdown} \begin{figure}[htbp] \figuremargin{% \begin{center} % % included by l1.tex % \setlength{\unitlength}{1in} \begin{picture}(3,1.13)(0,-0.2) \put(0,0.7){\framebox(3,0.20){$H(X,Y)$}} \put(0,0.4){\framebox(2.2,0.20){$H(X)$}} \put(1.5,0.1){\framebox(1.5,0.20){$H(Y)$}} \put(1.5125,-0.2){\framebox(0.675,0.20){$\I(X;Y)$}} \put(0,-0.2){\framebox(1.475,0.20){$H(X\,|\,Y)$}} \put(2.225,-0.2){\framebox(0.775,0.20){$H(Y|X)$}} \end{picture} \end{center} }{% \caption[a]{The relationship between joint information, marginal entropy, conditional entropy and mutual entropy. This figure is important, so I'm showing it twice.} \label{fig.entropy.breakdown.again} }% \end{figure} %\begin{center} %\input{tex/entropyfig.tex} %\end{center} %\noindent \exampla{ %{\sf Example 1:} Consider the % \index{channel!binary symmetric}\index{binary symmetric channel} \BSC\ again, with $\q\eq 0.15$ and $\P_X: \{p_0 \eq 0.9, p_1 \eq 0.1\}$. We already evaluated the marginal probabilities $P(y)$ implicitly above: $P(y\eq 0) = 0.78$; $P(y\eq 1) = 0.22$. The mutual information is: \beqa \I(X;Y) &=& H(Y) - H(Y \specialgiven X) . \eeqa What is $H(Y \specialgiven X)$? It is defined to be the weighted sum over $x$ of $H(Y \given x)$; but $H(Y \given x)$ is the same for each value of $x$: $H(Y \given x\eq{\tt{0}})$ is $H_2(0.15)$, and $H(Y \given x\eq{\tt{1}})$ is $H_2(0.15)$. So \beqan \I(X;Y) &=& H(Y) - H(Y \specialgiven X) \nonumber \\ &=& H_2(0.22) - H_2(0.15) \nonumber \\ & =& 0.76 - 0.61 \:\: = \:\: 0.15 \mbox{ bits}. \eeqan % this used to be in error (0.15) This may be contrasted with the entropy of the source $H(X) = H_2(0.1) = 0.47$ bits. Note: here we have used the binary entropy function $H_2(p) \equiv H(p,1\!-\!p)=p \log \frac{1}{p} + (1-p)\log \frac{1}{(1-p)}$.\marginpar{\small\raggedright\reducedlead{Throughout this book, $\log$ means $\log_2$.}} } %\medskip % \noindent \exampla{ % {\sf Example~2:} And now the \ind{Z channel}\index{channel!Z channel}, with $\P_X$ as above. % $P(y\eq 0)\eq 0.915; $P(y\eq 1)\eq 0.085$. \beqan \I(X;Y) &=& H(Y) - H(Y \specialgiven X) \nonumber \\ &=& H_2(0.085) - [ 0.9 H_2(0) + 0.1 H_2(0.15) ] \nonumber \\ &=& 0.42 - ( 0.1 \times 0.61 ) = 0.36 \mbox{ bits}. \eeqan The entropy of the source, as above, is $H(X) = 0.47$ bits. Notice that the mutual information $\I(X;Y)$ for the Z channel is bigger than the mutual information for the binary symmetric channel with the same $\q$. The Z channel is a more reliable channel. % is fits with our intuition that the } \exercissxA{1}{ex.bscMI}{Compute the mutual information between $X$ and $Y$ for the \BSC\ with $\q\eq 0.15$ when the input distribution is $\P_X = \{p_0 \eq 0.5, p_1 \eq 0.5\}$. } \exercissxA{2}{ex.zcMI}{Compute the mutual information between $X$ and $Y$ for the Z channel with $\q=0.15$ when the input distribution is $\P_X: \{p_0 \eq 0.5, p_1 \eq 0.5\}$. } \subsection{Maximizing the mutual information} We have observed in the above examples that the mutual information between the input and the output depends on the chosen {input ensemble}\index{channel!input ensemble}. Let us assume that we wish to maximize the mutual information conveyed by the channel by choosing the best possible input ensemble. We define the {\dbf\inds{capacity}\/} of the channel\index{channel!capacity} to be its maximum \ind{mutual information}. \begin{description} \item[The capacity] of a channel $Q$ is: \beq C(Q) = \max_{\P_X} \, \I(X;Y) . \eeq The distribution $\P_X$ that achieves the maximum is called the {\dem{\optens}},\indexs{optimal input distribution} denoted by $\P_X^*$. [There may be multiple {\optens}s achieving the same value of $\I(X;Y)$.] \end{description} % In \chref{ch6} we will show that the capacity does indeed measure the maximum amount of error-free information that can be transmitted % is transmittable % yes, spell checked over the channel per unit time. % \medskip % Sun 22/8/04 am having problems trying to get fig 9.2 to go at head % of p 151 - putting it there causes text to move. %\noindent \exampla{ %{\sf Example 1:} Consider the \BSC\ with $\q \eq 0.15$. Above, we considered $\P_X = \{p_0 \eq 0.9, p_1 \eq 0.1\}$, and found $\I(X;Y) = 0.15$ bits. % the page likes to break here How much better can we do? By symmetry, the \optens\ is $\{ 0.5, 0.5\}$ and% \amarginfig{t}{ \mbox{% %\begin{figure}[htbp] \small %\floatingmargin{% %\figuremargin{% \raisebox{0.91in}{$\I(X;Y)$}% \hspace{-0.42in}% \begin{tabular}{c} \mbox{\psfig{figure=figs/IXY.15.ps,% width=45mm,angle=-90}}\\[-0.1in] $p_1$ \end{tabular} } %}{% \caption[a]{The mutual information $\I(X;Y)$ for a binary symmetric channel with $\q=0.15$ as a function of the input distribution. % (\eqref{eq.IXYBSC}). } \label{fig.IXYBSC} } %%% the capacity is \beq C(Q_{\rm BSC}) \:=\: H_2(0.5) - H_2(0.15) \:=\: 1.0 - 0.61 \:=\: 0.39 \ubits. \eeq We'll justify the \ind{symmetry argument}\index{capacity!symmetry argument} later. If there's any doubt about the % such a symmetry argument, we can always resort to explicit maximization of the \ind{mutual information} $I(X;Y)$, \beq I(X;Y) = H_2( (1\!-\!\q)p_1 + (1\!-\!p_1)\q ) - H_2(\q) \ \ \mbox{ (\figref{fig.IXYBSC}). } \label{eq.IXYBSC} \eeq } % \medskip % \noindent % {\sf Example 2:} \exampl{exa.typewriter}{ The noisy typewriter. The \optens\ is a uniform distribution over $x$, and gives $C = \log_2 9$ bits. } % \medskip % \noindent \exampl{exa.Z.HXY}{ % {\sf Example 3:} Consider the \ind{Z channel} with $\q \eq 0.15$. Identifying the \optens\ is not so straightforward. We evaluate $\I(X;Y)$ explicitly for $\P_X = \{p_0, p_1\}$. First, we need to compute $P(y)$. The probability of $y\eq 1$ is easiest to write down: \beq P(y\eq 1) \:\:=\:\: p_1 (1-\q) . \eeq Then% \amarginfig{t}{ %\begin{figure}[htbp] \mbox{% \small %\floatingmargin{% %\figuremargin{% \raisebox{0.91in}{$\I(X;Y)$}% \hspace{-0.42in}% \begin{tabular}{c} \mbox{\psfig{figure=figs/HXY.ps,% width=45mm,angle=-90}}\\[-0.1in] $p_1$ \end{tabular} } %}{% \caption{The mutual information $\I(X;Y)$ for a Z channel with $\q=0.15$ as a function of the input distribution.} \label{hxyz} } %\end{figure} %%%%%%%%%%%%% old: %\begin{figure}[htbp] %\small %\begin{center} %\raisebox{1.3in}{$\I(X;Y)$}% %\hspace{-0.2in}% %\begin{tabular}{c} %\mbox{\psfig{figure=figs/HXY.ps,% %width=60mm,angle=-90}}\\ %$p_1$ %\end{tabular} %\end{center} %\caption[a]{The mutual information $\I(X;Y)$ for a Z channel with $\q=0.15$ % as a function of the input distribution.} %% (Horizontal axis $=p_1$.)} %\label{hxyz.old} %\end{figure} the mutual information is: \beqan \I(X;Y) &=& H(Y) - H(Y \specialgiven X) \nonumber \\ &=& H_2(p_1 (1-\q)) - ( p_0 H_2(0) + p_1 H_2(\q) ) \nonumber \\ &=& H_2(p_1 (1-\q)) - p_1 H_2(\q) . \eeqan This is a non-trivial function of $p_1$, shown in \figref{hxyz}. It is maximized for $\q=0.15$ by % the \optens\ $p_1^* = 0.445$. We find $C(Q_{\rm Z}) = 0.685$. Notice % that the \optens\ is not $\{ 0.5,0.5 \}$. We can communicate slightly more information by using input symbol {\tt{0}} more frequently than {\tt{1}}. } %\noindent {\sf Exercise b:} \exercissxA{1}{ex.bscC}{ What is the capacity of the \ind{binary symmetric channel} for general $\q$?\index{channel!binary symmetric} } \exercissxA{2}{ex.becC}{ Show that the capacity of the \ind{binary erasure channel}\index{channel!binary erasure} with $\q=0.15$ is $C_{\rm BEC} = 0.85$. What is its capacity for general $\q$? Comment. } % \bibliography{/home/mackay/bibs/bibs} %\section{The Noisy Channel Coding Theorem} \section{The noisy-channel coding theorem} It seems plausible that the `capacity' we have defined may be a measure of information conveyed by a channel; what is not obvious, and what we will prove in the next chapter, is that the \ind{capacity} indeed measures the rate at which blocks of data can be communicated over the channel {\em with arbitrarily small probability of error}. We make the following definitions.\label{sec.whereCWMdefined} \begin{description} \item[An $(N,K)$ {block code}] for\indexs{error-correcting code!block code} a channel $Q$ is a list of $\cwM=2^K$ codewords $$\{ \bx^{(1)}, \bx^{(2)}, \ldots, \bx^{({2^K)}} \}, \:\:\:\:\:\bx^{(\cwm)} \in \A_X^N ,$$ each of length $N$. Using this code we can encode a signal $\cwm \in \{ 1,2,3,\ldots, 2^K\}$ % The signal to be encoded is assumed to come from an % alphabet of size $2^K$; signal $m$ is encoded as $\bx^{(\cwm)}$. [The number of codewords $\cwM$ is an integer, but the number of bits specified by choosing a codeword, $K \equiv \log_2 \cwM$, is not necessarily an integer.] The {\dbf \inds{rate}\/} of\index{error-correcting code!rate} the code is $R = K/N$ bits per channel use. % character. [We will use this definition of the rate for any channel, not only channels with binary inputs; note however that it is sometimes conventional to define the rate of a code for a channel with $q$ input symbols to be $K/(N\log q)$.] % \item[A linear $(N,K)$ block code] is a block code in which all % moved into leftovers.tex \item[A \ind{decoder}] for an $(N,K)$ block code is a mapping from the set of length-$N$ strings of channel outputs, $\A_Y^N$, to a codeword label $\hat{\cwm} \in \{ 0 , 1 , 2 , \ldots, 2^K \}$. The extra symbol $\hat{\cwm} \eq 0$ can be used to indicate a `failure'. \item[The \ind{probability of block error}\index{error probability!block}] % $p_B$ of a code and decoder, for a given channel, and for a given probability distribution over the encoded signal $P(\cwm_{\rm in})$, is: \beq p_{\rm B} = \sum_{\cwm_{\rm in}} P( \cwm_{\rm in} ) P( \cwm_{\rm out} \! \not = \! \cwm_{\rm in} \given \cwm_{\rm in} ) . \eeq % the probability % that the decoded signal $\cwm_{\rm out}$ is not equal to $\cwm_{\rm in}$. \item[The maximal probability of block error] is \beq p_{\rm BM} = \max_{\cwm_{\rm in}} P( \cwm_{\rm out} \! \not = \! \cwm_{\rm in} \given \cwm_{\rm in} ) . \eeq \item[The \ind{optimal decoder}] for a channel code is the one that minimizes the probability of block error. It decodes an output $\by$ as the input $\cwm$ that has maximum \ind{posterior probability} $P(\cwm \given \by)$. \beq P(\cwm \given \by) = \frac{ P(\by \given \cwm ) P(\cwm) } { \sum_{\cwm' } P(\by \given \cwm') P(\cwm') } \eeq \beq \hat{\cwm}_{\rm optimal} = \argmax % _{\cwm} % did not appear underneath P(\cwm \given \by) . \eeq A uniform prior distribution on $\cwm$ is usually assumed, in which case the optimal decoder is also the {\dem \ind{maximum likelihood} decoder}, \ie, the decoder that maps an output $\by$ to the input $\cwm$ that has maximum {\dem \ind{likelihood}} $P(\by \given \cwm )$. \item[The probability of bit error] $p_{\rm b}$ is defined assuming that the codeword number $\cwm$ is represented by a binary vector $\bs$ of length $K$ bits; it is the average probability that a bit of $\bs_{\rm out}$ is not equal to the corresponding bit of $\bs_{\rm in}$ (averaging over all $K$ bits). \item[Shannon's\index{Shannon, Claude} \ind{noisy-channel coding theorem} (part one)\puncspace] %\begin{quote} Associated with each discrete memoryless channel, \marginfig{ \begin{center} \setlength{\unitlength}{2pt} \begin{picture}(60,45)(-2.5,-7) \thinlines \put(0,0){\vector(1,0){60}} \put(0,0){\vector(0,1){40}} \put(30,-3){\makebox(0,0)[t]{$C$}} \put(55,-2){\makebox(0,0)[t]{$R$}} \put(-1,35){\makebox(0,0)[r]{$p_{\rm BM}$}} \thicklines \put(0,0){\dashbox{3}(30,30){achievable}} % \put(0,0){\line(0,1){50}} % \end{picture} \end{center} \caption[a]{Portion of the $R,p_{\rm BM}$ plane asserted to be achievable by the first part of Shannon's noisy channel coding theorem.} \label{fig.belowCthm} }%end marginfig there is a non-negative number $C$ (called the channel capacity) with the following property. For any $\epsilon > 0$ and $R < C$, for large enough $N$, there exists a block code of length $N$ and rate $\geq R$ and a decoding algorithm, such that the maximal probability of block error is $< \epsilon$. %\end{quote} % \item[The negative part of the theorem\puncspace] moved to graveyard.tex Sun 3/2/02 \end{description} \begin{figure}[htbp] \figuremargin{% \[ \begin{array}{c} \setlength{\unitlength}{1pt} \begin{picture}(48,120)(0,5) \thinlines %\put(5,5){\vector(3,0){30}} %\put(5,25){\vector(3,0){30}} \put(5,15){\vector(3,0){30}} %\put(5,5){\vector(3,1){30}} %\put(5,25){\vector(3,-1){30}} % \put(4,5){\makebox(0,0)[r]{{\tt -}}} \put(4,15){\makebox(0,0)[r]{{\tt Z}}} % \put(4,25){\makebox(0,0)[r]{{\tt Y}}} \put(36,5){\makebox(0,0)[l]{{\tt -}}} \put(36,15){\makebox(0,0)[l]{{\tt Z}}} \put(36,25){\makebox(0,0)[l]{{\tt Y}}} % \put(5,15){\vector(3,1){30}} \put(5,15){\vector(3,-1){30}} %\put(5,25){\vector(3,0){30}} %\put(5,25){\vector(3,1){30}} \put(20,40){\makebox(0,0){$\vdots$}} % %\put(5,35){\vector(3,0){30}} %\put(5,35){\vector(3,1){30}} % \put(5,35){\vector(3,-1){30}} %\put(5,45){\vector(3,0){30}} % \put(5,45){\vector(3,1){30}} %\put(5,45){\vector(3,-1){30}} \put(5,55){\vector(3,0){30}} \put(5,55){\vector(3,1){30}} \put(5,55){\vector(3,-1){30}} % \thicklines % \put(5,65){\vector(3,0){30}} % \put(5,65){\vector(3,1){30}} % \put(5,65){\vector(3,-1){30}} % \thinlines % \put(5,75){\vector(3,0){30}} % \put(5,75){\vector(3,1){30}} % \put(5,75){\vector(3,-1){30}} \put(5,85){\vector(3,0){30}} \put(5,85){\vector(3,1){30}} \put(5,85){\vector(3,-1){30}} % \put(5,95){\vector(3,0){30}} % \put(5,95){\vector(3,1){30}} % \put(5,95){\vector(3,-1){30}} %\put(5,105){\vector(3,0){30}} %\put(5,105){\vector(3,1){30}} %\put(5,105){\vector(3,-1){30}} \put(5,115){\vector(3,0){30}} \put(5,115){\vector(3,1){30}} \put(5,115){\vector(3,-1){30}} %\put(5,125){\vector(3,0){30}} %\put(5,125){\vector(3,-1){30}} % %\put(5,5){\vector(1,4){30}} %\put(5,125){\vector(1,-4){30}} \put(36,45){\makebox(0,0)[l]{{\tt I}}} \put(4,55){\makebox(0,0)[r]{{\tt H}}} \put(36,55){\makebox(0,0)[l]{{\tt H}}} % \put(4,65){\makebox(0,0)[r]{{\tt G}}} \put(36,65){\makebox(0,0)[l]{{\tt G}}} % \put(4,75){\makebox(0,0)[r]{{\tt F}}} \put(36,75){\makebox(0,0)[l]{{\tt F}}} \put(4,85){\makebox(0,0)[r]{{\tt E}}} \put(36,85){\makebox(0,0)[l]{{\tt E}}} % \put(4,95){\makebox(0,0)[r]{{\tt D}}} \put(36,95){\makebox(0,0)[l]{{\tt D}}} % \put(4,105){\makebox(0,0)[r]{{\tt C}}} \put(36,105){\makebox(0,0)[l]{{\tt C}}} \put(4,115){\makebox(0,0)[r]{{\tt B}}} \put(36,115){\makebox(0,0)[l]{{\tt B}}} % \put(4,125){\makebox(0,0)[r]{{\tt A}}} \put(36,125){\makebox(0,0)[l]{{\tt A}}} \end{picture} \end{array} \hspace{1.5in} \begin{array}{c} % roughly 8pts from col to col \setlength{\unitlength}{1.005pt}% this was 1pt in jan 2000, I tweaked it \begin{picture}(50,110)(-5,-5) \thinlines \put(-5,-5){\ecfig{type}} \multiput(7.95,-3)(12,0){9}{\framebox(4,126){}} %\put(2.5,97){\makebox(0,0)[bl]{\small$\bx^{(1)}$}} %\put(26.5,97){\makebox(0,0)[bl]{\small$\bx^{(2)}$}} % \end{picture} \end{array} \] }{% \caption[a]{A non-confusable subset of inputs for the noisy typewriter.} \label{fig.typenine} } \end{figure} \subsection{Confirmation of the theorem for the noisy typewriter channel} In the case of the \ind{noisy typewriter}\index{channel!noisy typewriter}, we can easily confirm the % positive part of the theorem, % For this channel, because we can create a % n {\em error-free\/} completely error-free communication strategy using a block code of length $N =1$: we use only the letters {\tt B}, {\tt E}, {\tt H}, \ldots, {\tt Z}, \ie, every third letter. These letters form a {\dem non-confusable subset\/}\index{non-confusable inputs} of the input alphabet (see \figref{fig.typenine}). Any output can be uniquely decoded. The number of inputs in the non-confusable subset is 9, so the error-free information rate of this system is $\log_2 9$ bits, which is equal to the capacity $C$, which we evaluated in \exampleref{exa.typewriter}. % How does this translate into the terms of the theorem? The following table explains.\medskip %\begin{center} \begin{raggedright} \noindent % THIS TABLE IS DELIBERATELY FULL WIDTH % for textwidth, use this % \begin{tabular}{p{2.2in}p{2.5in}} \begin{tabular}{@{}p{2.7in}p{4.1in}@{}} \multicolumn{1}{@{}l}{\sf The theorem} & \multicolumn{1}{l}{\sf How it applies to the noisy typewriter } \\ \midrule \raggedright\em Associated with each discrete memoryless channel, there is a non-negative number $C$. % (called the channel capacity). & The capacity $C$ is $\log_2 9$. \\[0.047in] \raggedright\em For any $\epsilon > 0$ and $R < C$, for large enough $N$, & % Assume we are given an $R 0$. No matter what $\epsilon$ and $R$ are, we set the blocklength $N$ to 1. \\[0.047in] \raggedright\em there exists a block code of length $N$ and rate $\geq R$ & The block code is % can be the following list of nine codewords: $\{{\tt B,E,\ldots,Z}\}$. The value of $K$ is given by $2^K = 9$, so $K=\log_2 9$, and this code has rate $\log_2 9$, which is greater than the requested value of $R$. \\[0.047in] \raggedright\em and a decoding algorithm, & The decoding algorithm maps the received letter to the nearest letter in the code; \\[0.047in] \raggedright\em such that the maximal probability of block error is $< \epsilon$. & the maximal probability of block error is zero, which is less than the given $\epsilon$. \\ \end{tabular} \end{raggedright} %\end{center} % is greater than or equal % to 1 % source RUNME \section{Intuitive preview of proof} \subsection{Extended channels} To prove the theorem for any given channel, we consider the {\dem \ind{extended channel}\index{channel!extended}} corresponding to $N$ uses of the % original channel. The extended channel has $|\A_X|^N$ possible inputs $\bx$ and $|\A_Y|^N$ possible outputs. % {\em add a picture of extended channel here.} % \begin{figure} \figuremargin{% \small\begin{center} \begin{tabular}{cccc} %$\bQ$ & \ecfig{bsc15.1} & \ecfig{bsc15.2} & \ecfig{bsc15.4} \\ & $N=1$ & $N=2$ & $N=4$ \\ \end{tabular} \end{center} }{% \caption{Extended channels obtained from a binary symmetric channel with transition probability 0.15.} \label{fig.extended.bsc15} } \end{figure} % \begin{figure} \figuremargin{% \small\begin{center} \begin{tabular}{cccc} %$\bQ$ & \ecfig{z15.1} & \ecfig{z15.2} & \ecfig{z15.4} \\ & $N=1$ & $N=2$ & $N=4$ \\ \end{tabular} \end{center} }{% \caption{Extended channels obtained from a Z channel with transition probability 0.15. Each column corresponds to an input, and each row is a different output.} \label{fig.extended.z15} } \end{figure} % % % these figures made using % cd itp/extended Extended channels obtained from a \BSC\ and from a Z channel are shown in figures \ref{fig.extended.bsc15} and \ref{fig.extended.z15}, with $N=2$ and $N=4$. \exercissxA{2}{ex.extended}{ Find the transition probability matrices $\bQ$ for the extended channel, with $N=2$, derived from the binary erasure channel having erasure probability 0.15. %\item the extended channel with $N=2$ derived from % the ternary confusion channel, By selecting two columns of this transition probability matrix, % that have minimal overlap, we can define a rate-\dhalf\ code for this channel with blocklength $N=2$. What is the best choice of two columns? What is the decoding algorithm? } To prove the noisy-channel coding theorem, we make use of large blocklengths $N$. The intuitive idea is that, if $N$ is large, {\em an extended channel looks a lot like the noisy typewriter.} Any particular input $\bx$ is very likely to produce an output in a small subspace of the output alphabet -- the typical output set, given that input. So we can find a non-confusable subset of the inputs that produce essentially disjoint output sequences. % % add something like: % Remember what we learnt % in chapter \ref{ch2}: % For a given $N$, let us consider a way of generating such a non-confusable subset of the inputs, and count up how many distinct inputs it contains. Imagine making an input sequence $\bx$ for the extended channel by drawing it from an ensemble $X^N$, where $X$ is an arbitrary ensemble over the input alphabet. Recall the source coding theorem of \chapterref{ch.two}, and consider the number of probable output sequences $\by$. The total number of typical output sequences $\by$ % , when $\bx$ comes from the ensemble $X^N$, is $2^{N H(Y)}$, all having similar probability. For any particular typical input sequence $\bx$, there are about $2^{N H(Y \specialgiven X)}$ probable sequences. Some of these subsets of $\A_Y^N$ are depicted by circles in figure \ref{fig.ncct.typs}a. \begin{figure}%[htbp] \small \figuremargin{% \begin{center} \hspace*{-1mm}\begin{tabular}{cc} \framebox{ \setlength{\unitlength}{0.69mm}%was 0.8mm \begin{picture}(80,80)(0,0) \put(0,80){\makebox(0,0)[tl]{$\A_Y^N$}} \thicklines \put(40,40){\oval(50,50)} \thinlines \put(40,67){\makebox(0,0)[b]{Typical $\by$}} \put(30,50){\circle{12.5}} \put(50,40){\circle{12.5}} \put(35,52){\circle{12.5}} \put(58,33){\circle{12.5}} \put(33,40){\circle{12.5}} \put(35,45){\circle{12.5}} \put(50,30){\circle{12.5}} \put(40,50){\circle{12.5}} \put(52,35){\circle{12.5}} \put(33,58){\circle{12.5}} \put(40,33){\circle{12.5}} \put(45,35){\circle{12.5}} \put(50,50){\circle{12.5}} \put(23,55){\circle{12.5}} \put(24,45){\circle{12.5}} \put(27,57){\circle{12.5}} \put(25,40){\circle{12.5}} \put(55,42){\circle{12.5}} \put(55,52){\circle{12.5}} \put(58,53){\circle{12.5}} \put(53,40){\circle{12.5}} \put(35,22){\circle{12.5}} \put(27,30){\circle{12.5}} \put(40,24){\circle{12.5}} \put(40,39){\circle{12.5}} \put(46,43){\circle{12.5}} \put(55,40){\circle{12.5}} \put(40,55){\circle{12.5}} \put(52,23){\circle{12.5}} \put(50,26){\circle{12.5}} \put(40,54){\circle{12.5}} \put(52,55){\circle{12.5}} \put(33,28){\circle{12.5}} \put(57,33){\circle{12.5}} \put(25,35){\circle{12.5}} \put(55,25){\circle{12.5}} \put(25,26){\circle{12.5}} \multiput(23,22)(13,0){3}{\circle{12.5}} \multiput(30,34)(13,0){3}{\circle{12.5}} \multiput(23,46)(13,0){3}{\circle{12.5}} \multiput(30,58)(13,0){3}{\circle{12.5}} \thicklines \put(23,30){\circle{12.5}} \put(21,11){\vector(0,1){13}} \put(8,6){\makebox(0,0)[l]{ Typical $\by$ for a given typical $\bx$}} \end{picture} } & \framebox{ \setlength{\unitlength}{0.69mm} \begin{picture}(80,80)(0,0) \put(0,80){\makebox(0,0)[tl]{$\A_Y^N$}} \thicklines \put(40,40){\oval(50,50)} \thinlines \put(40,67){\makebox(0,0)[b]{Typical $\by$}} % \thicklines \multiput(23,22)(13,0){3}{\circle{12.5}} \multiput(30,34)(13,0){3}{\circle{12.5}} \multiput(23,46)(13,0){3}{\circle{12.5}} \multiput(30,58)(13,0){3}{\circle{12.5}} %\put(30,34){\circle{12.5}} %\put(43,34){\circle{12.5}} %\put(56,34){\circle{12.5}} %\put(23,45){\circle{12.5}} %\put(36,45){\circle{12.5}} %\put(49,45){\circle{12.5}} %\put(30,56){\circle{12.5}} %\put(43,56){\circle{12.5}} %\put(56,56){\circle{12.5}} \end{picture} }\\ (a)&(b) \\ \end{tabular} \end{center} }{% \caption[a]{(a) Some typical outputs in $\A_Y^N$ corresponding to typical inputs $\bx$. (b) A subset of the \ind{typical set}s shown in (a) that do not overlap each other. This picture can be compared with the solution to the \ind{noisy typewriter} in \figref{fig.typenine}.} \label{fig.ncct.typs} \label{fig.ncct.typs.no.overlap} } \end{figure} We now imagine restricting ourselves to a subset of the typical\index{typical set!for noisy channel} inputs $\bx$ such that the corresponding typical output sets do not overlap, as shown in \figref{fig.ncct.typs.no.overlap}b. We can then bound the number of non-confusable inputs by dividing the size of the typical $\by$ set, $2^{N H(Y)}$, by the size of each typical-$\by$-given-typical-$\bx$ set, $2^{N H(Y \specialgiven X)}$. So the number of non-confusable inputs, if they are selected from the set of typical inputs $\bx \sim X^N$, is $\leq 2^{N H(Y) - N H(Y \specialgiven X)} = 2^{N \I(X;Y)}$. % \begin{figure} % \begin{center} % \framebox{ % \setlength{\unitlength}{0.8mm} % } % \end{center} % \caption[a]{A subset of the typical sets shown in % \protect\figref{fig.ncct.typs} that do not overlap.} % \label{fig.ncct.typs.no.overlap} % \end{figure} The maximum value of this bound is achieved if $X$ is the ensemble that maximizes $\I(X;Y)$, in which case the number of non-confusable inputs is $\leq 2^{NC}$. Thus asymptotically up to $C$ bits per cycle, and no more, can be communicated with vanishing error probability.\ENDproof This sketch has not rigorously proved that reliable communication really is possible -- that's our task for the next chapter. \section{Further exercises} % \noindent % \exercissxA{3}{ex.zcdiscuss}{ Refer back to the computation of the capacity of the \ind{Z channel} with $\q=0.15$. \ben \item Why is $p_1^*$ less than 0.5? One could argue that it is good to favour the {\tt{0}} input, since it is transmitted without error -- and also argue that it is good to favour the {\tt1} input, since it often gives rise to the highly prized {\tt1} output, which allows certain identification of the input! Try to make a convincing argument. \item In the case of general $\q$, show that the \optens\ is \beq p_1^* = \frac{ 1/(1-\q) } { \displaystyle 1 + 2^{ \left( H_2(\q) / ( 1 - \q ) \right)} } . \eeq \item What happens to $p_1^*$ if the noise level $\q$ is very close to 1? \een } % see also ahmed.tex for a nice bound 0.5(1-q) on the capacity of the Z channel % and related graphs CZ.ps CZ2.ps CZ.gnu % \exercissxA{2}{ex.Csketch}{ Sketch graphs of the capacity of the \ind{Z channel}, the \BSC\ and the \BEC\ as a function of $\q$. % answer in figs/C.ps % \medskip } \exercisaxB{2}{ex.fiveC}{ What is the capacity of the five-input, ten-output channel % \index{channel!others} whose transition probability matrix is {\small \beq \left[ \begin{array}{*{5}{c}} 0.25 & 0 & 0 & 0 & 0.25 \\ 0.25 & 0 & 0 & 0 & 0.25 \\ 0.25 & 0.25 & 0 & 0 & 0 \\ 0.25 & 0.25 & 0 & 0 & 0 \\ 0 & 0.25 & 0.25 & 0 & 0 \\ 0 & 0.25 & 0.25 & 0 & 0 \\ 0 & 0 & 0.25 & 0.25 & 0 \\ 0 & 0 & 0.25 & 0.25 & 0 \\ 0 & 0 & 0 & 0.25 & 0.25 \\ 0 & 0 & 0 & 0.25 & 0.25 \\ \end{array} \right] \hspace{0.4in} \begin{array}{c}\ecfig{five}\end{array} ? \eeq } } \exercissxA{2}{ex.GC}{ Consider a \ind{Gaussian channel}\index{channel!Gaussian} with binary input $x \in \{ -1, +1\}$ and {\em real\/} output alphabet $\A_Y$, with transition probability density \beq Q(y \given x,\sa,\sigma) = \frac{1}{\sqrt{2 \pi \sigma^2}} \, e^{-\smallfrac{(y-x \sa)^2}{2 \sigma^2}} , \eeq where $\sa$ is the signal amplitude. \ben \item Compute the posterior probability of $x$ given $y$, assuming that the two inputs are equiprobable. Put your answer in the form \beq P(x\eq 1 \given y,\sa,\sigma) = \frac{1}{1+e^{-a(y)}} . \eeq Sketch the value of $P(x\eq 1 \given y,\sa,\sigma)$ as a function of $y$. \item Assume that a single bit is to be transmitted. What is the optimal decoder, and what is its probability of error? Express your answer in terms of the signal-to-noise ratio $\sa^2/\sigma^2$ and the \label{sec.erf}\ind{error function}%\index{conventions!error function} \index{erf}\index{notation!error function} (the \ind{cumulative probability function} of the Gaussian distribution), \beq \Phi(z) \equiv \int_{-\infty}^{z} \frac{1}{\sqrt{2 \pi}} \, e^{-\textstyle\frac{z^2}{2}} \: \d z. \eeq % % P(x \given y,s,sigma) = 1/(1+e^{-a}), a = 2 ( s / \sigma^2 ). y % [Note that this definition of the error function $\Phi(z)$ may not correspond to other people's.] % definitions of the `error function'. % Some people %% and some software libraries % leave out factors of two in the definition.] % I think that the % above definition is the only natural one. \een } % \section{ \subsection*{Pattern recognition as a noisy channel} We may think of many pattern recognition problems in terms of\index{pattern recognition} \ind{communication} channels. Consider the case of recognizing handwritten digits (such as postcodes on envelopes). The author of the digit wishes to communicate a message from the set $\A_X = \{ 0,1,2,3,\ldots, 9 \}$; this selected message is the input to the channel. What comes out of the channel is a pattern of ink on paper. If the ink pattern is represented using 256 binary pixels, the channel $Q$ has as its output a random variable $y \in \A_Y = \{0,1\}^{256}$. % Here is an example of an element from this alphabet. An example of an element from this alphabet is shown in the margin. % % hintond.p zero=0.0 range=1.25 rows=16 background=1.0 pos=0.0 o=/home/mackay/_applications/characters/ex2.ps 16 < /home/mackay/_applications/characters/example2 % %\[ \marginpar{ {\psfig{figure=/home/mackay/_applications/characters/ex2.ps,width=1.1in}} }%\end{marginpar} %\] \exercisaxA{2}{ex.twos}{ Estimate how many patterns in $\A_Y$ are recognizable as the character `2'. [The aim of this problem is to % Try not to underestimate this number --- try to demonstrate the existence of {\em as many patterns as possible\/} that are recognizable as 2s.] \amarginfig{t}{ \begin{center} \mbox{\psfig{figure=figs/random2.ps}} \\[0.15in]%\hspace{0.42in} \mbox{\psfig{figure=figs/2random2.ps}} \\[0.15in]%\hspace{0.42in} \mbox{\psfig{figure=figs/6random2.ps}} \\[0.15in]%\hspace{0.42in} \mbox{\psfig{figure=figs/7random2.ps}} \end{center} \caption[a]{Some more 2s.} \label{fig.random2s} %\end{figure} }%end{marginfig} % made using figs/random2.ps seed=7 Discuss how one might model the channel $P(y \given x\eq 2)$.\index{2s}\index{twos}\index{handwritten digits} % in the case of handwritten digit recognition. Estimate the entropy of the probability distribution $P(y \given x\eq 2)$. % Recognition of isolated handwritten digits % Digit 2 -> Q -> y $\in \{0,1\}^{256})$ % 3 % Estimate how many 2's there are. One strategy for doing \ind{pattern recognition} is to create a model for $P(y \given x)$ for each value of the input $x= \{ 0,1,2,3,\ldots, 9 \}$, then use \Bayes\ theorem to infer $x$ given $y$. \beq P(x \given y) = \frac{ P(y \given x) P(x) } { \sum_{x'} P(y \given x') P(x') } . \eeq This strategy is known as {\dbf \ind{full probabilistic model}ling\/} or {\dbf \ind{generative model}ling\/}. This is essentially how current speech recognition systems work. In addition to the channel model, $P(y \given x)$, one uses a prior probability distribution $P(x)$, which in the case of both character recognition and speech recognition is a language model that specifies the probability of the next character/word given the context and the known grammar and statistics of the language. % % Alternative, model $P(x \given y)$ directly. % Discriminative modelling; conditional modelling. % Feature extraction -- compute some $f(y)$ then model $P(f \given x)$ % - generative modelling in feature space. % or else model $P(x \given f)$ % which is still discriminative modelling / conditional modelling. % Notice number of parameters. % % } \subsection*{Random coding} \exercissxA{2}{ex.birthday}{ Given %\index{random coding} % \index{code!random} \index{random code}twenty-four people in a room, % at a party, what is the probability that there are at least two people present who % of them have the same \ind{birthday} (\ie, day and month of birth)? What is the expected number of pairs of people with the same birthday? Which of these two questions is easiest to solve? Which answer gives most insight? You may find it helpful to solve these problems and those that follow using notation such as $A=$ number of days in year $=365$ and $S=$ number of people $=24$. } \exercisaxB{2}{ex.birthdaycode}{ The birthday problem may be related to a coding scheme. Assume we wish to convey a message to an outsider identifying one of the twenty-four people. We could simply communicate a number $\cwm$ from $\A_S = \{ 1,2, \ldots, 24 \}$, having agreed a mapping of people onto numbers; alternatively, we could convey a number from $\A_X = \{ 1 ,2 , \ldots, 365\}$, identifying the day of the year that is the selected person's \ind{birthday} (with apologies to leapyearians). [The receiver is assumed to know all the people's birthdays.] What, roughly, is the probability of error of this communication scheme, assuming it is used for a single transmission? What is the capacity of the communication channel, and what is the rate of communication attempted by this scheme? } % % CHRIS SAYS ``this is not CLEAR''................. : % \exercisaxB{2}{ex.birthdaycodeb}{ Now imagine that there are $K$ rooms in a building, each containing $q$ people. (You might think of $K=2$ and $q=24$ as an example.) The aim is to communicate a selection of one person from each room by transmitting an ordered list of $K$ days (from $\A_X$). Compare the probability of error of the following two schemes. \ben \item As before, where each room transmits the \ind{birthday} of the selected person. \item To each $K$-tuple of people, one drawn from each room, an ordered $K$-tuple of randomly selected days from $\A_X$ is assigned (this $K$-tuple has nothing to do with their birthdays). This enormous list of $S = q^K$ strings is known to the receiver. When the building has selected a particular person from each room, the ordered string of days corresponding to that $K$-tuple of people is transmitted. \een What is the probability of error when $q=364$ and $K=1$? What is the probability of error when $q=364$ and $K$ is large, \eg\ $K=6000$? } % see synchronicity.tex % for cut example \dvips \section{Solutions}% to Chapter \protect\ref{ch5}'s exercises} % \fakesection{solns to exercises in l5.tex} % \soln{ex.bscy0}{ If we assume we observe $y\eq 0$, \beqan P(x\eq 1 \given y\eq 0) &=& \frac{ P(y\eq 0 \given x\eq 1) P(x\eq 1)}{\sum_{x'} P(y \given x') P(x')} \\ &=& \frac{ 0.15 \times 0.1 }{ 0.15 \times 0.1 + 0.85 \times 0.9 } \\ &=& \frac{ 0.015 }{0.78} \:=\: 0.019 . \eeqan } \soln{ex.zcy0}{ If we observe $y=0$, \beqan P(x\eq 1 \given y\eq 0) % &=& \frac{ P(y\eq 0 \given x\eq 1) P(x\eq 1)}{\sum_{x'} P(y \given x') P(x')} \\ &=& \frac{ 0.15 \times 0.1 }{ 0.15 \times 0.1 + 1.0 \times 0.9 } \\ &=& \frac{ 0.015}{ 0.915} \:=\: 0.016 . \eeqan } \soln{ex.bscMI}{ The probability that $y=1$ is $0.5$, so the mutual information is: \beqan \I(X;Y) &=& H(Y) - H(Y \given X) \\ &=& H_2(0.5) - H_2(0.15)\\ & =& 1 - 0.61 \:\: = \:\: 0.39 \mbox{ bits}. \eeqan } \soln{ex.zcMI}{ We again compute the mutual information using $\I(X;Y) = H(Y) - H(Y \given X)$. % fixed Tue 18/2/03 The probability that $y=0$ is $0.575$, and $H(Y \given X) = \sum_x P(x) H(Y \given x) = P(x\eq1) H(Y \given x\eq1) $ $+$ $P(x\eq0) H(Y \given x\eq0)$ so the mutual information is: \beqan \I(X;Y) &=& H(Y) - H(Y \given X) \\ &=& H_2(0.575) - [0.5 \times H_2(0.15)+0.5 \times 0 ] \\ & =& 0.98 - 0.30 \:\: = \:\: 0.679 \mbox{ bits}. \eeqan } \soln{ex.bscC}{ By symmetry, the \optens\ is $\{0.5,0.5\}$. Then the capacity is \beqan C \:=\: \I(X;Y) &=& H(Y) - H(Y \given X) \\ &=& H_2(0.5) - H_2(\q)\\ & =& 1 - H_2(\q) . \eeqan Would you like to find the \optens\ without invoking symmetry? We can do this by computing the mutual information in the general case where the input ensemble is $\{p_0,p_1\}$: \beqan \I(X;Y) &=& H(Y) - H(Y \given X) \\ &=& H_2(p_0 \q+ p_1(1-\q) ) - H_2(\q) . \eeqan The only $p$-dependence is in the first term $H_2(p_0\q+ p_1(1-\q) )$, which is maximized by setting the argument to 0.5. This value is given by setting $p_0=1/2$. } \soln{ex.becC}{ \noindent {\sf Answer 1}. By symmetry, the \optens\ is $\{0.5,0.5\}$. The capacity is most easily evaluated by writing the mutual information as $\I(X;Y) = H(X) - H(X \given Y)$. The conditional entropy $H(X \given Y)$ is $\sum_y P(y) H(X \given y)$; when $y$ is known, $x$ is uncertain only if $y=\mbox{\tt{?}}$, which occurs with probability $\q/2+\q/2$, so the conditional entropy $H(X \given Y)$ is $\q H_2(0.5)$. \beqan C \:=\: \I(X;Y) &=& H(X) - H(X \given Y) \\ &=& H_2(0.5) - \q H_2(0.5)\\ & =& 1 - \q . \eeqan % The conditional entropy $H(X \given Y)$ is $\q H_2(0.5)$. % The binary erasure channel fails a fraction $\q$ of the time. Its capacity is precisely $1-\q$, which is the fraction of the time that the channel is reliable. % functional. % , even though the sender % does not know when the channel will % fail. This result seems very reasonable, but it is far from obvious how to encode information so as to communicate {\em reliably\/} over this channel. \smallskip \noindent {\sf Answer 2}. Alternatively, without invoking the symmetry assumed above, we can start from the input ensemble $\{p_0,p_1\}$. The probability that $y=\mbox{\tt{?}}$ is $p_0 \q+ p_1 \q = \q$, and when we receive $y=\mbox{\tt{?}}$, the posterior probability of $x$ is the same as the prior probability, so: \beqan \I(X;Y) &=& H(X) - H(X \given Y) \\ &=& H_2(p_1) - \q H_2(p_1)\\ & =& (1 - \q ) H_2(p_1) . \eeqan This mutual information achieves its maximum value of $(1-\q)$ when $p_1=1/2$. } % % % \begin{figure}[htbp] \figuremargin{% \begin{center} \begin{tabular}{ccccc} $\bQ$ & \ecfig{bec.1} &{\small{(a)}} \, \ecfig{bec.2} &{\small{(b)}} \, % roughly 8pts from col to col \setlength{\unitlength}{1pt} \begin{picture}(50,110)(-5,-5) \put(-5,-5){\ecfig{bec.2}} \put(3.95,-3){\framebox(8,96){}} \put(28.5,-3){\framebox(8,96){}} \put(2.5,97){\makebox(0,0)[bl]{\small$\bx^{(1)}$}} \put(26.5,97){\makebox(0,0)[bl]{\small$\bx^{(2)}$}} \end{picture} &{\small{(c)}} \, % roughly 8pts from col to col \setlength{\unitlength}{1pt} \begin{picture}(50,110)(-5,-5) \put(-5,-5){\ecfig{bec.2}} \put(3.95,-3){\framebox(8,96){}} \put(28.5,-3){\framebox(8,96){}} \put(2.5,97){\makebox(0,0)[bl]{\small$\bx^{(1)}$}} \put(26.5,97){\makebox(0,0)[bl]{\small$\bx^{(2)}$}} % roughly 8pts from col to col %\setlength{\unitlength}{1pt} %\begin{picture}(100,110)(-5,-5) %\put(-5,-5){\ecfig{bec.2}} %\put(3.95,-3){\framebox(8,96){}} %\put(28.5,-3){\framebox(8,96){}} %\put(2.5,97){\makebox(0,0)[bl]{$\bx^{(1)}$}} %\put(26.5,97){\makebox(0,0)[bl]{$\bx^{(2)}$}} % \multiput(-4,3)(0,8){2}{\line(1,0){8}} \multiput(-4,27)(0,8){3}{\line(1,0){8}} \multiput(-4,59)(0,8){2}{\line(1,0){8}} \multiput(37,3)(0,8){2}{\vector(1,0){14}} \multiput(37,27)(0,8){3}{\vector(1,0){14}} \multiput(37,59)(0,8){2}{\vector(1,0){14}} \multiput(57,4)(0,8){2}{\makebox(0,0)[l]{\tiny$\hat{m}=2$}} \multiput(57,28)(0,8){1}{\makebox(0,0)[l]{\tiny$\hat{m}=2$}} \multiput(57,44)(0,8){1}{\makebox(0,0)[l]{\tiny$\hat{m}=1$}} \multiput(57,60)(0,8){2}{\makebox(0,0)[l]{\tiny$\hat{m}=1$}} \multiput(57,36)(0,8){1}{\makebox(0,0)[l]{\tiny$\hat{m}=0$}} % the box starts exactly at x=0. \end{picture} \\ & $N=1$ & $N=2$ & \\[-0.1in] \end{tabular} \end{center} }{% \caption[a]{(a) The {\ind{extended channel}} ($N=2$) obtained from a binary erasure channel with erasure probability 0.15. (b) A block code consisting of the two codewords {\tt 00} and {\tt 11}. (c) The optimal decoder for this code. } \label{fig.extended.bec} } \end{figure} % \soln{ex.extended}{ The extended channel is shown in \figref{fig.extended.bec}. The best code for this channel with $N=2$ is obtained by choosing two columns that have minimal overlap, for example, columns {\tt 00} and {\tt 11}. The decoding algorithm returns `{\tt 00}' if the extended channel output is among the top four % either output is {\tt 0}, and `{\tt 11}' if it's among the bottom four, % if either output is {\tt 1}, and gives up if the output is `{\tt ??}'. } % % end of chapter % \soln{ex.zcdiscuss}{ In \exampleref{exa.Z.HXY} % \exaseven\ of chapter \chfive\ we showed that the mutual information between input and output of the Z channel is \beqan \I(X;Y) &=& H(Y) - H(Y \given X) \nonumber \\ &=& H_2(p_1 (1-\q)) - p_1 H_2(\q) . \eeqan We differentiate this expression with respect to $p_1$, taking care not to confuse $\log_2$ with $\log_e$: \beq \frac{\d}{\d p_1} \I(X;Y) = (1-\q) \log_2 \frac{ 1- p_1 (1-\q) }{ p_1 (1-\q) } - H_2(\q) . \eeq Setting this derivative to zero and rearranging using skills developed in \exthirtyone, we obtain: \beq { p_1^* (1-\q) } = \frac{1}{1 + \displaystyle 2^{H_2(\q)/(1-\q)}} , \eeq so the \optens\ is \beq p_1^* = \frac{ 1/(1-\q) } { \displaystyle 1 + 2^{ \left( H_2(\q) / ( 1 - \q ) \right)} } . \eeq As the noise level $\q$ tends to 1, this expression tends to $1/e$ (as you can prove using L'H\^opital's rule). For all values of $\q\!$, $p_1^*$ is smaller than $1/2$. A rough intuition for why input {\tt1} is used less than input {\tt0} is that when input {\tt1} is used, the noisy channel injects entropy into the received string; whereas when input {\tt0} is used, the noise has zero entropy. %% RUBBISH % Thus starting from $p_1=1/2$, a perturbation % towards smaller $p_1$ will reduce the conditional entropy % $H(Y \given X)$ linearly while leaving $H(Y)$ unchanged, to first order. % $H(Y)$ decreases only quadratically in $(p_1-\dhalf)$. } \soln{ex.Csketch}{ The capacities of the three channels are shown in \figref{fig.capacities}. % below. \amarginfig{b}{ \begin{center} \mbox{\psfig{figure=figs/C.ps,angle=-90,width=2in} } \end{center} \caption[a]{Capacities of the Z channel, \BSC, and binary erasure channel.} \label{fig.capacities} }%end marginpar For any $\q <0.5$, % the channels can be ordered with the BEC being the the BEC is the channel with highest capacity and the BSC the lowest. } \soln{ex.GC}{ The logarithm of the posterior probability ratio, given $y$, is \beq a(y) = \ln \frac{P(x\eq 1 \given y,\sa,\sigma)}{P(x\eq -1 \given y,\sa,\sigma)} = \ln \frac{Q(y \given x\eq 1,\sa,\sigma)}{Q(y \given x\eq -1,\sa,\sigma)} = 2 \frac{\sa y}{\sigma^2} . % corrected march 2000 % and corrected log to ln Sun 22/8/04 \eeq Using our skills picked up from % in chapter \ref{ch1}, \exerciseref{ex.logit}, we rewrite % from exercise \label{eq.sigmoid} \label{eq.logistic} this in the form \beq P(x\eq 1 \given y,\sa,\sigma) = \frac{1}{1+e^{-a(y)}} . \eeq The optimal decoder selects the most probable hypothesis; this can be done simply by looking at the sign of $a(y)$. If $a(y)>0$ then decode as $\hat{x}=1$. The probability of error is \beq p_{\rm b} = \int_{-\infty}^{0} \!\! \d y \: Q(y \given x\eq 1,\sa,\sigma) = % chris suggests removing the x (=1) from what follows (twice) \int_{-\infty}^{- x \sa} \! \d y \: \frac{1}{\sqrt{2 \pi \sigma^2}} e^{-\smallfrac{y^2}{2 \sigma^2}} = \Phi \left( - \frac{ x\sa }{ \sigma } \right) . % corrected march 2000 \eeq % where %\beq % \Phi(z) \equiv \int_{z}^{\infty} \frac{1}{\sqrt{2 \pi}} % e^{-\frac{z^2}{2}} . %\eeq %\beq % \Phi(z) \equiv \int_{-\infty}^{z}{\smallfrac{1}{\sqrt{2 \pi}}} % e^{-\textstyle\frac{z^2}{2}} . %\eeq } \subsection*{Random coding} \soln{ex.birthday}{ The probability that $S=24$ people whose birthdays are drawn at random from $A=365$ days all have {\em distinct\/} birthdays is \beq \frac{ A(A-1)(A-2)\ldots(A-S+1) }{ A^S } . \eeq The probability that two (or more) people share a \ind{birthday} is one minus this quantity, which, for $S=24$ and $A=365$, is about 0.5. This exact way of answering the question is not very informative since it is not clear for what value of $S$ the probability changes from being close to 0 to being close to 1. The number of pairs is $S(S-1)/2$, and the probability that a particular pair shares a birthday is $1/A$, so the {\em expected number\/} of collisions is \beq \frac{ S(S-1)}{2 } \frac{1}{A} . \eeq This answer is more instructive. The expected number of collisions is tiny if $S \ll \sqrt{A}$ and big if $S \gg \sqrt{A}$. We can also approximate the probability that all birthdays are distinct, for small $S$, thus: \beqan \lefteqn{\hspace*{-0.7in} \frac{ A(A-1)(A-2)\ldots(A-S+1) }{ A^S } \:\:=\:\: (1)(1-\dfrac{1}{A})(1-\dfrac{2}{A})\ldots(1-\dfrac{(S\!-\!1)}{A}) \hspace*{1.7in}} % this hspace{ no good \nonumber \\ &\simeq& \exp( 0 ) \exp ( -\linefrac{1}{A}) \exp ( -\linefrac{2}{A}) \ldots \exp ( -\linefrac{(S\!-\!1)}{A}) \\ &\simeq& \exp \left( - \frac{1}{A} \sum_{i=1}^{S-1} i \right) = \exp \left( - \frac{S(S-1)/2}{A} \right) . \eeqan } \dvipsb{solutions noisy channel s5} \prechapter{About Chapter} \fakesection{prerequisites for chapter 6} Before reading \chref{ch.six}, you should have read Chapters \chtwo\ and \chfive. \Exerciseref{ex.extended} is especially recommended. % and worked on \exerciseref{ex.dataprocineq}. % % \extwentytwo\ from chapter \chone. % Please note that you {\em don't\/} need to understand % this proof in order to be able to solve most of the % problems involving noisy channels. %\footnote % {This exposition is based on that of Cover and Thomas (1991).} \subsection*{Cast of characters} \noindent% \begin{tabular}{lp{4in}} \toprule $Q$ & the noisy channel \\ $C$ & the capacity of the channel \\ $X^N$ & an ensemble used to create a \ind{random code} \\ $\C$ & a random code \\ $N$ & the length of the codewords \\ $\bx^{(\cwm)}$ & a codeword, the $\cwm$th in the code \\ $\cwm$ % $s$ & the number of a chosen codeword (mnemonic: the {\em source\/} selects $\cwm$) \\ $\cwM = 2^{K}$ % $S$ & the total number of codewords in the code\\ $K=\log_2 \cwM$ & the number of bits conveyed by the choice of one codeword from $\cwM$, assuming it is chosen with uniform probability \\ $\bs$ & a binary representation of the number $\cwm$ \\ $R = K/N$ & the rate of the code, in bits per channel use (sometimes called $R'$ instead) \\ % $R'$ & another rate, close to $R$ \\ $\hat{\cwm}$ % $s$ & the decoder's guess of $\cwm$ \\ \bottomrule \end{tabular} \medskip %{\sf Typo Warning:} % the letter $m$ may turn up where it should read $\cwm$. %%%% !!!!!!!!!!!!!! ok??????????????????????? \ENDprechapter \chapter{The Noisy-Channel Coding Theorem} % {The noisy-channel coding theorem}% Proof of \label{ch.six} % % \lecturetitle{The noisy-channel coding theorem, part b} % \chapter{The noisy channel coding theorem}% Proof of \label{ch6} \section{The theorem}\index{noisy-channel coding theorem}\index{communication} The theorem has three parts, two positive and one negative. The main positive result is the first. \amarginfig{t}{ \begin{center}\small \setlength{\unitlength}{2pt} \begin{picture}(60,45)(-2.5,-7) \thinlines \put(0,0){\vector(1,0){60}} \put(0,0){\vector(0,1){40}} \put(30,0){\line(0,1){30}} \put(30,0){\line(1,2){10}} \put(30,-3){\makebox(0,0)[t]{$C$}} \put(55,-2){\makebox(0,0)[t]{$R$}} \put(42,22){\makebox(0,0)[bl]{$R(p_{\rm b})$}} \put(-1,35){\makebox(0,0)[r]{$p_{\rm b}$}} \thicklines \put(0,0){\makebox(30,30){1}} \put(30,0){\makebox(7.5,35){2}} \put(35,0){\makebox(30,20){3}} % \put(0,0){\line(0,1){50}} % \end{picture} \end{center} \caption[a]{Portion of the $R,p_{\rm b}$ plane to be proved achievable (1,$\,$2) and not achievable (3). } \label{fig.belowCcoming} }%end marginfig \ben%gin{itemize} \item For every discrete memoryless channel, the channel capacity \beq C = \max_{\P_X}\, \I(X;Y) \eeq has the following property. For any $\epsilon > 0$ and $R < C$, for large enough $N$, there exists a code of length $N$ and rate $\geq R$ and a decoding algorithm, such that the maximal probability of block error is $< \epsilon$. \item If a probability of bit error $p_{\rm b}$ is acceptable, rates up to $R(p_{\rm b})$ are achievable, where \beq R(p_{\rm b}) = \frac{ C } {1 - H_2(p_{\rm b})} . \eeq \item For any $p_{\rm b}$, rates greater than $R(p_{\rm b})$ are not achievable. \een%d{itemize} \section{Jointly-typical sequences} We formalize the intuitive preview of the last chapter.\index{typicality} We will define codewords $\bx^{(\cwm )}$ as coming from an ensemble $X^N$, and consider the random selection of one codeword and a corresponding channel output $\by$, thus defining a joint ensemble $(XY)^N$. %, corresponding to random generation of a codeword and a corresponding channel output. We will use a {\dem typical-set decoder}, which decodes a received signal $\by$ as $\cwm$ if $\bx^{(\cwm )}$ and $\by$ are {\dem jointly typical}, a term to be defined shortly. The proof will then centre on determining the probabilities (a) that the true input codeword is {\em not\/} jointly \index{typicality}{typical} with the output sequence; and (b) that a {\em false\/} input codeword {is\/} jointly typical with the output. We will show that, for large $N$, both probabilities % $\rightarrow 0$, go to zero as long as there are fewer than $2^{NC}$ codewords, and the ensemble $X$ is the \index{optimal input distribution}{\optens}. \newcommand{\JNb}{\mbox{$J_{N \beta}$}} \begin{description} \item[Joint typicality\puncspace] A pair of sequences $\bx,\by$ of length $N$ are defined to be {jointly typical (to tolerance $\beta$)}\index{joint typicality} with respect to the distribution $P(x,y)$ if \beqan \mbox{$\bx$ is typical of $P(\bx)$,} & \mbox{\ie,} & \left| \frac{1}{N} \log \frac{1}{P(\bx)} - H(X) \right| < \beta , \nonumber \\ \mbox{$\by$ is typical of $P(\by)$,} & \mbox{\ie,} & \left| \frac{1}{N} \log \frac{1}{P(\by)} - H(Y) \right| < \beta , \nonumber \\ \mbox{and $\bx,\by$ is typical of $P(\bx,\by)$,} & \mbox{\ie,} & \left| \frac{1}{N} \log \frac{1}{P(\bx,\by)} - H(X,Y) \right| < \beta . \nonumber \eeqan \item[The jointly-typical set] $\JNb$ is the set of all jointly-typical sequence pairs of length $N$. % It has the following three properties, \end{description} %\begin{example} \noindent {\sf Example.} Here is a jointly-typical pair of length $N=100$ for the ensemble $P(x,y)$ in which $P(x)$ has $(p_0,p_1) = (0.9,0.1)$ and $P(y \given x)$ corresponds to a binary symmetric channel with noise level $0.2$. \[%beq \mbox{ \begin{tabular}{cc} $\bx$ &\mbox{\footnotesize\tt 1111111111000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000}\\ $\by$ &\mbox{\footnotesize\tt 0011111111000000000000000000000000000000000000000000000000000000000000000000000000111111111111111111} \end{tabular} } \]%eeq Notice that $\bx$ has 10 {\tt 1}s, and so is typical of the probability $P(\bx)$ (at any tolerance $\beta$); and $\by$ has % 18 + 8 = 26 26 {\tt 1}s, so it is typical of $P(\by)$ (because $P(y\eq 1) = 0.26$); and $\bx$ and $\by$ differ in % 18 + 2 20 bits, which is the typical number of flips for this channel. %\end{example} \begin{description} \item[Joint typicality theorem\puncspace] Let $\bx,\by$ be drawn from the ensemble $(XY)^N$ defined by $$P(\bx,\by)=\prod_{n=1}^N P(x_n,y_n).$$ Then\index{joint typicality theorem}\label{theorem.jtt} \ben \item the probability that $\bx,\by$ are jointly typical (to tolerance $\beta$) tends to 1 as $N \rightarrow \infty$; \item the number of jointly-typical sequences $|\JNb|$ is close to $2^{N H(X,Y) }$. To be precise, \beq |\JNb| \leq 2^{N ( H(X,Y) + \beta ) }; \eeq \item if $\bx'\sim X^N$ and $\by'\sim Y^N$, \ie, $\bx'$ and $\by'$ are {\em independent\/} samples with the same marginal distribution as $P(\bx,\by)$, then the probability that $(\bx' ,\by')$ lands in the jointly-typical set is about $2^{- N \I(X;Y)}$. To be precise, \beq P( (\bx' ,\by') \in \JNb ) \leq 2^{- N ( \I(X;Y) - 3 \beta ) } . \eeq % also, for the proof of the converse, we want... % for sufficiently large N % P( (\bx' ,\by') \in \JNb % \geq (1-\beta) 2^{- N ( \I(X;Y) + 3 \beta ) } \een \item[{\sf Proof.}] The proof of parts 1 and 2 by the law of large numbers follows that of the source coding theorem in \chref{ch2}. For part 2, let the pair $x,y$ play the role of $x$ in the source coding theorem, replacing $P(x)$ there by the probability distribution $P(x,y)$. % \marginpar{\footnotesize } For the third part, \beqan % \begin{array}{lll} % was (thin column) -- % \multicolumn{3}{l}{ % P( (\bx' ,\by') \in \JNb ) % \: = \: \sum_{(\bx ,\by) \in \JNb} P(\bx ) P(\by)} % \\[0.06in] % &\leq & |\JNb| \, 2^{-N(H(X)-\beta)} 2^{-N(H(Y)-\beta)} % \\[0.045in] % &\leq& 2^{N( H(X,Y) + \b) - N(H(X)+H(Y)-2\b)} % \\ % & =& 2^{-N ( \I(X;Y) - 3 \beta )} P( (\bx' ,\by') \in \JNb ) & = & \sum_{(\bx ,\by) \in \JNb} P(\bx ) P(\by) \\% [0.06in] &\leq & |\JNb| \, 2^{-N(H(X)-\beta)} \, 2^{-N(H(Y)-\beta)} \\% [0.045in] &\leq& 2^{N( H(X,Y) + \b) - N(H(X)+H(Y)-2\b)} \\ & =& 2^{-N ( \I(X;Y) - 3 \beta )} . \hspace{1in}\epfsymbol \eeqan % This quantity is a bound on the probability of confusing \end{description} A cartoon of the jointly-typical set is shown in \figref{fig.joint.typ}. % The property just proved, that t Two independent typical vectors are jointly typical with probability \beq P( (\bx' ,\by') \in \JNb ) \: \simeq \: 2^{-N ( \I(X;Y))} \eeq % because %, is readily understood by noticing that because the {\em total\/} number of independent typical pairs is the area of the dashed rectangle, $2^{NH(X)} 2^{NH(Y)}$, and the number of jointly-typical pairs is roughly $2^{NH(X,Y)}$, so the probability of hitting a jointly-typical pair is roughly \beq 2^{NH(X,Y)}/2^{NH(X)+NH(Y)} = 2^{-N\I(X;Y)}. \eeq % % the above eq was in-line but it looked ugly % \newcommand{\rad}{0.81} \begin{figure} \small \figuremargin{% \begin{center}\small \setlength{\unitlength}{1mm}% original picture is 9.75 in by 5.25 in \begin{picture}(74,105)(-15,-5) % \put(-10,-7){\framebox(62,99){}} % as well as box put Ax and Ay sizes \put(0,93.5){\vector(-1,0){10}} \put(0,93.5){\vector(1,0){52}} \put(-11,8){\vector(0,-1){15}} \put(-11,8){\vector(0,1){84}} %\put(0,92){\vector(-1,0){10}} %\put(0,92){\vector(1,0){52}} %\put(-10,8){\vector(0,-1){15}} %\put(-10,8){\vector(0,1){84}} % % width indicator \put(21,90){\vector(1,0){21}} \put(21,90){\vector(-1,0){21}} \put(21,88.7){\makebox(0,0)[t]{$2^{NH(X)}$}} % % height indicator \put(-2,45){\vector(0,1){43}} \put(-2,45){\vector(0,-1){43}} \put(0,30){\makebox(0,0)[l]{$2^{NH(Y)}$}}% was 45 % % RECTANGLE %\put(-1,0){\framebox(45,89){}} \put(-1,1){\dashbox{1}(44.5,88){}} % % strip width indicator \put(26,35){\vector(1,0){2}} \put(26,35){\vector(-1,0){2}} \put(26,15){\vector(1,0){2}} \put(26,15){\vector(-1,0){2}} \put(26,14){\makebox(0,0)[t]{$2^{NH(X|Y)}$}} % % strip height indicator \put(21,45){\vector(0,1){5}} \put(21,45){\vector(0,-1){5}} \put(28,45){\vector(0,1){5}}% was at 31,32 \put(28,45){\vector(0,-1){5}} \put(29,45){\makebox(0,0)[l]{$2^{NH(Y|X)}$}} % % JT set \multiput(2,88)(2,-4){21}{\circle*{\rad}} \multiput(2,86)(2,-4){21}{\circle*{\rad}} \multiput(0,86)(2,-4){22}{\circle*{\rad}} \multiput(0,88)(2,-4){22}{\circle*{\rad}} \multiput(0,82)(2,-4){21}{\circle*{\rad}} \multiput(0,84)(2,-4){21}{\circle*{\rad}} % %\put(38,20){\makebox(0,0)[l]{$2^{NH(X,Y)}$}} \put(18,64){\makebox(0,0)[l]{$2^{NH(X,Y)}$ dots}} \put(21,96){\makebox(0,0)[b]{$\A_{X}^N$}} \put(-12,45){\makebox(0,0)[r]{$\A_{Y}^N$}} \end{picture} \end{center} }{% \caption[a]{{The jointly-typical set.} The horizontal direction represents $\A_{X}^N$, the set of all input strings of length $N$. The vertical direction represents $\A_{Y}^N$, the set of all output strings of length $N$. The outer box contains all conceivable input--output pairs. Each dot represents a jointly-typical pair of sequences $(\bx,\by)$. The total number of jointly-typical sequences is about $2^{NH(X,Y)}$. % [Compare with \protect\figref{fig.extended.bec}a, % \protect\pref{fig.extended.bec}.] % page \protect\pageref{fig.extended.bec}.] } \label{fig.joint.typ} } \end{figure} \section{Proof of the noisy-channel coding theorem} \subsection{Analogy} Imagine that we wish to prove that there is a baby\index{weighing babies} in a class of one hundred babies who weighs less than 10\kg. Individual babies are difficult to catch and weigh.% \amarginfig{c}{ \begin{center} \mbox{\psfig{figure=figs/babiesscale4.ps,width=53mm}} \end{center} \caption[a]{Shannon's method for proving one baby weighs less than 10\kg.} } Shannon's method of\index{Shannon, Claude} solving the task is to scoop up all the babies and weigh them all at once on a big weighing machine. If we find that their {\em average\/} weight is % smaller than 1000\kg\ then the children's average weight % must be smaller than 10\kg, there must exist {\em at least one\/} baby who weighs less than 10\kg\ -- indeed there must be many! % In the context of weighing children, Shannon's method isn't guaranteed to reveal the existence of an underweight child, since it relies on there being a tiny number of elephants in the class. But if we use his method and get a total weight smaller than 1000\kg\ then our task is solved. \subsection{From skinny children to fantastic codes} We wish to show that there exists a code and a decoder having small probability of error. Evaluating the probability of error of any particular coding and decoding system is not easy. Shannon's innovation was this: instead of constructing a good coding and decoding system and evaluating its error probability, Shannon calculated the average probability of block error of {\em all\/} codes, and proved that this average is small. There must then exist individual codes that have small probability of block error. % Finally % to prove that the {\em maximal\/} probability of error is small too, % we modify one of these good codes by throwing away the worst 50\% % of its codewords. \begin{figure} \small \figuremargin{% \begin{center} \begin{tabular}{cc} \setlength{\unitlength}{0.81mm}% original picture is 9.75 in by 5.25 in %\begin{picture}(74,100)(-15,-5) \begin{picture}(62,100)(-5,-5) % %\put(-10,-2){\framebox(62,94){}} % codewords \put( 5,0){\framebox(2,91){}} \put(13,0){\framebox(2,91){}} \put(31,0){\framebox(2,91){}} \put(35,0){\framebox(2,91){}} % \put(5,94){\makebox(0,2.5)[bl]{$\bx^{(3)}$}} \put(13,94){\makebox(0,2.5)[bl]{$\bx^{(1)}$}} \put(29,94){\makebox(0,2.5)[bl]{$\bx^{(2)}$}} \put(37,94){\makebox(0,2.5)[bl]{$\bx^{(4)}$}} % JT set \multiput(2,88)(2,-4){21}{\circle*{\rad}} \multiput(2,86)(2,-4){21}{\circle*{\rad}} \multiput(0,86)(2,-4){22}{\circle*{\rad}} \multiput(0,88)(2,-4){22}{\circle*{\rad}} \multiput(0,82)(2,-4){21}{\circle*{\rad}} \multiput(0,84)(2,-4){21}{\circle*{\rad}} % %\put(21,96){\makebox(0,0)[b]{$\A_{X}^N$}} %\put(-12,45){\makebox(0,0)[r]{$\A_{Y}^N$}} \end{picture} & \setlength{\unitlength}{0.81mm} \begin{picture}(78,100)(-15,-5) % %\put(-10,-2){\framebox(62,94){}} % codewords \put(5,0){\framebox(2,91){}} \put(13,0){\framebox(2,91){}} \put(31,0){\framebox(2,91){}} \put(35,0){\framebox(2,91){}} % \put(5,94){\makebox(0,2.5)[bl]{$\bx^{(3)}$}} \put(13,94){\makebox(0,2.5)[bl]{$\bx^{(1)}$}} \put(29,94){\makebox(0,2.5)[bl]{$\bx^{(2)}$}} \put(37,94){\makebox(0,2.5)[bl]{$\bx^{(4)}$}} % % decodings \put(-13,10){\makebox(0,0)[r]{$\by_c$}} \put(-13,20){\makebox(0,0)[r]{$\by_d$}} \put(-13,72){\makebox(0,0)[r]{$\by_b$}} \put(-13,82){\makebox(0,0)[r]{$\by_a$}} \put(-11.3,10){\vector(1,0){63}} \put(-11.3,20){\vector(1,0){63}} \put(-11.3,72){\vector(1,0){63}} \put(-11.3,82){\vector(1,0){63}} \put(54,10){\makebox(0,0)[l]{$\hat{\cwm}(\by_c)\eq 4$}}% was 10, \put(54,20){\makebox(0,0)[l]{$\hat{\cwm}(\by_d)\eq 0$}}% was 25, \put(54,72){\makebox(0,0)[l]{$\hat{\cwm}(\by_b)\eq 3$}} \put(54,82){\makebox(0,0)[l]{$\hat{\cwm}(\by_a)\eq 0$}} % top end % % JT set \multiput(2,88)(2,-4){21}{\circle*{\rad}} \multiput(2,86)(2,-4){21}{\circle*{\rad}} \multiput(0,86)(2,-4){22}{\circle*{\rad}} \multiput(0,88)(2,-4){22}{\circle*{\rad}} \multiput(0,82)(2,-4){21}{\circle*{\rad}} \multiput(0,84)(2,-4){21}{\circle*{\rad}} % %\put(21,96){\makebox(0,0)[b]{$\A_{X}^N$}} %\put(-12,45){\makebox(0,0)[r]{$\A_{Y}^N$}} \end{picture} \\ (a) & (b) \\ \end{tabular} \end{center} }{% \caption[a]{(a) {A \ind{random code}.} % A random code is a selection of input % sequences $\{ \bx^{(1)}, \ldots, \bx^{(\cwM)}\}$ from the ensemble % $X^N$. Each codeword % $\bx^{(\cwm)}$ is likely to be a typical sequence. % [Compare with \protect\figref{fig.extended.bec}b, % page \protect\pageref{fig.extended.bec}.] (b) {Example decodings by the typical set decoder.} A sequence that is not jointly typical with any of the codewords, such as $\by_a$, is decoded as $\hat{\cwm}=0$. A sequence that is jointly typical with codeword $\bx^{(3)}$ alone, $\by_b$, is decoded as $\hat{\cwm}=3$. Similarly, $\by_c$ is decoded as $\hat{\cwm}=4$. A sequence that is jointly typical with more than one codeword, such as $\by_d$, is decoded as $\hat{\cwm}=0$. % [Compare with \protect\figref{fig.extended.bec}c, % page \protect\pageref{fig.extended.bec}.] } \label{fig.rand.code} \label{fig.typ.set.dec} } \end{figure} \subsection{Random coding and typical-set decoding} Consider the following encoding--decoding system, whose rate is $R'$.\index{random code} \ben \item We fix $P(x)$ and generate the $\cwM = 2^{NR'}$ codewords of a $(N,NR')=(N,K)$ code $\C$ at random according to \beq P(\bx) = \prod_{n=1}^{N} P(x_n) . \eeq A random code is shown schematically in \figref{fig.rand.code}a. \item The code is known to both sender and receiver. \item A message $\cwm$ is chosen from $\{1,2,\ldots, 2^{NR'}\}$, and $\bx^{(\cwm )}$ is transmitted. The received signal is $\by$, with \beq P(\by \given \bx^{(\cwm )} ) = \prod_{n=1}^{N} P(y_n \given x^{(\cwm )}_n) . \eeq \item The signal is decoded by {\dem{typical-set decoding}\index{typical-set decoder}}. \begin{description} \item[Typical-set decoding\puncspace] Decode $\by$ as $\hat{\cwm }$ {\sf if} $(\bx^{(\hat{\cwm })},\by)$ are jointly typical {\em and\/} there is no other $\cwm' $ such that $(\bx^{(\cwm')},\by)$ are jointly typical;\\ {\sf otherwise} declare a failure $(\hat{\cwm }\eq 0)$. \end{description} This is not the optimal decoding algorithm, but it will be good enough, and easier to \analyze. The typical-set decoder is illustrated in \figref{fig.typ.set.dec}b. \item A decoding error occurs if $\hat{\cwm } \not = \cwm $. \een There are three probabilities of error that we can distinguish. First, there is the probability of block error for a particular code $\C$, that is, \beq p_{\rm B}(\C) \equiv P(\hat{\cwm } \neq \cwm \given \C). \eeq This is a difficult quantity to evaluate for any given code. Second, there is the average over all codes of this block error probability, \beq \langle p_{\rm B} \rangle \equiv \sum_{\C} P(\hat{\cwm } \neq \cwm \given \C) P(\C) . \eeq Fortunately, this quantity is much easier to evaluate than the first quantity $P(\hat{\cwm } \neq \cwm \given \C)$.% \marginpar{\small\raggedright\reducedlead{$\langle p_{\rm B} \rangle$ is just the probability that there is a decoding error at step 5 of the five-step process on the previous page.}} Third, the maximal block error probability of a code $\C$, \beq p_{\rm BM}(\C) \equiv \max_{\cwm } P(\hat{\cwm } \neq \cwm \given \cwm, \C), \eeq is the quantity we are most interested in: we wish to show that there exists a code $\C$ with the required rate whose maximal block error probability is small. We will get to this result by first finding the average block error probability, $\langle p_{\rm B} \rangle$. Once we have shown that this can be made smaller than a desired small number, we immediately deduce that there must exist {\em at least one\/} code $\C$ whose block error probability is also less than this small number. Finally, we show that this code, whose block error probability is satisfactorily small but whose maximal block error probability is unknown (and could conceivably be enormous), can be modified to make a code of slightly smaller rate whose maximal block error probability is also guaranteed to be small. We modify the code by throwing away the worst 50\% of its codewords. We therefore now embark on finding the average probability of block error. \subsection{Probability of error of typical-set decoder} There are two sources of error when we use typical-set decoding. Either (a) the output $\by$ is not jointly typical with the transmitted codeword $\bx^{(\cwm )}$, or (b) there is some other codeword in $\cal{C}$ that is jointly typical with $\by$. By the symmetry of the code construction, the average probability of error averaged over all codes does not depend on the selected value of $\cwm$; we can assume without loss of generality that $\cwm=1$. (a) The probability that the input $\bx^{(1)}$ and the output $\by$ are not jointly typical vanishes, by the joint typicality theorem's first part (\pref{theorem.jtt}). We give a name, $\delta$, to the upper bound on this probability, % . satisfying $\delta \rightarrow 0$ as $N \rightarrow \infty$; for any desired $\delta$, we can find a blocklength $N(\delta)$ such that the $P( (\bx^{(1)},\by) \not \in \JNb) \leq \delta$. (b) The probability that $\bx^{(\cwm')}$ and $\by$ % $(\bx^{(\cwm' )},\by)$ are jointly typical, for a {\em given\/} $\cwm' \not = 1$ is $\leq 2^{-N(\I(X;Y)-3 \beta)}$, by part 3. And there are $(2^{NR'}-1)$ rival values of $\cwm'$ to worry about. Thus the average probability of error $\langle p_{\rm B} \rangle$ satisfies: \beqan \langle p_{\rm B} \rangle &\leq & \delta + \sum_{\cwm' =2}^{2^{NR'}} 2^{-N(\I(X;Y)-3 \beta)} \label{eq.uniona} \\ &\leq & \delta + 2^{-N(\I(X;Y)- R' -3 \beta)} . \label{eq.unionaa} \eeqan % MARGINPAR should align with the eqn if possible (above) \begin{aside} {The inequality (\ref{eq.uniona}) that bounds a total probability of error $P_{\rm TOT}$ by the sum of the probabilities $P_{s'}$ of all sorts of events $s'$ each of which is sufficient to cause error, $$P_{\rm TOT} \leq P_1 + P_2 + \cdots, $$ is called a {\dem\ind{union bound}}.\index{bound!union} It is only an equality if the different events that cause error never occur at the same time as each other. } \end{aside} The average probability of error (\ref{eq.unionaa}) can be made $< 2 \delta$ by increasing $N$ if % {\em if\/} \beq R' < \I(X;Y) -3 \beta . \eeq We are almost there. We make three modifications: \newcommand{\expurgfig}[1]{% \hspace*{-0.3in}\raisebox{-0.975in}[2.05in][0pt]{\psfig{figure=figs/expurgate#1.ps,width=3.2in}}\hspace*{-0.3in}} \begin{figure} \figuremargin{ %\marginfig{ \begin{center}\small \begin{tabular}{c@{}c@{}c} \expurgfig{1} &$\Rightarrow$ & \expurgfig{2} \\ (a) A random code $\ldots$ & & (b) expurgated \\ \end{tabular} \end{center} }{ \caption[a]{How expurgation works. (a) In a typical random code, a small fraction of the codewords are involved in collisions -- pairs of codewords are sufficiently close to each other that the probability of error when either codeword is transmitted is not tiny. We obtain a new code from a random code by deleting all these confusable codewords. (b) The resulting code has slightly fewer codewords, so has a slightly lower rate, and its maximal probability of error is greatly reduced. } \label{fig.expurgate} } \end{figure} % \newcommand{\optens}{optimal input distribution} \ben \item We choose $P(x)$ in the proof to be the \optens\ of the channel. Then the condition $R'<\I(X;Y) -3 \beta$ becomes $R' N C$ is not achievable, so $R > \smallfrac{C}{1-H_2(p_{\rm b})}$ is not achievable.\ENDproof \exercisxC{3}{ex.m.s.I.aboveC}{ Fill in the details in the preceding argument. If the bit errors between $\hat{\cwm }$ and $\cwm$ are independent then we have $\I(\cwm;\hat{\cwm }) = N R ( 1 - H_2(p_{\rm b}))$. What if we have complex correlations among those bit errors? Why does the inequality $\I(\cwm;\hat{\cwm }) \geq N R ( 1 - H_2(p_{\rm b}))$ hold? } \section{Computing capacity\nonexaminable} \label{sec.compcap} We\marginpar[c]{\small\raggedright\reducedlead{Sections \ref{sec.compcap}--\ref{sec.codthmpractice} contain advanced material. The first-time reader is encouraged to skip to section \ref{sec.codthmex} (\pref{sec.codthmex}).}} have proved that the capacity of a channel is the maximum rate at which reliable communication can be achieved. How can we compute the capacity of a given discrete memoryless channel? We need to find its \optens. In general we can find the \optens\ by a computer search, making use of the derivative of the mutual information with respect to the input probabilities. \exercisxB{2}{ex.Iderivative}{ Find the derivative of $\I(X;Y)$ with respect to the input probability $p_i$, $\partial \I(X;Y)/\partial p_i$, for a channel with conditional probabilities $Q_{j|i}$. } \exercisxC{2}{ex.Iconcave}{ Show that $\I(X;Y)$ is a \concavefrown\ function of the input probability vector $\bp$. } Since $\I(X;Y)$ is \concavefrown\ in the input distribution $\bp$, any probability distribution $\bp$ at which % that has $\partial \I(X;Y)/\partial p_i$ $\I(X;Y)$ is stationary must be a global maximum of $\I(X;Y)$. % So it is tempting to put the derivative of $\I(X;Y)$ into a routine that finds a local maximum of $\I(X;Y)$, that is, an input distribution $P(x)$ such that \beq \frac{\partial \I(X;Y)}{\partial p_i} = \lambda \:\:\: \mbox{for all $i$}, \label{eq.Imaxer} \eeq where $\lambda$ is a Lagrange multiplier associated with the constraint $\sum_i p_i = 1$. However, this approach may fail to find the right answer, because $\I(X;Y)$ might be maximized by a distribution that has $p_i \eq 0$ for some inputs. A simple example is given by the ternary confusion channel. \begin{description} % \item[Ternary confusion channel\puncspace] $\A_X \eq \{0,{\query},1\}$. $\A_Y \eq \{0,1\}$. \[ \begin{array}{c} \setlength{\unitlength}{0.46mm} \begin{picture}(20,30)(0,0) \put(5,5){\vector(1,0){10}} \put(5,25){\vector(1,0){10}} \put(5,15){\vector(1,1){10}} \put(5,15){\vector(1,-1){10}} \put(4,5){\makebox(0,0)[r]{1}} \put(4,25){\makebox(0,0)[r]{0}} \put(16,5){\makebox(0,0)[l]{1}} \put(16,25){\makebox(0,0)[l]{0}} \put(4,15){\makebox(0,0)[r]{{\query}}} \end{picture} \end{array} \begin{array}{c@{\:\:\,}c@{\:\:\,}l} P(y\eq 0 \given x\eq 0) &=& 1 \,; \\ P(y\eq 1 \given x\eq 0) &=& 0 \,; \end{array} \begin{array}{c@{\:\:\,}c@{\:\:\,}l} P(y\eq 0 \given x\eq {\query}) &=& 1/2 \,; \\ P(y\eq 1 \given x\eq {\query}) &=& 1/2 \,; \end{array} \begin{array}{c@{\:\:\,}c@{\:\:\,}l} P(y\eq 0 \given x\eq 1) &=& 0 \,; \\ P(y\eq 1 \given x\eq 1) &=& 1 . \end{array} \] Whenever the input $\mbox{\query}$ is used, the output is random; the other inputs are reliable inputs. The maximum information rate of 1 bit is achieved by making no use of the input $\mbox{\query}$. \end{description} \exercissxB{2}{ex.Iternaryconfusion}{ Sketch the mutual information for this channel as a function of % $$a\in (0,1)$ and $b\in (0,1)$, the input distribution $\bp$. Pick a convenient two-dimensional representation of $\bp$. } The \ind{optimization} routine must therefore take account of the possibility that, as we go up hill on $\I(X;Y)$, we may run into the inequality constraints $p_i \geq 0$. \exercissxB{2}{ex.Imaximizer}{ Describe the condition, similar to \eqref{eq.Imaxer}, that is satisfied at a point where $\I(X;Y)$ is maximized, and describe a computer program for finding the capacity of a channel. } \subsection{Results that may help in finding the \optens} % The following results \ben \item {All outputs must be used}. \item {$\I(X;Y)$ is a \convexsmile\ function of the channel parameters.}\marginpar{\small\raggedright\reducedlead {\sf Reminder:} The term `\convexsmile' means `convex', and the term `\concavefrown' means `concave'; the little smile and frown symbols are included simply to remind you what convex and concave mean.} \item {There may be several {\optens}s, but they all look the same at the output.} \een %\subsubsection{All outputs must be used\subsubpunc} \exercisxB{2}{ex.Iallused}{ Prove that no output $y$ is unused by an \optens, unless it is unreachable, that is, has $Q(y \given x)=0$ for all $x$. } %\subsubsection{Convexity of $\I(X;Y)$ with respect to the channel parameters\subsubpunc} \exercisxC{2}{ex.Iconvex}{ Prove that $\I(X;Y)$ is a \convexsmile\ function of $Q(y \given x)$. } %\subsubsection{There may be several {\optens}s, but they all look the same at the output\subsubpunc} \exercisxC{2}{ex.Imultiple}{ Prove that all {\optens}s of a channel have the same output probability distribution $P(y) = \sum_x P(x)Q(y \given x)$. } These results, along with the fact that $\I(X;Y)$ is a \concavefrown\ function of the input probability vector $\bp$, prove the validity of the symmetry argument that we have used when finding the capacity of symmetric channels. If a channel is invariant under a group of symmetry operations -- for example, interchanging the input symbols and interchanging the output symbols -- then, given any \optens\ that is not symmetric, \ie, is not invariant under these operations, we can create another input distribution by averaging together this \optens\ and all % % WORDY!!!!!!!!!!! % its permuted forms that we can make by applying the symmetry operations to the original \optens. The permuted distributions must have the same $\I(X;Y)$ as the original, by symmetry, so the new input distribution created by averaging must have $\I(X;Y)$ bigger than or equal to that of the original distribution, because of the concavity of $\I$. % see capacity.p \subsection{Symmetric channels} \label{sec.Symmetricchannels} In order to use symmetry arguments, it will help to have a definition of a symmetric channel. I like \quotecite{Gallager68} % Gallager's definition.\index{Gallager, Robert G.} % page 94 %\subsubsection{Gallager's definition of a symmetric channel} \begin{description} \item[A discrete memoryless channel is a symmetric channel] if the set of outputs can be partitioned into subsets in such a way that for each subset the matrix of transition probabilities % (using inputs as columns and outputs in the subset as rows) has the property that each row (if more than 1) is a permutation of each other row and each column is a permutation of each other column. \end{description} \exampl{exSymmetric}{ This channel \beq \begin{array}{c@{\:\:\,}c@{\:\:\,}l} P(y\eq 0 \given x\eq 0) &=& 0.7 \,; \\ P(y\eq {\query} \given x\eq 0) &=& 0.2 \,; \\ P(y\eq 1 \given x\eq 0) &=& 0.1 \,; \end{array} \begin{array}{c@{\:\:\,}c@{\:\:\,}l} P(y\eq 0 \given x\eq 1) &=& 0.1 \,; \\ P(y\eq {\query} \given x\eq 1) &=& 0.2 \,; \\ P(y\eq 1 \given x\eq 1) &=& 0.7. \end{array} \eeq is a symmetric channel because its outputs can be partitioned into $(0,1)$ and ${\query}$, so that the matrix can be rewritten: \beq \begin{array}{cc} \midrule \begin{array}{ccl}%{c@{}c@{}l} P(y\eq 0 \given x\eq 0) &=& 0.7 \,; \\ P(y\eq 1 \given x\eq 0) &=& 0.1 \,; \end{array} & \begin{array}{ccl}%{c@{}c@{}l} P(y\eq 0 \given x\eq 1) &=& 0.1 \,; \\ P(y\eq 1 \given x\eq 1) &=& 0.7 \,; \end{array} \\ \midrule \begin{array}{ccl}%{c@{}c@{}l} P(y\eq {\query} \given x\eq 0) &=& 0.2 \,; \\ \end{array} & \begin{array}{ccl}%{c@{}c@{}l} P(y\eq {\query} \given x\eq 1) &=& 0.2 . \\ \end{array} \\ \midrule \end{array} % \eeq } Symmetry is a useful property because, as we will see in a later chapter, communication at capacity can be achieved over symmetric channels by {\em{linear}\/} codes.\index{error-correcting code!linear}\index{linear block code} % that are good codes %-- a considerable simplification of the task of finding excellent codes. \exercisxC{2}{ex.Symmetricoptens}{ Prove that for a \ind{symmetric channel} with any number of inputs,\index{channel!symmetric} the uniform distribution over the inputs is an {\optens}. } \exercissxB{2}{ex.notSymmetric}{ Are there channels that are not symmetric whose {\optens}s are uniform? Find one, or prove there are none. } \section{Other coding theorems}% this star indicates skippable \label{sec.othercodthm} The noisy-channel coding theorem that we proved in this chapter\index{error-correcting code!error probability} is quite general, applying to any discrete memoryless channel; but it is not very specific. The theorem only says that reliable communication with error probability $\epsilon$ and rate $R$ % can be achieved over a channel can be achieved by using codes with {\em sufficiently large\/} blocklength $N$. The theorem does not say how large $N$ needs to be % as a function to achieve given values of $R$ and $\epsilon$. Presumably, the smaller $\epsilon$ is and the closer $R$ is to $C$, the larger $N$ has to be. % The task of proving explicit results about the blocklength % is challenging and solutions to this problem are considerably % more complex than the theorem we proved in this chapter. %\begin{figure} \marginfig{ \begin{center} \mbox{\raisebox{0.5in}{$E_{\rm r}(R)$}\psfig{figure=figs/Er.eps,width=0.97in}} \end{center} \caption[a]{A typical random-coding exponent.} \label{fig.Er} %\end{figure} }%\end{marginfig} % % \subsection{Noisy-channel coding theorem -- version with explicit $N$-dependence} % explicit blocklength dependence} \index{noisy-channel coding theorem} \begin{quote} For a discrete memoryless channel, a blocklength $N$ and a rate $R$, there exist block codes of length $N$ whose average probability of error satisfies: \beq p_{\rm B} \leq \exp \left[ -N E_{\rm r}(R) \right] \label{eq.pbEr} \eeq where $E_{\rm r}(R)$ is the {\dem\ind{random-coding exponent}\/} of the channel, a \convexsmile, decreasing, positive function of $R$ %which % satisfies %\beq % E_{\rm r}(R) > 0 \:\: \mbox{for all $R$ satisfying $0 \leq R < C$} . %\eeq for $0 \leq R < C$. The {random-coding exponent} is also known as the \ind{reliability function}. [By an \ind{expurgation} argument it can also be shown that there exist block codes for which the {\em{maximal\/}} probability of error $p_{\rm BM}$ % , like $p_{\rm B}$ in \eqref{eq.pbEr}, is also exponentially small in $N$.] \end{quote} The definition of $E_{\rm r}(R)$ is given in \citeasnoun{Gallager68}, p.$\,$139. $E_{\rm r}(R)$ approaches zero as $R \rightarrow C$; the typical behaviour of this function is illustrated in \figref{fig.Er}. The computation of the {random-coding exponent} for interesting channels is a challenging task on which much effort has been expended. Even for simple channels like the \BSC, there is no simple expression for $E_{\rm r}(R)$. \subsection{Lower bounds on the error probability as a function of blocklength} The theorem stated above % gives an upper bound on the error probability: asserts that there are codes with $p_{\rm B}$ smaller than $\exp \left[ -N E_{\rm r}(R) \right]$. But how small can the error probability be? Could it be much smaller? \begin{quote} For any code with blocklength $N$ on a discrete memoryless channel, the probability of error assuming all source messages are used with equal probability satisfies \beq p_{\rm B} \gtrsim \exp[ - N E_{\rm sp}(R) ] , \eeq where the function $E_{\rm sp}(R)$, the {\dem\ind{sphere-packing exponent}\/} of the channel, is a \convexsmile, decreasing, positive function of $R$ for $0 \leq R < C$. \end{quote} For a precise statement of this result and further references, see \citeasnoun{Gallager68}, \mbox{p.$\,$157}. %% \index{Gallager, Robert G.} \section{Noisy-channel coding theorems and coding practice} \label{sec.codthmpractice} Imagine a customer who wants to buy an error-correcting code and decoder for a noisy channel. The results described above allow us to offer the following service: if he tells us the properties of his channel, the desired rate $R$ and the desired error probability $p_{\rm B}$, we can, after working out the relevant functions $C$, $E_{\rm r}(R)$, and $E_{\rm sp}(R)$, advise him that there exists a solution to his problem using a particular blocklength $N$; indeed that almost any randomly chosen code with that blocklength should do the job. Unfortunately we have not found out how to implement these encoders and decoders in practice; the cost of implementing the encoder and decoder for a random code with large $N$ would be exponentially large in $N$. Furthermore, for practical purposes, the customer is unlikely to know exactly what channel he is dealing with. % and might be reluctant to specify a desired rate So \citeasnoun{Berlekamp80} suggests that\index{Berlekamp, Elwyn} the sensible way to approach error-correction is to design encoding-decoding systems and plot their performance on a {\em variety\/} of idealized channels as a function of the channel's noise level. These charts (one of which is illustrated on page \pageref{fig:GCResults}) can then be shown to the customer, who can choose among the systems on offer without having to specify what he really thinks his channel is like. With this attitude to the practical problem, the importance of the functions $E_{\rm r}(R)$ and $E_{\rm sp}(R)$ is diminished. % % put this back somewhere. : % % %\subsection{Noisy-channel coding theorem with errors allowed: % rate-distortion theory} % See Gallager p.466$\pm 20$. % %\subsection{Special case of linear codes} % Give Gallager's p.94 definition of a discrete symmetric channel. % Give coding theorem for linear codes on any symmetric channel % (including with memory). % %\subsection{More general case of % channels with memory} % %\subsection{Finite state channels} % Channels with and without intersymbol interference and % with and without noise. (Is it worth discussing these in any % individual detail, or shall % I just have a general channels with memory discussion?) % % end detour \section{Further exercises} \label{sec.codthmex} \exercisaxA{2}{ex.exam01}{ A binary erasure channel with input $x$ and output $y$ has transition probability matrix: \[ \bQ = \left[ \begin{array}{cc} 1-q & 0 \\ q & q \\ 0 & 1-q \end{array} \right] \hspace{1in} \begin{array}{c} \setlength{\unitlength}{0.13mm} \begin{picture}(100,100)(0,0) \put(18,0){\makebox(0,0)[r]{\tt 1}} % \put(18,80){\makebox(0,0)[r]{\tt 0}} \put(20,0){\vector(1,0){38}} \put(20,80){\vector(1,0){38}} % \put(20,0){\vector(1,1){38}} \put(20,80){\vector(1,-1){38}} % \put(62,0){\makebox(0,0)[l]{\tt 1}} \put(62,40){\makebox(0,0)[l]{\tt ?}} \put(62,80){\makebox(0,0)[l]{\tt 0}} \end{picture} \end{array} \] Find the {\em{mutual information}\/} $I(X;Y)$ between the input and output for general input distribution $\{ p_0,p_1 \}$, and show that the {\em{capacity}\/} of this channel is $C = 1-q$ bits. \medskip \item %\noindent (c) A Z channel\index{channel!Z channel} has transition probability matrix: \[ \bQ = \left[ \begin{array}{cc} 1 & q \\ 0 & 1-q \end{array} \right] \hspace{1in} \begin{array}{c} \setlength{\unitlength}{0.1mm} \begin{picture}(100,100)(0,0) \put(18,0){\makebox(0,0)[r]{\tt 1}} % \put(18,80){\makebox(0,0)[r]{\tt 0}} \put(20,0){\vector(1,0){38}} \put(20,80){\vector(1,0){38}} % \put(20,0){\vector(1,2){38}} % \put(62,0){\makebox(0,0)[l]{\tt 1}} \put(62,80){\makebox(0,0)[l]{\tt 0}} \end{picture} \end{array} \] Show that, using a $(2,1)$ code, % of blocklength 2, {\bf{two}} uses of a Z channel can be made to emulate {\bf{one}} use of an erasure channel, and state the erasure probability of that erasure channel. Hence show that the capacity of the Z channel, $C_{\rm Z}$, satisfies $C_{\rm Z} \geq \frac{1}{2}(1-q)$ bits. Explain why the result $C_{\rm Z} \geq \frac{1}{2}(1-q)$ is an inequality rather than an equality. } \exercissxC{3}{ex.wirelabelling}{ A \ind{transatlantic} cable contains $N=20$ indistinguishable electrical wires.\index{puzzle!transatlantic cable}\index{puzzle!cable labelling} You have the job of figuring out which wire is which, that is, % Alice and Bob, located at the opposite ends of the % cable, wish to create a consistent labelling of the wires at each end. Your only tools are the ability to connect wires to each other in groups of two or more, and to test for connectedness with a continuity tester. What is the smallest number of transatlantic trips you need to make, and how do you do it? How would you solve the problem for larger $N$ such as $N=1000$? As an illustration, if $N$ were 3 then the task can be solved in two steps by labelling one wire at one end $a$, connecting the other two together, crossing the \ind{Atlantic}, measuring which two wires are connected, labelling them $b$ and $c$ and the unconnected one $a$, then connecting $b$ to $a$ and returning across the Atlantic, whereupon on disconnecting $b$ from $c$, the identities of $b$ and $c$ can be deduced. This problem can be solved by persistent search, but the reason it is posed in this chapter is that it can also be solved by a greedy approach based on maximizing the acquired {\em information}. Let the unknown permutation of wires be $x$. % , drawn from an ensemble $X$. Having chosen a set of connections of wires $\cal C$ at one end, you can then make measurements at the other end, and these measurements $y$ convey {\em information\/} about $x$. How much? And for what set of connections is the information that $y$ conveys about $x$ maximized? } \dvips \section{Solutions}% to Chapter \protect\ref{ch6}'s exercises} % 80,82,84,85,86 % solutions to _l6.tex % %\soln{ex.m.s.I.aboveC}{ %%\input{tex/aboveC.tex} % {\em [More work needed here.]} %} %\soln{ex.Iderivative}{ %% Find derivative of $I$ w.r.t $P(x)$. % Get a specific mutual information % like object minus $\log e$. %} %\soln{ex.Iconcave}{ % $\I(X,Y) = \sum_{x,y} P(x) Q(y|x) \log \frac{Q(y|x)}{P(x)Q(y|x)}$ % is a \concavefrown\ function of $P(x)$. % Easy Proof in Gallager p.90, using \verb+z->x->y+, where $z$ chooses % between the two things we are mixing. % This satisfies $I(X;Y|Z) = 0$ % (data processing inequality). %} \soln{ex.Iternaryconfusion}% { \marginpar{\[ \begin{array}{c} \setlength{\unitlength}{1mm} \begin{picture}(20,30)(0,0) \put(5,5){\vector(1,0){8}} \put(5,25){\vector(1,0){8}} \put(5,15){\vector(1,1){8}} \put(5,15){\vector(1,-1){8}} \put(10,18){\makebox(0,0)[l]{\dhalf}} \put(10,12){\makebox(0,0)[l]{\dhalf}} \put(4,5){\makebox(0,0)[r]{\tt1}} \put(4,25){\makebox(0,0)[r]{\tt0}} \put(16,5){\makebox(0,0)[l]{\tt1}} \put(16,25){\makebox(0,0)[l]{\tt0}} \put(4,15){\makebox(0,0)[r]{\tt{?}}} \end{picture} \end{array} \] } If the input distribution is $\bp=(p_0,p_{\tt{?}},p_1)$, the mutual information is \beq I(X;Y) = H(Y) - H(Y|X) = H_2(p_0 + p_{{\tt{?}}}/2) - p_{{\tt{?}}} . \eeq We can build a good sketch of this function in two ways: by careful inspection of the function, or by looking at special cases. For the plots, the two-dimensional representation of $\bp$ I will use has $p_0$ and $p_1$ as the independent variables, so that $\bp=(p_0,p_{\tt{?}},p_1) = (p_0,(1-p_0-p_1),p_1)$. \medskip \noindent {\sf By inspection.} If we use the quantities $p_* \equiv p_0 + p_{{\tt{?}}}/2$ and $p_{\tt{?}}$ as our two degrees of freedom, the mutual information becomes very simple: $I(X;Y) = H_2(p_*) - p_{{\tt{?}}}$. Converting back to $p_0 = p_* - p_{{\tt{?}}}/2$ and $p_1 = 1 - p_* - p_{{\tt{?}}}/2$, we obtain the sketch shown at the left below. This function is like a tunnel rising up the direction of increasing $p_0$ and $p_1$. To obtain the required plot of $I(X;Y)$ we have to strip away the parts of this tunnel that live outside the feasible \ind{simplex} of probabilities; we do this by redrawing the surface, showing only the parts where $p_0>0$ and $p_1>0$. A full plot of the function is shown at the right. \medskip \begin{center} \mbox{% \hspace*{2.3in}% \makebox[0in][r]{\raisebox{0.3in}{$p_0$}}% \hspace*{-2.3in}% \raisebox{0in}[1.9in]{\psfig{figure=figs/confusion.view1.ps,angle=-90,width=3.62in}}% \hspace{-0.3in}% \makebox[0in][r]{\raisebox{0.87in}{$p_1$}}% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \hspace*{2.3in}% \makebox[0in][r]{\raisebox{0.3in}{$p_0$}}% \hspace*{-2.3in}% \raisebox{0in}[1.709in]{\psfig{figure=figs/confusion.view2.ps,angle=-90,width=3.62in}}% \hspace{-0.3in}% \makebox[0in][r]{\raisebox{0.87in}{$p_1$}}% }\\[-0.3in] \end{center} \medskip \noindent {\sf Special cases.} In the special case $p_{{\tt{?}}}=0$, the channel is a noiseless binary channel, and $I(X;Y) = H_2(p_0)$. In the special case $p_0=p_1$, the term $H_2(p_0 + p_{{\tt{?}}}/2)$ is equal to 1, so $I(X;Y) = 1-p_{{\tt{?}}}$. In the special case $p_0=0$, the channel is a Z channel with error probability 0.5. We know how to sketch that, from the previous chapter (\figref{hxyz}). \amarginfig{c}{\small% skeleton fixed Thu 10/7/03 \begin{center}% was -0.51in until Sat 24/5/03 \hspace*{-0.31in}\mbox{% \hspace*{1.62in}% \makebox[0in][r]{\raisebox{0.25in}{$p_0$}}% \hspace*{-1.62in}% {\psfig{figure=figs/confusion.skel.ps,angle=-90,width=2.5in}}%was 3in \hspace{-0.3in}% \makebox[0in][r]{\raisebox{0.77in}{$p_1$}}}\vspace{-0.2in}% \end{center} \caption[a]{Skeleton of the mutual information for the ternary confusion channel.} \label{fig.skeleton} }% end marginpar These special cases allow us to construct the skeleton shown in \figref{fig.skeleton}. % below. } \soln{ex.Imaximizer}{ Necessary and sufficient conditions for $\bp$ to maximize $\I(X;Y)$ are \beq \left. \begin{array}{rclcc} \frac{\partial \I(X;Y)}{\partial p_i} & =& \lambda & \mbox{and} & p_i>0 \\[0.05in] \frac{\partial \I(X;Y)}{\partial p_i} & \leq & \lambda & \mbox{and} & p_i=0 \\ \end{array} \right\} \:\:\: \mbox{for all $i$}, \label{eq.IequalsC} \eeq where $\lambda$ is a constant related to the capacity by $C = \lambda + \log_2 e$. This result can be used in a computer program that evaluates the derivatives, and increments and decrements the probabilities $p_i$ in proportion to the differences between those derivatives. This result is also useful for lazy human capacity-finders who are good guessers. Having guessed the \optens, one can simply confirm that \eqref{eq.IequalsC} holds. } %\soln{ex.Iallused}{ % coming %} %\soln{ex.Iconvex}{ % Easy Proof, using \verb+(x,z)->y+. %} %\soln{ex.Imultiple}{ %% If there are several \optens, they all give the same %% output probability (theorem). This is a general proof that %% the `by symmetry' argument is valid. % coming %} %\soln{ex.Symmetricoptens}{ % This can be proved by the symmetry argument given in the chapter. % % Alternatively see p.94 of Gallager. %} \soln{ex.notSymmetric}{ We certainly expect nonsymmetric channels with uniform {\optens}s to exist, since when inventing a channel we have $I(J-1)$ degrees of freedom whereas the \optens\ is just $(I-1)$-dimensional; so in the $I(J\!-\!1)$-dimensional space of perturbations around a symmetric channel, we expect there to be a subspace of perturbations of dimension $I(J-1)-(I-1) = I(J-2)+1$ that leave the \optens\ unchanged. Here is an explicit example, a bit like a Z channel. \beq \bQ = \left[ \begin{array}{cccc} 0.9585 & 0.0415 & 0.35 & 0.0 \\ 0.0415 & 0.9585 & 0.0 & 0.35 \\ 0 & 0 & 0.65 & 0 \\ 0 & 0 & 0 & 0.65 \\ \end{array} \right] \eeq } % removed to cutsolutions.tex % \soln{ex.exam01}{ \soln{ex.wirelabelling}{ The labelling problem can be solved for any $N>2$ with just two trips, one each way across the Atlantic. The key step in the information-theoretic approach to this problem is to write down the information content of one {\dem\ind{partition}}, the combinatorial object that is the connecting together of subsets of wires. If $N$ wires are grouped together into $g_1$ subsets of size $1$, $g_2$ subsets of size $2$, $\ldots,$ % $g_r$ groups of size $r$ $\ldots,$ then the number of such partitions is \beq \Omega = \frac{ N! }{\displaystyle \prod_r \left( r! \right)^{g_r} g_r! } , \eeq and the information content of one such \ind{partition} is the $\log$ of this quantity. In a greedy strategy we choose the first partition to maximize this information content. One game we can play is to maximize this information content with respect to the quantities $g_r$, treated as real numbers, subject to the constraint $\sum_r g_r r = N$. Introducing a \ind{Lagrange multiplier} $\l$ for the constraint, the derivative is \beq \frac{ \partial }{\partial g_r} \left( \log \Omega + \l \sum_r g_r r \right) = - \log r! - \log g_r + \l r , \eeq which, when set to zero, leads to the rather nice expression \beq g_r = \frac{ e^{\l r} }{ r! } ; % \:\:(r \geq 1) \eeq the optimal $g_r$ is proportional to a \ind{Poisson distribution}\index{distribution!Poisson}! We can solve for the Lagrange multiplier by plugging $g_r$ into the constraint $\sum_r g_r r = N$, which gives the implicit equation \beq N = \mu \, e^{\mu}, \eeq where $\mu \equiv e^{\l}$ is a convenient reparameterization of the Lagrange multiplier. \Figref{fig.atlantic}a shows a graph of $\mu(N)$; \figref{fig.atlantic}b shows the deduced non-integer assignments $g_r$ when $\mu=2.2$, and nearby integers $g_r = \{1,2,2,1,1\}$ that motivate setting the first partition to (a)(bc)(de)(fgh)(ijk)(lmno)(pqrst). \marginfig{\footnotesize \begin{center}\hspace*{-0.2in} \begin{tabular}{r@{\hspace{0.2in}}l} (a)&\mbox{\psfig{figure=figs/atlanticmuN.ps,width=1.5in,angle=-90}}\\[0.2in] (b)&\mbox{\psfig{figure=figs/atlanticpoi.ps,width=1.5in,angle=-90}}\\ \end{tabular} \end{center} \caption[a]{Approximate solution of the \index{cable labelling}{cable-labelling} problem using Lagrange multipliers. (a) The parameter $\mu$ as a function of $N$; the value $\mu(20) = 2.2$ is highlighted. (b) Non-integer values of the function $g_r = \dfrac{ \mu^{r} }{ r! }$ are shown by lines and integer values of $g_r$ motivated by those non-integer values are shown by crosses. } \label{fig.atlantic} } This partition produces a random partition at the other end, which has an information content of $\log \Omega =40.4\ubits$, % pr log(20!*1.0/( (2!)**2 * 2 * (3!)**2 * 2 * (4!) * (5!) ) )/log(2.0) % pr log(20!*1.0/( (2!)**10 * 10! ))/log(2.0) which is a lot more than half the total information content we need to acquire to infer the transatlantic permutation, $\log 20! \simeq 61\ubits$. [In contrast, if all the wires are joined together in pairs, the information content generated is only about 29$\ubits$.] How to choose the second partition is left to the reader. A Shannonesque approach is appropriate, picking a random partition at the other end, using the same $\{g_r\}$; you need to ensure the two partitions are as unlike each other as possible. If $N \neq 2$, 5 or 9, then the labelling problem has solutions that are particularly simple to implement, called \ind{Knowlton--Graham partitions}: partition $\{1,\ldots,N\}$ into disjoint sets in two ways $A$ % $A_1,\ldots,A_p$ and and $B$, % $B_1,\ldots,B_q$, subject to the condition that at most one element appears both in an $A$~set of cardinality~$j$ and in a $B$~set of cardinality~$k$, for each $j$ and~$k$ \cite{Graham66,GrahamKnowlton68}.\index{Graham, Ronald L.} % (R. L. Graham, ``On partitions of a finite set,'' {\sl Journal of Combinatorial % Theory\/ \bf 1} (1966), 215--223;\index{Graham, Ronald L.} % Ronald L. Graham and Kenneth C. Knowlton, ``Method of identifying conductors in % a cable by establishing conductor connection groupings at both ends of the % cable,'' U.S. Patent 3,369,177 (13~Feb 1968).) } %%%%%%%%%%%%%%%%%%%%%%%% % % end of chapter % %%%%%%%%%%%%%%%%%%%%%%%% % \dvipsb{solutions noisy channel s6} % % % CHAPTER 12 (formerly 7) % \prechapter{About Chapter} \fakesection{prerequisites for chapter 7} Before reading \chref{ch.ecc}, you should have read Chapters \ref{ch.five} and \ref{ch.six}. You will also need to be familiar with the {\dem\inds{Gaussian distribution}}. \label{sec.gaussian.props} \begin{description} \item[One-dimensional Gaussian distribution\puncspace] If a random variable $y$ is Gaussian and has mean $\mu$ and variance $\sigma^2$, which we write: \beq y \sim \Normal(\mu,\sigma^2) ,\mbox{ or } P(y) = \Normal(y;\mu,\sigma^2) , \eeq then the distribution of $y$ is: % a Gaussian distribution: \beq P(y\given \mu,\sigma^2) = \frac{1}{\sqrt{2 \pi \sigma^2}} \exp \left[ - ( y - \mu )^2 / 2 \sigma^2 \right] . \eeq [I use the symbol $P$ for both probability densities and probabilities.] The inverse-variance $\tau \equiv \dfrac{1}{\sigma^2}$ is sometimes called the {\dem\inds{precision}\/} of the Gaussian distribution. \item[Multi-dimensional Gaussian distribution\puncspace] If $\by = (y_1,y_2,\ldots,y_N)$ has a \ind{multivariate Gaussian} {distribution}, then \beq P( \by \given \bx, \bA ) = \frac{1}{Z(\bA)} \exp \left( - \frac{1}{2} (\by -\bx)^{\T} \bA (\by -\bx) \right) , \eeq where $\bx$ is the mean of the distribution, $\bA$ is the inverse of the \ind{variance--covariance matrix}\index{covariance matrix}, and the normalizing constant is ${Z(\bA)} = \left( { {\det}\! \left( \linefrac{\bA}{2 \pi} \right) } \right)^{-1/2}$. This distribution has the property that the variance $\Sigma_{ii}$ of $y_i$, and the covariance $\Sigma_{ij}$ of $y_i$ and $y_j$ are given by \beq \Sigma_{ij} \equiv \Exp \left[ ( y_i - \bar{y}_i ) ( y_j - \bar{y}_j ) \right] = A^{-1}_{ij} , \eeq where $\bA^{-1}$ is the inverse of the matrix $\bA$. The marginal distribution $P(y_i)$ of one component $y_i$ is Gaussian; the joint marginal distribution of any subset of the components is multivariate-Gaussian; and the conditional density of any subset, given the values of another subset, for example, $P(y_i\given y_j)$, is also Gaussian. \end{description} %\chapter{Error correcting codes \& real channels} % ampersand used to keep the title on one line on the chapter's opening page \ENDprechapter \chapter[Error-Correcting Codes and Real Channels]{Error-Correcting Codes \& Real Channels} \label{ch.ecc}\label{ch7} % % : l7.tex -- was l78.tex % \setcounter{chapter}{6}% set to previous value % \setcounter{page}{70} % set to current value % \setcounter{exercise_number}{89} % set to imminent value % % % \chapter{Error correcting codes \& real channels} % \label{ch7} The noisy-channel coding theorem that we have proved shows that there exist reliable % `very good' error-correcting codes for any noisy channel. In this chapter we address two questions. First, many practical channels have real, rather than discrete, inputs and outputs. What can Shannon tell us about these continuous channels? And how should digital signals be mapped into analogue waveforms, and {\em vice versa}? Second, how are practical error-correcting codes made, and what is achieved in practice, relative to the possibilities proved by Shannon? \section{The Gaussian channel} The most popular model of a real-input, real-output channel is the \inds{Gaussian channel}.\index{channel!Gaussian} \begin{description} \item[The Gaussian channel] has a real input $x$ and a real output $y$. The conditional distribution of $y$ given $x$ is a Gaussian distribution: \beq P(y\given x) = \frac{1}{\sqrt{2 \pi \sigma^2}} \exp \left[ - ( y - x )^2 / 2 \sigma^2 \right] . \label{eq.gaussian.channel.def} \eeq % This channel has a continuous input and output but is discrete in time. We will show below that certain continuous-time channels are equivalent to the discrete-time Gaussian channel. This channel is sometimes called the additive white Gaussian noise (AWGN) channel.\index{channel!AWGN}\index{channel!Gaussian}\index{AWGN} \end{description} % Why is this a useful channel model? And w As with discrete channels, we will discuss what rate of error-free information communication can be achieved over this channel. \subsection{Motivation % for the Gaussian channel in terms of a continuous-time channel \nonexaminable} Consider a physical (electrical, say) channel with inputs and outputs that are continuous in time. We put in $x(t)$, % which is a %% some sort of % band-limited signal, and out comes $y(t) = x(t) + n(t)$. Our transmission has a power cost. The average power of a transmission of length $T$ may be constrained thus: \beq \int_0^T \d t \: [x(t)]^2 / T \leq P . \eeq The received signal is assumed to differ from $x(t)$ by additive noise $n(t)$ (for example \ind{Johnson noise}), which we will model as white\index{white noise}\index{noise!white} Gaussian noise. The magnitude of this noise is quantified by the {\dem noise spectral density}, $N_0$.\index{noise!spectral density}\index{E$_{\rm b}/N_0$}\index{signal-to-noise ratio} % , which might depend on the effective temperature of the system. How could such a channel be used to communicate information? \amarginfig{t}{ \begin{tabular}{r@{}l} $\phi_1(t)$&\raisebox{-0.8cm}{\psfig{figure=figs/realchannel/phi1.ps,angle=-90,width=1in}}\\ $\phi_2(t)$&\raisebox{-0.8cm}{\psfig{figure=figs/realchannel/phi2.ps,angle=-90,width=1in}}\\ $\phi_3(t)$&\raisebox{-0.8cm}{\psfig{figure=figs/realchannel/phi3.ps,angle=-90,width=1in}}\\ $x(t)$ &\raisebox{-0.8cm}{\psfig{figure=figs/realchannel/xt.ps,angle=-90,width=1in}}\\ \end{tabular} % \caption[a]{Three basis functions, and a weighted combination of them, $ x(t) = \sum_{n=1}^N x_n \phi_n(t) , $ with $x_1 \eq 0.4$, $x_2 \eq -0.2$, and $x_3 \eq 0.1$. % see figs/realchannel.gnu } \label{fig.continuousfunctionexample} } Consider transmitting a set of $N$ real numbers $\{ x_n \}_{n=1}^N$ in a signal of duration $T$ made up of a weighted combination of orthonormal basis functions $\phi_n(t)$, \beq x(t) = \sum_{n=1}^N x_n \phi_n(t) , \eeq where $\int_0^T \: \d t \: \phi_n(t) \phi_m(t) = \delta_{nm}$. The receiver can then compute the scalars: \beqan y_n \:\: \equiv \:\: \int_0^T \: \d t \: \phi_n(t) y(t) &=& x_n + \int_0^T \: \d t \: \phi_n(t) n(t) \\ &\equiv& x_n + n_n \eeqan for $n=1 \ldots N$. If there were no noise, then $y_n$ would equal $x_n$. The white Gaussian noise $n(t)$ adds scalar noise $n_n$ to the estimate $y_n$. This noise is Gaussian: \beq n_n \sim \Normal(0,N_0/2), \eeq where $N_0$ is the spectral density introduced above. % [This is the definition of $N_0$.] Thus a continuous channel used in this way is equivalent to the Gaussian channel defined at \eqref{eq.gaussian.channel.def}. The power constraint $\int_0^T \d t \, [x(t)]^2 \leq P T$ defines a constraint on the signal amplitudes $x_n$, \beq \sum_n x_n^2 \leq PT \hspace{0.5in} \Rightarrow \hspace{0.5in} \overline{x_n^2} \leq \frac{PT}{N} . \eeq Before returning to the Gaussian channel, we define the {\dbf\ind{bandwidth}} (measured in \ind{Hertz}) of the \ind{continuous channel}\index{channel!continuous} to be: \beq W = \frac{N^{\max}}{2 T}, \eeq where $N^{\max}$ is the maximum number of orthonormal functions that can be produced in an interval of length $T$. This definition can be motivated by imagining creating a \ind{band-limited signal} of duration $T$ from orthonormal cosine and sine curves of maximum frequency $W$. The number of orthonormal functions is $N^{\max} = 2 W T$. This definition relates to the \ind{Nyquist sampling theorem}: if the highest frequency present in a signal is $W$, then the signal can be fully determined from its values at a series of discrete sample points separated by the Nyquist interval $\Delta t = \dfrac{1}{2W}$ seconds. So the use of a real continuous channel with bandwidth $W$, noise spectral density $N_0$, and power $P$ is equivalent to $N/T = 2 W$ uses per second of a Gaussian channel with noise level $\sigma^2 = N_0/2$ and subject to the signal power constraint $\overline{x_n^2} \leq\dfrac{P}{2W}$. \subsection{Definition of $E_{\rm b}/N_0$\nonexaminable} Imagine\index{E$_{\rm b}/N_0$} %\index{signal-to-noise ratio} that the Gaussian channel $y_n = x_n + n_n$ is used {with % an % error-correcting code an encoding system} to transmit {\em binary\/} source bits at a rate of $R$ bits per channel use. % , where a rate of 1 corresponds to the uncoded case. How can we compare two encoding systems that have different rates of \ind{communication} $R$ and that use different powers $\overline{x_n^2}$? Transmitting at a large rate $R$ is good; using small power is good too. It is conventional to measure the rate-compensated \ind{signal-to-noise ratio} % \marginpar{\footnotesize{I'm using signal to noise ratio in two different ways. Elsewhere it is defined to be $\frac{\overline{x_n^2}}{\sigma^2}$. Should I modify this phrase?}} by the ratio of the power per source bit $E_{\rm b} = \overline{x_n^2}/R$ to the noise spectral density $N_0$:\marginpar[t]{\small\raggedright\reducedlead {$E_{\rm b}/N_0$ is dimensionless, but it is usually reported in the units of \ind{decibel}s; the value given is $10 \log_{10} E_{\rm b}/N_0$.}} \beq E_{\rm b}/N_0 = \frac{\overline{x_n^2}}{2 \sigma^2 R} . \eeq % This signal-to-noise measure equates low rate, low power % cf ebno.p % The difference in $E_{\rm b}/N_0$ is one of the measures used to compare coding schemes for Gaussian channels. \section{Inferring the input to a real channel} \subsection{`The best detection of pulses'} \label{sec.pulse} In 1944 Shannon wrote a memorandum \cite{shannon44} on the problem of best differentiating between two types of pulses of known shape, represented by vectors $\bx_0$ and $\bx_1$, given that one of them has been transmitted over a noisy channel. This is a \ind{pattern recognition} problem.% \amarginfig{t}{ \begin{tabular}{r@{}l} $\bx_0$&\raisebox{-0.8cm}{\psfig{figure=figs/realchannel/x0.ps,angle=-90,width=1in}}\\ $\bx_1$&\raisebox{-0.8cm}{\psfig{figure=figs/realchannel/x1.ps,angle=-90,width=1in}}\\ $\by$&\raisebox{-0.8cm}{\psfig{figure=figs/realchannel/xn1.ps,angle=-90,width=1in}}\\ \end{tabular} % \caption[a]{Two pulses $\bx_0$ and $\bx_1$, represented as 31-dimensional vectors, and a noisy version of one of them, $\by$. % see figs/realchannel.gnu } \label{fig.detectionofpulses} } It is assumed that the noise is Gaussian with probability density \beq P( \bn ) = \left[ {\det}\left( \frac{\bA}{2 \pi} \right) \right]^{1/2} \exp \left( - \frac{1}{2} \bn^{\T} \bA \bn \right) , \eeq where $\bA$ is the inverse of the variance--covariance matrix of the noise, a symmetric and positive-definite matrix. (If $\bA$ is a multiple of the identity matrix, $\bI/\sigma^2$, then the noise is `white'.\index{noise!white}\index{white noise} For more general $\bA$, the noise is \index{noise!coloured}\index{coloured noise}`{coloured}'.) The probability of the received vector $\by$ given that the source signal was $s$ (either zero or one) is then \beq P( \by \given s ) = \left[ { {\det} \left( \frac{\bA}{2 \pi} \right) }\right]^{1/2} \exp \left( - \frac{1}{2} (\by -\bx_s)^{\T} \bA (\by -\bx_s) \right) . \eeq The optimal detector is based on the posterior probability ratio: \beqan \hspace{-0.6cm} \lefteqn{\frac{ P( s \eq 1\given \by )}{P(s \eq 0\given \by )} = \frac{ P( \by \given s \eq 1 ) }{ P( \by \given s \eq 0)} \frac{ P( s \eq 1 )}{P(s \eq 0 )} } \\ &=& \exp \left( - \frac{1}{2} (\by -\bx_1)^{\T} \bA (\by -\bx_1) + \frac{1}{2} (\by -\bx_0)^{\T} \bA (\by -\bx_0) + \ln \frac{ P( s \eq 1 )}{P(s \eq 0 )} \right) \nonumber \\ &=& \exp \left( \by^{\T} \bA ( \bx_1 -\bx_0) + \theta \right), \eeqan where $\theta$ is a constant independent of the received vector $\by$, \beq \theta = - \frac{1}{2} \bx_1^{\T} \bA \bx_1 + \frac{1}{2} \bx_0^{\T} \bA \bx_0 + \ln \frac{ P( s \eq 1 )}{P(s \eq 0 )} . \eeq If the detector is forced to make a decision (\ie, guess either $s \eq 1$ or $s \eq 0$) then the decision that minimizes the probability of error is to guess the most probable hypothesis. We can write the optimal decision in terms of a {\dem\ind{discriminant function}}: \beq a(\by) \equiv \by^{\T} \bA ( \bx_1 -\bx_0) + \theta \eeq with the decisions \marginfig{ \begin{tabular}{r@{}l} $\bw$&\raisebox{-0.8cm}{\psfig{figure=figs/realchannel/w.ps,angle=-90,width=1in}}\\ \end{tabular} % \caption[a]{The weight vector $\bw \propto \bx_1 -\bx_0$ that is used to discriminate between $\bx_0$ and $\bx_1$. % see figs/realchannel.gnu } \label{fig.detectionofpulses.w} } \beq \begin{array}{ccl} a(\by) > 0& \rightarrow & \mbox{guess $s \eq 1$} \\ a(\by) < 0& \rightarrow & \mbox{guess $s \eq 0$} \\ a(\by)=0 & \rightarrow & \mbox{guess either.} \end{array} \eeq Notice % It should be noted that $a(\by)$ is a linear function of the received vector, \beq a(\by) = \bw^{\T} \by + \theta , \eeq where $\bw \equiv \bA ( \bx_1 -\bx_0)$. \section{Capacity of Gaussian channel} \label{sec.entropy.continuous} Until now we have measured the joint, marginal, and conditional entropy of discrete variables only. In order to define the information conveyed by continuous variables, there are two issues we must address -- the infinite length of the real line, and the infinite precision of real numbers. \subsection{Infinite inputs} How much information can we convey in one use of a Gaussian channel? If we are allowed to put {\em any\/} real number $x$ into the Gaussian channel, we could communicate an enormous string of $N$ digits $d_1d_2d_3\ldots d_N$ by setting $x = d_1d_2d_3\ldots d_N 000\ldots 000$. The amount of error-free information conveyed in just a single transmission could be made arbitrarily large by increasing $N$, and the communication could be made arbitrarily reliable by increasing the number of zeroes at the end of $x$. There is usually some \ind{power cost} associated with large inputs, however, not to mention practical limits in the dynamic range acceptable to a receiver. It is therefore conventional to introduce a {\dem\ind{cost function}\/} $v(x)$ for every input $x$, and constrain codes to have an average cost $\bar{v}$ less than or equal to some maximum value. % a maximum average cost $\bar{v}$. A generalized channel coding theorem, including a cost function for the inputs, can be proved % for the discrete channels discussed previously -- see McEliece (1977).\nocite{McEliece77} The result is a channel capacity $C(\bar{v})$ that is a function of the permitted cost. For the Gaussian channel we will assume a cost \beq v(x) = x^2 \eeq such that the `average power' $\overline{x^2}$ of the input is constrained. We motivated this cost function above in the case of real electrical channels in which the physical power consumption is indeed quadratic in $x$. The constraint $\overline{x^2}=\bar{v}$ makes it impossible to communicate infinite information in one use of the Gaussian channel. \subsection{Infinite precision} \amarginfig{b}{ {\footnotesize\setlength{\unitlength}{1mm} \begin{tabular}{lc} (a)&{\psfig{figure=gnu/grainI.ps,angle=-90,width=1.3in}}\\ (b)&\makebox[0in]{\hspace*{4mm}\begin{picture}(20,10)% \put(17.65,6){\vector(1,0){1.42}} \put(17.65,6){\vector(-1,0){1.42}} \put(17.5,8){\makebox(0,0){$g$}} % \end{picture}}% {\psfig{figure=gnu/grain10.ps,angle=-90,width=1.3in}}\\ &{\psfig{figure=gnu/grain18.ps,angle=-90,width=1.3in}}\\ &{\psfig{figure=gnu/grain34.ps,angle=-90,width=1.3in}}\\ & $\vdots$ \\ \end{tabular} } % \caption[a]{(a) A probability density $P(x)$. {\sf Question:} can we define the `entropy' of this density? (b) We could evaluate the entropies of a sequence of probability distributions with decreasing grain-size $g$, but these entropies tend to $\displaystyle \int P(x) \log \frac{1}{ P(x) g } \, \d x$, which is not independent of $g$: % increases as $g$ decreases: the entropy goes up by one bit for every halving of $g$. $\displaystyle \int P(x) \log \frac{1}{ P(x) } \, \d x$ is an\index{sermon!illegal integral} % \\ \hspace illegal integral.} % see gnu/grain.gnu \label{fig.grain} } It is tempting to define joint, marginal, and conditional entropies\index{entropy!of continuous variable}\index{grain size} for real variables simply by replacing summations by integrals, but this is not a well defined operation. As we discretize an interval into smaller and smaller divisions, the entropy of the discrete distribution diverges (as the logarithm of the granularity) (\figref{fig.grain}). Also, it is not permissible to take the logarithm of a dimensional quantity such as a probability density $P(x)$ (whose dimensions are $[x]^{-1}$).\index{sermon!dimensions}\index{dimensions} There is one information measure, however, that has a well-behaved limit, namely the mutual information -- and this is the one that really matters, since it measures how much information one variable conveys about another. In the discrete case, \beq \I(X;Y) = \sum_{x,y} P(x,y) \log \frac{P(x,y)}{P(x)P(y)} . \eeq Now because the argument of the log is a ratio of two probabilities over the same space, it is OK to have $P(x,y)$, $P(x)$ and $P(y)$ be probability densities % (as long as they are not pathological) % densities) and replace the sum by an integral: \beqan \I(X;Y)& =& \int \! \d x \: \d y \: P(x,y) \log \frac{P(x,y)}{P(x)P(y)} \\ &=& \int \! \d x \: \d y \: P(x)P(y\given x) \log \frac{P(y\given x)}{P(y)} . \eeqan We can now ask these questions for the Gaussian channel: (a) what probability distribution $P(x)$ maximizes the mutual information (subject to the constraint $\overline{x^2}={v}$)? and (b) does the maximal mutual information still measure the maximum error-free communication rate of this real channel, as it did for the discrete channel? \exercissxD{3}{ex.gcoptens}{ Prove that the probability distribution $P(x)$ that maximizes the mutual information (subject to the constraint $\overline{x^2}={v}$) is a Gaussian distribution of mean zero and variance $v$. } % solution is in tex/sol_gc.tex \exercissxB{2}{ex.gcC}{ % Show that the mutual information $\I(X;Y)$, in the case of this optimized distribution, is \beq C = \frac{1}{2} \log \left( 1 + \frac{v}{\sigma^2} \right) . \eeq } This is an important result. We see that the capacity of the Gaussian channel is a function of the {\dem signal-to-noise ratio} $v/\sigma^2$. \subsection{Inferences given a Gaussian input distribution} If $ P(x) = \Normal(x;0,v) \mbox{ and } P(y\given x) = \Normal(y;x,\sigma^2) $ then the marginal distribution of $y$ is $ P(y) = \Normal(y;0,v\!+\!\sigma^2) $ and the posterior distribution of the input, given that the output is $y$, is: \beqan P(x\given y) &\!\!\propto\!\!& P(y\given x)P(x) \\ &\!\!\propto\!\!& \exp( -(y-x)^2/2 \sigma^2) \exp( -x^2/2 v) \label{eq.two.gaussians} \\ &\!\! =\!\! & \Normal\left( x ; \frac{ v}{v+\sigma^2} \, y \, , \, \left({\frac{1}{v}+\frac{1}{\sigma^2}}\right)^{\! -1} \right) . \label{eq.infer.mean.gaussian} \eeqan % % label this bit for reference when we get to Gaussian land [The step from (\ref{eq.two.gaussians}) to (\ref{eq.infer.mean.gaussian}) is made by completing the square in the exponent.] This \label{sec.infer.mean.gaussian} formula deserves careful study. The mean of the posterior distribution, $\frac{ v}{v+\sigma^2} \, y $, can be viewed as a weighted combination of the value that best fits the output, $x=y$, and the value that best fits the prior, $x=0$: \beq \frac{ v}{v+\sigma^2} \, y = \frac{1/\sigma^2 }{1/v+1/\sigma^2} \, y + \frac{1/v}{1/v+1/\sigma^2} \, 0 . \eeq The weights $1/\sigma^2$ and $1/v$ are the {\dem\ind{precision}s\/} % parameters' of the two Gaussians that we multiplied together in \eqref{eq.two.gaussians}: the prior and the likelihood. %-- the probability of the output given the input, % and the prior probability of the input. The precision of the posterior distribution is the sum of these two precisions. This is a general property: whenever two independent sources contribute information, via Gaussian distributions, about an unknown variable, the\index{precisions add} precisions add. [This is the dual to the better-known relationship `when independent variables are added, their variances add'.]\index{variances add} % inverse-variances add to define the inverse-variance of the % posterior distribution. \subsection{Noisy-channel coding theorem for the Gaussian channel} We\index{noisy-channel coding theorem!Gaussian channel} have evaluated a maximal mutual information. Does it correspond to a maximum possible rate of error-free information transmission? One way of proving that this is so is to define a sequence of discrete channels, all derived from the Gaussian channel, with increasing numbers of inputs and outputs, and prove that the maximum mutual information of these channels tends to the asserted $C$. The noisy-channel coding theorem for discrete channels applies to each of these derived channels, thus we obtain a coding theorem for the continuous channel. % coding theorem is then proved. % (with discrete inputs and % discrete outputs) by chopping the output into bins and using a % finite set of inputs, and then defining a sequence of such channels with % increasing numbers of inputs and outputs. A proof that the maximum % mutual information % of these channels tends to $C$ then completes the job, as we have already % proved the noisy channel coding theorem for discrete channels. % % A more intuitive argument for the coding theorem may be preferred. Alternatively, we can make an intuitive argument for the coding theorem specific for the Gaussian channel. \subsection{Geometrical view of the noisy-channel coding theorem: sphere packing} \index{sphere packing}Consider a sequence $\bx = (x_1,\ldots, x_N)$ of inputs, and the corresponding output $\by$, as defining two points in an $N$ dimensional space. For large $N$, the noise power is very likely to be close (fractionally) to $N \sigma^2$. The output $\by$ is therefore very likely to be close to the surface of a sphere of radius $\sqrt{ N \sigma^2}$ centred on $\bx$. Similarly, if the original signal $\bx$ is generated at random subject to an average power constraint $\overline{x^2} = v$, then $\bx$ is likely to lie close to a sphere, centred on the origin, of radius $\sqrt{N v}$; and because the total average power of $\by$ is $v+\sigma^2$, the received signal $\by$ is likely to lie on the surface of a sphere of radius $\sqrt{N (v+\sigma^2)}$, centred on the origin. The volume of an $N$-dimensional sphere of radius $r$ is % % this also appeared in _s1.tex % \beq \textstyle V(r,N) = \smallfrac{ \pi^{N/2} }{ \Gamma( N/2 + 1 ) } r^N . \eeq Now consider making a communication system based on non-confusable inputs $\bx$, that is, inputs whose spheres do not overlap significantly. The maximum number $S$ of non-confusable inputs is given by dividing the volume of the sphere of probable $\by$s by the volume of the sphere for $\by$ given $\bx$: % % An upper bound for the number $S$ of non-confusable inputs is: \beq S \leq \left( \frac{ \sqrt{N (v+\sigma^2)} }{ \sqrt{ N \sigma^2} } \right)^{\! N} \eeq Thus the capacity is bounded by:\index{capacity!Gaussian channel} \beq C = \frac{1}{N} \log M \leq \frac{1}{2} \log \left( 1 + \frac{v}{\sigma^2} \right) . \eeq A more detailed argument % using the law of large numbers like the one used in the previous chapter can establish equality. \subsection{Back to the continuous channel} Recall that the use of a real continuous channel with bandwidth $W$, noise spectral density $N_0$ and power $P$ is equivalent to $N/T = 2 W$ uses per second of a Gaussian channel with $\sigma^2 = N_0/2$ and subject to the constraint $\overline{x_n^2} \leq P/2W$. Substituting the result for the capacity of the Gaussian channel, we find the capacity of the continuous channel to be: \beq C = W \log \left( 1 + \frac{P}{N_0 W} \right) \: \mbox{ bits per second.} \eeq This formula gives insight into the tradeoffs of practical \ind{communication}. Imagine that we have a fixed power constraint. What is the best \ind{bandwidth} to make use of that power? Introducing $W_0=P/N_0$, \ie, the bandwidth for which the signal-to-noise ratio is 1, figure \ref{fig.wideband} shows $C/W_0 = W/W_0 \log \! \left( 1 + W_0/W \right)$ as a function of $W/W_0$. The capacity increases to an asymptote of $W_0 \log e$. It is dramatically better (in terms of capacity for fixed power) to transmit at a low signal-to-noise ratio over a large bandwidth, than with high signal-to-noise in a narrow bandwidth; this is one motivation for wideband communication methods such as the `direct sequence spread-spectrum'\index{spread spectrum} approach used in {3G} \ind{mobile phone}s. Of course, you are not alone, and your electromagnetic neighbours may not be pleased if you use a large bandwidth, so for social reasons, engineers often have to make do with higher-power, narrow-bandwidth transmitters. %\begin{figure} %\figuremargin{% \marginfig{ % figs: load 'wideband.com' \begin{center} \mbox{\psfig{figure=figs/wideband.ps,% width=1.75in,angle=-90}} \end{center} %}{% \caption[a]{Capacity versus bandwidth for a real channel: $C/W_0 = W/W_0 \log \left( 1 + W_0/W \right)$ as a function of $W/W_0$.} \label{fig.wideband} }% %\end{figure} \section{What are the capabilities of practical error-correcting codes?\nonexaminable} \label{sec.bad.code.def}% see also {sec.good.codes}! % cf also \ref{sec.bad.dist.def} % in _linear.tex % Description of Established Codes} % Nearly all codes are good, but nearly all codes require exponential look-up tables for practical implementation of the encoder and decoder -- exponential in the blocklength $N$. And the coding theorem required $N$ to be large. By a {\dem\ind{practical}\/} error-correcting code, we mean one that can be encoded and decoded in a reasonable amount of time, for example, a time that scales as a polynomial function of the blocklength $N$ -- preferably linearly. \subsection{The Shannon limit is not achieved in practice} The non-constructive proof of the noisy-channel coding theorem showed that good block codes exist for any noisy channel, and indeed that nearly all block codes are good. But writing down an explicit and {practical\/} encoder and decoder that are as good as promised by Shannon is still an unsolved problem. % Most of the explicit families of codes that have been written down have the % property that they can achieve a vanishing error probability $p_{\rm b}$ % as $N \rightarrow \infty$ only if the rate $R$ also goes to zero. % % There is one exception to this statement: % , given by a family of codes based on % {\dbf concatentation}. \label{sec.good.codes} \begin{description} \item[Very good codes\puncspace] Given a channel, a family of block\index{error-correcting code!very good} codes that achieve arbitrarily small probability of error at any communication rate up to the capacity of the channel are called `very good' codes for that channel. \item[Good codes] are code families that achieve arbitrarily small probability of error at non-zero communication rates up to some maximum rate that may be {\em less than\/} the \ind{capacity} of the given channel.\index{error-correcting code!good} \item[Bad codes] are code families that cannot achieve arbitrarily small probability of error, or that can achieve arbitrarily small probability of error\index{error-correcting code!bad} % $\epsilon$ `bad' only by decreasing the information rate % $R$ to zero. Repetition codes\index{error-correcting code!repetition}\index{repetition code}% \index{error-correcting code!bad} are an example of a bad code family. (Bad codes are not necessarily useless for practical purposes.) \item[Practical codes] are code families that can be\index{error-correcting code!practical} encoded and decoded in time and space polynomial in the blocklength. \end{description} \subsection{Most established codes are linear codes} Let us review the definition of a block code, and then add the definition of a linear block code.\index{error-correcting code!block code}\index{error-correcting code!linear}\index{linear block code} \begin{description} \item[An $(N,K)$ block code] for a channel $Q$ is a list of $\cwM=2^K$ codewords $\{ \bx^{(1)}, \bx^{(2)}, \ldots, \bx^{({2^K)}} \}$, each of length $N$: $\bx^{(\cwm)} \in \A_X^N$. The signal to be encoded, $\cwm$, which comes from an alphabet of size $2^K$, is encoded as $\bx^{(\cwm)}$. % The {\dbf\ind{rate}} of the code\index{error-correcting code!rate} is $R = K/N$ bits. % % [This definition holds for any channels, not only binary channels.] \item[A linear $(N,K)$ block code] is a block code in which the codewords $\{ \bx^{(\cwm)} \}$ make up a $K$-dimensional subspace of $\A_X^N$. The encoding operation can be represented by an $N \times K$ binary matrix\index{generator matrix} $\bG^{\T}$ such that if the signal to be encoded, in binary notation, is $\bs$ (a vector of length $K$ bits), then the encoded signal is $\bt = \bG^{\T} \bs \mbox{ modulo } 2$. The codewords $\{ \bt \}$ can be defined as the set of vectors satisfying $\bH \bt = {\bf 0} \mod 2$, where $\bH$ is the {\dem\ind{parity-check matrix}\/} of the code. \end{description} \marginpar[c]{\[%beq \bG^{\T} = {\small \left[ \begin{array}{@{\,}*{4}{c@{\,}}} 1 & \cdot & \cdot & \cdot \\[-0.05in] \cdot & 1 & \cdot & \cdot \\[-0.05in] \cdot & \cdot & 1 & \cdot \\[-0.05in] \cdot & \cdot & \cdot & 1 \\[-0.05in] 1 & 1 & 1 & \cdot \\[-0.05in] \cdot & 1 & 1 & 1 \\[-0.05in] 1 & \cdot & 1 & 1 \end{array} \right] } % nb different from l1.tex, no longer \]%eeq } For example the $(7,4)$ \ind{Hamming code} of section \ref{sec.ham74} takes $K=4$ signal bits, $\bs$, and transmits them followed by three parity-check bits. The $N=7$ transmitted symbols are given by $\bG^{\T} \bs \mod 2$. % , where: Coding theory was born with the work of Hamming, who invented a family of practical error-correcting codes, each able to correct one error in a block of length $N$, of which the repetition code $R_3$ and the $(7,4)$ code are the simplest. Since then most established codes have been generalizations of Hamming's codes: % `BCH' (Bose, Chaudhury and Hocquenhem) Bose--Chaudhury--Hocquenhem % The search for decodeable codes has produced the following families. codes, Reed--M\"uller codes, Reed--Solomon codes, and Goppa codes, to name a few. \subsection{Convolutional codes} Another family of linear codes are {\dem\ind{convolutional code}s}, which do not divide the source stream into blocks, but instead read and\index{error-correcting code!convolutional} transmit bits continuously. The transmitted bits are a linear function of the past source bits. % both bits and parity checks in some fixed proportion. Usually the rule for generating the transmitted bits % parity checks involves feeding the present source bit into a \lfsr\index{linear-feedback shift-register} of length $k$, and transmitting one or more linear functions of the state of the shift register at each iteration. The resulting transmitted bit stream is %can be thought of as the convolution of the source stream with a linear filter. The impulse-response function of this filter may have finite or infinite duration, depending on the choice of feedback shift-register. % it is We will discuss convolutional codes in \chapterref{ch.convol}. \subsection{Are linear codes `good'?} One might ask, is the reason that the Shannon limit is not achieved in practice because linear codes are inherently not\index{error-correcting code!linear}\index{error-correcting code!good}\index{error-correcting code!random} as good as random codes?\index{random code} The answer is no, the noisy-channel coding theorem can still be proved for linear codes, at least for some channels (see \chapterref{ch.linear.good}), though the proofs, like Shannon's proof for random codes, are non-constructive. %(We will prove that % there exist linear codes that are very good codes % in chapter \ref{ch.linear.good}. % and in particular for `cyclic codes', % a class to which BCH and Reed--Solomon codes belong. Linear codes are easy to implement at the encoding end. Is decoding a linear code also easy? Not necessarily. The general decoding problem\index{error-correcting code!decoding}\index{linear block code!decoding} (find the maximum likelihood $\bs$ in the equation $\bG^{\T} \bs + \bn = \br$) is in fact \inds{NP-complete} \cite{BMT78}. [NP-complete problems are computational problems that are all equally difficult and which are widely believed to require exponential computer time to solve in general.] So attention focuses on families of codes % (such as those listed above) for which there is a fast decoding algorithm. \subsection{Concatenation} One trick for building codes with practical decoders is the idea of {concatenation}.\index{error-correcting code!concatenated}\index{concatenation!error-correcting codes} An\amarginfignocaption{t}{ \begin{center} \setlength{\unitlength}{1mm} \begin{picture}(25,10)% \put(17.5,8){\makebox(0,0){$\C' \rightarrow \underbrace{\C \rightarrow Q \rightarrow \D} \rightarrow \D'$}} \put(17.5,3){\makebox(0,0){$Q'$}} % \end{picture}% \end{center} %\caption[a]{none} } encoder--channel--decoder system $\C \rightarrow Q \rightarrow \D$ can be viewed as defining a \ind{super-channel} $Q'$ with a smaller probability of error, and with complex\index{channel!complex} correlations among its errors. We can create an encoder $\C'$ and decoder $\D'$ for this super-channel $Q'$. The code consisting of the outer code $\C'$ followed by the inner code $\C$ is known as a {\dem{concatenated code}}.\index{concatenation!error-correcting codes} Some concatenated codes make use of the idea of {\dbf \ind{interleaving}}. We read % Interleaving involves encoding the data in blocks, the size of each block being larger than the blocklengths of the constituent codes $\C$ and $\C'$. After encoding the data of one block using code $\C'$, the bits are reordered within the block in such a way that nearby bits are separated from each other once the block is fed to the second code $\C$. A simple example of an interleaver is a {\dbf\ind{rectangular code}\/} or\index{error-correcting code!rectangular}\index{error-correcting code!product code} {\dem\ind{product code}\/} in which the data are arranged in a $K_2 \times K_1$ block, and encoded horizontally using an $(N_1,K_1)$ linear code, then vertically using a $(N_2,K_2)$ linear code. \exercisaxB{3}{ex.productorder}{ Show that either of the two codes can be viewed as the \ind{inner code} or the \ind{outer code}. } %\subsection{} % see also _concat2.tex As an example, \figref{fig.concath1} shows a product code in which we % encode horizontally % For example, if we encode first with the repetition code $\Rthree$ (also known as the \ind{Hamming code} $H(3,1)$) horizontally then with $H(7,4)$ vertically. The blocklength of the concatenated\index{concatenation} code is 27. The number of source bits per codeword is four, shown by the small rectangle. % The code would be equivalent if we % encoded first with $H(7,4)$ and second with $\Rthree$. \begin{figure} \figuremargin{% \setlength{\unitlength}{0.4mm} \begin{center} \begin{tabular}{rrrrr} (a) \begin{picture}(30,70)(0,0) \put(0,0){\framebox(30,70)} \put(0,30){\framebox(10,40)} \put(5,65){\makebox(0,0){1}} \put(5,55){\makebox(0,0){0}} \put(5,45){\makebox(0,0){1}} \put(5,35){\makebox(0,0){1}} \put(5,25){\makebox(0,0){0}} \put(5,15){\makebox(0,0){0}} \put(5,5){\makebox(0,0){1}} \put(15,65){\makebox(0,0){1}} \put(15,55){\makebox(0,0){0}} \put(15,45){\makebox(0,0){1}} \put(15,35){\makebox(0,0){1}} \put(15,25){\makebox(0,0){0}} \put(15,15){\makebox(0,0){0}} \put(15,5){\makebox(0,0){1}} \put(25,65){\makebox(0,0){1}} \put(25,55){\makebox(0,0){0}} \put(25,45){\makebox(0,0){1}} \put(25,35){\makebox(0,0){1}} \put(25,25){\makebox(0,0){0}} \put(25,15){\makebox(0,0){0}} \put(25,5){\makebox(0,0){1}} \end{picture}& % % noise picture % (b) \begin{picture}(30,70)(0,0) \put(0,0){\framebox(30,70)} \put(0,30){\framebox(10,40)} \put(5,55){\makebox(0,0){$\star$}}% \put(5,15){\makebox(0,0){$\star$}}% % \put(15,55){\makebox(0,0){$\star$}}% \put(15,35){\makebox(0,0){$\star$}}% % \put(25,25){\makebox(0,0){$\star$}}% \end{picture}& % % received vector picture % (c) \begin{picture}(30,70)(0,0) \put(0,0){\framebox(30,70)} \put(0,30){\framebox(10,40)} \put(5,65){\makebox(0,0){1}} \put(5,55){\makebox(0,0){1}}% \put(5,45){\makebox(0,0){1}} \put(5,35){\makebox(0,0){1}} \put(5,25){\makebox(0,0){0}} \put(5,15){\makebox(0,0){1}}% \put(5,5){\makebox(0,0){1}} % \put(15,65){\makebox(0,0){1}} \put(15,55){\makebox(0,0){1}}% \put(15,45){\makebox(0,0){1}} \put(15,35){\makebox(0,0){0}}% \put(15,25){\makebox(0,0){0}} \put(15,15){\makebox(0,0){0}} \put(15,5){\makebox(0,0){1}} % \put(25,65){\makebox(0,0){1}} \put(25,55){\makebox(0,0){0}} \put(25,45){\makebox(0,0){1}} \put(25,35){\makebox(0,0){1}} \put(25,25){\makebox(0,0){1}}% \put(25,15){\makebox(0,0){0}} \put(25,5){\makebox(0,0){1}} \end{picture} & % after R3 correction (d) \begin{picture}(30,70)(0,0) \put(0,0){\framebox(30,70)} \put(0,30){\framebox(10,40)} \put(5,65){\makebox(0,0){1}} \put(5,55){\makebox(0,0){1}}% \put(5,45){\makebox(0,0){1}} \put(5,35){\makebox(0,0){1}} \put(5,25){\makebox(0,0){0}} \put(5,15){\makebox(0,0){{\bf 0}}}% \put(5,5){\makebox(0,0){1}} % \put(15,65){\makebox(0,0){1}} \put(15,55){\makebox(0,0){1}}% \put(15,45){\makebox(0,0){1}} \put(15,35){\makebox(0,0){{\bf 1}}}% \put(15,25){\makebox(0,0){0}} \put(15,15){\makebox(0,0){0}} \put(15,5){\makebox(0,0){1}} % \put(25,65){\makebox(0,0){1}} \put(25,55){\makebox(0,0){{\bf 1}}} \put(25,45){\makebox(0,0){1}} \put(25,35){\makebox(0,0){1}} \put(25,25){\makebox(0,0){{\bf 0}}}% \put(25,15){\makebox(0,0){0}} \put(25,5){\makebox(0,0){1}} \end{picture}& % after 74 correction (e) \begin{picture}(30,70)(0,0) \put(0,0){\framebox(30,70)} \put(0,30){\framebox(10,40)} \put(5,65){\makebox(0,0){1}} \put(5,55){\makebox(0,0){{\bf 0}}}% \put(5,45){\makebox(0,0){1}} \put(5,35){\makebox(0,0){1}} \put(5,25){\makebox(0,0){0}} \put(5,15){\makebox(0,0){{0}}}% \put(5,5){\makebox(0,0){1}} % \put(15,65){\makebox(0,0){1}} \put(15,55){\makebox(0,0){{\bf 0}}}% \put(15,45){\makebox(0,0){1}} \put(15,35){\makebox(0,0){{1}}}% \put(15,25){\makebox(0,0){0}} \put(15,15){\makebox(0,0){0}} \put(15,5){\makebox(0,0){1}} % \put(25,65){\makebox(0,0){1}} \put(25,55){\makebox(0,0){{\bf 0}}} \put(25,45){\makebox(0,0){1}} \put(25,35){\makebox(0,0){1}} \put(25,25){\makebox(0,0){{0}}}% \put(25,15){\makebox(0,0){0}} \put(25,5){\makebox(0,0){1}} \end{picture}\\ & % % noise picture % & & % after 74 correction (d$^{\prime}$) \begin{picture}(30,70)(0,0) \put(0,0){\framebox(30,70)} \put(0,30){\framebox(10,40)} \put(5,65){\makebox(0,0){1}} \put(5,55){\makebox(0,0){1}}% \put(5,45){\makebox(0,0){1}} \put(5,35){\makebox(0,0){1}} \put(5,25){\makebox(0,0){{\bf 1}}}% \put(5,15){\makebox(0,0){1}}% \put(5,5){\makebox(0,0){1}} % \put(15,65){\makebox(0,0){{\bf 0}}}% \put(15,55){\makebox(0,0){1}}% \put(15,45){\makebox(0,0){1}} \put(15,35){\makebox(0,0){0}}% \put(15,25){\makebox(0,0){0}} \put(15,15){\makebox(0,0){0}} \put(15,5){\makebox(0,0){1}} % \put(25,65){\makebox(0,0){1}} \put(25,55){\makebox(0,0){0}} \put(25,45){\makebox(0,0){1}} \put(25,35){\makebox(0,0){1}} \put(25,25){\makebox(0,0){{\bf 0}}}% \put(25,15){\makebox(0,0){0}} \put(25,5){\makebox(0,0){1}} \end{picture} & % after R3 correction (e$^{\prime}$) \begin{picture}(30,70)(0,0) \put(0,0){\framebox(30,70)} \put(0,30){\framebox(10,40)} \put(5,65){\makebox(0,0){1}} \put(5,55){\makebox(0,0){(1)}} \put(5,45){\makebox(0,0){1}} \put(5,35){\makebox(0,0){1}} \put(5,25){\makebox(0,0){{\bf 0}}} \put(5,15){\makebox(0,0){{\bf 0}}}% \put(5,5){\makebox(0,0){1}} % \put(15,65){\makebox(0,0){{\bf 1}}} \put(15,55){\makebox(0,0){(1)}}% \put(15,45){\makebox(0,0){1}} \put(15,35){\makebox(0,0){{\bf 1}}}% \put(15,25){\makebox(0,0){0}} \put(15,15){\makebox(0,0){0}} \put(15,5){\makebox(0,0){1}} % \put(25,65){\makebox(0,0){1}} \put(25,55){\makebox(0,0){(1)}} \put(25,45){\makebox(0,0){1}} \put(25,35){\makebox(0,0){1}} \put(25,25){\makebox(0,0){{0}}}% \put(25,15){\makebox(0,0){0}} \put(25,5){\makebox(0,0){1}} \end{picture}\\ \end{tabular} \end{center} }{% \caption[a]{A product code. (a) A string {\tt{1011}} encoded using a concatenated code consisting of two Hamming codes, $H(3,1)$ and $H(7,4)$. (b) a noise pattern that flips 5 bits. (c) The received vector. (d) After decoding using the horizontal $(3,1)$ decoder, and (e) after subsequently using the vertical $(7,4)$ decoder. The decoded vector matches the original. (d$^{\prime}$, e$^{\prime}$) After decoding in the other order, three errors still remain.} \label{fig.concath1} }% \end{figure} \label{sec.concatdecode}We can decode conveniently (though not optimally) by using the individual decoders for each of the subcodes in some sequence. It makes most sense to first decode the code which has the lowest rate and hence the greatest error-correcting ability. \Figref{fig.concath1}(c--e) shows what happens if we receive the codeword of \figref{fig.concath1}a with some errors (five bits flipped, as shown) and apply the decoder for $H(3,1)$ first, and then the decoder for $H(7,4)$. The first decoder corrects three of the errors, but erroneously modifies the third bit in the second row where there are two bit errors. The $(7,4)$ decoder can then correct all three of these errors. \Figref{fig.concath1}(d$^{\prime}$--$\,$e$^{\prime}$) shows what happens if we decode the two codes in the other order. In columns one and two there are two errors, so the $(7,4)$ decoder introduces two extra errors. It corrects the one error in column 3. The $(3,1)$ decoder then cleans up four of the errors, but erroneously infers the second bit. % To make simple decoding possible, % we split up bits that are in a single codeword at the first level, % grouping them with other bits. Rectangular arrangement makes this easiest % to see. \subsection{Interleaving} The motivation for interleaving is that by spreading out bits that are nearby in one code, we make it possible to ignore % forget about the complex correlations among the errors that are produced by the inner code. Maybe the inner code will mess up an entire codeword; but that codeword is spread out one bit at a time over several codewords of the outer code. So we can treat the errors introduced by the inner code as if they are independent.\index{approximation!of complex distribution} % by a simpler one} % % By iterating this process, with each successive % code adding a small amount of redundancy to a geometrically increasing block, % we can define an explicit sequence of codes with the property that % $p_{\rm b} \rightarrow 0$ for some rate $R > 0$ (but not any $R$ up to the % capacity $C$). % % There is also a proof by Forney that better concatenations % exist, which achieve rates up to capacity and have encoding and decoding % complexity of order $O(N^4)$. But the proof is non-constructive. % % gf.tex could be included here % % \subsection{Coding theory sells you short} % At this point could discuss the universalist `this code corrects % all errors up to $t$' with the Shannonist `the prob of error % is tiny'. The latter attitude allows you to communicate at far % greater rates. The former attitude is happy with something that % is only halfway. % % Distance % Show Prob of error of ideal decoder (Schematic) as function of noise level. % Show that you can cope with double the noise. \subsection{Other channel models} % Most of the codes mentioned above are designed in terms of In addition to the binary symmetric channel and the Gaussian channel, % or in terms of the number of errors they can correct, but coding theorists keep more complex channels in mind also. %\index{burst-error channels} {\dem Burst-error channels\/}\index{channel!bursty}\index{burst errors} are important models in practice. \ind{Reed--Solomon code}s use \ind{Galois field}s (see \appendixref{app.GF}) with large numbers of elements (\eg\ $2^{16}$) as their input alphabets, and thereby automatically achieve a degree of burst-error tolerance in that even if 17 successive bits are corrupted, only 2 successive symbols in the Galois field representation are corrupted. Concatenation and interleaving can give further % fortuitous protection against % \index{concatenated code} burst errors. The concatenated\index{concatenation!error-correcting codes}\index{error-correcting code!concatenated} Reed--Solomon codes used on digital compact discs % DISKS? are able to correct bursts of errors of length 4000 bits. \exercissxB{2}{ex.interleaving.dumb}{ The technique of \ind{interleaving},\index{implicit assumptions} which allows bursts of\index{error-correcting code!interleaving} errors to be treated as independent, is widely used, but is theoretically a poor way to protect data against \ind{burst errors}, in terms of the amount of redundancy required. Explain why interleaving is a poor method, using the following burst-error channel as an example. Time is divided into chunks of length $N=100$ clock cycles; during each chunk, there is a burst with probability $b=0.2$; during a burst, the channel is a binary symmetric channel with $f=0.5$. If there is no burst, the channel is an error-free binary channel. Compute the capacity of this channel and compare it with the maximum communication rate that could conceivably be achieved if one used interleaving and treated the errors as independent. } % The BSC is an inadequate channel model for a second reason: many % channels have {\em real outputs}. For example, a % binary input $x$ may give rise to a % probability distribution over a real output $y$. Codes whose decoders % can handle real outputs (log likelihood ratios) are therefore % important. `Convolutional codes' are such codes, as are some block codes. {\dem\index{fading channel}{Fading channels}\/} are real\index{channel!fading} channels like Gaussian\index{channel!Gaussian} channels except that the received power is assumed to vary with time. A moving \ind{mobile phone}\index{cellphone|see{mobile phone}}\index{phone!cellular|see{mobile phone}} is an important example. The incoming \ind{radio} signal is reflected off nearby objects so that there are interference patterns and the intensity of the signal received by the phone varies with its location. The received power can easily vary by 10 decibels\index{decibel} (a factor of ten) as the phone's antenna moves through a distance similar to the wavelength of the radio signal (a few centimetres). %Fading channels are used as models % of the radio channel of mobile phones, in which the received power % varies rapidly \section{The state of the art} What are the best known codes for communicating over Gaussian channels? All the practical codes are linear codes, and are either based on convolutional codes or block codes.\index{linear block code} \subsection{Convolutional codes, and codes based on them} \begin{description} \item[Textbook convolutional codes\puncspace] The `de facto standard' % cite golomb? error-correcting code for\index{communication} \ind{satellite communications} is a convolutional code with constraint length 7. Convolutional codes are discussed in \chref{ch.convol}. \item[Concatenated convolutional codes\puncspace] The above \ind{convolutional code} can be used as the inner code of a\index{error-correcting code!concatenated} concatenated code whose outer code is a {Reed--Solomon code} with eight-bit symbols. This code was used in deep space communication systems such as the Voyager spacecraft. For further reading about Reed--Solomon codes, see \citeasnoun{lincostello83}. \item[The code for \index{Galileo code}{Galileo}\puncspace] A code using the same format but using a longer constraint length -- 15 -- for its convolutional code and a larger Reed--Solomon code was developed by the \ind{Jet Propulsion Laboratory} \cite{JPLcode}. The details of this code are unpublished outside JPL, and the decoding is only possible using a room full of special-purpose hardware. In 1992, this was the best code known of rate \dfrac{1}{4}. \item[Turbo codes\puncspace] In 1993, \index{Berrou, C.}{Berrou}, \index{Glavieux, A.}{Glavieux} and \index{Thitimajshima, P.}{Thitimajshima} \nocite{Berrou93:Turbo}reported work on {\dem\ind{turbo code}s}. The encoder of a turbo code is based on the encoders of two % or more constituent codes. In % the original paper the two constituent codes were convolutional codes. The source bits are fed into each encoder, the order of the source bits being permuted in a random way, and the resulting parity bits from each constituent code are transmitted. The decoding algorithm % invented by Berrou {\em et al\/} involves iteratively decoding each constituent code% \amarginfig{b}{ \begin{center} \setlength{\unitlength}{1mm} \begin{picture}(25,30)(0,8)% \put(15,18){\framebox(8,8){$C_1$}} \put(15, 8){\framebox(8,8){$C_2$}} \put( 9,12){\circle{6}} \put( 5, 8){\framebox(8,8){$\pi$}} \put(9.7,14.875){\vector(1,0){0.1}}% right pointing circle vector % was 975 \put(23,22){\vector(1,0){3}} \put(23,12){\vector(1,0){3}} \put(13,12){\line(1,0){2}} \put( 2,12){\vector(1,0){3}} \put( 0,22){\vector(1,0){15}} \put( 2,22){\line(0,-1){10}} % \end{picture}% \end{center} \caption[a]{The encoder of a turbo code. Each box $C_1$, $C_2$, contains a convolutional code. The source bits are reordered using a permutation $\pi$ before they are fed to $C_2$. The transmitted codeword is obtained by concatenating or interleaving the outputs of the two convolutional codes. The random permutation is chosen when the code is designed, and fixed thereafter. } } using its standard decoding algorithm, then using the output of the decoder as the input to the other decoder. This decoding algorithm is an instance of a {\dbf{message-passing}}\index{message passing} algorithm called the {\dbf\ind{sum--product algorithm}}. Turbo codes are discussed in \chref{ch.turbo}, and message passing in Chapters \ref{ch.message}, \ref{ch.noiseless}, \ref{ch.exact}, and \ref{ch.sumproduct}. \end{description} \subsection{Block codes} \begin{description} \item[Gallager's low-density parity-check codes\puncspace] The% \amarginfig{c}{ \[ \raisebox{0.425in}{ \bH \hspace{0.02in} =}\hspace{-0.1in} \psfig{figure=MNCfigs/12.4.3.111/A.ps,angle=-90,width=1.5in,height=1in} \] \begin{center} \mbox{ \psfig{figure=/home/mackay/itp/figs/gallager/16.12.ps,width=2in,angle=-90} }\end{center} \caption[a]{A low-density parity-check matrix and the corresponding graph of a rate-\dfrac{1}{4} low-density parity-check code with % $(j,k) = (3,4)$, blocklength $N \eq 16$, and $M \eq 12$ constraints. Each white circle represents a transmitted bit. Each bit participates in $j=3$ constraints, represented by \plusnode\ squares. Each % \plusnode\ constraint forces the sum of the $k=4$ bits to which it is connected to be even. This code is a $(16,4)$ code. Outstanding performance is obtained when the blocklength is increased to $N \simeq 10\,000$. } \label{fig.ldpccIntro} } best block codes known for Gaussian channels were invented by Gallager\index{Gallager, Robert G.} in 1962 but were promptly forgotten by most of the coding theory community. % by MacKay and Neal, They were rediscovered in 1995\nocite{mncEL,wiberg:phd}\index{Wiberg, Niclas}\index{MacKay, David J.C.}\index{error-correcting code!low-density parity-check}\index{Neal, Radford M.} and shown to have outstanding theoretical and practical properties.\index{error-correcting code!practical} Like turbo codes, they are decoded by message-passing algorithms. We will discuss these beautifully simple codes in Chapter % \ref{ch.belief.propagation} and \ref{ch.gallager}. \end{description} The performances of the above codes are compared for Gaussian channels in \figref{fig:GCResults}, \pref{fig:GCResults}.%{fig.gl.gc}. % the Galileo code and % Only the Galileo code and turbo codes outperform the original % regular, binary Gallager codes. % The best known Gallager codes, which are irregular, %% and non-binary, % outperform the Galileo code and turbo codes too \cite{DaveyMacKay96,Richardson2001b}. \section{Summary} \begin{description} \item[Random codes] are good, but they require exponential resources to encode and decode them. \item[Non-random codes] tend for the most part not to be as good as random codes. For a non-random code, encoding may be easy, but even for simply-defined linear codes, the decoding problem remains very difficult. \item[The best practical codes] %\ben %\item (a) employ very large block sizes; (b) % \item are based on semi-random code constructions; and (c) %\item make use of probability-based decoding algorithms. % \een \end{description} \section{Nonlinear codes} Most practically used codes are linear, but not all.\index{error-correcting code!nonlinear}\index{nonlinear code} Digital soundtracks are encoded onto cinema film as a binary pattern. The likely errors affecting the film involve dirt and scratches, which produce large numbers of {\tt{1}}s and {\tt{0}}s respectively. We want none of the codewords to look like all-{\tt{1}}s or all-{\tt{0}}s, so that it will be easy to detect errors caused by dirt and scratches. One of the codes used in \ind{digital cinema}\index{cinema} \ind{sound} systems is a nonlinear $(8,6)$ code consisting of 64 of the ${{8}\choose{4}}$ binary patterns of weight 4. % That's 70 patterns. Pick 64. \section{Errors other than noise} Another source of uncertainty for the receiver is uncertainty about the {\em{\ind{timing}}\/} of the transmitted signal $x(t)$. In ordinary coding theory and information theory, the transmitter's time $t$ and the receiver's time $u$ are assumed to be perfectly synchronized. % If a bit sequence is encoded by a simple signal % $x(t) \in \pm 1$, information is easily conveyed if % the transmitter and the receiver both know the same % time $t$; But if the receiver receives a signal $y(u)$, where the receiver's time, $u$, is an imperfectly known function $u(t)$ of the transmitter's time $t$, then the capacity of this channel for communication is reduced. The theory of such channels is incomplete, compared with the % ordinary % `normal' synchronized channels\index{insertions}\index{deletions} we have discussed thus far. Not even the {\em capacity\/} of channels with \ind{synchronization errors}\index{capacity!channel with synchronization errors} is known \cite{Levenshtein66,Ferreira97}; % % ear recommends citing zigangirov69 ullman67 % codes for reliable communication over channels with synchronization errors remain an active research area \cite{DaveyMacKay99b}. % ear recommends citing ratzer2003 \subsection*{Further reading} For a review of the history of spread-spectrum\index{spread spectrum} methods, see \citeasnoun{Scholtz82}. \section{Exercises} \subsection{The Gaussian channel} \exercissxB{2}{ex.gcCb}{ Consider a Gaussian channel with a real input $x$, and signal to noise ratio $v/\sigma^2$. \ben \item What is its capacity $C$? \item If the input is constrained to be binary, $x \in \{ \pm \sqrt{v} \}$, what is the capacity $C'$ of this constrained channel? \item If in addition the output of the channel is thresholded using the mapping \beq y \rightarrow y' = \left\{ \begin{array}{cc} 1 & y > 0 \\ 0 & y \leq 0, \end{array} \right. \eeq what is the capacity $C''$ of the resulting channel? \item Plot the three capacities above as a function of $v/\sigma^2$ from 0.1 to 2. [You'll need to do a numerical integral to evaluate $C'$.] \een } \exercisaxB{3}{ex.codeslinear}{ For large integers $K$ and $N$, what fraction of all binary error-correcting codes of length $N$ and rate $R=K/N$ are {\em{linear}\/} codes? [The answer will depend on whether you choose to define the code to be an {\em{ordered}\/} list of $2^K$ codewords, that is, a mapping from $s \in \{1,2,\ldots,2^K\}$ to $\bx^{(s)}$, or to define the code to be an unordered list, so that two codes consisting of the same codewords are identical. Use the latter definition: a code\index{error-correcting code} is a set of codewords; how the encoder operates is not part of the definition of the code.] } % that have not already been covered. \subsection{Erasure channels} \exercisxB{4}{ex.beccode}{ Design a code for the binary erasure channel, and a decoding algorithm, and evaluate their probability of error. [The design of good codes for erasure channels\index{erasure correction}\index{channel!erasure} is an active research area \cite{spielman-96,LubyDF}; see also \chref{chdfountain}.] % Have fun!] % } \exercisaxB{5}{ex.qeccode}{ Design a code for the $q$-ary erasure channel, whose input $x$ is drawn from $0,1,2,3,\ldots,(q-1)$, and whose output $y$ is equal to $x$ with probability $(1-f)$ and equal to {\tt{?}} otherwise. [This erasure channel is a good model for \ind{packet}s transmitted over the \ind{internet}, which are either received reliably or are lost.] } \exercissxC{3}{ex.raid}{ How do redundant arrays of independent disks (RAID) work?\marginpar{% \small\raggedright\reducedlead{% % aside [Some people say RAID stands for `redundant array of inexpensive disks', but I think that's silly -- RAID would still be a good idea\index{RAID}\index{redundant array of independent disks} even if the disks were expensive!] % end aside }} These are information storage systems consisting of about\index{erasure correction} ten \disc{} drives,\index{disk drive} of which any two or three can be disabled and the others are able to still able to reconstruct any requested file.\index{file storage} What codes are used, and how far are these systems from the Shannon limit for the problem they are solving? How would {\em you\/} design a better RAID system? % Some information is provided in the solution section. See {\tt http://{\breakhere}www.{\breakhere}acnc.{\breakhere}com/{\breakhere}raid2.html}; see also \chref{chdfountain}. % and {\tt http://www.digitalfountain.com/} for more. } %%%%\input{tex/_e7.tex} \dvips \section{Solutions}% to Chapter \protect\ref{ch.ecc}'s exercises} % % ex 89 \soln{ex.gcoptens}{ % \subsection{Maximization} Introduce a Lagrange multiplier $\l$ for the power constraint and another, $\mu$, for the constraint of normalization of $P(x)$. \beqan F &\eq & \I(X;Y) - { \l \textstyle \int \d x \, P(x) x^2 - \mu \textstyle \int \d x \, P(x) } \\ &\eq & \int \! \d x \, P(x) \left[ \int \! \d y \, P(y\given x) \ln \frac{P(y\given x)}{P(y)} - \l x^2 - \mu \right] . \eeqan Make the functional derivative with respect to $P(x^*)$. \beqan \frac{\delta F}{\delta P(x^*)} &=& \int \! \d y \, P(y\given x^*) \ln \frac{P(y\given x^*)}{P(y)} - \l {x^*}^2 - \mu \nonumber \\ && - \int \! \d x \: P(x) \int \! \d y \: P(y\given x) \frac{1}{P(y)} \frac{\delta P(y)}{\delta P(x^*)} . \hspace{0.5cm} \eeqan The final factor $\delta P(y)/\delta P(x^*)$ is found, using $P(y) = \int \! \d x \, P(x) P(y\given x)$, to be $P(y\given x^*)$, and the whole of the last term collapses in a puff of smoke to 1, which can be absorbed into the $\mu$ term. % We now substitute Substitute $P(y\given x) = \exp( -(y-x)^2/2 \sigma^2) / \sqrt{2 \pi \sigma^2}$ and set the derivative to zero: \beq \int \! \d y \, P(y\given x) \ln \frac{P(y\given x)}{P(y)} - \l x^2 - \mu' = 0 \eeq \beq \Rightarrow \int \! \d y \, \frac{\exp( -(y-x)^2/2 \sigma^2)}{\sqrt{2 \pi \sigma^2} } \ln \left[ P(y) \sigma \right] = - \l x^2 - \mu' - \frac{1}{2} . \label{eq.theconstr} \eeq This condition must be satisfied by $\ln \! \left[ P(y) \sigma \right]$ for all $x$. Writing a Taylor expansion of $\ln \! \left[ P(y) \sigma \right] = a + b y + c y^2 + \cdots$, only a quadratic function $\ln \! \left[ P(y) \sigma \right] = a + c y^2$ would satisfy the constraint (\ref{eq.theconstr}). (Any higher order terms $y^p$, $p>2$, would produce terms in $x^p$ that are not present on the right-hand side.) Therefore $P(y)$ is Gaussian. We can obtain this optimal output distribution by using a Gaussian input distribution $P(x)$. % \footnote{Note in passing that % the Gaussian is the probability distribution that has maximum % pseudo-entropy } \soln{ex.gcC}{ Given a Gaussian input distribution of variance $v$, the output distribution is $\Normal(0,v\!+\!\sigma^2)$, since $x$ and the noise are independent random variables, and variances add for independent random variables. The mutual information is: \beqan \!\!\!\!\!\!\!\!\!\! \I(X;Y)& =& \!\! \int \! \d x \, \d y \: P(x)P(y\given x) \log {P(y\given x)} - \int \! \d y \: P(y) \log {P(y)} \\ &=& \frac{1}{2} \log \frac{1}{\sigma^2} - \frac{1}{2} \log \frac{1}{v+\sigma^2} \\ &=& % \frac{1}{2} \log \frac{v+\sigma^2}{\sigma^2} = \frac{1}{2} \log \left( 1 + \frac{v}{\sigma^2} \right) . \eeqan } \soln{ex.interleaving.dumb}{ The capacity of the channel is one minus the information content of the noise that it adds. That information content is, per chunk, the entropy of the selection of whether the chunk is bursty, $H_2(b)$, plus, with probability $b$, the entropy of the flipped bits, $N$, which adds up to $H_2(b) + Nb$ per chunk (roughly; accurate if $N$ is large). So, per bit, the capacity is, for $N=100$, \beq C = 1 - \left( \frac{1}{N} H_2(b) + b \right) = 1 - 0.207 = 0.793 . \eeq In contrast, interleaving, which treats bursts of\index{sermon!interleaving} errors as independent, causes the channel to be treated as a binary symmetric channel with $f= 0.2 \times 0.5 = 0.1$, whose capacity is about 0.53. Interleaving throws away the useful information about the correlatedness of the errors. Theoretically, we should be able to communicate about $(0.79/0.53) \simeq 1.6$ times faster using a code and decoder that explicitly treat bursts as bursts. } % ex 91 \soln{ex.gcCb}{ \ben \item Putting together the results of exercises \ref{ex.gcoptens} and \ref{ex.gcC}, we deduce that a Gaussian channel with real input $x$, and signal to noise ratio $v/\sigma^2$ has capacity \beq C = \frac{1}{2} \log \left( 1 + \frac{v}{\sigma^2} \right) . \label{eq.unconstrained.cap} \eeq \item If the input is constrained to be binary, $x \in \{ \pm \sqrt{v} \}$, the capacity is achieved by using these two inputs with equal probability. The capacity is reduced to a somewhat messy integral, \beq C'' = \int_{-\infty}^{\infty} \d y \, N(y;0) \log N(y;0) %\nonumber \\ %& & - \int_{-\infty}^{\infty} \d y \, P(y) \log P(y) , \eeq where $N(y;x) \equiv (1/\sqrt{2 \pi}) \exp [ ( y-x)^2/2 ]$, $x\equiv \sqrt{v}/ \sigma$, and $P(y) \equiv [ N(y;x)+N(y;-x) ]/2$. This capacity is smaller than the unconstrained capacity (\ref{eq.unconstrained.cap}), but for small signal-to-noise ratio, the two capacities are close in value. \item If the output is thresholded, then the Gaussian channel is turned into a binary symmetric channel whose transition probability is given by the error function $\erf$ defined on page \pageref{sec.erf}. The capacity is %%%%%%% \marginfig{% \begin{center} \psfig{figure=/home/mackay/_doc/code/brendan/gc.ps,width=1.85in,angle=-90} \mbox{\psfig{figure=/home/mackay/_doc/code/brendan/gc.l.ps,width=1.85in,angle=-90}}\\[-0.05in] \end{center} % \caption[a]{Capacities (from top to bottom in each graph) $C$, $C'$, and $C''$, versus the signal-to-noise ratio $(\sqrt{v}/\sigma)$. The lower graph is a log--log plot.} } %%%%%%%% \beq C'' = 1 - H_2( f ), \mbox{ where $f= \erf(\sqrt{v}/\sigma)$} . \eeq %\item % The capacities are plotted in the margin. \een } %\soln{ex.beccode}{ % The design of good codes for erasure channels\index{erasure correction} % is an active research area % \cite{spielman-96,LubyDF}. Have fun! %} % RAID \soln{ex.raid}{ There are several RAID systems. One of the easiest to understand consists of 7 \disc{} drives which store data\index{erasure correction} at rate $4/7$ using a $(7,4)$ \ind{Hamming code}: each successive\index{RAID}\index{redundant array of independent disks} four bits are encoded with the code and the seven codeword bits are written one to each disk. Two or perhaps three disk drives can go down and the others can recover the data. The effective channel model here is a binary erasure channel, because it is assumed that we can tell when a disk is dead. It is not possible to recover the data for {\em some\/} choices of the three dead disk drives; can you see why? } \exercissxB{2}{ex.raid3}{ Give an example of three \disc{} drives that, if lost, lead to failure of the above RAID system, and three that can be lost without failure. } \soln{ex.raid3}{ The $(7,4)$ Hamming code has codewords of weight 3. If any set of three \disc{} drives\index{erasure correction} corresponding to one of those codewords is lost, then the other four disks can recover only 3 bits of information about the four source bits; a fourth bit is lost. [\cf\ \exerciseref{ex.qeccodeperfect} with $q=2$: there are no binary MDS codes. This deficit is discussed further in \secref{sec.RAIDII}.] Any other set of three disk drives can be lost without problems because the corresponding four by four submatrix of the generator matrix is invertible. % The simplest % example of a recoverable failure is when the three parity % drives (5,6,7) go down. A better code would be a digital fountain -- see \chref{chdfountain}. % \cite{LubyDF},\footnote{{\tt http://www.digitalfountain.com/}} } \dvipsb{solutions real channels s7} %%%%%%% was a chapter on further exercises here once! %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%% PART %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \renewcommand{\partfigure}{\poincare{8.2}} \part{Further Topics in Information Theory} \prechapter{About Chapter} In Chapters \ref{ch1}--\ref{ch7}, we concentrated on two aspects of information theory and coding theory: source coding -- the compression of information so as to make efficient use of data transmission and storage channels; and channel coding -- the redundant encoding of information so as to be able to detect and correct \ind{communication} errors. In both these areas we started by ignoring practical considerations, concentrating on the question of the theoretical limitations and possibilities of coding. We then discussed practical source-coding and channel-coding schemes, shifting the emphasis towards computational feasibility. But the prime criterion for comparing encoding schemes remained the efficiency of the code in terms of the channel resources it required: the best source codes were those that achieved the greatest compression; the best channel codes were those that communicated at the highest rate with a given probability of error. In this chapter we now shift our viewpoint a little, thinking of {\em ease of information retrieval\/} as a primary goal. It turns out that the random codes\index{random code} which were theoretically useful in our study of channel coding are also useful for rapid information retrieval. Efficient information retrieval is one of the problems that brains seem to solve effortlessly, and \ind{content-addressable memory}\index{memory!content-addressable} is one of the topics we will study when we look at neural networks. \medskip %\chapter{Hash codes: codes for efficient information retrieval} \ENDprechapter \chapter{Hash Codes: Codes for Efficient Information Retrieval \nonexaminable} % 9 \label{ch.hash} % \chapter{Hash codes: codes for efficient information retrieval} % \input{tex/_lhash.tex} % % prerequisites -- the birthday problem questions % postreqs: hopfield nets % % exercises also in _e8.tex AND _e7.tex, solns in _shash and _se8 % _e8 has ones relevant to hashes % % \label{ch.hash} % % SUGGESTION: % % include an illustrative example at start. % add a diagram showing buckets, memory.... \newcommand{\hashS}{S} \newcommand{\hashs}{s} \newcommand{\hashN}{N} \newcommand{\hashT}{T} % \newcommand{\hashn}{n} \section{The information-retrieval problem} A simple example of an \index{information retrieval}{information-retrieval}\index{hash code}\ %\index{code!hash} problem is the task of implementing a \ind{phone directory} service, which, in response to a person's {\dem name}, returns (a) a confirmation that that person is listed in the directory; and (b) the person's {phone number} and other details. We could formalize this problem as follows, with $\hashS$ being the number of names that must be stored in the \ind{directory}. \marginfig{\small \begin{tabular}{@{}p{1.20in}l} \toprule \parbox[t]{1.2in}{\small string length} & $N \simeq 200$ \\ \parbox[t]{1.2in}{\small\raggedright number of strings} & $S \simeq 2^{23}$ \\ \parbox[t]{1.2in}{\small\raggedright number of possible} & $2^N \simeq 2^{200}$ \\ \parbox[t]{1.2in}{\small\raggedright \hspace{0.2in} strings} & \\ \bottomrule % WOULD love this paragraph to be indented differently % HELP \end{tabular} \caption[a]{Cast of characters.} } % Imagine that y You are given a list of $\hashS$ binary strings of length $\hashN$ bits, $\{\bx^{(1)}, \ldots, \bx^{(\hashS)}\}$, where $\hashS$ is considerably smaller than the total number of possible strings, $2^\hashN$. We will call the superscript `$\hashs$' in $\bx^{(\hashs)}$ the {\dem record number\/} of the string. The idea is that $\hashs$ runs over customers in the order in which they are added to the directory and $\bx^{(\hashs)}$ is the name of customer $\hashs$. We assume for simplicity that all people have names of the same length. The name length might be, say, $\hashN = 200$ bits, and we might want to store the details of ten million customers, so $\hashS \simeq 10^7 \simeq 2^{23}$. We will ignore the possibility that two customers have identical names. The task is to construct the inverse of the mapping from $s$ to $\bx^{(\hashs)}$, \ie, to make a system that, given a string $\bx$, % with an unknown record number, will returns the value of $\hashs$ such that $\bx = \bx^{(\hashs)}$ if one exists, and otherwise reports that no such $\hashs$ exists. (Once we have the record number, we can go and look in memory location $\hashs$ in a separate memory full of phone numbers to find the required number.) The aim, when solving this task, is to % is system should use minimal computational resources in terms of the amount of memory used to store the inverse mapping from $\bx$ to $\hashs$ and the amount of time to compute the inverse mapping. And, preferably, the inverse mapping should be implemented in such a way that further new strings can be added to the directory in a small amount of computer time too.\index{content-addressable memory} % % add picture to show lookup table % \subsection{Some standard solutions} \label{sec.simplehash} The simplest and dumbest solutions to the information-retrieval problem are a look-up table and a raw list. \begin{description} \item[The look-up table] is a piece of memory of size $2^N \log_2 \hashS$, $\log_2 \hashS$ being the amount of memory required to store an integer between 1 and $\hashS$. In each of the $2^N$ locations, we put a zero, except for the locations $\bx$ that correspond to strings $\bx^{(\hashs)}$, into which we write the value of $\hashs$. The look-up table is a simple and quick solution, but only if there is sufficient memory for the table, and if the cost of looking up entries in memory is independent of the memory size. But in our definition of the task, we assumed that $N$ is % sufficiently large about 200 bits or more, so the amount of memory required would be of size $2^{200}$; this solution is completely out of the question. Bear in mind that the number of particles in the solar system is only about $2^{190}$. % particles in the known universe is \item[The raw list] is a simple list of ordered pairs $(\hashs, \bx^{(\hashs)} )$ ordered by the value of $\hashs$. The mapping from $\bx$ to $\hashs$ is achieved by searching through the list of strings, starting from the top, and comparing the incoming string $\bx$ with each record $\bx^{(\hashs)}$ until a match is found. This system is very easy to maintain, and uses a small amount of memory, about $\hashS \hashN$ bits, but is rather slow to use, since on average five million pairwise comparisons will be made. \end{description} \exercissxB{2}{ex.meanhash}{ Show that the average time taken to find the required string in a raw list, assuming that the original names were chosen at random, is about $\hashS + N$ binary comparisons. (Note that you don't have to compare the whole string of length $N$, since a comparison can be terminated as soon as a mismatch occurs; show that you need on average two binary comparisons per incorrect string match.) Compare this with the worst-case search time -- assuming that the devil chooses the set of strings and the search key. } The standard way in which phone directories are made improves on the look-up table and the raw list by using an {\dem{{alphabetically-ordered list}}}\index{alphabetical ordering}. \begin{description} \item[Alphabetical list\puncspace] The strings $\{ \bx^{(\hashs)} \}$ % $...$ are sorted into alphabetical order. Searching for an entry now usually takes less time than was needed for the raw list because we can take advantage of the sortedness; for example, we can open the phonebook at its middle page, and compare the name we find there with the target string; if the target is `greater' than the middle string then we know that the required string, if it exists, will be found in the second half of the alphabetical directory. Otherwise, we look in the first half. By iterating this splitting-in-the-middle procedure, we can identify the target string, or establish that the string is not listed, in $\lceil \log_2 \hashS \rceil$ string comparisons. The expected number of binary comparisons per string comparison will tend to increase as the search progresses, %, because the leading bits of the two strings involved % in the comparison are expected to become similar; but by being smart % and keeping track of which leading bits we have looked at % already in previous searches, it seems plausible that % we can reduce the number of binary % operations to about $\lceil \log_2 \hashS \rceil + N$ binary comparisons. but the total number of binary comparisons required will be no greater than $\lceil \log_2 \hashS \rceil N$. The amount of memory required is the same as that required for the raw list. Adding new strings to the database requires that we insert them in the correct location in the list. To find that location takes about $\lceil \log_2 \hashS \rceil$ binary comparisons. %Then shuffling along all % of the subsequent entries in the directory to make space for the % new entry may take some computer time, depending on how the memory works. \end{description} Can we improve on the well-established alphabetized list? Let us consider our task from some new viewpoints. % for a moment and think of other ways of viewing it. The task is to construct a mapping $\bx \rightarrow \hashs$ from $N$ bits % ($\bx$) to $\log_2 \hashS$ bits. % ($\hashs$). % % what does this mean? % This is a pseudo-invertible mapping, since for any $\bx$ that maps to a non-zero $\hashs$, the customer database contains the pair $(\hashs , \bx^{(\hashs)})$ that takes us back. Where have we come across the idea of mapping from $N$ bits to $M$ bits before? We encountered this idea twice: first, in source coding, we studied block codes which were mappings from strings of $N$ symbols to a selection of one label in a list. % $...$. The task of information retrieval is similar % pretty much identical to the task (which we never actually solved) of making an encoder for a typical-set compression code. The second time that we mapped bit strings to bit strings of another dimensionality was when we studied channel codes. There, we considered codes that mapped from $K$ bits to $N$ bits, with $N$ greater than $K$, and we made theoretical progress using {\em random\/} codes. In hash codes, we put together these two notions. We will study {random codes that map from $N$ bits to $M$ bits where $M$ is {\em smaller\/} than $N$}.\index{random code} % Another strand: the dumb look-up table would be really nice, very quick, % the only problem is it requires too much memory. But there are so % few vectors, what if we project them down into a lower-dimensional % space? A few will collide, but if they are mainly distinct then % we can just implement the look-up table in a lower dimensional % space. The idea is that we will map the original high-dimensional space down into a lower-dimensional space, one in which it is feasible to implement the dumb look-up table method which we rejected a moment ago. \amarginfig{t}{\small \begin{tabular}{@{}p{1.2in}l} \toprule \parbox[t]{1.2in}{\small string length} & $N \simeq 200$ \\ \parbox[t]{1.2in}{\small number of strings} & $S \,\simeq 2^{23}$ \\ \parbox[t]{1.2in}{\small size of hash function} & $M \simeq 30\ubits$ \\[0.01in] \parbox[t]{1.2in}{\small size of hash table} & $T = 2^M $\\ & $\:\:\:\:\: \simeq 2^{30}$ \\ \bottomrule % HELP the spacing between successive rows % is smaller than the spacing between lines!! :-( % HELP \end{tabular} \caption[a]{Revised cast of characters.} } \section{Hash codes} First we will describe how a hash code works, then we will study the properties of idealized hash codes. A hash code implements a solution to the information-retrieval problem, that is, a mapping from $\bx$ to $s$, with the help of a pseudo-random function called a {\dem\ind{hash function}}, which maps the $N$-bit string $\bx$ to an $M$-bit string $\bh(\bx)$, where $M$ is smaller than $N$. $M$ is typically chosen % to be sufficiently small such that the `table size' $\hashT \simeq 2^M$ is a little bigger than $S$ -- say, ten times % one or two orders of magnitude bigger. For example, if we were expecting % $S$ a million values for $\bx$ $S$ to be about a million, we might map %a 200-bit $\bx$ into a 30-bit hash $\bh$ (regardless of the size $N$ of each item $\bx$). The hash function is some fixed deterministic function which should ideally be indistinguishable from a fixed random code. For practical purposes, the hash function must be quick to compute. Two simple examples of \ind{hash function}s are: \begin{description} \item[Division method\puncspace] The table size $\hashT$ is a prime number, preferably one that is not close to a power of 2. The hash value is the remainder when the integer $\bx$ is divided by $\hashT$. \item[Variable string addition method\puncspace] This method assumes that $\bx$ is a string of bytes and that the table size $\hashT$ is 256. The characters of $\bx$ are added, modulo 256. % % % http://members.xoom.com/thomasn/s_man.htm % % % % This hash function does not distinguish anagrams. This hash function has the defect that it maps strings that are anagrams of each other onto the same hash. It may be improved by putting the running total through a fixed pseudorandom permutation after each character is added. % %\item[ In the\index{hash function} {\dem variable string exclusive-or method\/} with table size $\leq 65\,536$, the string is hashed twice in this way, with the initial running total being set to 0 and 1 respectively (\algref{alg.hashxor}). The result is a 16-bit hash. \end{description} % % probably a good idea to include this code stolen from Thomas Niemann % typedef unsigned short int HashIndexType; (changed to int) % \begin{algorithm}% figure} \begin{framedalgorithmwithcaption}% { \caption[a]{{\tt C} code implementing the variable string exclusive-or method to create a hash {\tt h} in the range $0\ldots 65\,535$ from a string {\tt x}. Author: Thomas Niemann.} \label{alg.hashxor} } \small \begin{verbatim} unsigned char Rand8[256]; // This array contains a random permutation from 0..255 to 0..255 int Hash(char *x) { // x is a pointer to the first char; int h; // *x is the first character unsigned char h1, h2; if (*x == 0) return 0; // Special handling of empty string h1 = *x; h2 = *x + 1; // Initialize two hashes x++; // Proceed to the next character while (*x) { h1 = Rand8[h1 ^ *x]; // Exclusive-or with the two hashes h2 = Rand8[h2 ^ *x]; // and put through the randomizer x++; } // End of string is reached when *x=0 h = ((int)(h1)<<8) | // Shift h1 left 8 bits and add h2 (int) h2 ; return h ; // Hash is concatenation of h1 and h2 } \end{verbatim} % original code stored in tex/_hash.code \end{framedalgorithmwithcaption} \end{algorithm}% figure} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \begin{figure} \figuremargin{\footnotesize \setlength{\unitlength}{1mm} \thinlines \begin{picture}(100,100)(-20,-40) \put(65,-40){\line(0,1){90}} \put(75,-40){\line(0,1){90}} \multiput(65,-40)(0,3){31}{\line(1,0){10}} \newcommand{\xvector}[2]{\put(-10,#1){\framebox(40,4){$\bx^{(#2)}$}}} \newcommand{\hvector}[2]{\put(53,#1){\makebox(5,0){$\bh(\bx^{(#2)})\rightarrow$}}} \newcommand{\svector}[2]{\put(74.3,#1){\makebox(0,0)[r]{$#2$}}} \newcommand{\slvector}[2]{\put(35,#1){\vector#2{10}}} \newcommand{\xhs}[4]{\xvector{#1}{#2}\hvector{#3}{#2}\svector{#3}{#2}\slvector{#1}{#4}} \xhs{30}{1}{18.7}{(1,-1)} \xhs{24}{2}{45.536}{(1,2)} \xhs{18}{3}{6.7}{(1,-1)} \xhs{0}{s}{-20.5}{(1,-2)} % labels \put(39,65){\makebox(0,0){Hash}} \put(39,62){\makebox(0,0){function}} \put(34,59){\vector(1,0){11}} \put(10,60){\makebox(0,0){Strings}} \put(48,58.60){\makebox(0,0)[l]{hashes}} \put(70,62){\makebox(0,0){Hash table}} % \put(10,12){\makebox(0,0){$\vdots$}} \put(10,-8){\makebox(0,0){$\vdots$}} % N range indication \put(10,40){\vector(-1,0){20}} \put(10,40){\vector(1,0){20}} \put(10,43){\makebox(0,0){$N$ bits}} % M range indication \put(70,54){\vector(-1,0){5}} \put(70,54){\vector(1,0){5}} \put(70,57){\makebox(0,0){$M$ bits}} % 2^M range \put(82,5){\vector(0,-1){45}} \put(82,5){\vector(0,1){45}} \put(84,5){\makebox(0,0)[l]{$2^M$}} % S range \put(-15,10){\vector(0,1){23}} \put(-15,10){\vector(0,-1){30}} \put(-17,10){\makebox(0,0)[r]{$S$}} % \end{picture} }{ \caption[a]{Use of hash functions for information retrieval. For each string $\bx^{(s)}$, the hash $\bh= \bh(\bx^{(s)})$ is computed, and the value of $s$ is written into the $\bh$th row of the hash table. Blank rows in the hash table contain the value zero. The table size is $T = 2^M$.} \label{fig.hashtable} } \end{figure} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Having picked a hash function $\bh(\bx)$, we implement an % efficient information retriever as follows. (See \figref{fig.hashtable}.) \begin{description} \item[Encoding\puncspace] A piece of memory called the {\em hash table\/} is created of size $2^Mb$ memory units, where $b$ is the amount of memory needed to represent an integer between $0$ and $\hashS$. This table is initially set to zero throughout. Each memory $\bx^{(\hashs)}$ is put through the hash function, and at the location in the hash table corresponding to the resulting vector $\bh^{(\hashs)} = \bh( \bx^{(\hashs)} )$, the integer $\hashs$ is written -- unless that entry in the hash table is already occupied, in which case we have a {\em collision\/} between $\bx^{(\hashs)}$ and some earlier $\bx^{(\hashs')}$ which both happen to have the same hash code. Collisions can be handled in various ways -- we will discuss some in a moment -- but first let us complete the basic picture. \item[Decoding\puncspace] To retrieve a piece of information corresponding to a target vector $\bx$, we compute the hash $\bh$ of $\bx$ and look at the corresponding location in the hash table. If there is a zero, then we know immediately that the string $\bx$ is not in the database. The cost of this answer is the cost of one hash-function evaluation and one look-up in the table of size $2^M$. If, on the other hand, there is a non-zero entry $\hashs$ in the table, there are two possibilities: either the vector $\bx$ is indeed equal to $\bx^{(\hashs)}$; or the vector $\bx^{(\hashs)}$ is another vector that happens to have the same hash code as the target $\bx$. (A third possibility is that this non-zero entry might have something to do with our yet-to-be-discussed collision-resolution system.) To check whether $\bx$ is indeed equal to $\bx^{(\hashs)}$, we take the tentative answer $\hashs$, look up $\bx^{(\hashs)}$ in the original forward database, and compare it bit by bit with $\bx$; if it matches then we report $\hashs$ as the desired answer. This successful retrieval has an overall cost of one hash-function evaluation, one look-up in the table of size $2^M$, another look-up in a table of size $\hashS$, and % up to $N$ binary comparisons -- which may be much cheaper than the simple solutions presented in section \ref{sec.simplehash}. \end{description} \exercissxB{2}{ex.hash.retrieval}{ If we have checked the first few bits of $\bx^{(\hashs)}$ with $\bx$ and found them to be equal, what is the probability that the correct entry has been retrieved, if the alternative hypothesis is that $\bx$ is actually not in the database? Assume that the original source strings are random, and the hash function is a random hash function. How many % Could have an exercise here on the number of binary evaluations are needed to be sure with odds of a billion to one that the correct entry has been retrieved? % [Note we are not assuming that the % original strings $\{ \bx^{(\hashs)} \}$ are random; they may be % very similar to each other. We are just assuming that the hash function % is random.] } % % view as a kind of source % encoding - reduces huge redundancy, where the redundancy % has the form P(x) = sum_x pi_c delta(x_c) % % does so using random coding. The hashing method of information retrieval can be used for strings $\bx$ of arbitrary length, if the hash function $\bh(\bx)$ can be applied to strings of any length. \section{Collision resolution} We will study two ways of resolving collisions: appending in the table, and storing elsewhere. \subsection{Appending in table} When encoding, if a collision occurs, we continue down the hash table and write the value of $s$ into the next available location in memory that currently contains a zero. If we reach the bottom of the table before encountering a zero, we continue from the top. When decoding, if we compute the hash code for $\bx$ and find that the $s$ contained in the table doesn't point to an $\bx^{(s)}$ that matches the cue $\bx$, we continue down the hash table until we either find an $s$ whose $\bx^{(s)}$ does match the cue % key $\bx$, in which case we are done, or else encounter a zero, in which case we know that the cue $\bx$ is not in the database. For this method, it is essential that the table be substantially bigger in size than $\hashS$. If $2^M < \hashS$ then the encoding rule will become stuck with nowhere to put the last strings. \subsection{Storing elsewhere} A more robust and flexible method is to use {\dem pointers\/} to additional pieces of memory in which collided strings are stored. There are many ways of doing this. As an example, we could store in location $\bh$ in the hash table a pointer (which must be distinguishable from a valid record number $s$) to a `bucket' where all the strings that have hash code $\bh$ are stored in a {\dem sorted list}. The encoder sorts the strings in each bucket alphabetically as the hash table and buckets are created. The decoder simply has to go and look in the relevant bucket and then check the short list of strings that are there by a brief alphabetical search. % of strings that have this encoding. This method of storing the strings in buckets allows the option of making the hash table quite small, which may have practical benefits. We may make it so small that almost all strings are involved in collisions, so all buckets contain a small number of strings. It only takes a small number of binary comparisons to identify which of the strings in the bucket matches the cue $\bx$. \section{Planning for collisions: a birthday problem} \index{birthday} \exercissxA{2}{ex.hash.collision}{ If we wish to store $S$ entries using a hash function whose output has $M$ bits, how many collisions should we expect to happen, assuming that our hash function is an ideal random function? What size $M$ of hash table is needed if we would like the expected number of collisions to be smaller than 1? What size $M$ of hash table is needed if we would like the expected number of collisions to be a small fraction, say 1\%, of $S$? } [Notice the similarity of this problem to \exerciseref{ex.birthday}.] \section{Other roles for hash codes} \subsection{Checking arithmetic} \index{error detection}If you wish to check an addition that was done by hand, you may find useful the method of {\dem{\ind{casting out nines}}}.\index{nines} In casting out nines, one finds the sum, modulo nine, of all the {\em digits\/} of the numbers to be summed and compares it with the sum, modulo nine, of the digits of the putative answer. [With a little practice, these sums can be computed much more rapidly than the full original addition.] % calculation proper.] \exampla{%??????????? % want this to have reference: {ex.nines}{ In the calculation shown in the margin \marginpar{\begin{center} \begin{tabular}[t]{r} {\tt 189} \\ {\tt +1254} \\ {\tt + 238} \\ \hline {\tt 1681} \\ \end{tabular} \end{center}} the sum, modulo nine, of the digits in {\tt 189+1254+238} is {\tt 7}, and the sum, modulo nine, of {\tt 1+6+8+1} is {\tt 7}. The calculation thus passes the casting-out-nines test. } Casting out nines gives a simple example of a hash function. For any addition expression of the form $a+b+c+\cdots$, where $a, b, c, \ldots$ are decimal numbers we define $h \in \{0,1,2,3,4,5,6,7,8\}$ by \beq h(a+b+c+\cdots) = \mbox{ sum modulo nine of all digits in $a,b,c$ } ; \eeq then it is nice property of decimal arithmetic that if \beq a+b+c+\cdots = m+n+o+\cdots \eeq then the hashes $h(a+b+c+\cdots)$ and $h(m+n+o+\cdots)$ are equal. \exercissxB{1}{ex.nines.p}{ What evidence\index{model comparison} does a correct casting-out-nines match give in favour of the hypothesis that the addition has been done correctly? } \subsection{Error detection among friends} \index{error detection}Are two files the same? If the files are on the same computer, we could just compare them bit by bit. But if the two files are on separate machines, it would be nice to have a way of confirming that two files are identical without having to transfer one of the files from A to B. [And even if we did transfer one of the files, we would still like a way to confirm whether it has been received without modifications!] This problem can be solved using hash codes. % Alice sends a file to Bob, and wants to do error detection. Let Alice and Bob be the holders of the two files; Alice sent the file to Bob, and they wish to confirm it has been received without error. If Alice computes the hash % function of her file and sends it to Bob, and Bob computes the hash % function of his file, using the same $M$-bit hash function, and the two hashes match, then Bob can deduce that the two files are almost surely the same. % should have some sort of reference to digest? % The hash of the file is often called the {\dem\ind{digest}}. \exampl{example.hash.II}{ What is the probability of a false negative, \ie, the probability, given that the two files do differ, that the two hashes % Bob concludes are nevertheless identical? } % Solution:::::::: If we assume that the hash function is random and that the % unrelated process that causes the files to differ knows nothing about the hash function, then the probability of a false negative is $2^{-M}$.\ENDsolution A 32-bit hash gives a probability of false negative of about $10^{-10}$. % 2.3283064365387e-10 It is common practice to use a linear hash function called a 32-bit cyclic redundancy check to detect errors in files. (A cyclic redundancy check is a set of 32 \ind{parity-check bits} similar to the 3 parity-check bits of the $(7,4)$ Hamming code.) %%%%%%%%% end solution \begin{conclusionbox} To have a false-negative rate smaller than one in a billion, $M = 32$ bits is plenty, if the errors are produced by noise. \end{conclusionbox} \exercissxB{2}{ex.whyonlyCRC}{ Such a simple parity-check code only detects errors; it doesn't help correct them. Since error-{\em{correcting\/}} codes exist, why not use one of them to get some error-correcting capability too? } % % more maths requested here % \subsection{Tamper detection} \index{security}\index{tamper detection}\index{detection of forgery}\index{forgery}What if the differences between the two files are not simply `noise', but are introduced by an adversary, a clever {\dem forger\/} called Fiona, who modifies the original file to make a \ind{forgery}\index{cryptography!digital signatures}\index{cryptography!tamper detection} that purports to be \ind{Alice}'s file? How can Alice make a \ind{digital signature} for the file so that \ind{Bob} can confirm that no-one has tampered with the file? And how can we prevent Fiona from listening in on Alice's signature and attaching it to other files? Let's assume that Alice computes a hash function for the file and sends it securely to Bob. % , in the same way as for error-detection above. If Alice computes a simple hash function for the file like the linear cyclic redundancy check, and Fiona knows that this is the method of verifying the file's integrity, Fiona can make her chosen modifications to the file and then easily identify (by linear algebra) a further 32-or-so single bits that, when flipped, restore the hash function of the file to its original value. {\em Linear hash functions give no security against forgers.} We must therefore require that the hash function\index{inversion of hash function} be {\em hard to invert\/} so that no-one can construct a tampering that leaves the hash function unaffected. We would still like the hash function to be easy to compute, however, so that Bob doesn't have to do hours of work to verify every file he received. Such a hash function -- easy to compute, but hard to invert -- is called a {\dem\ind{one-way hash function}}.\index{hash function!one-way} Finding such functions is one of the active research areas of \ind{cryptography}. % Don't want to use an ecc, because with a linear ecc it is easy to construct % a pair of tamperings which have the same syndrome and % so leave the hash unaffected. %How can we invent a function that has the %property that h(x) is easy to compute, but %it is very hard to find an x %suxh that h(x) has a chosen value h? %A lot of research is being done on this question %still, and the sort of functions people use %to make a one-way hash function are functions like: % % exponentiation-modulo-M % %Definition: % take x, and think of it as a number. % compute 1023^(x) modulo M, % where "^" means "1023 to the power x", % and M is some other integer, eg 97. % %Apparently it is hard to invert this sort of % function (i.e. to take the "discrete logarithm"). % %Real one-way hash functions are more complicated than %this, but I hope this gives the idea. % A hash function that is widely used in the free software\index{software!hash function} community to confirm that two files do not differ is {\tt\ind{MD5}}, which produces a 128-bit hash. The details of how it works are quite complicated, involving convoluted exclusive-or-ing and if-ing and and-ing.\footnote{{\tt http://www.freesoft.org/CIE/RFC/1321/3.htm}} % % of bits with each other % % Cryptography is the topic of the next chapter. % % rsync uses MD4 with a 128-bit checksum (for files with a matching size % and date) initially. But (from the man entry): % Current versions of rsync actually use an adaptive % algorithm for the checksum length by default, using % a 16 byte file checksum to determine if a 2nd pass % is required with a longer block checksum. Only use % this option if you have read the source code and % know what you are doing. % The `md5sum' program also uses 128 bits. Even with a good one-way hash function, the digital signatures described above are still vulnerable to attack, if Fiona has access to the hash function. Fiona could take the tampered file and hunt for a further tiny modification to it such that its hash matches the original hash of Alice's file. This would take some time -- on average, about $2^{32}$ attempts, if the hash function has 32 bits -- but eventually Fiona would find a tampered file that matches the given hash. To be secure against forgery, \ind{digital signature}s must either have enough bits for such a random search to take too long, or the \ind{hash function} itself must be kept \ind{secret}. \begin{conclusionbox} Fiona has to hash $2^M$ files to cheat. $2^{32}$ file modifications is not very many, so a 32-bit hash function is not large enough for \ind{forgery} prevention. \end{conclusionbox} % If Fiona works as Another person who might have a motivation for forgery is Alice herself. For example, she might be making a bet on the outcome of a race, without wishing to broadcast her prediction publicly; a method for placing bets would be for her to send to Bob the bookie the hash of her bet. Later on, she could send Bob the details of her bet. Everyone can confirm that her bet is consistent with the previously publicized hash. [This method of secret publication % shing ideas was used by Isaac Newton and Robert Hooke\index{Newton, Isaac}\index{Hooke, Robert} % (1635-1703) when they wished to establish priority for scientific ideas without revealing them. Hooke's hash function was alphabetization % ed latin statements, as illustrated by the conversion of {\em UT TENSIO, SIC VIS\/} into the \ind{anagram} {\tt{CEIIINOSSSTTUV}}.] % http://www.microscopy-uk.org.uk/mag/artmar00/hooke2.html % http://www.rod.beavon.clara.net/leonardo.htm % It was in his Helioscopes in 1676 that Hooke followed the popular seventeenth-century conceit of announcing a discovery in an anagram: cediinnoopsssttuu. He published its key two years later, in his most complete treatment of elasticity, in De Potentia Bestitutiva, or Of Spring. Here Hooke enunciated the original formulation of the law that bears his name: Ut Pondus sic Tensia, or 'the weight is equal to the tension'. [33] As the tension was seen as the product of an increasing series of weights in pans suspended on coiled springs, it is easy in this pre-Newtoniangravitation age to understand how Hooke spoke of the pondus, or weight, as acting on the spring. The formulation of 'Hooke's Law' with which we are more familiar today is Ut Tensia, sic Vis, or 'the tension is equal to the force'. % % http://www.aero.ufl.edu/~uhk/strength/strength.htm ??? CEIIOSSOTTUU ??? CEIINOSSITTUV % ??? ceiiinosssttvv % all accounts differ! % http://arc-gen1.life.uiuc.edu/Bioph354/lect19.html Such a protocol relies on the assumption that Alice cannot change her bet after the event without the hash coming out wrong. How big a hash function do we need to use to ensure that Alice cannot cheat? The answer is different from the size of the hash we needed in order to defeat Fiona above, because Alice is the author of {\em both\/} files. Alice could \ind{cheat} by searching for two files that have identical hashes to each other. For example, if she'd like to cheat by placing two bets for the price of one, she could make a large number $N_1$ of versions of bet one (differing from each other in minor details only), and a large number $N_2$ of versions of bet two, and hash them all. If there's a \ind{collision} between the hashes of two bets of different types, then she can submit the common hash and thus buy herself the option of placing either \ind{bet}. \exampl{example.hashN1N2}{ If the hash has $M$ bits, how big do $N_1$ and $N_2$ need to be for Alice to have a good chance of finding two different bets with the same hash? } % solution This is a \ind{birthday} problem like \exerciseref{ex.birthday}. If there are $N_1$ Montagues and $N_2$ Capulets at a party, and each is assigned a `birthday' of $M$ bits, the expected number of \ind{collision}s between a Montague and a Capulet is \beq N_1 N_2 2^{-M} , \eeq so to minimize the number of files hashed, $N_1+N_2$, Alice should make $N_1$ and $N_2$ equal, and will need to hash about $2^{M/2}$ files until she finds two that match.\ENDsolution \begin{conclusionbox} Alice has to hash $2^{M/2}$ files to cheat. [This is the square root of the number of hashes Fiona had to make.] \end{conclusionbox} If Alice has the use of $C=10^6$ computers for $T=10$\,years, each computer taking $t=1\,$ns to evaluate a hash, the bet-communication system is\index{security} secure against Alice's dishonesty only if $M \gg 2 \log_2 CT/t \simeq 160$ bits. % end solution \section*{Further reading} The Bible for hash codes is volume 3 of \citeasnoun{KnuthAll}. I highly recommend the story of Doug McIlroy's {\tt{\ind{spell}}} program, as told in section 13.8 of {\em{Programming Pearls}} \cite{Bentley2}. This astonishing piece of software makes use of a 64-\kilobyte\ data structure to store the spellings of all the words of $75\,000$-word dictionary. % also has some hash functions for strings on p 161, chapter 15. % and random text generator. \section{Further exercises} % removed and returned (maybe should transfer some of these?) % solutions in _se8.tex % oct 97 % % info theory and the real world % \fakesection{Information theory and the real world (questions relating to hash functions)} \exercisaxA{1}{ex.address}{ What is the shortest the \ind{address} on a typical international {letter} could be, if it is to get to a unique human recipient? (Assume the permitted characters are {\tt{[A-Z,0-9]}}.) How long are typical \ind{email} addresses? } \exercissxA{2}{ex.uniquestring}{ How long does a piece of text need to be for you to be pretty sure that no human has written that string of characters before? How many notes are there in a new \ind{melody}\index{music} that has not been composed before? } \exercissxB{3}{ex.proteinmatch}{ {\sf Pattern recognition by \ind{molecules}}.\index{pattern recognition} Some proteins produced in a cell have a regulatory role. A regulatory {protein} controls the transcription of specific \ind{genes} in the \ind{genome}. % that might code for other proteins or sometimes the protein itself. This control often involves the protein's binding to a particular \ind{DNA} sequence in the vicinity of the regulated gene. The presence of the bound protein either promotes or inhibits transcription of the gene. \ben \item Use information-theoretic arguments to obtain a lower bound on the size of a typical protein that acts as a regulator specific to one gene in the whole human genome. Assume that the genome is a sequence of $3 \times 10^{9}$ nucleotides drawn from a four letter alphabet $\{{\tt A},{\tt C},{\tt G},{\tt T}\}$;\index{amino acid}\index{nucleotide}\index{binding DNA} a protein is a sequence of amino acids drawn from a twenty letter alphabet. [Hint: establish how long the recognized DNA sequence has to be in order for that sequence to be unique to the vicinity of one gene, treating the rest of the genome as a random sequence. Then discuss how big the protein must be to recognize a sequence of that length uniquely.] \item Some of the sequences recognized by \ind{DNA}-binding regulatory\index{protein!regulatory} proteins consist of a subsequence that is repeated twice or more, for example the sequence \beq \mbox{{\tt{\underline{GCCCCC}CACCCCT\underline{GCCCCC}}}} \eeq is a binding site found upstream of the alpha-actin gene in humans. %; this is a binding site for a transcription factor called Sp1. Does the fact that some binding sites consist of a {repeated\/} subsequence influence your answer to part (a)? \een } % % stole information acquisition exercises from here to move to gene chapter % \dvips \section{Solutions}% to Chapter \protect\ref{ch.hash}'s exercises} % \soln{ex.meanhash}{ First imagine comparing the string $\bx$ with another random string $\bx^{(s)}$. The probability that the first bits of the two strings match is $1/2$. The probability that the second bits match is $1/2$. Assuming we stop comparing once we hit the first mismatch, the expected number of matches is 1, so the expected number of comparisons is 2 \exercisebref{ex.waithead}. % errors corrected in draft 2.0.7 on Sun 31/12/00 Assuming the correct string is located at random in the raw list, we will have to compare with an average of $\hashS/2$ strings before we find it, which costs $2 \hashS/2$ binary comparisons; and comparing the correct strings takes $N$ binary comparisons, giving a total expectation of $\hashS + N$ binary comparisons, if the strings are chosen at random. In the worst case (which may indeed happen in practice), the other strings are very similar to the search key, so that a lengthy sequence of comparisons is needed to find each mismatch. The worst case is when the correct string is last in the list, and all the other strings differ in the last bit only, giving a requirement of $\hashS N$ binary comparisons. } \soln{ex.hash.retrieval}{ The likelihood ratio for the two hypotheses, $\H_0$: $\bx^{(\hashs)} = \bx$, and $\H_1$: $\bx^{(\hashs)} \neq \bx$, contributed by the datum `the first bits of $\bx^{(\hashs)}$ and $\bx$ are equal' is \beq \frac{ P( \mbox{Datum} \given \H_0 ) } { P( \mbox{Datum} \given \H_1 ) } = \frac{1}{1/2} = 2. \eeq If the first $r$ bits all match, the likelihood ratio is $2^r$ to one. On finding that 30 bits match, the odds are a billion to one in favour of $\H_0$, assuming we start from even odds. [For a complete answer, we should compute the evidence % prior probability of $\H_0$ and $\H_1$ given by the prior information that the hash entry $s$ has been found in the table at $\bh(\bx)$. This fact gives further evidence in favour of $\H_0$.] } \soln{ex.hash.collision}{ Let the hash function have an output alphabet of size $T = 2^M$. If $M$ were equal to $\log_2 S$ then we would have exactly enough bits for each entry to have its own unique hash. The probability that one particular pair of entries collide under a random hash function is $1/T$. The number of pairs is $S(S-1)/2$. So the expected number of collisions between pairs is exactly \beq S(S-1)/(2T). \eeq If we would like this to be smaller than 1, then we need $ T > S(S-1)/2 % S(S-1) < 2A \:\: \Rightarrow \:\: S < \sqrt{2A} $ so \beq M > 2 \log_2 S. \label{eq.M2Shash} \eeq We need {\em twice as many\/} bits as the number of bits, $\log_2 S$, that would be sufficient to give each entry a unique name. % fS = S(S-1)/(2A) % A = (S-1) / (2 f ) If we are happy to have occasional collisions, involving a fraction $f$ of the names $S$, then we need $T > S/f$ (since the probability that one particular name is collided-with is $f \simeq S/T$) so \beq M > \log_2 S + \log_2 [1/f] , \label{eq.MShash} \eeq which means for $f \simeq 0.01$ that we need an extra 7 bits above $\log_2 S$. The important point to note is the \ind{scaling} of $T$ with $S$ in the two cases (\ref{eq.M2Shash},$\,$\ref{eq.MShash}). If we want the hash function to be collision-free, then we must have $T$ greater than $\sim \! S^2$. If we are happy to have a small frequency of collisions, then $T$ needs to be of order $S$ only. % some factor greater than } % % % \soln{ex.nines.p}{ The posterior probability ratio for the two hypotheses, $\H_{+} = $ `calculation correct' and $\H_{-} = $ `calculation incorrect' is the product of the prior probability ratio $P(\H_{+})/P(\H_{-})$ and the likelihood ratio, $P(\mbox{match} \given \H_{+})/P(\mbox{match} \given \H_{-})$. This second factor is the answer to the question. The numerator $P(\mbox{match} \given \H_{+})$ is equal to 1. The denominator's value depends on our model of errors. If we know that the human calculator is prone to errors involving multiplication of the answer by 10, or to transposition of adjacent digits, neither of which affects the hash value, then $P(\mbox{match} \given \H_{-})$ could be equal to 1 also, so that the correct match gives no evidence in favour of $\H_{+}$. But if we assume that errors are `random from the point of view of the hash function' then the probability of a false positive is $P(\mbox{match} \given \H_{-}) = 1/9$, and the correct match gives evidence 9:1 in favour of $\H_{+}$. } % % % \soln{ex.whyonlyCRC}{ If you add a tiny $M=32$ extra bits of hash to a huge $N$-bit file you get pretty good \ind{error detection}\index{error-correcting code} -- % $1-2^{-M}$ the probability that an % of detecting an error, less than a one-in-a-billion chance that the error is undetected is $2^{-M}$, less than one in a billion. To do error {\em correction\/} requires far more check bits, the number depending on the expected types of corruption, and on the file size. For example, if just eight random bits in a megabyte file are corrupted, it would take % $\log_2 {{ 8\times 10^{6}} \choose {8} } \simeq 180$ about $\log_2 {{ 2^{23} }\choose{8} } \simeq 23 \times 8 \simeq 180$ bits to specify which are the corrupted bits, and the number of \ind{parity-check bits} used by a successful error-correcting code would have to be at least this number, by the counting argument of \exerciseonlyref{ex.makecode2error} (solution, \pref{ex.makecode2error.sol}). % Shannon's \ind{noisy-channel coding theorem}. } % see also _se8.tex \fakesection{se8} %\begincuttable% NO, I LIKE IT \soln{ex.uniquestring}{ We want to know the length $L$ of a string such that it is very improbable that that string matches any part of the entire writings of humanity. Let's estimate that these writings total about one book for each person living, and that each book contains two million characters (200 pages with $10\,000$ characters per page) -- that's % $5\times 10^9 \times 2 \times 10^6 = $10^{16}$ characters, drawn from an alphabet of, say, 37 characters. The probability that a randomly chosen string of length $L$ matches at one point in the collected works of humanity is $1/37^{L}$. So the expected number of matches is $10^{16} /37^{L}$, which is vanishingly small if $L \geq 16/\log_{10} 37 \simeq 10$. % 10.2 Because of the redundancy and repetition of humanity's writings, it is possible that $L \simeq 10$ is an overestimate. So, if you want to write something unique, sit down and compose a string of ten characters. But don't write {\tt{gidnebinzz}}, because I already thought of that string. As for a new \ind{melody},\index{music} if we focus on the sequence of notes, ignoring duration and stress, and allow leaps of up to an octave at each note, then the number of choices per note is 23. The pitch of the first note is arbitrary. The number of melodies of length $r$ notes in this rather ugly ensemble of \ind{Sch\"onberg}ian tunes is $23^{r-1}$; for example, there are $250\,000$ of length $r=5$. Restricting the permitted intervals will reduce this figure; including duration and stress will increase it again. [If we restrict the permitted intervals to repetitions and tones or semitones, the reduction is particularly severe; is this why the melody of `\ind{Ode to Joy}' sounds so boring?] The number of recorded compositions is probably less than a million. % top of the pops for 50 * 50 weeks with 100 new songs per week If you learn 100 new melodies per week for every week of your life then you will have learned $250\,000$ melodies at age 50. Based on empirical experience of playing the game\index{game!guess that tune} `{\tt{guess that tune}}',\marginpar{\small\raggedright\reducedlead{In {\tt{guess that tune}}, one player chooses a melody, and sings a gradually-increasing number of its notes, while the other participants try to guess the whole melody.\medskip % aka http://www.melodyhound.com/ The {\dem\ind{Parsons code}\/} is a related hash function for melodies: % . To make the Parsons code of a melody, each pair of consecutive notes is coded as {\tt{U}} (`up') if the second note is higher than the first, {\tt{R}} (`repeat') if the pitches are equal, and {\tt{D}} (`down') otherwise. You can find out how well this hash function works at {\tt{http://{\breakhere}musipedia.{\breakhere}org/}}. % {\tt{www.{\breakhere}name-{\breakhere}this-{\breakhere}tune.{\breakhere}com}}. }} it seems to me that whereas many four-note sequences are shared in common between melodies, the number of collisions between five-note sequences is rather smaller -- most famous five-note sequences are unique. } %\ENDcuttable \soln{ex.proteinmatch}{ %\ben %\item (a) Let the DNA-binding {protein} recognize a sequence of length $L$ nucleotides. That is, it binds preferentially to that \ind{DNA} sequence, and not to any other pieces of DNA in the whole genome. (In reality, the recognized sequence may contain some wildcard characters, \eg, the {\tt{*}} in {\tt{TATAA*A}}, which denotes `any of {\tt{A}}, {\tt{C}}, {\tt{G}} and {\tt{T}}'; so, to be precise, we are assuming that the recognized sequence contains $L$ non-wildcard characters.) % in a sequence whose length can be greater than $L$.) Assuming the rest of the genome is `random', \ie, that the sequence consists of random nucleotides {\tt{A}}, {\tt{C}}, {\tt{G}} and {\tt{T}} with equal probability -- which is obviously untrue, but it shouldn't make too much difference to our calculation -- the chance that there is no other occurrence of the target sequence in the whole genome, of length $N$ nucleotides, is roughly \beq (1 - (1/4)^L )^N \simeq \exp ( - N (1/4)^L ) , \eeq which is close to one only if \beq N 4^{-L} \ll 1 , \eeq that is, \beq L > \log N / \log 4 . \eeq Using $N= 3 \times 10^9$, % from cell p.386 we require the recognized sequence to be longer than $L_{\min} = 16$ nucleotides. What size of \ind{protein} does this imply? %\ben \bit \item % A weak lower bound can be obtained by assuming that the information content of the protein sequence itself is greater than the information content of the \ind{nucleotide} sequence the protein prefers to bind to (which we have argued above must be at least 32 bits). This gives a minimum protein length of $32 / \log_2(20) \simeq 7$ \ind{amino acid}s. \item Thinking realistically, the \ind{recognition} of the DNA sequence by the protein presumably involves the protein coming into contact with all sixteen nucleotides in the target sequence. If the protein is a monomer, it must be big enough that it can simultaneously make contact with sixteen nucleotides of DNA. One helical turn of DNA containing ten nucleotides has a length of 3.4$\,$nm, so a contiguous sequence of sixteen nucleotides has a length of 5.4$\,$nm. The diameter of the protein must therefore be about 5.4$\,$nm or greater. Egg-white lysozyme is a small globular protein with a length of 129 amino acids % cell p.90 and a diameter of about 4$\,$nm. % cell p.130. Assuming that volume is proportional to sequence length and that volume scales as the cube of the diameter, a protein of diameter 5.4$\,$nm must have a sequence of length $2.5 \times 129 \simeq 324$ amino acids. %\een \eit % \item % (b) If, however, a target sequence consists of a twice-repeated sub-sequence, we can get by with a much smaller protein that recognizes only the sub-sequence, and that binds to the \ind{DNA} strongly only if it can form a {\em\ind{dimer}}, both halves of which are bound to the recognized sequence. % , which must appear twice in succession in the DNA. % with a neighbour. Halving the diameter of the protein, we now only need a protein whose length is greater than 324/8 = 40 amino acids. A protein of length smaller than this cannot by itself serve as a regulatory protein\index{protein!regulatory} specific to one gene, because it's simply too small to be able to make a sufficiently specific match -- its available surface does not have enough information content. % \een } % \dvips % % ch 8 LINEAR %\chapter{Linear Error correcting codes and perfect codes} %\chapter{Linear Error Correcting Codes and Perfect Codes \nonexaminable} \prechapter{About Chapter} % prechapter for linear codes / binary codes In Chapters \ref{ch.prefive}--\ref{ch.ecc}, we established Shannon's noisy-channel coding theorem for a general channel with any input and output alphabets. A great deal of attention in coding theory focuses on the special case of channels with binary inputs. Constraining ourselves to these channels simplifies matters, and leads us into an exceptionally rich world, which we will only taste in this book. One of the aims of this chapter is to point out a contrast between Shannon's aim of achieving reliable communication over a noisy channel and the apparent aim of many in the % this wonderful world of \ind{coding theory}.\index{sphere packing} Many coding theorists take as their fundamental problem the task of packing as many spheres as possible, with radius as large as possible, into an $N$-dimensional space, {\em with no spheres overlapping}. Prizes are awarded to people who find packings that squeeze in an extra few spheres. % of a given radius. While this is a fascinating mathematical topic, we shall see that the aim of maximizing the \ind{distance} between codewords in a code has only a tenuous relationship to Shannon's aim of reliable \ind{communication}. \ENDprechapter \chapter{Binary Codes \nonexaminable} \label{ch.linearecc} \label{ch.linear} % see also linearblock.tex % % chapter 8: linear error correcting codes % % distance % %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % see also NOTES.tex We've established Shannon's noisy-channel coding theorem\index{linear block code} for a general channel with any input and output alphabets. A great deal of attention in coding theory focuses on the special case of channels with binary inputs, the first implicit choice being the binary symmetric channel.\index{channel!binary symmetric} The optimal decoder for a code, given a binary symmetric channel, finds the codeword that is closest to the received vector, closest\marginpar[b]{\small{{\sf Example:}\\[0.0012in] %\begin{center} \begin{tabular}{rl} \multicolumn{2}{c}{ The Hamming distance }\\ {between}& {\tt{00001111}}\\ and & {\tt{11001101}}\\ \multicolumn{2}{c}{ is 3. }\\ \end{tabular} %\end{center} }} in {\dem\ind{Hamming distance}}.\index{distance!Hamming} The Hamming distance between two binary vectors is the number of coordinates in which the two vectors differ. Decoding errors will occur if the noise takes us from the transmitted codeword $\bt$ to a received vector $\br$ that is closer to some other codeword. The {\dem{distances\/}} between codewords are thus relevant to the probability of a decoding error.\index{distance!of code} \section{Distance properties of a code} %\begin{description} %\item[The {\dem{distance}\/} of a\index{distance!of code} code] The {\dem{distance}\/} of a\index{distance!of code} % \index{error-correcting code!distance} code is the smallest separation between two of its codewords. % \end{description} % \begin{ indented \exampl{ex.hamm74dist}{ %\noindent {\sf Example:} The $(7,4)$ Hamming code (\pref{sec.ham74}) has distance $d= 3$. All pairs of its codewords differ in at least 3 bits. The maximum number of errors it can correct is $t=1$; in general a code with distance $d$ is $\lfloor (d\!-\!1)/2 \rfloor$-error-correcting. } % , and % the distance is related to this quantity by % $d=2t+1$. % \end{indented A more precise term for distance is the {\dem\ind{minimum distance}\/} of the code. The distance of a code is often denoted by $d$ or $d_{\min}$. % % \section{Weight enumerator function} % see code/bucky/README \index{error-correcting code!weight enumerator}% %\index{error-correcting code!distance distribution}% We'll now constrain our attention to linear codes. In a linear code, all codewords have identical distance properties, so we can summarize % the dis. % are equivalent, % from the point of view of the spectrum of % distances to other codewords. % summarizes all the distances between the code's codewords by counting the distances from the all-zero codeword. %\begin{description} %\item[The {\dem\ind{weight enumerator} function} of a code,] $A(w)$, The {\dem\ind{weight enumerator} function} of a code, $A(w)$, % $A(w)$ is defined to be the number of codewords in the code that have weight $w$. \amarginfig{b}{% \footnotesize \begin{tabular}{c} \raisebox{0.2in}{\buckypsfig{H74.eps}} \\ %# weight enumerator of $(7,4)$ code %# w A(w) C Random Random N-choose-w \begin{tabular}[b]{rr} \toprule $w$ & $A(w)$ \\ \midrule %%%%%%%%%%%%%%%%%%%%%%%%%%%% 0 & 1 \\ 3 & 7 \\ 4 & 7 \\ 7 & 1 \\ \midrule %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%55 Total & 16\\ \bottomrule \end{tabular} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \\ \buckypsgraphb{H74.Aw.ps} \end{tabular} % see /home/mackay/code/bucky/H74.gnu \caption[a]{The graph of the $(7,4)$ Hamming code, and its weight enumerator function.} \label{fig.wef.h74} } % The weight enumerator function is also known as the {\dem{{distance distribution}\index{distance!distance distribution}}\/} of the code. %\end{description} % original is in graveyard.tex \begin{figure} \figuremargin{% \footnotesize \begin{tabular}{ccc} \buckypsfig{dodec.eps} & %# weight enumerator of (30,11) code dodec2.G %# w A(w) C Random Random N-choose-w \begin{tabular}{rr} \toprule $w$ & $A(w)$ \\ \midrule %%%%%%%%%%%%%%%%%%%%%%%%%%%% 0 & 1 \\ 5 & 12 \\ 8 & 30 \\ 9 & 20 \\ 10 & 72 \\ 11 & 120 \\ 12 & 100 \\ 13 & 180 \\ 14 & 240 \\ 15 & 272 \\ 16 & 345 \\ 17 & 300 \\ 18 & 200 \\ 19 & 120 \\ 20 & 36 \\ \midrule %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%55 Total & 2048\\ \bottomrule \end{tabular} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% & \begin{tabular}{@{}c@{}} \buckypsgraphB{dodec2.Aw.ps} \\ \buckypsgraphB{dodec2.Aw.l.ps} \end{tabular} \\% see /home/mackay/code/bucky \end{tabular} }{ \caption[a]{ The graph defining the $(30,11)$ \ind{dodecahedron code}\index{error-correcting code!dodecahedron} % first introduced in secref{sec.dodecahedron} (the circles are the 30 transmitted bits and the triangles are the 20 parity checks, one of which is redundant) and the % (b-ii) The weight enumerator function (solid lines). The dotted lines show the average weight enumerator function of all random linear codes with the same size of generator matrix, % (dotted lines), which will be computed shortly. The lower figure shows the same functions on a log scale. %%%%%%%%%%%%%%% CHECK %%%%%%%%%%%%%%% % {\em (Check for cross-reference to earlier occurrence?)} } \label{fig.Aw} } \end{figure} % \begin{ indented ? \exampl{ex.hamm74Aw}{ % \noindent {\sf Example:} The weight enumerator functions of the $(7,4)$ Hamming code and the \ind{dodecahedron code}\index{error-correcting code!dodecahedron} are shown in figures \ref{fig.wef.h74} and \ref{fig.Aw}. % \end{indented } \section{Obsession with distance} Since the maximum number of errors that a code can {\em guarantee\/} to correct, $t$, is related to its distance $d$ by $t= \lfloor (d\!-\!1)/2 \rfloor$,\marginpar{\small{% $d=2t+1$ if $d$ is odd, and\\ $d=2t+2$ if $d$ is even.}} many coding theorists focus on the\index{distance!of code} distance of a code, searching for codes of a given size that have the biggest possible distance. Much of practical coding theory has focused on decoders that give the optimal decoding for all error patterns of weight up to the half-distance $t$ of their codes. \begin{description} \item[A \ind{bounded-distance decoder}]\index{decoder!bounded-distance} is a decoder that returns the closest codeword to a received\label{sec.bdd} binary vector $\br$ if the distance from $\br$ to that codeword is less than or equal to $t$; otherwise it returns a failure message. \end{description} The rationale for not trying to decode when more than $t$ errors have occurred might be `we can't {\em guarantee\/} that we can correct more than $t$ errors, so we won't bother trying -- who would be interested in a decoder that corrects some\index{sermon!worst-case-ism} error patterns of weight greater than $t$, but not others?' This defeatist attitude is an example of {\dem\ind{worst-case-ism}}, a widespread mental ailment % yes, spell checked which this book is intended to cure. The fact is that bounded-distance decoders cannot reach the\wow\ Shannon limit of the binary symmetric channel; only a decoder that often corrects more than $t$ errors can do this. The state of the art in error-correcting codes have decoders that work way beyond the minimum distance of the code. \subsection{Definitions of good and bad distance properties} \index{distance!of code!good/bad}Given a family of codes of increasing blocklength $N$, and with rates approaching a limit $R>0$, we may be able to put that family in one of the following categories, which have some similarities to the categories of `good' and `bad' codes defined earlier (\pref{sec.bad.code.def}):\index{error-correcting code!good}\index{error-correcting code!bad}\index{error-correcting code!very bad}\index{distance!good}\index{distance!bad}\index{distance!very bad} \label{sec.bad.dist.def} \begin{description} \item[A sequence of codes has `good' distance] if $d/N$ tends to a constant greater than zero. \item[A sequence of codes has `bad' distance] if $d/N$ tends to zero. \item[A sequence of codes has `very bad' distance] if $d$ tends to a constant. \end{description} % THIS really belongs over the page \amarginfig{b}{ \begin{center} \mbox{ \psfig{figure=/home/mackay/itp/figs/gallager/16.12.G.ps,width=2in,angle=-90} }\end{center} \caption[a]{The graph of a rate-\dfrac{1}{2} low-density generator-matrix code. The rightmost $M$ of the transmitted bits are each connected to a single distinct parity constraint. The leftmost $K$ transmitted bits are each connected to a small number of parity constraints. } \label{fig.ldgmc} } \exampl{example.badcode}{ A {\dem\ind{low-density generator-matrix code}\/} is a linear code whose $K \times N$ generator matrix $\bG$ has a small number $d_0$ of {\tt{1}}s per row, regardless of how big $N$ is. The minimum distance of such a code is at most $d_0$, so {low-density generator-matrix code}s have `very bad' distance. } While having large distance is no bad thing, we'll see, later on, why an emphasis on distance can be unhealthy. \begin{figure}[htbp] \figuremargin{ \mbox{\psfig{figure=figs/caveperfect.ps,angle=-90,width=3in}} }{ \caption[a]{Schematic picture of part of Hamming space perfectly filled by $t$-spheres centred on the codewords of a perfect code.} \label{fig.caveperfect} } \end{figure} \section{Perfect codes} A $t$-sphere (or a sphere of radius $t$) in Hamming space, centred on a point $\bx$, is the set of points whose Hamming distance from $\bx$ is less than or equal to $t$. The $(7,4)$ \ind{Hamming code}\index{perfect code}\index{error-correcting code!perfect} has the beautiful property that if we place 1-spheres % of radius 1 about each of its 16 codewords, those spheres perfectly fill Hamming space without overlapping. As we saw in \chref{ch1}, every binary vector of length 7 is within a distance of $t=1$ of exactly one codeword of the Hamming code. \begin{description} \item[A code is a perfect $t$-error-correcting code] if the set of $t$-spheres centred on the codewords of the code fill the Hamming space without overlapping. (See \figref{fig.caveperfect}.) \end{description} Let's recap our cast of characters. The number of codewords is $S=2^K$. The number of points in the entire Hamming space is $2^N$. The number of points in a Hamming sphere of radius $t$ is \beq \sum_{w=0}^{t} {{N}\choose{w}} . \eeq For a code to be perfect with these parameters, we require $S$ times the number of points in the $t$-sphere to equal $2^N$: \beqan \mbox{for a perfect code, } \:\: 2^K \sum_{w=0}^{t} {{N}\choose{w}} & =& 2^N \\ \mbox{or, equivalently, }\:\: \sum_{w=0}^{t} {{N}\choose{w}} & =& 2^{N-K} . \eeqan For a perfect code, the number of noise vectors in one sphere must equal the number of possible syndromes. The $(7,4)$ Hamming code satisfies this numerological condition\index{numerology} because \beq 1 + {{7}\choose{1}} = 2^3 . \label{eq.coincidence} \eeq % Interestingly, the first appearance of the ternary Golay code predated %Golay's publication by a good year. A Finnish devotee of football pools thought it up in list form (!) and published it % in 1947. % Covering codes. %G.Cohen, I.Honkala. S.Litsyn, and A.Lobstein %North-Holland Publishing Co., Amsterdam, 1997. xxii+542 pp. ISBN 0-444-82511-8 % It is this "ternary" Golay code which was first discovered by a Finn who was % determining good strategies for betting on blocks of 11 soccer games. Here, % one places a bet by predicting a Win, Lose, or Tie for all 11 games, and as % long as you do not miss more than two of them, you get a payoff. If a group % gets together in a "pool" and makes multiple bets to "cover all the options" % (so that no matter what the outcome, somebody's bet comes within 2 of the % actual outcome), then the codewords of a 2-error-correcting perfect code % provide a very nice option; the balls around its codewords fill all of the % space, with none left over. % % It was in this vein that the ternary Golay code was first constructed; its % discover, Juhani Virtakallio, exhibited it merely as a good betting system % for football-pools, and its 729 codewords appeared in the football-pool % magazine Veikkaaja. For more on this, see Barg's article [1]. % % [1] Barg, Alexander. "At the Dawn of the Theory of Codes," The Mathematical % Intelligencer, Vol. 15 (1993), No. 1, pp. 20--26. \subsection{How happy we would be to use perfect codes} If there were large numbers of perfect codes to choose from, with a wide range of blocklengths and rates, then these would be the perfect solution to Shannon's problem. We could communicate over a binary symmetric channel with noise level $f$, for example, by picking a perfect $t$-error-correcting code with blocklength $N$ and $t=f^* N$, where $f^* = f + \delta$ and $N$ and $\delta$ are chosen such that the probability that the noise flips more than $t$ bits is satisfactorily small. However, {\em there are almost no perfect codes}.\wow\ The only nontrivial perfect binary codes are \ben \item the Hamming codes, which are perfect codes with $t=1$ % -error-correcting with and blocklength $N=2^M-1$, defined below; the rate of a \ind{Hamming code} approaches 1 as its blocklength $N$ increases; \item the repetition codes of odd blocklength $N$, which are perfect codes with $t=(N-1)/2$; the rate of repetition codes goes to zero as $1/N$; and \item one remarkable $3$-error-correcting code with $2^{12}$ codewords of blocklength $N=23$ known as the binary \ind{Golay code}\index{error-correcting code!Golay}. [A second 2-error-correcting Golay code of length $N=11$ over a ternary alphabet was % 729 cw's in football-pool magazine Veikkaaja. discovered by a Finnish football-pool enthusiast\index{football pools}\index{bet}\index{design theory} % \index{Finland} called Juhani Virtakallio\index{Virtakallio, Juhani} in 1947.] % 1+23+23*11 + 23*11*7 = 2048 % If we allow more symbols in our alphabet than just 0 and 1, then we get analogues of the % Hamming codes, and another Golay code of length 11, this time on three letters (say 0, +, and -) and with parameters (11, % 3^6, 5). This completes the list of all linear perfect codes. parameters (11,3^6, 5). % http://lev.yudalevich.tripod.com/ECC/betting.html % % [1] Barg, Alexander. "At the Dawn of the Theory of Codes," The Mathematical Intelligencer, Vol. 15 (1993), No. 1, pp. % 20--26. \een There are no other binary perfect codes. Why this shortage of perfect codes? Is it because precise numerological coincidences like those satisfied by the parameters of the Hamming code (\ref{eq.coincidence}) and the Golay code, \beq 1 + {{23}\choose{1}} + {{23}\choose{2}} + {{23}\choose{3}} = 2^{11}, \eeq are rare? Are there plenty of `almost-perfect' codes for which the $t$-spheres fill {\em almost\/} the whole space? No. In fact, the picture of Hamming spheres centred on the codewords {\em{almost}\/} filling Hamming space (\figref{fig.cavenotquite}) is a misleading one: for most codes, whether they are good codes or bad codes,\index{sermon!sphere-packing} % almost all the Hamming space is taken up by the space {\em{between}\/} $t$-spheres % \wow\ (which is shown in grey in \figref{fig.cavenotquite}). \begin{figure} \figuremargin{ \mbox{\psfig{figure=figs/cavenotquite.ps,angle=-90,width=3in}} }{ \caption[a]{Schematic picture of Hamming space not perfectly filled by $t$-spheres centred on the codewords of a code. The grey regions show points that are at a Hamming distance of more than $t$ from any codeword. This is a misleading picture, as, for any code with large $t$ in high dimensions, the grey space between the spheres takes up almost all of Hamming space. } \label{fig.cavenotquite} } \end{figure} Having established this gloomy picture, we spend a moment filling in the properties of the perfect codes mentioned above. \subsection{The Hamming codes} The $(7,4)$ Hamming code can be defined as the linear code whose $3\times 7$ parity-check matrix contains, as its columns, all the 7 ($=2^3-1$) non-zero vectors of length 3. Since these 7 vectors are all different, any single bit-flip produces a distinct syndrome, so all single-bit errors can be detected and corrected. % from \input{tex/_concat2.tex} We can generalize this code, with $M=3$ parity constraints, as follows. The Hamming codes are single-error-correcting codes defined by picking a number of parity-check constraints, $M$; the blocklength $N$ is $N = 2^M-1$; the parity-check matrix contains, as its columns, all the $N$ non-zero vectors of length $M$ bits. The first few Hamming codes have the following rates: \medskip% added because of my change to the center environment \begin{center} \begin{tabular}{cr@{,$\,$}llp{1.4in}} \toprule % checks & %% (block length, source bits) % & rate & \\ % $M$ & ($N = 2^M-1$ , $K = N - M$) & $R=K/N$ & \\ \midrule \multicolumn{1}{c}{Checks, $M$} & \multicolumn{2}{c}{($N,K$)} & $R=K/N$ & \\ \midrule 2 & (3&1) & 1/3 & repetition code $R_3$ \\ 3 & (7&4) & 4/7 & $(7,4)$ Hamming code \\ 4 & (15&11) & 11/15 & \\ 5 & (31&26) & 26/31 & \\ 6 & (63&57) & 57/63 & \\ \bottomrule \end{tabular} \end{center} \exercissxA{2}{ex.HammingP}{ What is the probability of block error of the $(N,K)$ Hamming code to leading order, when the code is used for a binary symmetric channel with noise density $f$? } \section{Perfectness is unattainable -- first proof \nonexaminable} We will show in several ways that useful \ind{perfect code}s do not exist (here, `useful' means `having large blocklength $N$, and rate close neither to 0 nor 1'). % First, let's study a pithy, no-nonsense example. Shannon proved that, given a binary symmetric channel with any noise level $f$, there exist codes with large blocklength $N$ and rate as close as you like to $C(f) = 1 - H_2(f)$ that enable \ind{communication} with arbitrarily small error probability. For large $N$, the number of errors per block will typically be about $\fN$, so these codes of Shannon are `almost-certainly-$\fN$-error-correcting' codes. Let's pick the special case of a noisy channel with $f \in ( 1/3, 1/2)$. Can we find a large {\em perfect\/} code that is $\fN$-error-correcting? % with large blocklength for this channel? Well, let's suppose that such a code has been found, and examine just three of its codewords. (Remember that the code ought to have rate $R \simeq 1-H_2(f)$, so it should have an enormous number ($2^{NR}$) of codewords.) \begin{figure} \figuremargin{ \mbox{\psfig{figure=figs/noperfect3.ps,% width=64mm,angle=-90}} }{% \caption[a]{Three codewords. } \label{fig.noperfect} } % load 'gnuR' \end{figure} Without loss of generality, we choose one of the codewords to be the all-zero codeword and define the other two to have overlaps with it as shown in \figref{fig.noperfect}. The second codeword differs from the first in a fraction $u+v$ of its coordinates. The third codeword differs from the first in a fraction $v+w$, and from the second in a fraction $u+w$. A fraction $x$ of the coordinates have value zero in all three codewords. Now, if the code is $\fN$-error-correcting, its minimum distance must be greater than $2\fN$, so \beq u+v > 2f, \:\:\: v+w > 2f, \:\:\: \mbox{and} \:\:\: u+w > 2f . \eeq Summing these three inequalities and dividing by two, we have \beq u +v+w > 3f . \eeq So if $f>1/3$, we can deduce $u+v+w > 1$, so that $x<0$, which is impossible. Such a code cannot exist. So the code cannot have {\em three\/} codewords, let alone $2^{NR}$. We conclude that, whereas Shannon proved there are plenty of codes for communicating over a \ind{binary symmetric channel}\index{channel!binary symmetric}\index{perfect code} with $f>1/3$, {\em there are no perfect codes\index{error-correcting code!perfect} that can do this.} We now study a more general argument that indicates that there are no large perfect linear codes for general rates (other than 0 and 1). We do this by finding the typical distance of a random linear code. %\mynewpage \section{Weight enumerator function of random linear codes \nonexaminable} \label{sec.wef.random} Imagine % H=rand(12,24)>0.5 % octave \marginfig{\tiny{ \[%\mbox{\footnotesize{$\bH=$}} \hspace{-2mm}\begin{array}{c} {N}\\ \overbrace{\left.\hspace{-2mm}\left[\begin{array}{@{}*{24}{c@{\hspace{0.45mm}}}} 1&0&1&0&1&0&1&0&0&1&0&0&1&1&0&1&0&0&0&1&0&1&1&0\\ 0&0&1&1&1&0&1&1&1&1&0&0&0&1&1&0&0&1&1&0&1&0&0&0\\ 1&0&1&1&1&0&1&1&1&0&0&1&0&1&1&0&0&0&1&1&0&1&0&0\\ 0&0&0&0&1&0&1&1&1&1&0&0&1&0&1&1&0&1&0&0&1&0&0&0\\ 0&0&0&0&0&0&1&1&0&0&1&1&1&1&0&1&0&0&0&0&0&1&0&0\\ 1&1&0&0&1&0&0&0&1&1&1&1&1&0&0&0&0&0&1&0&1&1&1&0\\ 1&0&1&1&1&1&1&0&0&0&1&0&1&0&0&0&0&1&0&0&1&1&1&0\\ 1&1&0&0&1&0&1&1&0&0&0&1&1&0&1&0&1&1&1&0&1&0&1&0\\ 1&0&0&0&1&1&1&0&0&1&0&1&0&0&0&0&1&0&1&1&1&1&0&1\\ 0&1&0&0&0&1&0&0&0&0&1&0&1&0&1&0&0&1&1&0&1&0&1&0\\ 0&1&0&1&1&1&1&1&0&1&1&1&1&1&1&1&1&0&1&1&1&0&1&0\\ 1&0&1&1&1&0&1&0&1&0&0&1&0&0&1&1&0&1&0&0&0&0&1&1 \end{array}\right]\right\} M \hspace{-2mm}\hspace{-0.25in} } \end{array} \] } \caption[a]{A random binary parity-check matrix.} \label{fig.randommatrix} }% making a code by picking the binary entries in the $M \times N$ parity-check matrix $\bH$ at random.\index{error-correcting code!random linear} What weight enumerator function should we expect? The \ind{weight enumerator} of one particular code with parity-check matrix $\bH$, $A(w)_{\bH}$, is the number of codewords of weight $w$, which can be written \beq A(w)_{\bH} = \sum_{\bx: |\bx| = w} \truth\! \left[ \bH \bx = 0 \right] , \eeq where the sum is over all vectors $\bx$ whose weight is $w$ and the \ind{truth function} $\truth\! \left[ \bH \bx = 0 \right]$ equals one if % it is true that $\bH \bx = 0$ and zero otherwise. We can find the expected value of $A(w)$, \beqan \langle A(w) \rangle &=& \sum_{\bH} P(\bH) A(w)_{\bH} \\ &=& \sum_{\bx: |\bx| = w} \sum_{\bH} P(\bH) \, \truth\! \left[ \bH \bx \eq 0 \right] , \label{eq.expAw} \eeqan by evaluating the probability that a particular word of weight $w>0$ is a codeword of the code (averaging over all binary linear codes in our ensemble). By symmetry, this probability depends only on the weight $w$ of the word, not on the details of the word. The probability that the entire syndrome $\bH \bx$ is zero can be found by multiplying together the probabilities that each of the $M$ bits in the syndrome is zero. Each bit $z_m$ of the syndrome is a sum (mod 2) of $w$ random bits, so the probability that $z_m \eq 0$ is $\dhalf$. The probability that $\bH \bx \eq 0$ is thus \beq \sum_{\bH} P(\bH) \, \truth\! \left[ \bH \bx \eq 0 \right] = (\dhalf)^M = 2^{-M}, \eeq independent of $w$. The expected number of words of weight $w$ (\ref{eq.expAw}) is given by summing, over all words of weight $w$, the probability that each word is a codeword. The number of words of weight $w$ is ${{N}\choose{w}}$, so \beq \langle A(w) \rangle = {{N}\choose{w}} 2^{-M} \:\:\mbox{for any $w>0$}. \eeq For large $N$, we can use $\log {{N}\choose{w}} \simeq N H_2(w/N)$ and $R\simeq 1-M/N$ to write \beqan \log_2 \langle A(w) \rangle &\simeq& N H_2(w/N) -M \\ &\simeq& N [ H_2(w/N) - (1-R) ] \:\:\mbox{for any $w>0$}. \label{eq.wef.random} \eeqan As a concrete example, \figref{fig.Aw.540} shows the expected weight enumerator function of a rate-$1/3$ random linear code\index{error-correcting code!random linear} with $N=540$ and $M=360$. \marginfig{ \begin{center} \mbox{% \small \hspace{-0.01in}% \begin{tabular}{c} \hspace{-0.15in}\mbox{\psfig{figure=/home/mackay/_doc/code/gallager/Am540R.ps,% width=41.5mm,angle=-90}}\\[-0.01in] \hspace{0.1in}\mbox{\hspace*{-0.35in}\psfig{figure=/home/mackay/_doc/code/gallager/Am540Rl.ps,% width=41.5mm,angle=-90}}\\[-0.1in] \end{tabular} } \end{center} %}{% \caption[a]{The expected weight enumerator function $\langle A(w) \rangle$ of a \index{error-correcting code!random linear}random linear code with $N=540$ and $M=360$. Lower figure shows $\langle A(w) \rangle$ on a logarithmic scale. } \label{fig.Aw.540} % load 'gnuR' } \subsection{Gilbert--Varshamov distance} For weights $w$ such that $H_2(w/N) < (1-R)$, the expectation of $A(w)$ is smaller than 1; for weights such that $H_2(w/N) > (1-R)$, the expectation is greater than 1. We thus expect, for large $N$, that the minimum distance of a random linear code will be close to the distance $d_{\rm GV}$ defined by \beq H_2(d_{\rm GV}/N) = (1-R) . \label{eq.GV.def} \eeq % INDENT ME? \noindent {\sf Definition.} This distance, $d_{\rm GV} \equiv N H_2^{-1}(1-R)$, is % known as the {\dem{Gilbert--Varshamov\index{distance!Gilbert--Varshamov}\index{Gilbert--Varshamov distance} distance}\/} for rate $R$ and blocklength $N$. The {\dem{Gilbert--Varshamov conjecture}}, widely believed, asserts that (for large $N$) it is not possible to\index{Gilbert--Varshamov conjecture} create binary codes with minimum distance significantly greater than $d_{\rm GV}$. \medskip % INDENT ME? \noindent {\sf Definition.} The {\dem{\index{Gilbert--Varshamov rate}Gilbert--Varshamov rate}\/} $R_{\rm GV}$ is the maximum rate at which you can reliably communicate with a \ind{bounded-distance decoder} (as defined on \pref{sec.bdd}), assuming that the Gilbert--Varshamov conjecture\index{Gilbert--Varshamov conjecture} is true. % \section{Perfect codes} A \index{error-correcting code!perfect}\see{perfect code}{code} \subsection{Why sphere-packing is a bad perspective, and an obsession with distance is inappropriate} If one uses a \ind{bounded-distance decoder},\index{sermon!sphere-packing} the maximum tolerable noise level will flip a fraction $f_{\rm bd} = \half d_{\min}/N$ of the bits. So, assuming $d_{\min}$ is equal to the \index{Gilbert--Varshamov distance}Gilbert distance $d_{\rm GV}$ (\ref{eq.GV.def}), we have:% \amarginfig{b}{ \begin{center} \mbox{\psfig{figure=figs/RGV.ps,angle=-90,width=1.7in}}\\[-0.1in] $f$ \end{center} \caption[a]{Contrast between Shannon's channel capacity $C$ and the Gilbert rate $R_{\rm GV}$ -- the maximum communication rate achievable using a \ind{bounded-distance decoder}, as a function of noise level $f$. For any given rate, $R$, the maximum tolerable noise level for Shannon is twice as big as the maximum tolerable noise level for a `worst-case-ist' who uses a bounded-distance decoder. } } \beq H_2(2 f_{\rm bd}) = (1-R_{\rm GV}) . \label{eq.idiotf} \eeq \beq R_{\rm GV} = 1 - H_2(2 f_{\rm bd}). \eeq Now, here's the crunch: what did Shannon say is achievable?\index{Shannon, Claude} He said the maximum possible rate of communication is the capacity, \beq C = 1 - H_2(f) . \eeq So for a given rate $R$, the maximum tolerable noise level, according to Shannon, is given by \beq H_2(f) = (1-R) . \label{eq.shannonf} \eeq Our conclusion: imagine a good code of rate $R$ has been chosen; equations (\ref{eq.idiotf}) and (\ref{eq.shannonf}) respectively define the maximum noise levels tolerable by a bounded-distance decoder, $f_{\rm bd}$, and by Shannon's decoder, $f$. \beq f_{\rm bd} = f/2 . \eeq Bounded-distance decoders can only ever cope with {\em half\/} the noise-level that Shannon proved is tolerable! % Need to show implication for perfect codes at the same time. How does this relate to perfect\index{error-correcting code!perfect} codes? A code is perfect if there are $t$-spheres around its codewords that fill Hamming space without overlapping. But when a typical random linear code is used to communicate over a binary symmetric channel near to the Shannon limit, the typical number of bits flipped is $\fN$, and the minimum distance between codewords is also $\fN$, or a little bigger, if we are a little below the Shannon limit. So the $\fN$-spheres around the codewords overlap with each other sufficiently that each sphere almost contains the centre of its nearest neighbour! \marginfig{\begin{center} \mbox{\psfig{figure=figs/overlap.eps,width=1.7in}}\\[-0.02in] \end{center} \caption[a]{Two overlapping spheres whose radius is almost as big as the distance between their centres. } \label{fig.overlap} } The reason why this overlap is not disastrous is because, in high dimensions, the volume associated with the overlap, shown shaded in \figref{fig.overlap}, is a tiny fraction of either sphere, so the probability of landing in it is extremely small. The moral of the story is that \ind{worst-case-ism} can be bad for you, halving your ability to tolerate noise. You have to be able to decode {\em way\/} beyond the minimum distance of a code to get to the Shannon limit! Nevertheless, the minimum distance of a code is of interest in practice, because, under some conditions, the minimum distance dominates the errors made by a code. % On to the bat cave. (Could also dissect the random code % in more detail.) \section{Berlekamp's bats} \label{sec.bats} A blind \ind{bat}\index{Berlekamp, Elwyn} lives in a cave. It flies about the centre of the cave, which corresponds to one codeword, with its typical distance from the centre controlled by a friskiness parameter $f$. (The displacement of the bat from the centre corresponds to the noise vector.) The boundaries of the cave are made up of stalactites that point in towards the centre of the cave (\figref{fig.cavereal}). Each stalactite is analogous to the boundary between the home codeword and another codeword. The stalactite is like the shaded region in \figref{fig.overlap}, but reshaped to convey the idea that it is a region of very small volume. Decoding errors correspond to the bat's intended trajectory passing inside a stalactite. Collisions with stalactites at various distances from the centre are possible. If the friskiness % (noise level) is very small, the bat is usually very close to the centre of the cave; collisions will be rare, and when they do occur, they will usually involve the stalactites whose tips are closest to the centre point. Similarly, under low-noise conditions, decoding errors will be rare, and they will typically involve low-weight codewords. Under low-noise conditions, the minimum distance of a code is relevant to the (very small) probability of error. \begin{figure}[hbtp] \figuremargin{ \mbox{\psfig{figure=figs/cavereal.ps,angle=-90,width=3in}} }{ \caption[a]{Berlekamp's schematic picture of Hamming space in the vicinity of a codeword. The jagged solid line encloses all points to which this codeword is the closest. The $t$-sphere around the codeword takes up a small fraction of this space. } \label{fig.cavereal} } \end{figure} If the friskiness is higher, the bat may often make excursions beyond the safe distance $t$ where the longest stalactites start, but % it is quite possible that it will collide most frequently with more distant stalactites, owing to their greater number. There's only a tiny number of \ind{stalactite}s at the minimum distance, so they are relatively unlikely to cause the errors. Similarly, errors in a real error-correcting code depend on the properties of the \ind{weight enumerator} function. At very high friskiness, the \ind{bat} is always a long way from the centre of the \ind{cave}, and almost all its collisions involve contact with distant stalactites. % bat in a cave. Under these conditions, the bat's collision frequency has nothing to do with the distance from the centre to the closest stalactite. %\section{Concatenation} % see also _concat.tex % this is the bit where we do the ``hamming are good'' story \section{Concatenation of Hamming codes\nonexaminable} \label{sec.concatenation} It is instructive to play some more with the \ind{concatenation} of \ind{Hamming code}s,\index{error-correcting code!Hamming} a concept we first visited in \figref{fig.concath1}, because we will get insights into the notion of good codes and the relevance or otherwise of the \ind{minimum distance} of a code.\index{distance!of code} We can create a concatenated code for a binary symmetric channel with noise density $f$ by encoding with several Hamming codes in succession. % /home/mackay/bin/concath.p~ % /home/mackay/_courses/itprnn/hamming/concath % /home/mackay/_courses/itprnn/hamming/concath.gnu The table recaps the key properties of the Hamming codes, indexed by number of constraints, $M$. All the Hamming codes have minimum distance $d=3$ and can correct one error in $N$. \medskip% because of modified center \begin{center} \begin{tabular}{ll}\toprule $N = 2^M-1$ & blocklength \\ % $K$ & $K = N - M$ & number of source bits \\ $p_{\rm B} = \smallfrac{3}{N} {{N} \choose {2}} f^2$ & probability of block error to leading order \\ \bottomrule % $R$ & $K/N$ \\ \end{tabular} \medskip \end{center} \marginfig{ \begin{center} %\mbox{% \footnotesize \raisebox{0.3591in}{$R$}% \hspace{0.2in}% \begin{tabular}{c} \mbox{\psfig{figure=hamming/concath.rate.ps,% width=40.5mm,angle=-90}}\\[0.1in] \hspace{0.3in}$C$ \end{tabular} %} \end{center} %}{% \caption[a]{The rate $R$ of the concatenated Hamming code as a function of the number of concatenations, $C$. } \label{fig.concath.rate} } % % \subsection{Proving that good codes can be made by concatenation} If we make a \ind{product code} by\index{error-correcting code!good}\index{error-correcting code!product code} concatenating a sequence of $C$ Hamming codes with increasing $M$, we can choose those parameters $\{ M_c \}_{c=1}^{C}$ in such a way that the rate of the product code % $R_C$ \beq R_C = \prod_{c=1}^C \frac{N_c - M_c}{N_c} \eeq tends to a non-zero limit as $C$ increases. For example, if we set $M_1 =2$, $M_2=3$, $M_3=4$, etc., then the asymptotic rate is 0.093 (\figref{fig.concath.rate}). The blocklength $N$ is a rapidly-growing function of $C$, so these codes are somewhat impractical. A further weakness of these codes is\index{distance!of concatenated code} that\index{error-correcting code!concatenated} their\index{error-correcting code!product code} minimum distance is not very good (\figref{fig.concath.n.d}).% \amarginfig{b}{ \begin{center} \small\footnotesize % %\hspace{0.042in}% %\begin{tabular}{c} %\mbox{\psfig{figure=hamming/concath.n.k.l.ps,% %width=40.5mm,angle=-90}} %% \\[-0.1in] $C$ %\end{tabular}\\[0.13in] \hspace*{0.2042in}% \begin{tabular}{c} \mbox{\psfig{figure=hamming/concath.n.d.ps,% width=40.5mm,angle=-90}}\\[0.1in] \hspace{0.3in}$C$\\[-0.05in] \end{tabular} \end{center} %}{% \caption[a]{The blocklength $N_C$ (upper curve) and % $(N,K)$ (upper figure) and minimum distance $d_C$ (lower curve) % (lower figure) of the concatenated Hamming code as a function of the number of concatenations $C$. } \label{fig.concath.n.k.l} \label{fig.concath.n.d} } % % why is this fig not taking up its correct space? % % The blocklength $N$ is a rapidly growing function of $C$, so these codes % are mainly of theoretical interest. % Every one of the constituent Hamming codes has \ind{minimum distance}\index{distance!of code} 3, so the minimum distance of the $C$th product is $3^C$. The blocklength $N$ grows faster % with $C$ than $3^C$, so the ratio $d/N$ tends to zero as $C$ increases. In contrast, for typical random codes, the ratio $d/N$ tends to a constant\index{random code} % distance tends to a fraction of $N$, such that $H_2(d/N) = 1-R$.\index{Hamming code} Concatenated Hamming codes\index{distance!bad}\index{distance!of product code} thus have `bad' distance.% \pref{distance.defs} Nevertheless, it turns out that this simple sequence of codes yields good codes\index{error-correcting code!good} for some channels -- but not very good codes (see \sectionref{sec.good.codes} to recall the definitions of the terms `good' and `very good'). Rather than prove this result, we will simply explore it numerically. \Figref{fig.concath.rateeb} shows the bit error probability $p_{\rm b}$ of the concatenated codes assuming that the constituent codes are decoded in sequence, as described in section \ref{sec.concatdecode}. [This one-code-at-a-time decoding is suboptimal, as we saw there.] % refers to {tex/_concat.tex}% contains simple example % % concath.p The horizontal axis shows the rates of the codes. As the number of concatenations increases, the rate drops to 0.093 and the error probability drops towards zero. The channel assumed in the figure is the binary symmetric channel with $\q=0.0588$. This is the highest noise level that can be tolerated using this concatenated code. \amarginfig{c}{ \begin{center} \footnotesize \mbox{% \raisebox{0.591in}{$p_{\rm b}$}% \hspace{0.2042in}% \begin{tabular}{c} \mbox{\psfig{figure=hamming/concath.rate.058.ps,% width=40mm,angle=-90}}\\[0.1in] \hspace{0.54in}$R$\\[-0.03in] \end{tabular}} \end{center} %}{% \caption[a]{The bit error probabilities versus the rates $R$ of the concatenated Hamming codes, for the binary symmetric channel with $\q=0.0588$. Labels alongside the points show the blocklengths, $N$. The solid line shows the Shannon limit for this channel. The bit error probability drops to zero while the rate tends to 0.093, so the concatenated Hamming codes are a `good' code family. } \label{fig.concath.rateeb} } %%%%%%%%%%%%%%%%%%%% there is a major margin object problem here, % don't understand it! The take-home message from this story is {\em{distance isn't everything}}.\index{distance!isn't everything} % Indeed, t The minimum distance of a code, although widely worshipped by coding theorists, is not of fundamental importance\index{coding theory} to Shannon's mission of achieving reliable \ind{communication} over noisy channels.\index{Shannon, Claude}\index{coding theory} \exercisxB{3}{ex.distancenotE}{ Prove that there exist families of codes with `bad' distance that are `very good' codes. } % soln in _linear.tex \section{Distance isn't everything} Let's\index{error-correcting code!error probability} % look at this assertion some more in order to get a quantitative feeling for the effect of the minimum distance\index{error probability!and distance} of a code, for the special case of a \ind{binary symmetric channel}.\index{channel!binary symmetric} %\exampl{ex.bhat}{ \subsection{The error probability associated with one low-weight codeword} \label{sec.err.prob.one} % begin INTRO Let a binary code have blocklength $N$ and just two codewords, which differ in $d$ places. For simplicity, let's assume $d$ is even. What is the error probability if this code is used on a binary symmetric channel with noise level $f$? Bit flips matter only in places where the two codewords differ. % Only flips of bits in the places that differ matter. The error probability is dominated by the probability that $d/2$ of these bits are flipped. What happens to the other bits is irrelevant, since the optimal decoder ignores them. \beqan P(\mbox{block error}) & \simeq & {{d}\choose{d/2}} f^{d/2} (1-f)^{d/2} . % \geq here if you want \eeqan This error probability associated with a single codeword of weight $d$ is plotted in \figref{fig.dist}.% \amarginfig{c}{% \footnotesize \begin{tabular}{c} \hspace*{0.2in}\psfig{figure=gnu/errorVdist.ps,width=1.8in,angle=-90}\\[0.1in] \end{tabular} % see /home/mackay/itp/gnu/dist.gnu \caption[a]{ The error probability associated with a single codeword of weight $d$, ${{d}\choose{d/2}} f^{d/2} (1-f)^{d/2}$, as a function of $f$.} \label{fig.dist} } Using the approximation for the binomial coefficient (\ref{eq.stirling.choose}), we can further approximate \beqan P(\mbox{block error}) % \leq here if you want & \simeq & \left[ 2 f^{1/2} (1-f )^{1/2} \right]^{d} \\ & \equiv & [\beta(f)]^{d} , \label{eq.bhatta} \eeqan where $\beta(f) = 2 f^{1/2} (1-f )^{1/2}$ is called the \ind{Bhattacharyya parameter} of the channel.\nocite{Bhattacharyya} %\marginpar{\footnotesize{You don't need % to memorize this name; indeed, I need to check this is the correct name, as it is not in the %index of any coding theory books on my shelf! Must check in McEliece.}} % % Bhattacharyya, A.On a measure of divergence between two statistical % populations defined by their probability distributions. Bull. % Calcutta Math. Soc. 35 (1943), pp. 99-110. % % A recent book that calls your $\beta$ the Bhattacharyya parameter is % Johanesson and Zigangirov's book on convolutional codes. I think some % of Viterbi's books also use the term. % % end INTRO % \subsection{Recap of `very bad' distance} Now, consider a general linear code with distance $d$. Its block error probability must be at least ${{d}\choose{d/2}} f^{d/2} (1-f)^{d/2}$, independent of the blocklength $N$ of the code. For this reason, a sequence of codes of increasing blocklength $N$ and constant distance $d$ (\ie, `very bad' distance)\label{sec.verybadisbad} cannot have a block error probability that tends to zero, on any binary symmetric channel. If we are interested in making superb error-correcting codes with tiny, tiny error probability, we might therefore shun codes with bad distance. However, being pragmatic, we should look more carefully at \figref{fig.dist}. In \chref{ch1} we argued that codes for disk drives need an error probability smaller than about $10^{-18}$. If the raw error probability in the \ind{disk drive} is about $0.001$, the error probability associated with one codeword at distance $d=20$ is smaller than $10^{-24}$. If the raw error probability in the disk drive is about $0.01$, the error probability associated with one codeword at distance $d=30$ is smaller than $10^{-20}$. For practical purposes, therefore, it is not essential for a code to have good distance. For example, codes of blocklength $10\,000$, known to have many codewords of weight 32, can nevertheless correct errors of weight 320 with tiny error probability. I wouldn't want you to think I am {\em recommending\/} the use of codes with bad distance; in \chref{ch.ldpcc} we will discuss low-density parity-check codes, my favourite codes, which have both excellent performance and {\em good\/} distance. % These are my favourite codes. % It's as a matter of honesty that I am pointing out % that having good distance scarcely matters. % So regardless of the blocklength used, \section{The union bound} The error probability of a code on the binary symmetric channel can be bounded in terms of its \ind{weight enumerator} function by adding up appropriate multiples of the error probability associated with a single codeword (\ref{eq.bhatta}): \beq P(\mbox{block error}) \leq \sum_{w>0} A(w) [\beta(f)]^w . \label{eq.unionB} \eeq % could include Bob's poor man's coding theorem here. This inequality, which is an example of a {\dem\ind{union bound}},\index{bound!union} is accurate for low noise levels $f$, but inaccurate for high noise levels, because it overcounts the contribution of errors that cause confusion with more than one codeword at a time. %MNBV\newpage \exercisxB{3}{ex.poormancoding}{ {\sf Poor man's noisy-channel coding theorem}.\index{noisy-channel coding theorem!poor man's version}\index{poor man's coding theorem} Pretending that the union bound (\ref{eq.unionB}) {\em is\/} accurate, and using the average {\ind{weight enumerator} function of a random linear code} (\ref{eq.wef.random}) (\secref{sec.wef.random}) as $A(w)$, estimate the maximum rate $R_{\rm UB}(f)$ at which one can communicate over a binary symmetric channel. Or, to look at it more positively, using the union bound (\ref{eq.unionB}) as an inequality, show that communication at rates up to $R_{\rm UB}(f)$ is possible over the binary symmetric channel. % In proving this result, you are proving a `poor man's version' of % {Shannon}'s noisy-channel coding theorem. } In the following chapter, by analysing the probability of error of {\em \ind{syndrome decoding}\/} for a binary linear code, and using a union bound, we will prove Shannon's noisy-channel coding theorem (for symmetric binary channels), and thus show that {\em very good linear codes exist}. % possible point for exercise from exact.tex to be included. \section{Dual codes\nonexaminable} A concept that has some importance in coding theory,\index{error-correcting code!dual} though we will have no immediate use for it in this book, is the idea of the {\dem\ind{dual}} of a linear error-correcting code. An $(N,K)$ linear error-correcting code can be thought of as a set of $2^{K}$ codewords generated by adding together all combinations of $K$ independent basis codewords. The generator matrix of the code consists of those $K$ basis codewords, conventionally written as row vectors. For example, the $(7,4)$ Hamming code's generator matrix (from \pref{eq.Generator}) % \eqref{eq.Generator}, is \beq \bG = \left[ \begin{array}{ccccccc} \tt 1& \tt 0& \tt 0& \tt 0& \tt 1& \tt 0& \tt 1 \\ \tt 0& \tt 1& \tt 0& \tt 0& \tt 1& \tt 1& \tt 0 \\ \tt 0& \tt 0& \tt 1& \tt 0& \tt 1& \tt 1& \tt 1 \\ \tt 0& \tt 0& \tt 0& \tt 1& \tt 0& \tt 1& \tt 1 \\ \end{array} \right] \label{eq.Generator2} \eeq and its sixteen codewords were displayed in \tabref{tab.74h} (\pref{tab.74h}). The codewords of this code are linear combinations of the four vectors $\left[ \tt 1 \: \tt 0 \: \tt 0 \: \tt 0 \: \tt 1 \: \tt 0 \: \tt 1 \right]$, $\left[ \tt 0 \: \tt 1 \: \tt 0 \: \tt 0 \: \tt 1 \: \tt 1 \: \tt 0 \right]$, $\left[ \tt 0 \: \tt 0 \: \tt 1 \: \tt 0 \: \tt 1 \: \tt 1 \: \tt 1 \right]$, and $\left[ \tt 0 \: \tt 0 \: \tt 0 \: \tt 1 \: \tt 0 \: \tt 1 \: \tt 1 \right]$. An $(N,K)$ code may also be described in terms of an $M \times N$ parity-check matrix (where $M=N-K$) as the set of vectors $\{ \bt \}$ that satisfy \beq \bH \bt = {\bf 0} . \eeq One way of thinking of this equation is that each row of $\bH$ specifies a vector to which $\bt$ must be orthogonal if it is a codeword. \medskip \noindent \begin{conclusionbox} The generator matrix specifies $K$ vectors {\em from which\/} all codewords can be built, and the parity-check matrix specifies a set of $M$ vectors {\em to which\/} all codewords are orthogonal. \smallskip The dual of a code is obtained by exchanging the generator matrix and the parity-check matrix. \end{conclusionbox} \medskip \noindent {\sf Definition.} The set of {\em all\/} vectors of length $N$ that are orthogonal to all codewords in a code, $\C$, is called the dual of the code, $\C^{\perp}$. \medskip If $\bt$ is orthogonal to $\bh_1$ and $\bh_2$, then it is also orthogonal to $\bh_3 \equiv \bh_1 + \bh_2$; so all codewords are orthogonal to any linear combination of the $M$ rows of $\bH$. So the set of all linear combinations of the rows of the parity-check matrix is the dual code. % called the dual of the code. % The dual is itself a linear % error-correcting code, whose generator matrix is $\bH$. %% And similarly, t % The parity-check matrix of the dual is $\bG$, % the generator matrix of the first code. For our Hamming $(7,4)$ code, the parity-check matrix is (from \pref{eq.pcmatrix}): \beq \bH = \left[ \begin{array}{cc} \bP & \bI_3 \end{array} \right] = \left[ \begin{array}{ccccccc} \tt 1&\tt 1&\tt 1&\tt 0&\tt 1&\tt 0&\tt 0 \\ \tt 0&\tt 1&\tt 1&\tt 1&\tt 0&\tt 1&\tt 0 \\ \tt 1&\tt 0&\tt 1&\tt 1&\tt 0&\tt 0&\tt 1 \end{array} \right] . \label{eq.pcmatrix2} \eeq % and the three vectors to which the codewords are % orthogonal are %$\left[ %\tt 1\: \tt 1\: \tt 1\: \tt 0\: \tt 1\: \tt 0\: \tt 0 % \right]$, %$\left[ %\tt 0\: \tt 1\: \tt 1\: \tt 1\: \tt 0\: \tt 1\: \tt 0 % \right]$, % and %$\left[ %\tt 1\: \tt 0\: \tt 1\: \tt 1\: \tt 0\: \tt 0\: \tt 1 % \right]$. % The codewords are not orthogonal to these $M$ % vectors only, however. I The dual of the $(7,4)$ Hamming code $\H_{(7,4)}$ is the code shown in \tabref{tab.74h.dual}. \begin{table}[htbp] \figuremargin{% \begin{center} \mbox{\small \begin{tabular}{c} \toprule % Transmitted sequence % $\bt$ \\ \midrule \tt 0000000 \\% yes \tt 0010111 \\% yes \bottomrule \end{tabular} \hspace{0.02in} \begin{tabular}{c} \toprule % $\bt$ \\ \midrule \tt 0101101 \\% yes \tt 0111010 \\ \bottomrule % yes \end{tabular} \hspace{0.02in} \begin{tabular}{c} \toprule % $\bt$ \\ \midrule \tt 1001110 \\% yes \tt 1011001 \\ \bottomrule % yes \end{tabular} \hspace{0.02in} \begin{tabular}{c} \toprule % $\bt$ \\ \midrule \tt 1100011 \\% yes \tt 1110100 \\ % yes \bottomrule \end{tabular} }%%%%%%%%% end of row of four tables \end{center} }{% \caption[a]{The eight codewords % $\{ \bt \}$ of the dual of the $(7,4)$ Hamming code. [Compare with \protect\tabref{tab.74h}, \protect\pref{tab.74h}.] } \label{tab.74h.dual} } \end{table} % STRANGE MISREF????????? CHECK A possibly unexpected property of this pair of codes is that the dual, $\H_{(7,4)}^{\perp}$, is contained within the code $\H_{(7,4)}$ itself: every word in the dual code is a codeword of the original $(7,4)$ Hamming code. This relationship can be written using set notation: \beq \H_{(7,4)}^{\perp} \subset \H_{(7,4)} . \eeq The possibility that the set of dual vectors can overlap the set of codeword vectors is counterintuitive if we think of the vectors as real vectors -- how can a vector be orthogonal to itself? But when we work in modulo-two arithmetic, many non-zero vectors are indeed orthogonal % perpendicular to themselves! \exercissxB{1}{ex.perp}{ Give a simple rule that distinguishes whether a binary vector is orthogonal to itself, as is each of the three vectors $\left[ \tt 1\: \tt 1\: \tt 1\: \tt 0\: \tt 1\: \tt 0\: \tt 0 \right]$, $\left[ \tt 0\: \tt 1\: \tt 1\: \tt 1\: \tt 0\: \tt 1\: \tt 0 \right]$, and $\left[ \tt 1\: \tt 0\: \tt 1\: \tt 1\: \tt 0\: \tt 0\: \tt 1 \right]$. } \subsection{Some more duals} In general, if a code has a systematic generator matrix, \beq \bG = \left[ \bI_K | \bP^{\T} \right] , \eeq where $\bP$ is a $K \times M$ matrix, then its parity-check matrix is \beq \bH = \left[ \bP | \bI_M \right] . \eeq \exampl{example.rthreedual}{ The repetition code $\Rthree$ has generator matrix \beq \bG =\left[ \begin{array}{ccc} \tt 1 &\tt 1 &\tt 1 \end{array} \right]; % [{\tt 1\:1\:1} ] ; \eeq its parity-check matrix is \beq \bH = \left[ \begin{array}{ccc} \tt 1 &\tt 1 &\tt 0 \\ \tt 1 &\tt 0 &\tt 1 \end{array} \right] . \eeq The two codewords are [{\tt 1 1 1}] and [{\tt 0 0 0}]. The dual code has generator matrix \beq \bG^{\perp} = \bH = \left[ \begin{array}{ccc} \tt 1 &\tt 1 &\tt 0 \\ \tt 1 &\tt 0 &\tt 1 \end{array} \right] \eeq or equivalently, modifying $\bG^{\perp}$ into systematic form by row additions, % manipulations, \beq \bG^{\perp} = \left[ \begin{array}{ccc} \tt 1 &\tt 0 &\tt 1 \\ \tt 0 &\tt 1 &\tt 1 \end{array} \right] . \eeq We call this dual code the {\dem{simple parity code}} P$_3$;\index{error-correcting code!P$_3$}\index{error-correcting code!simple parity}\index{error-correcting code!dual} it is the code with one parity-check bit, which is equal to the sum of the two source bits. The dual code's four codewords are $ \left[ \tt 1 \: \tt 1 \: \tt 0 \right] $, $ \left[ \tt 1 \: \tt 0 \: \tt 1 \right] $, $ \left[ \tt 0 \: \tt 0 \: \tt 0 \right] $, and $ \left[ \tt 0 \: \tt 1 \: \tt 1 \right] $. In this case, the only vector common to the code and the dual is the all-zero codeword. } \subsection{Goodness of duals} If a sequence of codes is `good', are their \index{error-correcting code!dual}duals {good} too?\index{error-correcting code!good} Examples can be constructed of all cases: good codes with good duals (random linear codes); bad codes with bad duals; and good codes with bad duals. The last category is especially important: many state-of-the-art codes have the property that their duals are bad. The classic example is the low-density parity-check code, whose dual is a low-density generator-matrix code.\index{error-correcting code!low-density generator-matrix} \exercisxB{3}{ex.ldgmbad}{ Show that low-density generator-matrix codes are bad. A family of low-density generator-matrix codes is defined by two parameters $j,k$, which are the column weight and row weight of all rows and columns respectively of $\bG$. These weights are fixed, independent of $N$; for example, $(j,k)=(3,6)$. [Hint: show that the code has low-weight codewords, then use the argument from \pref{sec.verybadisbad}.] } \exercisxD{5}{ex.ldpcgood}{ Show that low-density parity-check codes are good, and have good distance.\index{error-correcting code!low-density parity-check} (For solutions, see \citeasnoun{Gallager63} and \citeasnoun{mncN}.) } \subsection{Self-dual codes} The $(7,4)$ Hamming code had the property that the dual was contained in the code itself. % used to say - % A code is {\dem{\ind{self-orthogonal}}} if it contains its dual. A code is {\dem{\ind{self-orthogonal}}\/} if it is contained in its dual. For example, the dual of the $(7,4)$ Hamming code is a self-orthogonal code. One way of seeing this is that the overlap between any pair of rows of $\bH$ is even. %\marginpar{Is % it an accepted abuse of terminology to also say % a code is self-orthogonal if it contains its dual?} Codes that contain their duals are important in quantum error-correction \cite{ShorCSS}. It is intriguing, though not necessarily useful, to look at codes that are {\dem\ind{self-dual}}. A code $\C$ is self-dual if the dual of the code is identical to the code. % Here, we are looking for codes that satisfy \beq \C^{\perp} = \C . \eeq Some properties of self-dual codes can be deduced: % \ben \item If a code is self-dual, then its generator matrix is also a parity-check matrix for the code. \item Self-dual codes have rate $1/2$, \ie, $M=K=N/2$. \item All codewords have even weight. \een \exercissxB{2}{ex.selfdual}{ What property must the matrix $\bP$ satisfy, if the code with generator matrix $\bG = \left[ \bI_K | \bP^{\T} \right]$ is self-dual? } \subsubsection{Examples of self-dual codes} \ben \item The repetition code R$_2$ is a simple example of a self-dual code. \beq \bG = \bH = \left[ \begin{array}{cc} \tt 1 &\tt 1 \end{array} \right] . % [{\tt 1 \: 1 } ] \eeq \item The smallest non-trivial self-dual code is the following $(8,4)$ code. \beq \bG = \left[ \begin{array}{c|c} \bI_4 & \bP^{\T} \end{array} \right] = \left[ \begin{array}{cccc|cccc} \tt 1&\tt 0&\tt 0 &\tt 0 &\tt 0&\tt 1&\tt 1&\tt 1\\ \tt 0&\tt 1&\tt 0 &\tt 0 &\tt 1&\tt 0&\tt 1&\tt 1\\ \tt 0&\tt 0&\tt 1 &\tt 0 &\tt 1&\tt 1&\tt 0&\tt 1\\ \tt 0&\tt 0&\tt 0 &\tt 1 &\tt 1&\tt 1&\tt 1&\tt 0 \end{array} \right] . \label{eq.selfdual84G} \eeq \een \exercissxB{2}{ex.dual84.74}{ Find the relationship of the above $(8,4)$ code to the $(7,4)$ Hamming code. } \subsection{Duals and graphs} Let a code be represented by a graph in which there are nodes of two types, parity-check constraints and equality constraints, joined by edges which represent the bits of the code (not all of which need be transmitted). The dual code's graph is obtained by replacing all \ind{parity-check nodes} by equality nodes and {\em vice versa}. This type of graph is called a \ind{normal graph} by \citeasnoun{Forney2001}. % Forney % added Thu 16/1/03 \subsection*{Further reading} Duals are important in coding theory because functions involving a code (such as the posterior distribution over codewords) can be transformed by a \ind{Fourier transform} into functions over the dual code. For an accessible introduction to Fourier analysis on finite groups, see \citeasnoun{Terras99}. See also \citeasnoun{macwilliams&sloane}. \section{Generalizing perfectness to other channels} Having given up on the search for \ind{perfect code}s for the binary symmetric channel, we could console ourselves by changing channel.\index{error-correcting code!perfect} We could call a code `a perfect $u$-error-correcting code for the binary \ind{erasure channel}'\index{channel!erasure} if it can restore any $u$ erased bits, and never more than $u$.% \marginpar{\small\raggedright\reducedlead{In a perfect $u$-error-correcting code for the binary {erasure channel}, the number of redundant bits must be $N-K=u$. }} Rather than using the word perfect, however, the conventional term for such a code is a `\ind{maximum distance separable} code', or MDS code. \label{sec.RAIDII} % Examples: As we already noted in \exerciseref{ex.raid3}, the $(7,4)$ \ind{Hamming code} is {\em not\/} an MDS % maximum distance separable code. It can recover {\em some\/} sets of 3 erased bits, but not all. If any 3 bits corresponding to a codeword of weight 3 are erased, then one bit of information is unrecoverable. This is why the $(7,4)$ code is a poor choice for a \ind{RAID} system.\index{redundant array of independent disks} %A maximum distance separable (MDS) block code is a linear code whose distance is maximal among all linear % block codes of rate k/n. It is well known that MDS block codes do exist if the field size is more than n. A tiny example of a maximum distance separable code\index{erasure correction}\index{error-correcting code!maximum distance separable}\index{error-correcting code!parity-check code}\index{MDS} is the simple parity-check code $P_{3}$\index{parity-check code} whose parity-check matrix is $\bH = [{\tt 1\, 1\, 1}]$. This code has 4 codewords, all of which have even parity. All codewords are separated by a distance of 2. Any single erased bit can be restored by setting it to the parity of the other two bits. The repetition codes are also maximum distance separable codes. \exercissxB{5}{ex.qeccodeperfect}{ Can you make an $(N,K)$ code, with $M=N-K$ parity symbols, for a $q$-ary erasure channel, such that the decoder can recover the codeword when {\em{any}\/} $M$ symbols are erased in a block of $N$? [Example: for % There do exist some such codes: for example, for the channel with $q=4$ symbols there is an $(N,K) = (5,2)$ code which can correct any $M=3$ erasures.] % ; and for $q=8$ there is a $(9,2)$ code.] } For the $q$-ary erasure channel with $q>2$, there are large numbers of MDS codes, of which the Reed--Solomon codes are the most famous and most widely used. As long as the field size $q$ is bigger than the blocklength $N$, MDS block codes of any rate can be found. (For further reading, see \citeasnoun{lincostello83}.) % according to my notes. % 4-ary erasure channel. % Include tournament example. GF4, 16 individuals. can tolerate 3 erasures. % Reed--Solomon codes. \section{Summary} Shannon's codes for the binary symmetric channel can almost always correct $\fN$ errors, but they are not $\fN$-error-correcting codes. %\noindent \subsection*{Reasons why the distance of a code has little relevance} \ben \item The Shannon limit shows that the best codes must be able to cope with a noise level twice as big as the maximum noise level for a bounded-distance decoder. \item When the binary symmetric channel has $f>1/4$, no code with a bounded-distance decoder can communicate at all; but Shannon says good codes exist for such channels. \item Concatenation shows that we can get good performance even if the distance is bad.\index{concatenation}\index{distance!of code} \een % % Furthermore, `distance isn't everything' -- you can actually % get to the Shannon limit with a code whose distance is `bad'. % % Exercise - prove that if a sequence of codes is very bad then it can't % have arbitrarily small error probability. The whole weight enumerator function is relevant to the question of whether a code is a good code. The relationship between good codes and distance properties is discussed further in \exerciseref{ex.prob.error.match}. % ex.equal.threshold}. %\section*{Further reading} % For a paper with codes having the property % distance, but for practical purposes a code with blocklength $N=10\,000$ % can have codewords of weight $d=32$ and the error probability % can remain negligibly small even when the channel % is creating errors of weight 320. % {mackaymitchisonmcfadden2003} \section{Further exercises} % also known as {ex.equal.threshold} \exercissxC{3}{ex.prob.error.match}{ A codeword $\bt$ is selected from a linear $(N,K)$ code $\C$, and it is transmitted over a noisy channel; the received signal is $\by$. We assume that the channel is a memoryless channel such as a Gaussian channel. Given an assumed channel model $P(\by \given \bt)$, there are two decoding problems. \begin{description} \item[The codeword decoding problem] is the task of\index{decoder!codeword} inferring which codeword $\bt$ was transmitted given the received signal. \item[The bitwise decoding problem] is the task of inferring\index{decoder!bitwise} for each transmitted bit $t_n$ how likely it is that that bit was a one rather than a zero. \end{description} Consider optimal decoders for these two decoding problems. % % these will be presented again in % section \ref{sec.decoding.problems} % exact.tex % Prove that the probability of error of the optimal bitwise-decoder is closely related to the probability of error of the optimal codeword-decoder, by proving the following theorem.\index{decoder!probability of error} \begin{ctheorem} If a binary linear code\index{error probability!and distance}\index{error-correcting code!error probability} % \index{distance!of code, and error probability} has minimum distance $d_{\min}$, then, for any given channel, the codeword bit error probability of the optimal bitwise decoder, $p_{\rm b}$, and the block error probability of the maximum likelihood decoder, $p_{\rm B}$, are related by: \beq p_{\rm B} \geq p_{\rm b} \geq \frac{1}{2} \frac{d_{\min}}{N} p_{\rm B} . \label{eq.thmpBpb} \eeq % [I am sure this theorem is well-known; I am not claiming it is original.] \end{ctheorem} } \exercisaxA{1}{ex.HammingD}{ What are the minimum distances of the $(15,11)$ Hamming code and the $(31,26)$ Hamming code? } \exercisaxB{2}{ex.estimate.wef}{ Let $A(w)$ be the average weight enumerator function of a rate-$1/3$ random linear code with $N=540$ and $M=360$. Estimate, from first principles, the value of $A(w)$ at $w=1$. } \exercisaxC{3C}{ex.handshakecode}{ {\sf A code with minimum distance\index{Gilbert--Varshamov distance}\index{distance!Gilbert--Varshamov} greater than $d_{\rm GV}$.} % Another way to make a code is to define a generator matrix % or parity-check matrix. A rather nice $(15,5)$ code is generated by this generator matrix, which is based on measuring the parities of all the ${{5}\choose{3}} = 10$ triplets of source bits: \beq \bG = \left[ \begin{array}{*{15}{c}} 1&\tinyo&\tinyo&\tinyo&\tinyo&\tinyo&1&1&1&\tinyo&\tinyo&1&1&\tinyo&1 \\ \tinyo&1&\tinyo&\tinyo&\tinyo&\tinyo&\tinyo&1&1&1&1&\tinyo&1&1&\tinyo \\ \tinyo&\tinyo&1&\tinyo&\tinyo&1&\tinyo&\tinyo&1&1&\tinyo&1&\tinyo&1&1\\ \tinyo&\tinyo&\tinyo&1&\tinyo&1&1&\tinyo&\tinyo&1&1&\tinyo&1&\tinyo&1\\ \tinyo&\tinyo&\tinyo&\tinyo&1&1&1&1&\tinyo&\tinyo&1&1&\tinyo&1&\tinyo \end{array} \right] . \eeq Find the minimum distance and weight enumerator function of this code. } \exercisaxC{3C}{ex.findAwmonodec}{ % {\sf A code with minimum distance\index{Gilbert--Varshamov distance}\index{distance!Gilbert--Varshamov} % slightly greater than $d_{\rm GV}$.} Find the minimum distance of the `{pentagonful}\index{pentagonful code}'\index{error-correcting code!pentagonful}% \amarginfig{t}{ \begin{center} \buckypsfigw{pentagon.eps} \end{center} \caption[a]{The graph of the pentagonful low-density parity-check code with 15 bit nodes (circles) and 10 parity-check nodes (triangles). [This graph is known as the \ind{Petersen graph}.] } } low-density parity-check code whose parity-check matrix is \beq \bH = \left[ \begin{array}{*{5}{c}|*{5}{c}|*{5}{c}} 1 & \tinyo & \tinyo & \tinyo & 1 & 1 & \tinyo & \tinyo & \tinyo & \tinyo & \tinyo & \tinyo & \tinyo & \tinyo & \tinyo \\ 1 & 1 & \tinyo & \tinyo & \tinyo & \tinyo & 1 & \tinyo & \tinyo & \tinyo & \tinyo & \tinyo & \tinyo & \tinyo & \tinyo \\ \tinyo & 1 & 1 & \tinyo & \tinyo & \tinyo & \tinyo & 1 & \tinyo & \tinyo & \tinyo & \tinyo & \tinyo & \tinyo & \tinyo \\ \tinyo & \tinyo & 1 & 1 & \tinyo & \tinyo & \tinyo & \tinyo & 1 & \tinyo & \tinyo & \tinyo & \tinyo & \tinyo & \tinyo \\ \tinyo & \tinyo & \tinyo & 1 & 1 & \tinyo & \tinyo & \tinyo & \tinyo & 1 & \tinyo & \tinyo & \tinyo & \tinyo & \tinyo \\ \hline \tinyo & \tinyo & \tinyo & \tinyo & \tinyo & 1 & \tinyo & \tinyo & \tinyo & \tinyo & 1 & \tinyo & \tinyo & \tinyo & 1 \\ \tinyo & \tinyo & \tinyo & \tinyo & \tinyo & \tinyo & \tinyo & \tinyo & 1 & \tinyo & 1 & 1 & \tinyo & \tinyo & \tinyo \\ \tinyo & \tinyo & \tinyo & \tinyo & \tinyo & \tinyo & 1 & \tinyo & \tinyo & \tinyo & \tinyo & 1 & 1 & \tinyo & \tinyo \\ \tinyo & \tinyo & \tinyo & \tinyo & \tinyo & \tinyo & \tinyo & \tinyo & \tinyo & 1 & \tinyo & \tinyo & 1 & 1 & \tinyo \\ \tinyo & \tinyo & \tinyo & \tinyo & \tinyo & \tinyo & \tinyo & 1 & \tinyo & \tinyo & \tinyo & \tinyo & \tinyo & 1 & 1 \end{array} \right] . \label{eq.monodec} \eeq Show that nine of the ten rows are independent, so the code has parameters $N=15$, $K=6$. Using a computer, find its weight enumerator function. % Find its weight enumerator function. } \exercisxB{3C}{ex.concateex}{ Replicate the calculations used to produce \figref{fig.concath.rate}. Check the assertion that the highest noise level that's correctable is 0.0588. Explore alternative concatenated sequences of codes. Can you find a better sequence of concatenated codes -- better in the sense that it has either higher asymptotic rate $R$ or can tolerate a higher noise level $\q$? } \exercissxA{3}{ex.syndromecount}{ Investigate the possibility of achieving the Shannon limit with linear block codes, using the following \ind{counting argument}. Assume a linear code of large blocklength $N$ and rate $R=K/N$. The code's parity-check matrix $\bH$ has $M = N - K$ rows. Assume that the code's optimal decoder, which solves the syndrome decoding problem $\bH \bn = \bz$, allows reliable communication over a binary symmetric channel with flip probability $f$. How many `typical' noise vectors $\bn$ are there? Roughly how many distinct syndromes $\bz$ are there? Since $\bn$ is reliably deduced from $\bz$ by the optimal decoder, the number of syndromes must be greater than or equal to the number of typical noise vectors. What does this tell you about the largest possible value of rate $R$ for a given $f$? } \exercisxB{2}{ex.zchanneldeficit}{ Linear binary codes use the input symbols {\tt{0}} and {\tt{1}} with equal probability, implicitly treating the channel as a symmetric channel. Investigate how much loss in communication rate is caused by this assumption, if in fact the channel is a highly asymmetric channel. Take as an example a Z-channel. How much smaller is the maximum possible rate of communication using symmetric inputs than the capacity of the channel? [Answer: about 6\%.] } \exercisxC{2}{ex.baddistbad}{ Show that codes with `very bad' distance are `bad' codes, as defined in \secref{sec.bad.code.def} (\pref{sec.bad.code.def}). % % Show that there exist codes with `bad' distance % that are `very good' codes. % % this bit already done in {ex.distancenotE}{ } \exercisxC{3}{ex.puncture}{ One linear code can be obtained from another by {\dem{\ind{puncturing}}}. Puncturing means taking each codeword and deleting a defined set of bits. Puncturing turns an $(N,K)$ code into an $(N',K)$ code, where $N' 2$, some MDS codes can be found. As a simple example, here is a $(9,2)$ code for the $8$-ary erasure channel. The code is defined in terms of the\index{Galois field} % \index{finite field} multiplication and addition rules of $GF(8)$, which are given in \appendixref{sec.gf8}. The elements of the input alphabet are $\{0,1,A,B,C,D,E,F\}$ and the generator matrix of the code is \beq \bG = \left[ \begin{array}{*{9}{c}} 1 &0 &1 &A &B &C &D &E &F \\ 0 &1 &1 &1 &1 &1 &1 &1 &1 \\ \end{array} \right] . \eeq The resulting 64 codewords are:\smallskip {\footnotesize\tt \begin{narrow}{0in}{-\margindistancefudge}% \begin{realcenter} \begin{tabular}{*{8}{c}} 000000000 & 011111111 & 0AAAAAAAA & 0BBBBBBBB & 0CCCCCCCC & 0DDDDDDDD & 0EEEEEEEE & 0FFFFFFFF \\ 101ABCDEF & 110BADCFE & 1AB01EFCD & 1BA10FEDC & 1CDEF01AB & 1DCFE10BA & 1EFCDAB01 & 1FEDCBA10 \\ A0ACEB1FD & A1BDFA0EC & AA0EC1BDF & AB1FD0ACE & ACE0AFDB1 & ADF1BECA0 & AECA0DF1B & AFDB1CE0A \\ B0BEDFC1A & B1AFCED0B & BA1CFDEB0 & BB0DECFA1 & BCFA1B0DE & BDEB0A1CF & BED0B1AFC & BFC1A0BED \\ C0CBFEAD1 & C1DAEFBC0 & CAE1DC0FB & CBF0CD1EA & CC0FBAE1D & CD1EABF0C & CEAD10CBF & CFBC01DAE \\ D0D1CAFBE & D1C0DBEAF & DAFBE0D1C & DBEAF1C0D & DC1D0EBFA & DD0C1FAEB & DEBFAC1D0 & DFAEBD0C1 \\ E0EF1DBAC & E1FE0CABD & EACDBF10E & EBDCAE01F & ECABD1FE0 & EDBAC0EF1 & EE01FBDCA & EF10EACDB \\ F0FDA1ECB & F1ECB0FDA & FADF0BCE1 & FBCE1ADF0 & FCB1EDA0F & FDA0FCB1E & FE1BCF0AD & FF0ADE1BC \\ \end{tabular} \end{realcenter} \end{narrow} } } % from exercise section in _linear.tex % % this was in _sexact % % ex.prob.error.match \soln{ex.prob.error.match}{% ex.equal.threshold}{ {\sf Quick, rough proof of the theorem.} Let $\bx$ denote the difference between the reconstructed codeword and the transmitted codeword. For any given channel output $\br$, there is a posterior distribution over $\bx$. This posterior distribution is positive only on vectors $\bx$ belonging to the code; the sums that follow are over codewords $\bx$. The block error probability is: \beq p_{\rm B} = \sum_{\bx \neq 0} P(\bx \given \br) . \label{eq.pBdef} \eeq The average bit error probability, averaging over all bits in the codeword, is: \beq p_{\rm b} = \sum_{\bx \neq 0} P(\bx \given \br) \frac{w(\bx)}{N} , \label{eq.pbdef} \eeq where $w(\bx)$ is the weight of codeword $\bx$. Now the weights of the non-zero codewords satisfy \beq 1 \geq \frac{w(\bx)}{N} \geq \frac{d_{\min}}{N} . \label{eq.ineq} \eeq Substituting the inequalities (\ref{eq.ineq}) into the definitions (\ref{eq.pBdef},$\,$\ref{eq.pbdef}), we obtain: % \beq p_{\rm B} \geq p_{\rm b} \geq % \frac{1}{2} \frac{d_{\min}}{N} p_{\rm B} , \label{eq.thmpBpbA} \eeq which is a factor of two stronger, on the right, than the stated result (\ref{eq.thmpBpb}). In making the proof watertight, I have weakened the result a little.\medskip % So the bit and block {\em thresholds\/} of a code with good distance % are identical. %\section \noindent {\sf Careful proof.} The theorem relates the performance of the optimal block decoding algorithm and the optimal bitwise decoding algorithm. We introduce another pair of decoding algorithms, called the block-guessing decoder and the bit-guessing decoder.\index{guessing decoder} The idea is that these two algorithms are similar to the optimal block decoder and the optimal bitwise decoder, but lend themselves more easily to analysis. We now define these decoders. Let $\bx$ denote the inferred codeword. For any given code: \begin{description} \item[The optimal block decoder] returns the codeword $\bx$ that maximizes the posterior probability $P(\bx \given \br)$, which is proportional to the likelihood $P( \br \given \bx)$. The probability of error of this decoder is called $\PB$. \item[The optimal bit decoder] returns for each of the $N$ bits, $x_n$, the value of $a$ that maximizes the posterior probability $P( x_n \eq a \given \br ) = \sum_{\bx} P(\bx \given \br) \,\truth\! [ x_n\eq a ]$. The probability of error of this decoder is called $\Pb$. \item[The block-guessing decoder] returns a random codeword $\bx$ with probability distribution given by the posterior probability $P(\bx \given \br)$. The probability of error of this decoder is called $\PGB$. \item[The bit-guessing decoder] returns for each of the $N$ bits, $x_n$, a random bit from the probability distribution $P( x_n \eq a \given \br )$. The probability of error of this decoder is called $\PGb$. \end{description} The theorem states that the optimal bit error probability $\Pb$ is bounded above by $\PB$ and below by a given multiple of $\PB$ (\ref{eq.thmpBpb}). % %\beq % P_B \geq P_b \geq \frac{1}{2} \frac{d_{\min}}{N} P_B . %\label{eq.thmpBpb.again} %\eeq The left-hand inequality in (\ref{eq.thmpBpb}) is trivially true -- if a block is correct, all its constituent bits are correct; so if the optimal block decoder outperformed the optimal bit decoder, we could make a better bit decoder from the block decoder. We prove the right-hand inequality by establishing that: % the following two lemmas: \ben \item the bit-guessing decoder is nearly as good as the optimal bit decoder: \beq \PGb \leq 2 \Pb . \label{eq.guess} \eeq \item the bit-guessing decoder's error probability is related to the block-guessing decoder's by \beq \PGb \geq \frac{d_{\min}}{N} \PGB . \eeq \een Then since $\PGB \geq \PB$, we have \beq \Pb > \frac{1}{2} \PGb \geq \frac{1}{2} \frac{d_{\min}}{N} \PGB \geq \frac{1}{2} \frac{d_{\min}}{N} \PB . \eeq We now prove the two lemmas.\medskip \noindent %\subsection {\sf Near-optimality of guessing:} Consider first the case of a single bit, with posterior probability $\{ p_0, p_1 \}$. % Without loss of generality, let $p_0 \geq p_1$. The optimal bit decoder % picks $\argmax_a p_a$, % \ie, 0, % and has probability of error \beq % \Pb P^{\rm{optimal}} = \min (p_0,p_1). \eeq % $p_1$. The guessing decoder picks from 0 and 1. The truth is also distributed with the same probability. The probability that the guesser and the truth match is $p_0^2 + p_1^2$; the probability that they mismatch is the guessing error probability, \beq % \PGb P^{\rm guess} = 2 p_0 p_1 \leq 2 \min (p_0,p_1) = 2 P^{\rm{optimal}} . \eeq Since $\PGb$ is the average of many such error probabilities, $P^{\rm guess}$, and $\Pb$ is the average of the corresponding optimal error probabilities, $P^{\rm{optimal}}$, we obtain the desired relationship (\ref{eq.guess}) between $\PGb$ and $\Pb$.\ENDproof % \medskip %\subsection \noindent {\sf Relationship between bit error probability and block error probability:} The bit-guessing and block-guessing decoders can be combined in a single system: % The posterior probability of a bit $x_n$ and a block $\bx$ % is given by %\beq % P( x_n = a , \bx \given \br ) = % P( \bx \given \br ) P( x_n = a \given \bx, \br ) = %\eeq % So w we can draw a sample $x_n$ from the marginal distribution $P(x_n \given \br)$ by drawing a sample $( x_n , \bx )$ from the joint distribution $P( x_n , \bx \given \br )$, then discarding the value of $\bx$. We can distinguish between two cases: the discarded value of $\bx$ is the correct codeword, or not. The probability of bit error for the bit-guessing decoder can then be written as a sum of two terms: \beqa \PGb &\eq & P(\mbox{$\bx$ correct}) P(\mbox{bit error} \given \mbox{$\bx$ correct}) \nonumber \\ & & + \, P(\mbox{$\bx$ incorrect}) P(\mbox{bit error} \given \mbox{$\bx$ incorrect}) \\ &=& % P(\mbox{$\bx$ correct}) \times 0 + \PGB P(\mbox{bit error} \given \mbox{$\bx$ incorrect}) . \eeqa % The first of these terms is zero. Now, whenever the guessed $\bx$ is incorrect, the true $\bx$ must differ from it in at least $d$ bits, so the probability of bit error in these cases is at least $d/N$. So \[%beq \PGb \geq \frac{d}{N} \PGB . % \eepf \]%eeq QED.\hfill $\epfsymbol$ } \soln{ex.syndromecount}{ The number of `typical' noise vectors $\bn$ is roughly $2^{NH_2(f)}$. % , where $H=H_2(f)$. The number of distinct syndromes $\bz$ is $2^M$. So reliable communication implies \beq M \geq NH_2(f) , \eeq or, in terms of the rate $R = 1-M/N$, \beq R \leq 1 - H_2(f) , \eeq a bound which agrees precisely with the capacity of the channel. This argument is turned into a proof in the following chapter. } % BORDERLINE \soln{ex.hat.puzzle}{ % Mathematicians credit the problem to Dr. Todd Ebert, a computer % science instructor at the University of California at Irvine, who % introduced it in his Ph.D. thesis at the University of California at % Santa Barbara in 1998. In the three-player case, it is possible for the group to win three-quarters of the time. Three-quarters of the time, two of the players will have hats of the same colour and the third player's hat will be the opposite colour. The group can win every time this happens by using the following strategy. Each player looks at the other two players' hats. If the two hats are {\em different\/} colours, he passes. If they are the {\em same\/} colour, the player guesses his own hat is the {\em opposite\/} colour. This way, every time the hat colours are distributed two and one, one player will guess correctly and the others will pass, and the group will win the game. When all the hats are the same colour, however, {\em all three\/} players will guess incorrectly and the group will lose. When any particular player guesses a colour, it is true that there is only a 50:50 chance that their guess is right. The reason that the group wins 75\% of the time is that their strategy ensures that when players are guessing wrong, a great many are guessing wrong. For larger numbers of players, the aim is to ensure that most of the time no one is wrong and occasionally everyone is wrong at once. In the game with 7 players, there is a strategy for which the group wins 7 out of every 8 times they play. In the game with 15 players, the group can win 15 out of 16 times. If you have not figured out these winning strategies for teams of 7 and 15, I recommend thinking about the solution to the three-player game in terms of the locations of the winning and losing states on the three-dimensional hypercube, then thinking laterally. \begincuttable If the number of players, $N$, is $2^r-1$, the optimal strategy can be defined using a Hamming code of length $N$, and the probability of winning the prize is $\linefrac{N}{(N+1)}$. Each player is identified with a number $n \in 1\ldots N$. The two colours are mapped onto {\tt{0}} and {\tt{1}}. Any state of their hats can be viewed as a received vector out of a binary channel. A random binary vector of length $N$ is either a codeword of the Hamming code, with probability $1/(N+1)$, or it differs in exactly one bit from a codeword. % There is a probability Each player looks at all the other bits and considers whether his bit can be set to a colour such that the state is a codeword (which can be deduced using the decoder of the Hamming code). If it can, then the player guesses that his hat is the {\em other\/} colour. If the state is actually a codeword, all players will guess and will guess wrong. If the state is a non-codeword, only one player will guess, and his guess will be correct. It's quite easy to train seven players to follow the optimal strategy if the cyclic representation of the $(7,4)$ Hamming code is used (\pref{sec.h74cyclic}). % I am not sure of the optimal solution for the `Scottish version' % of the rules in which the prize is only awarded to the group % if they {\em all\/} guess correctly. % As a starting point, if one flips the guesses of the winning strategy % for the original game, the group % will win whenever it is in a codeword state, which % happens with probability $1/(N+1)$. The question is % what to do with the `passes'. %% since passing is never in one's interests. % Can the group do better than replacing passes with random guessing? } % \soln{ex.selforthog}{ % removed to cutsolutions.tex % end from _linear.tex \dvips %\section{Solutions to Chapter \protect\ref{ch.linearecc}'s exercises} % %\section{Solutions to Chapter \protect\ref{ch.linearecc}'s exercises} % \dvipsb{solutions linear} \dvips \prechapter{About Chapter} In this chapter we will draw together several ideas that we've encountered so far in one nice short proof. We will simultaneously prove both Shannon's noisy-channel coding theorem (for symmetric binary channels) and his source coding theorem (for binary sources). While this proof has connections to many preceding chapters in the book, it's not essential to have read them all. On the noisy-channel coding side, our proof will be more constructive than the proof given in \chref{ch.six}; there, we proved that almost any random code is `very good'. Here we will show that almost any {\em linear\/} code is very good. We will make use of the idea of typical sets (Chapters \ref{ch.two} and \ref{ch.six}), and we'll borrow from the previous chapter's calculation of the weight enumerator function of random linear codes (\secref{sec.wef.random}). On the source coding side, our proof will show that {\em random linear \ind{hash function}s} can be used for compression of compressible binary sources, thus giving a link to \chref{ch.hash}. \ENDprechapter \chapter{Very Good Linear Codes Exist} \label{ch.lineartypical} % % very good linear codes exist % In this chapter\index{linear block code} we'll use a single calculation to prove simultaneously the \ind{source coding theorem} and the\index{noisy-channel coding theorem} noisy-channel coding theorem for the \ind{binary symmetric channel}.\index{channel!binary symmetric}\index{noisy-channel coding theorem!linear codes}\index{linear block code!coding theorem}\index{error-correcting code!linear!coding theorem} {Incidentally, this proof works for much more general channel models, not only the binary symmetric channel. For example, the proof can be reworked for channels with non-binary outputs, for time-varying channels and for channels with memory, as long as they have binary inputs satisfying a symmetry property, \cf\ \secref{sec.Symmetricchannels}.} % \label{ch.linear.good} \section{A simultaneous proof of the source coding and noisy-channel coding theorems} We consider a linear error-correcting code with binary \ind{parity-check matrix} $\bH$. The matrix has $M$ rows and $N$ columns. Later in the proof we will increase $N$ and $M$, keeping $M \propto N$. The rate of the code satisfies \beq R \geq 1 - \frac{M}{N}. \eeq If all the rows of $\bH$ are independent then this is an equality, $R = 1 -M/N$. In what follows,\index{error-correcting code!rate}\index{error-correcting code!linear} we'll assume the equality holds. Eager readers may work out the expected rank of a random binary matrix $\bH$ (it's very close to $M$) and pursue the effect that the difference ($M - \mbox{rank}$) has % small number of linear dependences have on the rest of this proof (it's negligible). A codeword $\bt$ is selected, satisfying \beq \bH \bt = {\bf 0} \mod 2 , \eeq and a binary symmetric channel adds noise $\bx$, giving the received signal\marginpar{\small\raggedright\reducedlead{In this chapter $\bx$ denotes the noise added by the channel, not the input to the channel.}} \beq \br = \bt + \bx \mod 2. \eeq The receiver aims to infer both $\bt$ and $\bx$ from $\br$ using a \index{syndrome decoding}{syndrome-decoding} approach. Syndrome decoding was first introduced in \secref{sec.syndromedecoding} (\pref{sec.syndromedecoding} and \pageref{sec.syndromedecoding2}). % and \secref{sec.syndromedecoding2}. The receiver computes the syndrome \beq \bz = \bH \br \mod 2 = \bH \bt + \bH \bx \mod 2 = \bH \bx \mod 2 . \eeq % Since $\bH \bt = {\bf 0}$, t The syndrome only depends on the noise $\bx$, and the decoding problem is to find the most probable $\bx$ that satisfies \beq \bH \bx = \bz \mod 2. \eeq This best estimate for the noise vector, $\hat{\bx}$, is then subtracted from $\br$ to give the best guess for $\bt$. Our aim is to show that, as long as $R < 1-H(X) = 1-H_2(f)$, where $f$ is the flip probability of the binary symmetric channel, the optimal decoder for this syndrome-decoding problem has vanishing probability of error, as $N$ increases, for random $\bH$. % and averaging over all binary matrices $\bH$. We prove this result by studying a sub-optimal strategy for solving the decoding problem. Neither the optimal decoder nor this {\em \ind{typical-set decoder}\/} would be easy to implement, but the typical-set decoder is easier to \analyze. The typical-set decoder examines the typical set $T$ of noise vectors, the set of noise vectors $\bx'$ that satisfy $\log \dfrac{1}{P(\bx')} \simeq NH(X)$,\marginpar{\small\raggedright\reducedlead{We'll leave out the $\epsilon$s and $\beta$s that make a typical-set definition rigorous. Enthusiasts are encouraged to revisit \secref{sec.ts} and put these details into this proof.}} checking to see if any of those typical vectors $\bx'$ satisfies the observed syndrome, \beq \bH \bx' = \bz . \eeq If exactly one typical vector $\bx'$ does so, the typical set decoder reports that vector as the hypothesized noise vector. If no typical vector matches the observed syndrome, or more than one does, then the typical set decoder reports an error. The probability of error of the typical-set decoder, for a given matrix $\bH$, can be written as a sum of two terms, \beq P_{{\rm TS}|\bH} = P^{(I)} + P^{(II)}_{{\rm TS}|\bH} , \eeq where $P^{(I)}$ is the probability that the true noise vector $\bx$ is itself not typical, and $P^{(II)}_{{\rm TS}|\bH}$ is the probability that the true $\bx$ is typical and at least one other typical vector clashes with it. The first probability vanishes as $N$ increases, as we proved when we first studied typical sets (\chref{ch.two}). We concentrate on the second probability. % , the probability of a type-II error. To recap, we're imagining a true noise vector, $\bx$; and if {\em any\/} of the typical noise vectors $\bx'$, different from $\bx$, satisfies $\bH (\bx' - \bx) = 0$, then we have an error. We use the truth function \beq \truth \! \left[ \bH (\bx' - \bx) = 0 \right], \eeq whose value is one if the statement $\bH (\bx' - \bx) = 0$ is true and zero otherwise. We can bound the number of type II errors made when the noise is $\bx$ thus: \newcommand{\xprimecondition}{\raisebox{-4pt}{\footnotesize\ensuremath{\bx'}:} \raisebox{-3pt}[0.025in][0.0in]{% prevent it from hanging down and pushing other stuff down \makebox[0.2in][l]{\tiny$\!\begin{array}{l} {\tiny\bx' \!\in T}\\ {\tiny\bx' \! \neq \bx} \end{array}$}}} \beq \left[\mbox{Number of errors given $\bx$ and $\bH$}\right] \: \leq \: \sum_{\xprimecondition} \truth\! \left[ \bH (\bx' - \bx) = 0 \right] . \label{eq.lt.union} \eeq The number of errors is either zero or one; the sum on the right-hand side may exceed one,\marginpar{\small\raggedright\reducedlead{\Eqref{eq.lt.union} is a \ind{union bound}.}}\index{bound!union} in cases where several typical noise vectors have the same syndrome. We can now write down the probability of a type-II error by averaging over $\bx$: \beq P^{(II)}_{{\rm TS}|\bH} \: \leq \: \sum_{\bx \in T} P(\bx) \sum_{\xprimecondition} \truth\! \left[ \bH (\bx' - \bx) = 0 \right] . \eeq Now, we will find the average of this probability of type-II error over all linear codes by averaging over $\bH$. By showing that the {\em average\/} probability of type-II error vanishes, we will thus show that there exist linear codes with vanishing error probability, indeed, that almost all linear codes are very good. We denote averaging over all binary matrices $\bH$ by $\left< \ldots \right>_{\bH}$. The average probability of type-II error is \beqan \bar{P}^{(II)}_{{\rm TS}} & =& \sum_{\bH} P(\bH) P^{(II)}_{{\rm TS}|\bH} \: = \: \left< P^{(II)}_{{\rm TS}|\bH} \right>_{\bH} \\ &=& \left< \sum_{\bx \in T} P(\bx) \sum_{\xprimecondition} \truth\! \left[ \bH (\bx' - \bx) = 0 \right] \right>_{\!\bH} \\ &=& \sum_{\bx \in T} P(\bx) \sum_{\xprimecondition} \left< \truth\! \left[ \bH (\bx' - \bx) = 0 \right] \right>_{\bH} . \eeqan Now, the quantity $\left< \truth\! \left[ \bH (\bx' - \bx) = 0 \right] \right>_{\bH}$ already cropped up when we were calculating the expected weight enumerator function of random linear codes (\secref{sec.wef.random}): for any non-zero binary vector $\bv$, the probability that $\bH \bv =0$, averaging over all matrices $\bH$, is $2^{-M}$. So \beqan \bar{P}^{(II)}_{{\rm TS}} & = & \left( \sum_{\bx \in T} P(\bx) \right) \left( |T| - 1 \right) 2^{-M}\\ & \leq & |T| \: 2^{-M} , \eeqan where $|T|$ denotes the size of the typical set. As you will recall from \chref{ch.two}, there are roughly $2^{NH(X)}$ noise vectors in the typical set. So \beqan \bar{P}^{(II)}_{{\rm TS}} & \leq & 2^{NH(X)} 2^{-M} . \eeqan This bound on the probability of error either vanishes or grows exponentially as $N$ increases (remembering that % , as we are fixing the code rate % $R = 1-M/N$, we are keeping $M$ proportional to $N$ as $N$ increases). It vanishes if \beq H(X) < M/N . \eeq % this clause is cuttable % CUT ME? % and grows if %\beq % NH(X) > M . %\eeq % end CUT ME Substituting $R=1-M/N$, we have thus established the % positive half of Shannon's noisy-channel coding theorem for the binary symmetric channel: very good linear codes exist %as long as %$H(X) < M/N$, \ie, as long as for any rate $R$ satisfying \beq R < 1-H(X) , \eeq where $H(X)$ is the entropy of the channel noise, per bit.\ENDproof \exercisxC{3}{ex.generalchannel}{ Redo the proof for a more general channel. } \section{Data compression by linear hash codes} The decoding game we have just played can also\index{random code!for compression} be viewed as an {\dem\ind{uncompression}\/} game.\index{hash code} The world produces a binary noise vector $\bx$ from a source $P(\bx)$. The noise has redundancy (if the flip probability is not 0.5). We compress it with a linear compressor that maps the $N$-bit input $\bx$ (the noise) to the $M$-bit output $\bz$ (the syndrome).\index{hash function!linear}\index{hash code} Our uncompression task is to recover the input $\bx$ from the output $\bz$. The rate of the compressor is \beq R_{\rm compressor} \equiv M/N . \eeq [We don't care about the possibility of linear redundancies in our definition of the rate, here.] The result that we just found, that the decoding problem can be solved, for almost any $\bH$, with vanishing error probability, as long as $H(X) < M/N$, thus instantly proves a \ind{source coding theorem}: \begin{quote} Given a binary source $X$ of entropy $H(X)$, and a required compressed rate $R > H(X)$, there exists a linear compressor $\bx \rightarrow \bz = \bH \bx \mod 2$ having rate $M/N$ equal to that required rate $R$, and an associated uncompressor, that is virtually lossless. \end{quote} % To put it another way, if you have a source of % entropy $H(X)$ and you encode a string of % $N$ bits from it using a \ind{hash code} (\chref{ch.hash}) % where the hash $\bz$ is of length $M$ bits, % where $M > N H(X)$, % a random linear hash function $\bz = \bH \bx \mod 2$ % is just as good (for collision avoidance) as a % fully random hash function. %% there are very unlikely to be any collisions among %% the hashes {This theorem is true not only for a source of independent identically distributed symbols but also for any source for which a typical set can be defined: sources with memory, and time-varying sources, for example; all that's required is that the source be ergodic. } \subsection*{Notes} This method for proving that codes are good can be applied to other linear codes, such as low-density parity-check codes \cite{mncN,McElieceMacKay00}. For each code we need an approximation of its expected weight enumerator function. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%55 % \dvips % \chapter{Further exercises on information theory} \chapter{Further Exercises on Information Theory} % this was two chapters once \label{ch_fInfo} % {noisy channels} \label{ch_f8} \fakesection{Further exercises on noisy channels} % I've been asked to include some exercises {\em without\/} worked % solutions. Here are a few. Numerical solutions to some of them % are provided on page \pageref{sec.solf8}. % The most exciting exercises, which will introduce you to further ideas in information theory, are towards the end of this chapter. %\section{Exercises} \subsection*{Refresher exercises on source coding and noisy channels} \exercisaxB{2}{ex.X100}{ % from Yaser Let $X$ be an ensemble with $\A_X = \{0,1\}$ and $\P_X = \{ 0.995, 0.005\}$. Consider source coding using the block coding of $X^{100}$ where every $\bx \in X^{100}$ containing 3 or fewer 1s is assigned a distinct codeword, while the other $\bx$s are ignored. \ben \item If the assigned codewords are all of the same length, find the minimum length required to provide the above set with distinct codewords. \item Calculate the probability of getting an $\bx$ that will be ignored. \een } \exercisaxB{2}{ex.0001}{ Let $X$ be an ensemble with $\P_X = \{ 0.1,0.2,0.3,0.4 \}$. The ensemble is encoded using the symbol code $\C = \{ 0001 , 001 , 01 , 1 \}$. Consider the codeword corresponding to $\bx \in X^N$, where $N$ is large. \ben \item Compute the entropy of the fourth bit of transmission. \item Compute the conditional entropy of the fourth bit given the third bit. \item Estimate the entropy of the hundredth bit. \item Estimate the conditional entropy of the hundredth bit given the ninety-ninth bit. % \item \een } \exercisaxA{2}{ex.dicetree}{ Two fair dice are rolled by Alice and the sum is recorded. Bob's task is to ask a sequence of questions with yes/no answers to find out this number. Devise in detail a strategy that achieves the minimum possible average number of questions. } % added Wed 22/1/03 \exercisxB{2}{ex.fairstraws}{ How can you use a coin to \ind{draw straws} among 3 people?\index{straws, drawing} }% my solution: arithmetic coding. % perhaps use this in exam? % - could also use exact sampling method! (see mcexact.tex) \exercisxB{2}{ex.magicnumber}{ In a {magic} trick,\index{puzzle!magic trick} there are three participants: the \ind{magician}, an assistant, and a volunteer. The assistant, who claims to have \ind{paranormal}\index{conjuror}\index{puzzle!magic trick} abilities, is in a soundproof room. The magician gives the volunteer six blank cards, five white and one blue. The volunteer writes a different integer from 1 to 100 on each \ind{card}, as the magician is watching. The volunteer keeps the blue card. The magician arranges the five white cards in some order and passes them to the assistant. The assistant then announces the number on the blue card. How does the trick work? } % card trick \exercisxB{3}{ex.magicnumber2}{ How does {\em this\/} trick work? \begin{quote} `Here's an ordinary pack of cards, shuffled into random order. Please choose five cards from the pack, any that you wish. Don't let me see their faces. No, don't give them to me: pass them to my assistant Esmerelda. She can look at them. `Now, Esmerelda, show me four of the cards. Hmm$\ldots$ nine of spades, six of clubs, four of hearts, ten of diamonds. The hidden card, then, must be the queen of spades!' \end{quote} The trick can be performed as described above\index{puzzle!magic trick} for a pack of 52 cards. Use information theory to give an upper bound on the number of cards for which the trick can be performed. % (This exercise is much harder than \exerciseonlyref{ex.magicnumber}.) % Hint: think of X = the 5 cardds, Y = the seque of 4 cards. how does H(X) compare with H(Y)? % n choose 5 cf. n....(n-3) -> (n-4)/5! = 1 -> n=124. } % see l/iam for soln \exercisxB{2}{ex.Hinfty}{ Find a probability sequence $\bp = (p_1,p_2, \ldots)$ such that $H(\bp) = \infty$. } \exercisaxB{2}{ex.typical2488}{ Consider a discrete memoryless source with $\A_X = \{a,b,c,d\}$ and $\P_X =$ $\{1/2,1/4,$ $1/8,1/8\}$. There are $4^8 = 65\,536$ eight-letter words that can be formed from the four letters. Find the total number of such words that are in the typical set $T_{N\beta}$ (equation \ref{eq.TNb}) where $N=8$ and $\beta = 0.1$. %The definition of $T_{N\b}$, from % chapter \chtwo, is:% equation \ref{eq.TNb} %\beq % T_{N\b} = \left\{ \bx\in\A_X^N : % \left| \frac{1}{N} \log_2 \frac{1}{P(\bx)} - H \right| < \b % \right\} . %\eeq } % source coding and channels........... \exercisxB{2}{ex.sourcechannel}{ Consider the source $\A_S = \{ a,b,c,d,e\}$, $\P_S = \{ \dthird, \dthird, \dfrac{1}{9}, \dfrac{1}{9}, \dfrac{1}{9} \}$ and the channel whose transition probability matrix is \beq Q = \left[ \begin{array}{cccc} 1 & 0 & 0 & 0 \\ 0 & 0 & \dfrac{2}{3} & 0 \\ 0 & 1 & 0 & 1 \\ 0 & 0 & \dthird & 0 \\ % 1 & 0 & 0 & 0 \\ % 0 & 0 & 1 & 0 \\ % 0 & \dfrac{2}{3} & 0 & \dthird \\ % 0 & 0 & 1 & 0 \\ \end{array}\right] . \eeq Note that the source alphabet % $\A_S = \{a,b,c,d,e\}$ has five symbols, but the channel alphabet $\A_X = \A_Y = \{0,1,2,3\}$ has only four. Assume that the source produces symbols at exactly 3/4 the rate that the channel accepts channel symbols. For a given (tiny) $\epsilon>0$, explain how you would design a system for communicating the source's output over the channel with an % overall average error probability per source symbol less than $\epsilon$. Be as explicit as possible. In particular, {\em do not\/} invoke Shannon's noisy-channel coding theorem. } % \subsection{Noisy Channels} \exercisxB{2}{ex.C0000}{Consider a binary symmetric channel and a code $C = \{ 0000,0011,1100,1111 \}$; assume that the four codewords are used with probabilities $\{ 1/2, 1/8,1/8,1/4\}$. What is the decoding rule that minimizes the probability of decoding error? [The optimal decoding rule depends on the noise level $f$ of the binary symmetric channel. Give the decoding rule for each range of values of $f$, for $f$ between 0 and $1/2$.] } \exercisaxA{2}{ex.C3channel}{ Find the capacity and \optens\ % optimizing input distribution for the three-input, three-output channel whose transition probabilities are: \beq Q = \left[ \begin{array}{ccc} 1 & 0 & 0 \\ 0 & \dfrac{2}{3} & \dthird \\ 0 & \dthird & \dfrac{2}{3} \end{array}\right] . \eeq } % % I am not sure I like this ex: % %\exercis{ex.Herrors}{ % Consider the $(7,4)$ Hamming code. %\ben\item % What is the probability of bit error if 3 channel errors occur % in a single block? %\item % What is the probability of bit error if 4 channel errors occur % in a single block? %\een %} % \end{document} % see also _e6.tex % % extra exercises do-able after chapter 6. % \fakesection{e6 exam qs} \exercissxA{3}{ex.85channel}{ % Describe briefly the encoder for a $(7,4)$ Hamming code. % % Assuming that one codeword of this code is sent over a % binary symmetric channel, define the {\em syndrome\/} $\bf z$ % of the received vector $\bf r$; state how many different possible syndromes % there are; and state % the maximum number of channel errors that the optimal decoder %% code % can correct. % % Define the {\em capacity\/} of a channel with input $x$ and output $y$ % and transition probability matrix $Q(y|x)$. % The input to a channel $Q$ is a word of 8 bits. The output is also a word of 8 bits. % A message block consisting of 8 bits is transmitted over a channel which Each time it is used, the channel flips {\em exactly one\/} of the transmitted bits, but the receiver does not know which one. The other seven bits are received without error. All 8 bits are equally likely to be the one that is flipped. Derive the capacity of this channel. % Tough version: % % {\bf Either} show, by constructing an explicit encoder and decoder using a % linear (8,5) code that it % is possible to reliably communicate 5 bits per cycle % over this channel, {\bf or} prove that no such linear (8,5) code exists. % % Wimps version: % practical Show, by describing an {\em explicit\/} encoder % {\em and\/} and decoder that it is possible {\em reliably\/} (that is, with {\em zero\/} error probability) to communicate 5 bits per cycle over this channel. % Your description should be % {\em should I give a hint here?} % [Hint: a solution exists that involves a simple $(8,5)$ code.] } \exercisxB{2}{ex.rstu}{ A channel with input $x \in \{ {\tt a},{\tt b},{\tt c} \}$ and output $y \in \{ {\tt r},{\tt s},{\tt t} ,{\tt u} \}$ has conditional probability matrix: \[ \bQ = \left[ \begin{array}{ccc} \dhalf & 0 & 0 \\ \dhalf & \dhalf & 0 \\ 0 & \dhalf & \dhalf \\ 0 & 0 & \dhalf \\ \end{array} \right] . \hspace{1in} \begin{array}{c} \setlength{\unitlength}{0.13mm} \begin{picture}(100,140)(0,-20) \put(18,0){\makebox(0,0)[r]{\tt c}} \put(18,40){\makebox(0,0)[r]{\tt b}} \put(18,80){\makebox(0,0)[r]{\tt a}} % \multiput(20,0)(0,40){3}{\vector(2,1){36}} \multiput(20,0)(0,40){3}{\vector(2,-1){36}} % \put(62,-20){\makebox(0,0)[l]{\tt u}} \put(62,20){\makebox(0,0)[l]{\tt t}} \put(62,60){\makebox(0,0)[l]{\tt s}} \put(62,100){\makebox(0,0)[l]{\tt r}} \end{picture} \end{array} \] What is its capacity? } \exercisxB{3}{ex.isbn}{ The ten-digit number on the cover of a book known as the\index{book ISBN} \ind{ISBN}\amargintab{t}{ \begin{center} \begin{tabular}{l} 0-521-64298-1 \\ 1-010-00000-4 \\ \end{tabular} \end{center} \caption[a]{Some valid ISBNs. [The hyphens are included for legibility.] } } incorporates an error-detecting code. The number consists of nine source digits $x_1,x_2,\ldots,x_{9}$, satisfying $x_n \in \{ 0,1,\ldots,9 \}$, and a tenth check digit whose value is given by \[ x_{10} = \left( \sum_{n=1}^{9} n x_n \right) \mod 11 . \] Here $x_{10} \in \{ 0,1,\ldots,9 , 10 \}.$ If $x_{10} = 10$ then the tenth digit is shown using the roman numeral X. % $\tt X$. % For example, 1-010-00000-4 is a valid ISBN. % bishop % 0-19-853864-2 % see lewis:con/isbn.p Show that a valid ISBN satisfies: \[ \left( \sum_{n=1}^{10} n x_n \right) \mod 11 = 0 . \] Imagine that an ISBN is communicated over an unreliable human channel which sometimes {\em modifies\/} digits and sometimes {\em reorders\/} digits. Show that this code can be used to detect (but not correct) all errors in which any one of the ten digits is modified (for example, 1-010-00000-4 $\rightarrow$ 1-010-00080-4). Show that this code can be used to detect all errors in which any two adjacent digits are transposed (for example, 1-010-00000-4 $\rightarrow$ 1-100-00000-4). What other transpositions of pairs of {\em non-adjacent\/} digits can be detected? % What types of error can be detected {\em and corrected?} If the tenth digit were defined to be \[ x_{10} = \left( \sum_{n=1}^{9} n x_n \right) \mod 10 , \] why would the code not work so well? (Discuss the detection of % errors % involving both modifications of single digits and transpositions of digits.) } \exercisaxA{3}{ex.two.bsc.choose}{ A\marginpar{\[ \setlength{\unitlength}{0.17mm} \begin{picture}(100,140)(0,-45) \put(15,-40){\makebox(0,0)[r]{d}} \put(15,0){\makebox(0,0)[r]{{c}}} \put(15,40){\makebox(0,0)[r]{b}} \put(15,80){\makebox(0,0)[r]{a}} \put(20,0){\vector(1,0){34}} \put(20,40){\vector(1,0){34}} \put(20,-40){\vector(1,0){34}} \put(20,80){\vector(1,0){34}} \put(20,40){\vector(1,1){34}} % \put(20,40){\vector(1,-1){34}} \put(20,-40){\vector(1,1){34}} \put(20,0){\vector(1,-1){34}} % \put(20,0){\vector(1,1){34}} \put(20,80){\vector(1,-1){34}} % \put(65,-40){\makebox(0,0)[l]{d}} \put(65,0){\makebox(0,0)[l]{c}} \put(65,40){\makebox(0,0)[l]{b}} \put(65,80){\makebox(0,0)[l]{a}} \end{picture} \] } channel with input $x$ and output $y$ has transition probability matrix: \[ Q = \left[ \begin{array}{cccc} 1-f & f & 0 & 0 \\ f & 1-f & 0 & 0 \\ 0 & 0 & 1-g & g \\ 0 & 0 & g & 1-g \end{array} \right] . \] Assuming an input distribution of the form \[ {\cal P}_X = \left\{ \frac{p}{2}, \frac{p}{2} , \frac{1-p}{2} , \frac{1-p}{2} \right\}, \] write down the entropy of the output, $H(Y)$, and the conditional entropy of the output given the input, $H(Y|X)$. Show that the optimal input distribution is given by \[ % corrected! p = \frac{1}{1 + 2^{-H_2(g) + H_2(f) }} , \] where $H_2(f) = f \log_2 \frac{1}{f} + (1-f) \log_2 \frac{1}{(1-f)}$. % CUTTABLE % [You may find the identity % $\frac{\d}{\d p} H_2(p) = \log_2 \frac{1-p}{p}$ helpful.] \marginpar{\small\raggedright\reducedlead{Remember $\frac{\d}{\d p} H_2(p) = \log_2 \frac{1-p}{p}$.}} Write down the optimal input distribution and the capacity of the channel in the case $f=1/2$, $g=0$, and comment on your answer. } \exercisxB{2}{ex.detect.vs.correct}{ What are the differences in the redundancies needed in an error-detecting code (which can reliably detect that a block of data has been corrupted) and an error-correcting code (which can detect and correct errors)? } % difficult exercises see _e7 % \input{tex/_fInfo.tex} % included directly by thebook.tex after _f8.tex \subsection{Further tales from information theory} The following exercises give you the chance to discover for yourself the answers to some more surprising results of information theory. % \subsection{Further tales from information theory} % \input{tex/_e7.tex} % \noindent \ExercisxC{3}{ex.corrinfo}{ % \item[Communication of correlated information.] {\sf Communication of information from correlated % dependent <--- would be better, but I want to keep same name for exercise as in first edn. sources.}\index{channel!with dependent sources} Imagine that we want to communicate data from two data sources $X^{(A)}$ and $X^{(B)}$ to a central location C via noise-free one-way \index{communication!of dependent information}{communication} channels (\figref{fig.achievableXY}a). The signals $x^{(A)}$ and $x^{(B)}$ are strongly dependent, so their joint information content is only a little greater than the marginal information content of either of them. For example, C is a \ind{weather collator} who wishes to receive a string of reports saying whether it is raining in Allerton ($x^{(A)}$) and whether it is raining in Bognor ($x^{(B)}$). The joint probability of $x^{(A)}$ and $x^{(B)}$ might be \beq \fourfourtabler{{$P(x^{(A)},x^{(B)})$}}{$x^{(A)}$}{{\mathsstrut}0}{{\mathsstrut}1}{{\mathsstrut}$x^{(B)}$}{0.49}{0.01}{0.01}{0.49} %\fourfourtable{\makebox[0.2in][r]{$P(x^{(A)},x^{(B)})$}}{$x^{(A)}$}{{\mathsstrut}0}{{\mathsstrut}1}{{\mathsstrut}$x^{(B)}$}{0.49}{0.01}{0.01}{0.49} %\:\: %\begin{array}{c|cc} %x^{(A)} :x^{(B)} & 0 & 1 \\ \hline %0 & 0.49 & 0.01 \\ %1 & 0.01 & 0.49 \\ %\end{array} \eeq The weather collator would like to know $N$ successive values of $x^{(A)}$ and $x^{(B)}$ exactly, but, since he has to pay for every bit of information he receives, he is interested in the possibility of avoiding buying $N$ bits from source $A$ {\em and\/} $N$ bits from source $B$. Assuming that variables $x^{(A)}$ and $x^{(B)}$ are generated repeatedly from this distribution, can they be encoded at rates $R_A$ and $R_B$ in such a way that C can reconstruct all the variables, with the sum of information transmission rates on the two lines being less than two bits per cycle? % For simplicity, assume that the % one-way communication channels are noise-free binary channels. % Encoding of correlated sources. Slepian Wolf (409) \begin{figure} \figuremargin{% \begin{center}\small \begin{tabular}{cc} \raisebox{0.71in}{(a)\hspace{0.2in}{\input{tex/corrinfo.tex}}} & \mbox{(b)\footnotesize \setlength{\unitlength}{0.075in} \begin{picture}(28,21)(-7.5,-1) \put(0.3,0){\makebox(0,0)[bl]{\psfig{figure=figs/achievableXY.eps,width=1.5in}}} \put(0,6.5){\makebox(0,0)[r]{\footnotesize$H(X^{(B)} \given X^{(A)})$}} \put(0,14){\makebox(0,0)[r]{\footnotesize$H(X^{(B)})$}} \put(0,17.5){\makebox(0,0)[r]{\footnotesize$H(X^{(A)},X^{(B)})$}} \put(0,20){\makebox(0,0)[r]{\footnotesize$R_B$}} % \put(20,-0.27){\makebox(0,0)[t]{\footnotesize$R_A$}} \put(2.5,-0.5){\makebox(0,0)[t]{\footnotesize$H(X^{(A)} \given X^{(B)})$}} \put(12,-0.5){\makebox(0,0)[t]{\footnotesize$H(X^{(A)})$}} %\put(15,-0.5){\makebox(0,0)[t]{\footnotesize$H(X^{(A)},X^{(B)})$}} \end{picture} }\\ \end{tabular} \end{center} }{% \caption[a]{Communication of % correlated information from dependent sources. (a) % The communication situation: $x^{(A)}$ and $x^{(B)}$ are dependent sources (the dependence is represented by the dotted arrow). Strings of values of each variable are encoded using codes of rate $R_A$ and $R_B$ into transmissions $\bt^{(A)}$ and $\bt^{(B)}$, which are communicated over noise-free channels to a receiver $C$. (b) The achievable rate region. Both strings can be conveyed without error even though $R_A < H(X^{(A)})$ and $R_B < H(X^{(B)})$. } % % this copy is all ready to work on...... % % cp achievableXY.fig achievableXYAB.fig \label{fig.achievableXY} }% \end{figure} The answer, which you should demonstrate,\index{dependent sources}\index{correlated sources} %\index{Slepian--Wolf|see{dependent sources}} is indicated in \figref{fig.achievableXY}. In the general case of two dependent sources $X^{(A)}$ and $X^{(B)}$, there exist codes for the two transmitters that can achieve reliable communication of both $X^{(A)}$ and $X^{(B)}$ to C, as long as: the information rate from $X^{(A)}$, $R_A$, exceeds $H(X^{(A)} \given X^{(B)})$; the information rate from $X^{(B)}$, $R_B$, exceeds $H(X^{(B)} \given X^{(A)})$; and the total information rate $R_A+R_B$ exceeds the joint entropy $H(X^{(A)},X^{(B)})$ \cite{SlepianWolf}. % In the general % case of two correlated sources $X$ and $Y$, there exist codes for % the two transmitters that can achieve reliable communication % of both $X$ and $Y$ to C, as long as: the information rate from % $X$, $R(X)$, exceeds $H(X \given Y)$; the information rate from % $Y$, $R(Y)$, exceeds $H(Y \given X)$; and the total information rate % $R(X)+R(Y)$ exceeds the joint information $H(X,Y)$. So in the case of $x^{(A)}$ and $x^{(B)}$ above, each transmitter must transmit at a rate greater than $H_2(0.02) = 0.14$ bits, and the total rate $R_A+R_B$ must be greater than 1.14 bits, for example $R_A=0.6$, $R_B=0.6$. There exist codes that can achieve these rates. Your task is to figure out why this is so. Try to find an explicit solution in which one of the sources is sent as plain text, $\bt^{(B)} = \bx^{(B)}$, and the other is encoded. } % \end{description} %\noindent \ExercisxC{3}{ex.multaccess}{ {\sf \index{multiple access channel}Multiple access channels}.\index{channel!multiple access} Consider a channel with two sets of inputs and one output -- for example, a shared telephone line (\figref{fig.achievableAB}a). A simple model system has two binary inputs $x^{(A)}$ and $x^{(B)}$ and a ternary output $y$ equal to the arithmetic sum of the two inputs, that's 0, 1 or 2. There is no noise. Users $A$ and $B$ cannot communicate with each other, and they cannot hear the output of the channel. If the output is a 0, the receiver can be certain that both inputs were set to 0; and if the output is a 2, the receiver can be certain that both