#### 14 Nov 2000 schoen»(Master)

I posted a diary entry earlier today which contains a proof of the arithmetic mean/geometric mean inequality and some comments on solar thermal power plants and other stuff. You can look at that earlier diary entry, if you'd like (it's still there). I don't like editing older diary entries, if I can avoid it.

My friend Eric sent me some Moxie! I've got Moxie!

Pseudonym: a very interesting problem. Hmmmm....

By the way, is there such a thing as a "logarithmic mean"? I'd look at MathWorld, but it's subject to "downtime by law" right now.

ahosey, 8-bit bytes have 256 possibilities each. So n bytes represent 256^n possibilities. (I slipped up with exponents writing about that last week... thanks, Jamie.)

OK, so if you have 256^n possibilities and you want to write them in decimal digits, just remember that writing them in an alphabet of 10 symbols will require a total of log(256^n)/log(10)=n log(256)/log(10)=2.408n symbols; if you must use a whole number of symbols, and without loss of generality for any n, you'd better use 3n of them.

That number 2.408... is also equal to 8/log_2(10), which is the quantity that would give you the same result if you'd phrased the question as "how many digits do we need to use to write a number with n bits" and then multiplied by 8 to convert that answer from bit-units to byte-units.

The general formula is ceil(log_b(a)) or equivalently ceil(log(a)/log(b)) to give you the largest number of symbols in an alphabet of size b needed to code for a symbol from an alphabet of size a. So if you prefer, in the specific application, n*ceil(log_b(a)) is the largest number of digits possible in an n-digit base a number after it's been converted to base b. n*log_b(a) is the average number of digits, but we have to be careful about what kind of "average" that is, which reduces this to the previous conversation. :-)

OK, if you want a more rigorous proof, you need to show that place value representations always use each symbol with equal probability -- which can be phrased in other ways. The log-ratio thing is stolen from information theory, although there are many different ways to derive it.

Actually, it suffices to show that in base b, there are exactly b^n numbers, counting from zero, whose representations are not longer than n digits (alternatively, "shorter than the representation of b^n", since the representation of b^n in base b has exactly n+1 digits, and b^n is the smallest number whose representation has at least n+1 digits). This is reasonably easy to show, because there are b^n numbers from 0 to (b^n)-1, since (b^n)-1+1=(b^n)-1. See, I told you it was easy!

Given that there are b^n numbers, counting from zero, whose representations are not longer than n digits in base b, we are justified in using that logarithm rule above, even if we don't do any information theory or probability stuff.

We can also see from information theory that where log_b(a) is not an integer, there will sometimes be more than one possible choice in the coding, so that we can choose to store additional information if we want. For example, if you have 3 decimal digits but you know that you're only writing an 8-bit number, you can say that the most significant digit should be interpreted modulo 3; then we actually have access to a spare bit or more to play with in coding each time! (There is even more information than that available some of the time; I'll leave it as an exercise for someone to propose an encoding of 8-bit numbers in 3-digit numbers that yields at least 1 bit free all the time and an average of more than 2 bits free in general. Of course, it's not really a matter of "propose" but of "show how to make use of".)

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.