## Saturday, April 16, 2016

### Bits are most efficient, entropy/energy => Zipf's law, new memory systems have a solvable problem

This will need a strong interest in Shannon entropy, Landauer's limit, and Zipf's law.

Consider a message of N symbols composed of n unique symbols, and the unique symbols have an energy cost that it is a linear function. For example, a brick at 10 different heights on an inclined plane. To be distinguishable states from a distance, they have to be some fixed minimal distance apart. If the inclined plane is not even with the ground, but raised a distance "b" and the vertical distance between distinguishable states is "a", then energy required for each symbol is E=a*i+b where i=1 to n assigned to the n unique symbols.

It will cost a lot of energy to use each symbol with equal probability, but that is what gives the most information (entropy).  What is the tradeoff? What function of "i" usage will result in the highest information per energy required? Underlying all this is the energy required to change a bit state, E=kb*T*ln(2), Landauer's limit. "a" and "b" in my linear E equation above have this E as a constant in them. a+b>kb*T*ln(2)which I'll discuss later.

The Zipf law (Here's a good overview) is a good solution in many energy-increasing profiles of the symbols. Zipf-mandelbrot law is Ci =  C1*p/(i+q)^d where p=1 and q=0 for this discussion because p will cancel and q, at best, will make 100% efficiency possible when 80% is possible without it, so I'll not investigate it except to say using d=q=5x the d's I gets 100% in many cases.

So I'll investigate this form of Zipf's law:
Ci =  C1/i^d
where Ci is the count of symbol "i" out of n unique symbols occurring in a message of N total symbols, C1 is the count of the most-used symbol, "i" is the ranking of the symbol, and what d values are optimal is what all this is about. Note: max entropy (information) per symbol occurs when d=0 which means every symbol occurs with same probability. [update: I found a paper that compares normalized entropy (inherent disorder from 0 to 1 without regard to n or N) to Zipf's law. The most interesting text, half way between order and disorder with normalized entropy=0.5 is when d=1 for very large n, greater 50,000. Shakespeare has d=1 verses others of only d=0.8. Language approaches this, or well exceeds it if you consider word pairs as symbols, and we do indeed remember them in pairs or more. For small set of symbols, n=100, n=100 to get H=0.5. Maybe the word-types diagrammed could be a set this small and follow this rule if the writer is smarter. http://arxiv.org/pdf/1001.2733.pdf ]

Note: using a=1, it turns out my efficiency equation below has 100% as its max.

There is an ideal Zipf factor (d) that determines the way computers should use symbols in order to give max bits/energy if they are going to shift towards memristor-type memories and computation that uses more than 2 bits. The d will increase from our current ideal of 0 as the energy baseline b divided by the number of unique symbols decreases as computers get better.

Heat generated by computation is always the limitation in chips having to synchronize with a clock because at high clock speeds, the light signal can only travel so far before it is no longer letting distance parts be in sync. A 10 Ghz square wave that needs everyone synced up in 2/3 of its cycle can only travel 2 cm down the twisted wiring inside a 1 cm chip at the speed of light. This means the chip is limited in size. As you pack more and more in, the heat generated starts throwing low energy signals out of where they are supposed to be. Using higher energy signals with higher voltage just makes it hotter even faster.

From this, one might not ever want to consider going to more than 2 symbols because it is going to require higher and higher energy states. That's generally true. The 66% my equation gives for bits can be made 80% even when using more than 2 symbols by letting d=2.4, if it is a very efficient idealized computer.

After numerical experimentation with the equation below, I've found (drum roll please):

d=0 if b greater than 10*n and n greater than 2 (modern computers)
d=0.6 if b=n
d=1 if b=n/3 and n greater than 9 (language)
d=1.5 if b=10 and n=b^2.
d=2.4-2.7 if b less than 1, n <1 and="" n="" nbsp="">100 to 30,000 (idealized computer)

An equation for d as a function of b and n could be developed, but seeing the above was good enough for me.

Bits are by far the most energy-efficient (bits information / joule energy cost) when maximizing the entropy per N symbol (i.e., when you require the symbols to occur with equal frequency).

d=1 does not, by far, use equal frequency of symbols which is what is required for maximal entropy per symbol used in a message of N symbols. This is partly why languages are only 9.8 bits per word out of a possible 13.3 (Shannon's entropy H=13.3 bits per word if words were used with equal frequency because 2^13.3 = 10,000 words)

New ideal memory systems having more symbols per memory slot and operating at the lowest theoretical energy levels (b=0) under this model (E=a*i+b) will be 0.06*n times less efficient if they use the memory states (symbols) with equal frequency (which is max entropy per symbol) than if they follow the rule of d=2.4.  For example, if you have 100 possible states per memory slot, you will be 6x less efficient if you use the states with equal frequency. For larger b of systems in the near future with b=10*a and n=b^2, d=1.5 is 2x more efficient than d=0.

Side note:  Here is a paper that assumes it is logarithmic instead of linear. It may be an exponential function is many situations: words rarely used may be hard to think of and understand, so it may be more efficient to use 2 well-known words that have the same meaning. Conversely, trying to put together two common words to describe one nearly-common word can be more costly.

The a+b is also the minimal energy a "0" state on a computer needs to be at above background temperature energy fluctuations in order to have a VERY small probability of accidentally being in that state.   The probability of a thermal energy state is e^(-(a+b)/kT), so a+b needs to be a lot greater than kT, which modern computers are getting close to having problems with. Distinct symbols contained in same memory slot would have to be to be energies above this. Computers use only 2 states, 0 and1, but I am interested in language with words as the symbols. We require different amounts of energy when speaking and in deciding the next word to speak.  The energy between each of these separate symbols could be less and less and maintain the same reliability (logarithmic instead of linear), but if the reader of such states has a fixed ability to distinguish energy levels, then there is an "a" slope needed, having units of energy per symbol.

What is the most efficient way for the sender to send a message of x bits? "Counti" below is the count of unique symbol i.

Actually the units are joules/joules, because as I said before "a" and "b" have a k*T*ln(2) constant in them based on Landauer's limit and how reliable the symbols should be.

Also, it's not H=1 at the end but H=log2(n), so my last line should be Efficiency=c*log2(n)/n^2 when the counts are all equal, but the point remains: from the last equation, bits are the most energy-efficient way of sending information if you are requiring the symbols to occur with equal frequency, which is normally the case if you want the most information per symbol. But intuition already knows this:  it's no surprise at all that using the 2 symbols that require the least amount of energy will be the most energy-efficient at sending entropy. n can't be 1 because entropy = 0 for that limit.

This does not mean using only the 2 lowest-energy symbols will be the most energy-efficient method of sending information. Remember, the previous paragraph requires the symbols to be used with equal frequency. It's a good assumption these days in wasteful computers, but it is not the most energy-efficient method when the computer is efficient enough to be affected by the energy cost function of the symbols. On less-frequent occasions, you could use the energy-expensive symbols to "surprise" the receiver which carries a lot more information. To find the optimal usage, the efficiency equation above must be optimized by experimentation because it's easy to analytically solve only when H = 1 (or log2(n)). doing that resulted in my 4 claims above.

Here is a 2015 paper that professionally does what I am going to do, but they just look at logarithmi energy functions which I agree might be better for things like cities, but for signal generators, I think the distinct energy levels between state is an important category.

The rest of this post are my older notes on this. It will ramble. Most of it is trying to see how an energy cost results in Zipf-like distributions.

Zipf's law states that 1/rank is roughly the number of people in a city times the population of the largest city. It seems to usually be based on a feedback effect where efficiency is gained by more "people" joining causes an already-efficient thing (like a city, or buyers of a product or service) to become more efficient. Mandelbrot increased its accuracy by using c1/(rank+c2)^c3.  I'm going to use to find the best constants for different energy functions that result in the highest entropy.  How does a set of cities correspond to seeking highest entropy/efficiency?  I might have the sequence backwards: it might be that since people are potentially randomly distributed (high entropy) that the only thing guiding their selection of city is based on efficiency. There's not one mega city because its success in some ways causes its failure in other ways. It can be a fractal pattern from seeking the more efficient nearby options, like side sparks on lightening bolts and blood vessel routes being established, if not brain wiring. Current in wires will seek other routes as highly efficient (low resistance) routes become too heated from too much success. After defining symbols to represent these situations, it will be found to follow some zipf-mandelbrot law. But I want to derive it. Mandelbrot talked about fractals and the erngy cost also.   We have market dominators like Walmart and Amazon who do a good job because lots of people support them, and lots of people support them because they did a good job. So it's a feedback. It's been said China and Soviet union cities, at least in the past, did not follow the rule, apparently because aggressive socialism interfered with market forces.  Others tried to dismiss it as just a statistical effect, but those efforts were never convincing and are out of favor.

In terms of word frequency, there can be feedback.  "a" is an efficient word. A lot of words like to use it, probably because a lot of other words like it. By everyone using it, it might have become very price for accuracy or general. It might have been said less efficient in the past, but it then got used more which made it more efficient. My point is to mention this feedback before carrying on with relating the resulting energy efficiency with frequency of occurrence.  It also shows my efficiency equation might need to be logarithmic, but unlike others, that is not my starting point.

If we find a little profit from easy action, the language of our action will have a high count for that action despite the small profit. And if actions of an ever-increasing amount of energy can be combined in a clever way (high entropy), we can get a lot more energy from the world by carefully selecting how often we take the small actions compared to the energy-intensive actions. But by making these actions interdependent to achieve a larger goal (including the order in which they occur) then a lot more "heat" can be generated.  "Heat" needs to be used imaginatively here to apply, but it does apply in a real way.

If we have a bunch of levers to control something, we will make sure the levers we use the most will have the lowest required energy for action. Short words are easier to say, so we give them the definitions we need to use the most. The mechanics of speech and hearing do not allow for us to choose the next syllable out of every possible syllable with equal ease.  The most frequently used words, syllables, or whatever sequences are the ones that require the least amount of food energy.  We've assigned them to the things we need to say the most. But that's probably not exactly right: we say things in a way that allows use to use the easy words more frequently.

I tried 100 symbols in excel with a=1 and b=0.  Result: H=1 (d=0) is half as efficient (13%) as countsi=1/i which had 27% efficiency.  Value of "a" does not matter because it kills both efficiencies.  I tried d/(e*n)^c as a count per symbol factor, and 2.6 for c was best at 80%. The d and e had no effect. 1/(n+d)^c got it higher and higher to finally 0.99 as both d and c rose above 6 with n=100. It might needed to be higher with higher n.  If a of the linear function changed, the 80%increased by 1/a.  With significant b, c wanted to come down to 1.  And the d and c going up did not help much and needed to be lower.  At a=1 and b=10, c=1.5 was good.  Higher b=100 needed to be closer to 1, but 0.5 to 2 was OK, no so closer to the law for the cities.

In conclusion, the frequency=1/rank^(<1 accurate="" b="" be="" for="" nbsp="" rule="" should="">>a which is probably normally the case (speaking requires more energy than choosing what to say), if we are selecting words based on the amount of energy requiring in choosing words.  Trying to determine the energy function is a fuzzy thing: it includes how hard it is to say some words after others and how good you can recall words. If this later effect is large, great variability between writers might be seen.  When an author sees little difficulty in using a wide variety of words, the energy difference will be less and he will be closer to use words more evenly with a smaller power factor in 1/r^c.  c=0 is the same as saying H=1 (aka H=log2(n)).  Reducing n did not have a big effect. I tested mostly on n=100.  I see no reason to think that this idea is not sufficient or that it is less effective than other views.