Economics, A.I., physics, & evolution: entropy: relation between physical and information

== The relation between S and H ==

H is intensive S
Shannon entropy H appears to be more like bits per symbol of a message (see examples below for proof) which has abetter parallel with '''intensive''' entropy (S per volume or S per mole), NOT the usual '''extensive''' entropy S. This allows "n" distinct symbols like a,b,c in a Shannon message of length N to correlate with "n" energy levels in N particles or energy modes.

H as possible extensive S? Not really
However, a larger block of matter will have more longer wavelength phonons so that more Shannon symbols will be needed to represent the larger number of energy levels, so maybe it can't be taken as intensive S in all occasions. But the same error will apply if you try to use intensive S. So intensive S seems to be exactly equal to Shannon, as long as the same reference size block is used. This problem does not seem to apply to gases. But does a double-sized volume of gas with double number of molecules have double entropy as I would like to use Shannon entropy? Yes, according to the equation I found (at the bottom) for S for a gas: S = N*( a*ln(U/N) + b*ln(V/N)+c* ). If N, U, and V all double, then entropy merely doubles.

Why Shannon entropy H is not like the usual extensive entropy S
Examples of Shannon entropy (using the online Shannon entropy calculator): http://www.shannonentropy.netmark.pl/
01, 0011, and 00101101 all have Shannon entropy H = log2(2) = 1
abc, aabbcc, abcabcabc, and abccba all have Shannon entropy H = log2(3) = 1.58

The important point is to notice it does not depend on the message length, whose analogy is number of particles. Clearly physical extensive entropy increases with number of particles.

Suggested conversion between H and regular extensive S
To convert Shannon entropy H to extensive entropy S or vice versa use
'''S=N*kb*ln(2)*H'''
where N is number of oscillators or particles, kb is Boltzmann's constant, ln(2) is the conversion from bits to nats, and N is the number of particles to which you've assigned a "Shannon" symbol representing the distinct energy level (if volume is constant) or microstates (a combination of volume and energy) of the particles or other oscillators of energy. A big deal should not be made out of kb (and therefore S) having units of J/K because temperature is a precise measure of a kinetic energy distribution, so it is unitless in a deep sense. It is merely a slope that allows zero heat energy and zero temperature to meet at the same point, both of which are Joules per particle.

Physical entropy of an Einstein solid derivable from information theory?
Consider an Einstein Solidhttp://hyperphysics.phy-astr.gsu.edu/hbase/therm/einsol.html with 3 possible energy states (0, 1, 2) of in each of N oscillators. Let the average energy per oscillator be 1 which means the total energy q=N. The physical extensive entropy S is (by the reference above) very close to S = kb*ln(2)*2N = 1.38*N*kb for large N. You can verify this by plugging in example N's, or by Sterling's approximation and some math.

Shannon entropy for this system is as follows: each oscillator's energy is represented by 1 of 3 symbols of equal probability (i.e. think of a, b, c substituted for energies 0,1,2). Then H = log2(3) = 1.585 which does not depend on N (see my Shannon entropy examples above). Using the H to S conversion equation above gives S=N*kb*ln(2)*log2(3) = ln(3)*N*kb = 1.10*N*kb. This is 56% lower than the Debye model. I think this means there are 14 possible states of even probability in the message of oscillators for each oscillator N giving S = k*N*ln(14) and K.E.= N*energy/N.

log2(3) = ln(3)/ln(2)

Note that the physical S is 25% more than the S derived from information theory, 2*ln(2) verses ln(3). The reason is that the physical entropy is not restricted to independently random oscillator energy levels which average 1. It only requires the total energy be q=N.

You might say this proves Shannon entropy H is not exactly like intensive entropy S. But I wonder if this idealized multiplicity is a physical reality. Or maybe kb contains a "hidden" 25% reduction factor in which case I would need to correct my H to S conversion above by a factor of 0.75 if used in multiplicity examples instead of classical thermodynamics. Another possibility is that my approach in assigning a symbol to an energy level and a symbol location to a particle is wrong. But it's hard to imagine the usefulness of looking at it any other way.

Going further towards Debye model of solids
However, I might be doing it wrong. The oscillators might also be all 1's or half 2's and half 0's ( all 1's has log2(1)=0) . This would constitute 3 different types of signal sources as the possible source of the total energy with entropy of maybe H=log2(3)+log2(2)+log2(1)=2.58. Times ln(2) gives 1.77 which is higher than the 1.38, but close to Debye's model of solids which predicts temps 1/0.806 more than einstein for a given total energy which is S/kb=1.72 (because U/N in both models ~T*S). The model itself is supposed to be the problem with einstein solid, so H should not come out closer based on it, unless I'm cheating or re-interpreting in a way that makes it closer to the Debye model. To complete this, I need to convert the debye model to a shannon signal source, assigning a symbol to each energy level. Each symbol would occur in the signal source (the solid) based on how many modes are allowing it. So N would not be particles but number of phonons, which determine plank radiation even at low temps, and heat capacity. It is strange that energy level quantity does not change the entropy, just the amount of variation in the possibilities. This probably is the way the physical entropy is calculated so it's garanteed to be correct.

Debye Temp ~ s* sqrt(N/V) where s=sonic velocity of the solid . "Einstein Temp" = 0.806* debye temp. This means my treatment

C = a*T^3 (heat capacity)
S=a/3*C

Entropy of a single harmonic oscillator (giving rise to phonons?) at "high" temp is
for solids S=k*ln(kT/hf + 1) where f is frequency of the oscillations which is higher for stronger bonds. So if Earth acquires stronger bonds per oscillator and maintains a constant temperature, S decreases. For indpendent oscillators in 1 D I think it might be S=kN*(ln(kT/hf/N) + 1). In terms of S=n*H entropy: S=X*[-N/X*[ln(N/X)-1]] where X=kT/hf. For 3D X^3. See Vibrational Thermodynamics of Materials, Fultz paper.

Example that includes volume
The entropy of an ideal gas according to internal energy page.
S = k * N *[ cln(U/N) + R*ln(V/N) ]
by using log rules applied to Wikipedia's internal energy page, where c is heat capacity. You could use two sets of "Shannon" symbols, one for internal energy states U and a set for V, or use one set to represent the microstates determined by U and V to give S=constant*N*ln(2)*H. H using distinct symbols for microstates is thereby the Shannon entropy of an ideal gas.

ideal gas not near absolute zero, more accurately:
S=kN*ln(V/N*(mU4pi/3Nh^2)^3/2 + 5/2))
Sackur–Tetrode equation
S~N*( ln(V/N*(mU/N)^3/2 + a) +b)
S~N*(ln(V/N) + a*ln(mU/N) +b)
The separate ln()'s are because there are 2 different sets of symbols that need to be on the same absolute base, but have "something", maybe different "symbol lengths" as a result of momentum being the underlying quantity and it affects pressure per particle as v whereas it affects U as v^2. In PV=nRT, T is v^2 for a given mass, and P is v. P1/P2 = sqrt(m2/m1) when T1 and T2 are the same for a given N/V.

==================
edit to Wikipedia: (entropy article)

The closest connection between entropy in information theory and physical entropy can be seen by assigning a symbol to each distinct way a quantum energy level (microstate) can occur per mole, kilogram, volume, or particle of homogeneous substance. The symbols will thereby "occur" in the unit substance with different probability corresponding to the probability of each microstate. Physical entropy on a "per quantity" basis is called "[[Intensive_and_extensive_properties|intensive]]" entropy as opposed to total entropy which is called "extensive" entropy. When there are N moles, kilograms, volumes, or particles of the substance, the relationship between this assigned Shannon entropy H in bits and physical extensive entropy in nats is:
:S = k_\mathrm{B} \ln(2) N H
where ln(2) is the conversion factor from base 2 of Shannon entropy to the natural base e of physical entropy. [[Landauer's principle]] demonstrates the reality of this connection: the minimum energy E required and therefore heat Q generated by an ideally efficient memory change or logic operation from irreversibly erasing or merging N*H bits of information will be S times the temperature,
:E = Q = T k_\mathrm{B} \ln(2) N H,
where H is in information bits and E and Q are in physical Joules. This has been experimentally confirmed.{{Citation |author1=Antoine Bérut |author2=Artak Arakelyan |author3=Artyom Petrosyan |author4=Sergio Ciliberto |author5=Raoul Dillenschneider |author6=Eric Lutz |doi=10.1038/nature10872 |title=Experimental verification of Landauer’s principle linking information and thermodynamics |journal=Nature |volume=483 |issue=7388 |pages=187–190 |date=8 March 2012 |url=http://www.physik.uni-kl.de/eggert/papers/raoul.pdf|bibcode = 2012Natur.483..187B }}

(scott note: H = sum(count/N*log2 (N/count) )

BEST wiki article
When a file or message is viewed as all the data a source will ever generate, the Shannon entropy H in '''bits per symbol''' is
H = - \sum_{i=1}^{n} p_i \log_2 (p_i) = \sum_{i=1}^{n}count_i/N \log_2 (N/count_i)

where i is for each distinct symbol, p_i is "probability of symbol being received, and count_i is the number of times symbol i occurs in the message that is N symbols long. This equation is "bits" per symbol because it is in a logarithm of base 2. It is "entropy per symbol" or "normalized entropy" that ranges from 0 to1 when the logarithm base is equal to the n distinct symbols. This can be calculated by multiplying H in bits/symbol by ln(2)/ln(n). The "bits per symbol" of data that has only 2 symbols is therefore also "entropy per symbol".

The "entropy" of a message, file, source, or other data is S=N*H, in keeping with Boltzmann's [[H-theorem]] for physical entropy that Claude Shannon's 1948 paper cited as analogous to his information entropy. To clarify why Shannon's H is often called "entropy" instead of "entropy per symbol" or "bits per symbol", section 1.6 of Shannon's paper refers to H as the "entropy of the set of probabilities" which is not the same as the entropy of N symbols. H is analogous to physic's [[Intensive and extensive properties|intensive]] entropy So that is on a per mole or per kg basis, which is different from the more common extensive entropy S. In section 1.7 Shannon more explicitly says H is "entropy per symbol" or "bits per symbol".

It is instructive to see H values in bits per symbol (log2) for several examples of short messages by using an online entropy (bits per symbol) calculator:http://www.shannonentropy.netmark.pl/http://planetcalc.com/2476/
* H=0 for "A", "AAAAA", and "11111"
* H=1 for "AB", "ABAB", "0011", and "0110101001011001" (eight 1's and eight 0's)
* H=1.58 for "ABC", "ABCABCABCABC", and "aaaaBBBB2222"
* H=2 for "ABCD" and "AABBCCDD"

The entropy in each of these short messages is H*N.

Claude Shannon's 1948 paper was the first to define an "entropy" for use in information theory. His H function (formally defined below) is named after Boltzmann's [[H-theorem]] which was used to define physical entropy by S=kb*N*H of an ideal gas. Boltzmann's H is entropy in [[nats_(unit)|nats]] per particle, so each symbol in a message has an analogy to each particle in an ideal gas, and the probability of a specific symbol is analogous to the probability of a particle's microstate. The "per particle" basis does not work in solids because they are interacting. But the information entropy maintains mathematical equivalency with bulk [[Intensive and extensive properties|intensive]] physical entropy So which is on a per mole or per kilogram basis (molar entropy and specific entropy). H*N is mathematically analogous to total [[Intensive and extensive properties|extensive]] physical entropy S. If the probability of a symbol in a message represents the probability of a microstate per mole or kg, and each symbol represents a specific mole or kg, then S=kb*ln(2)*N*Hbits. Temperature is a measure of the average kinetic energy per particle in an ideal gas (Kelvins = Joules*2/3/kb) so the Joules/Kelvins units of kbare fundamentally unitless (Joules/Joules), so S is fundamentally an entropy in the same sense as H.
[[Landauer's principle]] shows a change in information entropy is equal to a change in physical entropy S when the information system is perfectly efficient. In other words, a device can't irreversibly change its information entropy content without causing an equal or greater increase in physical entropy. Sphysical=k*ln(2)*(H*N)bits where k, being greater than Boltzmann's constant kb, represents the system's inefficiency at erasing data and thereby losing energy previously stored in bits as heat dQ=T*dS.

==============
on the information and entropy article: (my best version)

The Shannon entropy H in information theory has units of bits per symbol (Chapter 1, section 7). For example, the messages "AB" and "AAABBBABAB" both have Shannon entropy H=log2(2) = 1 bit per symbol because the 2 symbols A and B occur with equal probability. So when comparing it to physical entropy, the physical entropy should be on a "per quantity" basis which is called "[[Intensive_and_extensive_properties|intensive]]" entropy instead of the usual total entropy which is called "extensive" entropy. The "shannons" of a message are its total "extensive" information entropy and is H times the number of bits in the message.
A direct and physically real relationship between H and S can be found by assigning a symbol to each microstate that occurs per mole, kilogram, volume, or particle of a homogeneous substance, then calculating the H of these symbols. By theory or by observation, the symbols (microstates) will occur with different probabilities and this will determine H. If there are N moles, kilograms, volumes, or particles of the unit substance, the relationship between H (in bits per unit substance) and physical extensive entropy in nats is:
:S = k_\mathrm{B} \ln(2) N H
where ln(2) is the conversion factor from base 2 of Shannon entropy to the natural base e of physical entropy. N*H is the amount of information in bits needed to describe the state of a physical system with entropy S. [[Landauer's principle]] demonstrates the reality of this by stating the minimum energy E required (and therefore heat Q generated) by an ideally efficient memory change or logic operation by irreversibly erasing or merging N*H bits of information will be S times the temperature which is
:E = Q = T k_\mathrm{B} \ln(2) N H
where H is in informational bits and E and Q are in physical Joules. This has been experimentally confirmed.{{Citation |author1=Antoine Bérut |author2=Artak Arakelyan |author3=Artyom Petrosyan |author4=Sergio Ciliberto |author5=Raoul Dillenschneider |author6=Eric Lutz |doi=10.1038/nature10872 |title=Experimental verification of Landauer’s principle linking information and thermodynamics |journal=Nature |volume=483 |issue=7388 |pages=187–190 |date=8 March 2012 |url=http://www.physik.uni-kl.de/eggert/papers/raoul.pdf|bibcode = 2012Natur.483..187B }}
Temperature is a measure of the average kinetic energy per particle in an ideal gase (Kelvins = 2/3*Joules/kb) so the J/K units of kb is fundamentally unitless (Joules/Joules). kb is the conversion factor from energy in 3/2*Kelvins to Joules for an ideal gas. If kinetic energy measurements per particle of an ideal gas were expressed as Joules instead of Kelvins, kb in the above equations would be replaced by 3/2. This shows S is a true statistical measure of microstates that does not have a fundamental physical unit other than "nats" which is just a statement of which logarithm base was chosen by convention.

===========
Example:
8 molecules of gas, with 2 equally probable internal energy states and 2 equally probable positions in a box. If at first they are on the same side of the box with half in 1 energy state and the other half in the other energy state, then the state of the system could be written ABABABAB. Now if they are allowed to distribute evenly in the box keeping their previous energy levels it is written ABCDABCD. For an ideal gas: S ~ N*ln(U*V) where U and V are averages per particle.

An ideal gas uses internal energy and volume. Solids seem to not include volume, and depend on phonons with have bond-stretching energies in place of rotational energies. But it seems like solids should follow S~N*ln(U) = k*ln(2)*H where H would get into quantum probabilities of each energy state. The problem is that phonon waves across the solid are occurring. So that instead of U, an H like this might need to be calculated from phase-space (momentum and position of the atoms and their electrons? )
=============
comment Wikipedia
Leegrc, "bits" are the invented name for when the log base is 2 is used. There is, like you say, no "thing" in the DATA itself you can point to. Pointing to the equation itself to declare a unit is, like you are thinking, suspicious. But physical entropy itself is in "nats" for natural units for the same reason (they use base "e"). The only way to take out this "arbitrary unit" is to make the base of the logarithm equal to the number of symbols. The base would be just another variable to plug a number in. Then the range of the H function would stay between 0 and 1. Then it is a true measure of randomness of the message per symbol. But by sticking with base two, I can look at any set of symbols and know how many bits (in my computing system that can only talk in bits) would be required to convey the same amount of information. If I see a long file of 26 letters having equal probability, then I need H = log2(26) = 4.7 bits to re-code each letter in 1's and 0's. There are H=4.7 bits per letter.
PAR, as far as I know, H should be used blind without knowledge of prior symbol probabilities, especially if looking for a definition of entropy. You are talking about watching a transmitter for a long time to determine probabilities, then looking at a short message and using the H function with the prior probabilities.

Let me give an example of why a blind and simple H can be extremely useful. Let's say there is a file that has 8 bytes in it. One moment it say AAAABBBB and the next moment it says ABCDABCD. I apply H blindly not knowing what the symbols represent. H=1 in the first case and H=2 in the second. H*N went from 8 to 16. Now someone reveals the bytes were representing microstates of 8 gas particles. I know nothing else. Not the size of the box they were in, not if the temperature had been raised, not if a partition had been lifted, and not even if these were the only possible microstates (symbols). But there was a physical entropy change everyone agrees upon from S1=kb*ln(2)*8 to S2=kb*ln(2)*16. So I believe entropy H*N as I've described it is as fundamental in information theory as it is in physics. Prior probabilities and such are useful but need to be defined how they are used. H on a per message basis will be the fundamental input to those other ideas, not to be brushed aside or detracted from.
=============
I agree you can shorten up the H equation by entering the p's directly by theory or by experience. But you're doing the same thing as me when I calculate H for large N, but I do not make any assumption about the symbol probabilities. You and I will get the same entropy H and "extensive" entropy N*H for a SOURCE. Your N*H extensive entropy is N*sum(p*log(p)). The online entropy calculators and I use N*H = N*sum[ count/N*log(count/N) ] ( they usually give H without the N). These are equal for large N if the source and channel do not change. "My" H can immediately detect if a source has deviated from its historical average. "My" H will fluctuate around the historical or theoretical average H for small N. You should see this method is more objective and more general than your declaration it can't be applied to a file or message without knowing prior p's. For example, let a partition be removed to allow particles in a box to get to the other side. You would immediately calculate the N*H entropy for this box from theory. "My" N*H will increase until it reaches your N*H as the particles reach maximum entropy. This is how thermodynamic entropy is calculated and measured. A message or file can have a H entropy that deviates from the expected H value of the source.
The distinct symbols A, B, C, D are distinct microstates at the lowest level. The "byte" POSITION determines WHICH particle (or microvolume if you want) has that microstate: that is the level to which this applies. The entropy of any one of them, is "0" by the H function, or "meaningless" as you stated. A sequence of these "bytes" tells the EXACT state of each particle and system, not a particular microstate (because microstate does not care about the order unless it is relevant to it's probability). A single MACROstate would be combinations of these distinct states. One example macrostate of this is when the gas might be in any one of these 6 distinct states: AABB, ABAB, BBAA, BABA, ABBA, or BAAB. You can "go to a higher level" than using A and B as microstates, and claim AA, BB, AB, and BA are individual microstates with a certain probabilities. But the H*N entropy will come out the same. There was not an error in my AAAABBBBB example and I did not make an assumption. It was observed data that "just happened" to be equally likely probabilities (so that my math was simple). I just blindly calculated the standard N*H entropy, and showed how it give the same result physics gets when a partition is removed and the macrostate H*N entropy went from 8 to 16 as the volume of the box doubled. The normal S increases S2-S1=kb*N*ln(2) as it always should when mid-box partition is removed.
I can derive your entropy from the way the online calculators and I use Shannon's entropy, but you can't go the opposite way.
Now is the time to think carefully, check the math, and realize I am correct. There are a lot of problems in the article because it does not distinguish between intensive Shannon entropy H in bits/symbol and extensive entropy N*H in bits (or "shannons to be more precise to distinguish it from the "bit" count of a file which may not have 1's and 0's of equal probability).
BTW, the entropy of an ideal gas is S~N*log2(u*v) where u and v are internal energy and volume per particle. u*v gives the number of microstates per particle. Quantum mechanics determines that u can take on a very large number of values and v is the requirement that the particles are not occupying the same spot, roughly 1000 different places per particle at standard conditions. The energy levels will have different probabilities. By merely observing large N and counting, H will automatically include the probabilities.
In summary, there are only 3 simple equations I am saying. They precisely lay the foundation of all further information entropy considerations. These equations should replace 70% of the existing article. These are not new equations, but defining them and how to use them is hard to come across since there is so much clutter and confusion regarding entropy as a result of people not understanding these statements.
1) Shannon's entropy is "intensive" bits/symbol = H = sum[ count/N*log2(count/N) ] where N is the length of a message and count is for each distinct symbol.
2) Absolute ("extensive") information entropy is in units of bits or shannons = N*H.
3) S = kb*ln(2)*N*H where each N has a distinct microstate which is represented by a symbol. H is calculated directly from these symbols for all N. This works from the macro down to the quantum level.
==========
Example: 3 interacting particles with sum total energy 2 and possible individual energies 0,1,2 may have possible energy distributions 011, 110, 101, 200, 020, or 002. I believe the order is not relevant to what is called a microstate, so you have only 2 symbols for 2 microstates, and get the probability for each is 50-50. Maybe there is usually something that skews this towards low energies. I would simply call each one of the 6 "sub-micro states" a microstate and let the count be included in H. Assuming equal p's again, the first case gives log(2)=1 and the 2nd log(6)=2.58. I believe the first one is the physically correct entropy (the approach, that is, not the exact number I gave). If I had let 0,1,2 be the symbols, then it would have 3*1.46 = 4.38 which is wrong.
Physically, because of the above, when saying S=k*ln(2)*NH, it requires that you look at specific entropy So and make it = k*ln(2)*H, so you'll have the correct H. This back-calculates the correct H. This assumes you are like me and can't derive Boltzmann's thermodynamic H from first (quantum or not) principles. I may be able to do it for an ideal gas. I tried to apply H to Einstein's oscillators (he was not aware of Shannon's entropy at the time) for solids, and I was 25% lower than his multiplicity, which is 25% lower than the more accurate Debye model. So a VERY simplistic approach to entropy with information theory was only 40% lower than experiment and good theory, for the one set of conditions I tried. I assumed the oscillators had only 4 energy states and got S=1.1*kT where Debye via Einstein said S=1.7*kT
My point is this: looking at a source of data and choosing how we group the data into symbols can result in different values for H and NH, [edit: if not independent]. Using no grouping on the original data is no compression and is the only one that does not use an algorithm plus lookup table. Higher grouping on independent data means more memory is required with no benefit to understanding (better understanding=lower NH). People with bad memories are forced to develop better compression methods (lower NH), which is why smart people can sometimes be so clueless about the big picture, reading too much with high NH in their brains and thinking too little, never needing to reduce the NH because they are so smart. Looking for a lower NH by grouping the symbols is the simplest compression algorithm. The next step up is run-length encoding, a variable symbol length. All compression and pattern recognition create some sort of "lookup table" (symbols = weighting factors) to run through an algorithm that may combine symbols to create on-the-fly higher-order symbols in order to find the lowest NH to explain higher original NH. The natural, default non-compressed starting point should be to take the data as it is and apply the H and NH statistics, letting each symbol be a microstate. Perfect compression for generalized data is not a solvable problem, so we can't start from the other direction with an obvious standard.
This lowering of NH is important because compression is 1 of 3 requirements for intelligence. Intelligence is the ability to acquire highest profit divided by noise*log(memory*computation) in the largest number of environments. Memory on a computing device has a potential energy cost and computation has a kinetic energy cost. The combination is internal energy U. Specifically, for devices with a fixed volume, in both production machines and computational machines, profit = Work output/[k*Temp*N*ln(U/N)] = Work/(kTNH). This is Carnot efficiency W/Q, except the work output includes acquisition of energy from the environment so that the ratio can be larger than 1. The thinking machine must power itself from its own work production, so I should write (W-Q)/Q instead. W-Q feeds back to change Q to improve the ratio. The denominator represents a thinking machine plus its body (environment manipulator) that moves particles, ions (in brains), or electrons (in computers) to model much larger objects in the external world to try different scenarios before deciding where to invest W-Q. "Efficient body" means trying to lower k for a given NH. NH is the thinking machine's algorithmic efficiency for a giving k. NH has physical structure with U losses, but that should be a conversion factor moved out to be part of the kT so that NH could be a theoretical information construct. The ultimate body is bringing kT down to kb at 0 C. The goal of life and a more complete definition of intelligence is to feed Work back to supply the internal energy U and to build physical structures that hold more and more N operating at lower and lower k*T. A Buddhist might say we only need to stop being greedy and stop trying to raise N (copies of physical self, kT, aka the number of computations) and U and we could leave k alone. This assumes constant volume, otherwise replace U/N with V/N*(U/N)^3/2 (for an ideal gas, anyway, maybe UV/NN is ok for solids. Including volume means we want to make it lower to lower kTNH. So denser thinking machine. The universe itself increases V/N (Hubble expansion) buth it cancels in determining Q because it causes U/N to decrease at possibly the same rate. This keeps entropy and energy CONSTANT on a universal COMOVING basis (ref: Weinberg's 1977 famous book "First 3 Minutes"), which causes entropy to be emitted (not universally "increased" as the laymen's books still say) from gravitational systems like Earth and Galaxies. The least action principle (the most general form of Newton's law, better than Hamiltonian & Lagrangian for developing new theories, see Feynman's red books) appears to me to have an inherent bias against entropy, preferring PE over KE over all time scales, and thereby tries to lower temp and raise the P.E. part of U for each N on Earth. This appears to be the source of evolution and why machines are replacing biology, killing off species 50,000 times faster than the historical rate. The legal requirement of all public companies is to dis-employ workers because they are expensive and to extract as much wealth from society as possible so that the machine can grow. Technology is even replacing the need for shareholders and skill (2 guys started MS, Apple, google, youtube, facebook, and snapchat and you can see trend in decreasing intelligence and age and increasing random luck needed to get your first $billion). Silicon, carbon-carbon, and matals are higher energy bonds (which least action prefers over kinetic energy) enabling lower N/U and k, and even capturing 20 times more Work energy per m^2 than photosynthesis. Ions that brains have to model objects with still weigh 100,000 times more than the electrons computers use.
In the case of the balance and 13 balls, we applied the balance like asking a question and organize thigs to get the most data out of the test. We may seek more NH answers from people or nature than we give in order to profit, but in producing W, we want to spend as little NH as possible.
[edit: I originally backtracked on dependency but corrected it, and I made a lot errors with my ratios from not letting k be positive for the ln().]Ywaz (talk) 23:09, 8 December 2015 (UTC)

Rubik's cube
Let the number of quarter turns from a specific disorder to ordered state be a microstate with a probability based on number of turns. Longer routes to solution are more likely. The shortest route uses the least energy which is indicative of intelligence. That's why we prize short solutions. The N in N*H = N*sum(count/N*log2(N/count)) is the number of turns. Count is the number of different routes with that N. Speakin in terms Carnot understood and reversing time, the fall from low entropy to high entropy should be fast in order minimize work production (work absorption when forward in time). The problem is always determining if a single particular step is the most efficient towards the goal or not. There's no incremental measure for profit increase in the problems we can't solve. Solving problems is a decrease in entropy. Is working backwards to generate the most mixed up state in as few turns as possible a beginning point?

Normalized entropy
There may not be a fundamental cost to a large memory, i.e., a large number of symbols aka classifications. That is potential energy which is an investment in infrastructure. Maybe there is a "comparison","lookup", or even transmission cost (more bits are still required), but maybe those are reversible. Maybe you can take large entropy, classify it the same as with microbits, and the before and after entropy cost is the same, but maybe computation is not reversible (indeed, dispersion calls it into question) but memory access is, so a larger memory set for classification is less energy to computation. If there are no obvious dependencies to the H function, the larger number of symbols absorbing more bits/symbol in the message will have the same NH. If H is not equal to 1 for binary, then this may be an unusual situation. Let i=0 to n=number of unique symbols of message N symbols long then H=sum(log(N/count_i)) . So

H' = H*ln(2)/ln(n) where "2" should be replaced if the H was not calculated with log base 2.

= normalized entropy where each "bit" position can take on i states, i.e., an access to a memory of symbols has i choices, the number of symbols, aka the number of distinct microstates. H' varies from 0 to 1, which Shannon did not point out, but let "entropy" vary in definition based on the log base. So if there is not a cost to accessing a large memory bank and it speeds up computation or discussion, then even if NH is the same, NH' will be lower for the larger memory bank (the division makes it smaller). But again, NH is only the same between the two anyway if symbol frequencies are equal and this occurs only if you see no big patterns at all and it looks like noise in both cases, or if you have really chosen the optimum set of symbols for the data (physics equations seem to be seeking smallest NH, not smallest NH'). It seems like a larger number of symbols available will almost always detect more patterns, with no problems about narrow-mindedness if the training data set is large enough. The true "disorder" of a physical object relative to itself (per particle or per symbol) rather than to it's mass or size (So) or information is H' = S/ln(number of microstates). If one is trying to compare the total entropy of objects or data that use completely different sets of microstates (symbols) that are very different, the measure N*H' is still completely objective.

With varying-length symbols, NH' has to more seriously lose all connection with physical entropy and energy on a per bit level. It has to be strictly on a per symbol level. Varying symbols lose all connection to physics. consider the following: n times symbol bit length is no longer representing a measure of the bit memory space required because bit length is varying. N divided by symbol bit length is also not the number of symbols in a message. There are more memory locations addressed, with total memory required being n*average symbol length. It seems like it would be strictly "symbols only", especially if the log base is chosen based on n. The log base loses the proportional connection to "energy" (if energy were still connected to number of bits instead of symbols).

For varying-length symbols there is also a problem when trying to relate computations to energy and memory even if I keep base 2. To keep varying symbol lengths on a bit basis for the entropy in regards to computation energy required per information entropy, I need the following: For varying bit-length symbols where symbol i has bit length Li, I get
H=1/N*sum ( count_i * Li * log2(N/(count_i*Li) ) )
This is average bit (and real entropy since N is in bits) variation per bit communicated.
where N has to be in bits instead of symbols and the log base has to be in bits. Inside the log is the ability of the count to reduce the number of bits needed to be transmitted. Remember for positive H it is p*log(1/p) so this represents a higher probability of "count" occurring for a particular bit if L is higher. I don't know how to convert the log to a standard base where it varies from 0 to 1. Because they are varying length, it kind of loses relevance. Maybe the sum frequency of symbol occurring times tits length divided by sum (count_i/N*Li/(sum( L)/n))

Log base two on varying symbol bit-lengths loses its connection to bits required to re-code, so the above has to be used.

So in order to keep information connected to physics, you need to stick with log base 2 so you can count the changes in energy accurately. The symbols can vary in bit length as long as they are the same length.

So H' is a disorder measurement per symbol, comparable across all types of systems with varying number of n and N symbols. It varies from 0 to 1. It is "per energy" only if there is a fixed energy cost per symbol transmission and storage without regard to symbol length. If someone has remembered a lot of symbols of varying lengths, they might know a sequence of bits immediately as 2 symbols and get a good NH' on the data at hand but not a better NH if they had remembered all sequences of that length. If they know the whole message as 1 symbol they already know the message it is 0 in both and the best in both. Being able to utilize the symbol for profit is another matter (low k factor as above). A huge memory will have a cost that raises k. Knowing lots of facts but not being able to know how to use them is a high k (no profit). Facts without profit are meaningless. You might as well read data and just store it.

Maybe "patterns" or "symbols" or "microstates" to be recognized are the ones that lead to profit. You run scenarios and look at the output. That requires patterns as constraints and capabilities. Or look at profit patterns and reverse-invent scenarios of possible patterns under the constraints.

Someone who knows many symbols of a Rubik's cube is one who recognizes the patterns that lead to the quickest profit. The current state plus the procedure to take is a pattern that leads to profit. The goal is speed. Energy losses for computation and turn cost are not relevant. They will have the lowest NH', so H' is relevant.
================
Thanks for the link to indistinguishable particles. The clearest explanation seems to be here, [[Gibbs_paradox#The_mixing_paradox|the mixing paradox]]. The idea is this: if we need to know the kTNH energy required (think NH for a given kT) to return to the initial state at the level 010, 100, 001 with correct sequence from a certain final sequence, then we need do the microstates at that low level. Going the other way, "my" method should be mathematically the same as "yours" if it is required to NOT specify the exact initial and final sequences, since those were implicitly not measured. Measuring the initial state sequences without the final state sequences would be changing the base of the logarithm mid-stream. H is in units of true entropy per symbol when the base of logarithm is equal to the number of distinct symbols. In this way H always varies from 0 to 1 for all objects and data packets, giving a true disorder (entropy) per symbol (particle). You multiply by ln(2)/ln(n) to change base from 2 to n symbols. Therefore the ultimate objective entropy (disorder or information) in all systems, physical or information, when applied to data that accurately represents the source should be

Entropy = N*(-H) = \sum_i count_i \log_n (N/count_i)

Entropy = sum(count*logn(N/count))

where i=1 to n distinct symbols in data N symbols long. Shannon did not specify which base H uses, so it is a valid H. To convert it to nats of normal physical entropy or entropy in bits, multiply by ln(n) or log2(n). The count/N is inverted to make H positive. In this equation, with the ln(2) conversion factor, this entropy of "data" is physically same as the entropy of "physics" if the symbols are indistinguishable, and we use energy to change the state of our system E=kT*NH where our computer system has a k larger than kb due to inefficiency. Notice that changes in entropy will be the same without regard to k, which seems to explain why ultimately distinguishable states get away with using higher-level microstates definitions that are different with different absolute entropy. For thermo, kb is what appears to have fixed not caring about the deeper states that were ultimately distinguishable.

In the equation above, there is a penalty if you chose a larger symbol set. Maybe that accounts for the extra memory space required to define symbols.

The best wiki articles are going to be like this: you derive what is the simplest but perfectly accurate view, then find the sources using that conclusion to justifyits inclusion.
So if particles (symbols) are distinguishable and we use that level of distinguishability, the count at the 010 level has to be used. Knowing the sequence means knowing EACH particle's energy. The "byte-position" in a sequence of bits represents WHICH particle. This is not mere symbolism because the byte positions on a computer have a physical location in a volume, so that memory core and CPU entropy changes are exactly the physical entropy changes if they are at 100% efficiency ([[Landauer's principle]]). (BTW the isotope method won't work better than different molecules because it has more mass. This does not affect temperature, but it affects pressure, which means the count has to be different so that pressure is the same. So if you do not do anything that changes P/n in PV=nRT, using different gases will have no effect to your measured CHANGE in entropy, and you will not know if they mixed or not. ) [[User:Ywaz|Ywaz]] ([[User talk:Ywaz|talk]]) 11:48, 10 December 2015 (UTC)

By using indistinguishable states, physics seems to be using a non-fundamental set of symbols, which allows it to define states that work in terms of energy and volume as long as kb is used. The ultimate, as far as physicists might know, might be phase space (momentum and position) as well as spin, charge, potential energy and whatever else. Momentum and position per particle are 9 more variables because unlike energy momentum is a 6D vector (including angular), and a precise description of the "state" of a system would mean which particle has the quantities matters, not just the total. Thermo gets away with just assigning states based on internal energy and volume, each per particle. I do not see kb in the ultimate quantum description of entropy unless they are trying to bring it back out in terms of thermo. If charge, spin, and particles are made up of even smaller distinguishable things, it might be turtles all the way down, in which case, defining physical entropy as well as information entropy in the base of the number of symbols used (our available knowledge) might be best.

I didn't make it up. It's normally called normalized entropy, although they normally refer to this H with logn "as normalized entropy" when according to Shannon they should say "per symbol" and use NH to call it an entropy. I'm saying there's a serious objectivity to it that I did not realize until reading about indistinguishable states.
I hope you agree "entropy/symbol" is a number that should describe a certain variation in a probability distribution, and that if a set of n symbols were made of continuous p's, then a set of m symbols should have the same continuous distribution. But you can't do that (get the same entropy number) for the exact same "extrapolated" probability distributions if they use a differing number of symbols. You have to let the log base equal the number of symbols. I'll get back to the issue of more symbols having a "higher resolution". The point is that any set of symbols can have the same H and have the same continuous distribution if extrapolated.
If you pick a base like 2, you are throwing in an arbitrary element, and then have to call it (by Shannon's own words) "bits/symbol" instead of "entropy/symbol". Normalized entropy makes sense because of the following
entropy in bits/symbol = log2(2^("avg" entropy variation/symbol))
entropy per symbol = logn(n^("avg" entropy variation/symbol))
The equation I gave above is the normalized entropy that gives this 2nd result.
Previously we showed for a message of N bits, NH=MH' if the bits are converted to bytes and you calculate H' based on the byte symbols using the same log base as the bits, and if the bits were independent. M = number of byte symbols = N/8. This is fine for digital systems that have to use a certain amount of energy per bit. But what if energy is per symbol? We would want NH = M/8*H' because the byte system used 8 fewer symbols.
By using log base n, H=H' for any set of symbols with the same probability distribution, and N*H=M/8*H.
Bytes can take on an infinite number of different p distributions for the same H value, whereas bits are restricted to a certain pair of values for p0 and p1 (or p1 and p0) for a certain H, since p0=1-p0. So bytes have more specificity, that could allow for higher compression or describing things like 6-vector momentum instead of just a single scalar for energy, using the same number of SYMBOLS. The normalized entropy allows them to have the same H to get the same kTNH energy without going through contortions. So for N particles let's say bits are being used to describe each one's energy with entropy/particle H, and bytes are used to described their momentums with entropy/particle H'. Momentums uniquely describe the energy (but not vice versa). NH=NH'. And our independent property does not appear to be needed: H' can take on a specific values of p's that satisfy H=H', not some sort of average of those sets. Our previous method of NH=MH' is not as nice, violating Occam's razor.

entropy and kolmogorov complexity
== Shannon entropy of a universal measure of Kolmogorov complexity ==
What is the relation of Kolmogorov complexity to [[Information theory]]? It seems very close to Shannon entropy, in that both are maximised by randomness, although one seems to deal more with the message than the context. [[User:Cesiumfrog|Cesiumfrog]] ([[User talk:Cesiumfrog|talk]]) 00:03, 14 October 2010 (UTC)
:Shannon entropy is a statistical measure applied to data which shows how efficiently the symbols are transferring bits based solely on how many times the symbols occur relative to each other. It's like the dumbest, simplest way of looking for patterns in data, which is ironically why it is useful. Its biggest most basic use is for taking data with a large number of different symbols and stating how many bits are needed to say the same thing without doing any compression on the data.
:Kolmogorov complexity is the absolute extreme in the direction Shannon entropy takes step 0. It is the highest possible level of intelligent compression of a data set. It is not computable in most cases, it's just a theoretical idea. But Shannon entropy is always computable and blind. By "compression" I mean an algorithm has been added to the data, and the data reduced to a much smaller set that is the input to the algorithm to generate the original data. Kolmogorov is the combination in bits of the program plus the smaller data set.
:I would use Shannon entropy to help determine the "real" length in bits of a program that is proposing to be better than others at get close to the idealized Kolmogorov complexity. Competing programs might be using a different or larger sets of functions, which I would assign a symbol to. Programs that use a larger set of functions would be penalized when the Shannon entropy measure is applied. There are a lot of problems with this staring point that I'll get to, but I would assign a symbol to each "function" like a=next line, b=*, e=OR, f=space between arguments, and so on. Numbers would remain number symbols 1,2,3, because they are already efficiently encoded. Then the Shannon entropy (in bits) and therefore the "reduced" level attempted Kolgomorov complexity (Shannon entropy in bits) is
: N*H = - N \sum_i f_i \log_2 (f_i) = \sum_i count_i \log_2 (N/count_i)
:where i=1 to n is for each of n distinct symbols, f_i is "frequency of symbol i occurring in program", and count_i is the number of times symbol i occurs in the program that is N symbols long. This is the Shannon bits (the unit is called "shannons") in the program. The reason this is a better measure of k is because if
:normalized H = \sum_i count_i/N \log_n (N/count_i)
:is < 1, then there is an inefficiency in the way the symbols themselves were used (which has no relevance to the efficiency of the logic of the algorithm) that is more efficient when expressed as bits. Notice the log base is "n" in this 2nd equation. This H is normalized and is the truest entropy per symbol. To calculate things in this base use logn(x) = ln(x)/ln(n)
:But there is a big problem. Suppose you want to define a function to be equal to a longer program and just assign a symbol to it, so the program length is reduced to nearly "1". Or maybe a certain function in a certain language is more efficient for certain problems. So there needs to be a reference "language" (set of allowable functions) to compare one K program to the next. All standard functions have a known optimal expression in Boolean logic: AND, NOT, and OR. So by choosing those 3 as the only functions, any program in any higher-level language can be reduced back to a standard. Going even further, these can be reduced back to a transistor count or to a single universal logic gate like NAND or XOR, or a single universal reversible logic gate like Toffoli or Fredkin. Transistors are just performing a function to, so i guess they are universal to, but I am not sure. The universal gates can be wired to performs all logic operations and are Turing complete. So any program in any language can be re-expressed into a single fundamental function. The function itself will not even need a symbol in the program to identify it (as I'll show), so that the only code in the program is describing the wiring between "addresses". So the complexity describes the complexity of the physical connections, the path the electrons or photons in the program take, without regard to the distance between them. Example: sequence of symbols ABCCDA&CEFA&G-A would mean "perform A NAND B and send output to C, perform C NAND D and send output to A and C, perform E NAND F and send output to A and F, go to A. The would define input and output addresses, I would use something like ABD>(program)>G. The "goto" symbol "-" is not used if the program is expanded to express Levine's modified K complexity which penalizes loop \s by expanding them. It thereby takes into account computational time (and therefore a crude measure of energy energy) as well and a different type of program length.
:The program length by itself could or should be a measure of the computational infrastructure required (measured as energy required to create the transistors), which is another reason it should be in terms of something like a single gate: so it's cost to implement can be measured. What's the cost to build an OR verses a fourier transform? Answer: Count the required transistors or the NAND gates (which are made up of a distinct number of transistors). All basic functions already have had their Boolean logic and transistors reduce to a minimal level so it's not arbitrary or difficult.
:I think this is how you can get a comparable K in all programs in all languages: express them in a single function with distinct symbols representing the input and ouput addresses. Then calculate Shannon's N*H to get its K or Levine complexity in absolute physical terms. Using [[Landauer's principle]], counting the number of times a bit address (at the input or output of a NAND gate) changes state will give the total energy in joules that the program used, Joules=k*T*ln(2)*N where k is larger than Boltzmann's constant kb as a measure of the inefficiency of the particular physical implementation of the NAND gates. If a theoretically reversible gate is used there is theoretically no computational energy loss, except for the input and output changes. [[User:Ywaz|Ywaz]] ([[User talk:Ywaz|talk]]) 22:20, 12 December 2015 (UTC)

========
simplifying ST equation for ideal gas and looking at it from an entropy view.
== Derivation from uncertainty principle and relation to information entropy ==
Volume and energy are not the direct source of the states that generate entropy, so I wanted to express it in terms of x*p/h' number of states for each N. Someone above asked for a derivation from the uncertainty principle ("the U.P.") and he says it's pretty easy. S-T pre-dates U.P., so it may be only for historical reasons that a more efficient derivation is not often seen.

The U.P. says x'p'>h/4pi where x'p' are the standard deviations of x and p. The x and p below are the full range, not the 34.1% of the standard deviation so I multiplied x and p each by 0.341. It is 2p because p could be + or - in x,y, and z. By plugging the variables in and solving, it comes very close to the S-T equation. For Ω/N=1000 this was even more accurate than S-T, 0.4% lower.

Sackur-Tetrode equation:

S = k_B \ln\left(\frac{\Omega^N}{N!}\right) \approx k_B N \left(\ln\left(\frac{\Omega}{N} \right) +1\right)

where

\Omega= \left(\frac{2px}{\hbar/2}\right)^{3} = \left(\frac{2p_x x_x}{\hbar/2}\right)\left(\frac{2p_y x_y}{\hbar/2}\right)\left(\frac{2p_z x_z}{\hbar/2}\right)

\sigma=0.341, x=\sigma V^\frac{1}{3},  p = \sigma\left(\frac{2mU*b}{N!}\right)^\frac{1}{2}, U*b=K.E.=3/2 k_B T

Stirling's approximation N!=(N/e)^N is used in two places that results in a 1/N^(5/2) and and e^(5/2) which is where the 5/2 factor comes from. The molecules' internal energy U is kinetic energy for the monoatomic gas case for which the S-T applies. b=1 for monoatomic, and it may simply be changed for other non-monoatomic gases that have a different K.E./U ratio. The equation for p is the only difficult part of getting from the U.P. to the S-T equation and it is difficult only because the thermodynamic measurements T (kinetic energy per atom) and V are an energy and a distance where the U.P. needs x*p or t*E. This strangeness is where the 3/2 and 5/2 factors come from. The 2m is to get 2*m*1/2*m*V^2 = p^2. Boltzmann's entropy assumes it is a max for the given T, V, and P which I believe means the N's are evenly distributed in x^3 and assumes all are carrying the same magnitude p momentum.

To show how this can come directly from information theory, first remember that Shannon's H function is valid only for a random variable. In this physical case, there are only 2 possible values that each phase space can have: with an atom in it, or not, so it is like a binary file. But unlike normal information entropy, some or many of the atoms may have zero momentum as the others carry more: the total energy just has to be the same. So physical entropy can use anywhere from N to 1 symbols (atoms) to carry the same message (the energy), whereas information entropy is stuck with N. Where information entropy has a log(1/N^N)=N*log(N) factor, physical entropy has log(1/N!)=N*log(N)+N which is a higher entropy. My correction to Shannon's H shown below is known as the sum of the surprisals and is equal to the information (entropy) content in bits. The left sum is the information contributions from the empty phase space slots and the right side are those where an atom occurs. The left sum is about 0.7 without regard to N (1 if ln() had been used) and the right side is about 17*N for a gas at room temperature (for Ω/N ~ 100,000 states/atom):

H*N = \sum_{i=1}^{\Omega-N} \log_2\left(\frac{\Omega}{\Omega-i N/(\Omega-N)}\right) + \sum_{j=0}^{N-1} \log_2\left(\frac{\Omega}{N-j}\right) \approx \log_2\left(\frac{\Omega^N}{N!}\right)

Physical entropy S then comes directly from this Shannon entropy:

S = H*N k_B \ln(2)

A similar procedure can be applied to phonons in solids to go from information theory to physical entropy. For reference here is a 1D oscillator in solids:

S = k_B \left[ \ln \left( \frac{k_B T}{hf}\right) +1 \right]

[[User:Ywaz|Ywaz]] ([[User talk:Ywaz|talk]]) 02:37, 15 January 2016 (UTC)
=============
email to jon bain, NYU 1/17/2016 (this conatins errors. See above for the best stuff..
You have a nice entropy presentation, but I wanted to show you Shannon's H is an "intensive" entropy which is entropy per symbol of a message as he states explicitly in section 1.7 of his paper. This has much more of a direct analogy to Boltzmann's H-theorem that Shannon says was his inspiration.

Specific entropy in physics is S/kg or S/mole which is what both the H's are. Regular extensive entropy S for both Shannon and Boltzmann is:

S=N*H

So conceptually it seems the only difference from Shannon/Boltzmann and any other physical entropy is that Shannon/Boltzmann make use independent (non-interacting) symbols (particles) because they are available. It seems like they could be forced to generalize towards Gibbs or QM entropy as needed, mostly by just changing definitions of the variables (semantics).

Boltzmann's N particles
Let O{N!} = "O!" = # of QM-possible mAcrostates for N at E
Only a portion of N may carry E (and therefore p) and each Ni may not occupy less than (xp/h')^3 phase space:
O{N!} = (2xp/h')^3N = [2x*sqrt(2m*E/N!)/h']^3N , N!=(N/e)^N
2xp because p could be + or -.
ln [ (O{N!})^3 / N! ] = ln [O^3N*(1/e)^(-3/2*N) / (N/e)^N ] = 5/2*N*[ln(O/N)+1]
O= 2xP/N/h' = 2xp/h'
In order to count O possible states I have to consider than some of the Ni's did not have the minimum xp'/h' which is why S-T has error for small N: they can't be zero by QM.
S=k*ln(O{N!}/N!)=5/2*k*N*[ln(O/N) + 1]
results in a constant in the S-T equation: S=kN(ln(O/N) + 5/2)

S= k*ln(Gamma) = k*ln("O!"/N!) = k*5/2*N*(ln(O/N)+1] )
S=k*N*(-1*Oi/N*sum[ln(N/Oi) -const]
S=k*N*(sum[Oi/N*const - Oi/N*ln(N/Oi)]

Shannon: N=number of symbols in the data, O^O=distinct symbols,

S=N*(-sum(O/N*(log(O/N))

So the Shannon entropy appears to me to be Boltzmann's entropy except that particles can be used only once and symbols can be used a lot. It seems to me Shannon entropy math could be applied to Boltzmann's physics (or vice versa) it get the right answer with only a few changes to semantics.

If I sent messages with uniquely-colored marbles from a bag with gaps in the possible time slots, I am not sure there is a difference between Shannon and Boltzmann entropy.

Can the partition function be expressed as a type of N!/O! ?

The partition function's macrostates seems to work backwards to get to Shannon's H because measurable macrostates are the starting point.

The S-T equation for a mono-atomic gas is S=kN*(ln(O/N)+5/2) where O is (2px/h')^3=(kT/h'f)^3 in the V=x^3 volume where h' is an adjusted h-bar to account for p*x in the uncertainty principle being only 1 standard deviation (34.1%) instead of 100% of the possibilities, so h'=h-bar/(0.34.1)^2 and there are no other constants or pi's in S-T equation except for 5/2 which comes from [(xp/h')^3N]/N! = [x*sqrt(2m*E/N!)/h']^3 ~ (sqrt(E/e))^(3N*sqrt(E)) / (N/e)^N = 5/2

k is just a conversion factor from the kinetic energy per N as measured by temperature in order to get it into joules when you use dQ=dS*T. If we measure the kinetic energy of the N's in Joules instead of temp, then k is absent. Of course ln(2) is the other man-made historical difference.

If the logarithm base is made N, then it is a normalized entropy that ranges from 0 to 1 for all systems, giving a universal measure of how close the system is to maximum or minimum entropy.

I think your equations should keep "const" inside the parenthesis like this because it depends on N and I believe it is a simple ratio. I think the const is merely the result of "sampling without replacement" factorials.
=========
they say Boltzmann's entropy is when the system has reached thermal equilibrium so that S=N*H works, or something like that, and that Gibbs is more general where the systems macrostate measurements may have a lower entropy than boltzmann's calculation. I think this is like a message where you have calculated the p's, but it sends you something different. Or that you calculate S=N*H and source's H is different than it turns out to be in the future.
=======
post to rosettacode.org on "entropy" where they give code for H in many languages.

== This is not exactly "entropy". H is in bits/symbol. ==
This article is confusing Shannon entropy with information entropy and incorrectly states Shannon entropy H has units of bits.

There are many problems in applying H= -1*sum(p*log(p)) to a string and calling it the entropy of that string. H is called entropy but its units are bits/symbol, or entropy/symbol if the correct log base is chosen. For example, H of 01 and 011100101010101000011110 are exactly the same "entropy", H=1 bit/symbol, even though the 2nd one obviously carries more information entropy than the 1st. Another problem is that if you simply re-express the same data in hexadecimal, H gives a different answer for the same information entropy. The best and real information entropy of a string is 4) below. Applying 4) to binary data gives the entropy in "bits", but since the data was binary, its units are also a true "entropy" without having to specify "bits" as a unit.

Total entropy (in an arbitrarily chosen log base, which is not the best type of "entropy") for a file is S=N*H where N is the length of the file. Many times in [http://worrydream.com/refs/Shannon%20-%20A%20Mathematical%20Theory%20of%20Communication.pdf Shannon's book] he says H is in units of "bits/symbol", "entopy/symbol", and "information/symbol". Some people don't believe Shannon, so [https://schneider.ncifcrf.gov/ here's a modern respected researcher's home page] that tries to clear the confusion by stating the units out in the open.

Shannon called H "entropy" when he should have said "specific entropy" which is analogous to physics' S⁰ that is on a per kg or per mole basis instead of S. On page 13 of [http://worrydream.com/refs/Shannon%20-%20A%20Mathematical%20Theory%20of%20Communication.pdf Shannon's book], you easily can see Shannon's horrendous error that has resulted in so much confusion. On that page he says H, his "entropy", is in units of "entropy per symbol". This is like saying some function "s" is called "meters" and its results are in "meters/second". He named H after Boltzmann's H-theorem where H is a specific entropy on a per molecule basis. Boltzmann's entropy S = k*N*H = k*ln(states).

There 4 types of entropy of a file of N symbols long with n unique types of symbols:

1) Shannon (specific) entropy '''H = sum(count_i / N * log(N / count_i))'''
where count_i is the number of times symbol i occured in N.
Units are bits/symbol if log is base 2, nats/symbol if natural log.

2) Normalized specific entropy: '''H_n = H / log(n).'''
The division converts the logarithm base of H to n. Units are entropy/symbol. Ranges from 0 to 1. When it is 1 it means each symbol occurred equally often, n/N times. Near 0 means all symbols except 1 occurred only once, and the rest of a very long file was the other symbol. "Log" is in same base as H.

3) Total entropy '''S' = N * H.'''
Units are bits if log is base 2, nats if ln()).

4) Normalized total entropy '''S_n' = N * H / log(n).''' See "gotcha" below in choosing n.
Unit is "entropy". It varies from 0 to N

5) Physical entropy S of a binary file when the data is stored perfectly efficiently (using Landauer's limit): '''S = S' * k_B / log(e)'''

6) Macroscopic information entropy of an ideal gas of N identical molecules in its most likely random state (n=1 and N is known a priori): '''S' = S / k_B / ln(1)''' = k_B*[ln(states^N/N!)] = k_B*N* [ln(states/N)+1].

*Gotcha: a data generator may have the option of using say 256 symbols but only use 200 of those symbols for a set of data. So it becomes a matter of semantics if you chose n=256 or n=200, and neither may work (giving the same entropy when expressed in a different symbol set) because an implicit compression has been applied.

=====
rosetta stone, on the main entropy page:

Calculate the Shannon entropy H of a given input string.

Given the discreet random variable

X

that is a string of

N

"symbols" (total characters) consisting of

n

different characters (n=2 for binary), the Shannon entropy of X in '''bits/symbol''' is :
:

H_2(X) = -\sum_{i=1}^n \frac{count_i}{N} \log_2 \left(\frac{count_i}{N}\right)

where

count_i

is the count of character

n_i

.

For this task, use X="1223334444" as an example. The result should be 1.84644... bits/symbol. This assumes X was a random variable, which may not be the case, or it may depend on the observer.

This coding problem calculates the "specific" or "[[wp:Intensive_and_extensive_properties|intensive]]" entropy that finds its parallel in physics with "specific entropy" S⁰ which is entropy per kg or per mole, not like physical entropy S and therefore not the "information" content of a file. It comes from Boltzmann's H-theorem where

S=k_B  N  H

where N=number of molecules. Boltzmann's H is the same equation as Shannon's H, and it gives the specific entropy H on a "per molecule" basis.

The "total", "absolute", or "[[wp:Intensive_and_extensive_properties|extensive]]" information entropy is
:

S=H_2 N

bits
This is not the entropy being coded here, but it is the closest to physical entropy and a measure of the information content of a string. But it does not look for any patterns that might be available for compression, so it is a very restricted, basic, and certain measure of "information". Every binary file with an equal number of 1's and 0's will have S=N bits. All hex files with equal symbol frequencies will have

S=N \log_2(16)

bits of entropy. The total entropy in bits of the example above is S= 10*18.4644 = 18.4644 bits.

The H function does not look for any patterns in data or check if X was a random variable. For example, X=000000111111 gives the same calculated entropy in all senses as Y=010011100101. For most purposes it is usually more relevant to divide the gzip length by the length of the original data to get an informal measure of how much "order" was in the data.

Two other "entropies" are useful:

Normalized specific entropy:
:

H_n=\frac{H_2 * \log(2)}{\log(n)}

which varies from 0 to 1 and it has units of "entropy/symbol" or just 1/symbol. For this example, H_{n<\sub>= 0.923.}

Normalized total (extensive) entropy:
:

S_n = \frac{H_2 N * \log(2)}{\log(n)}

which varies from 0 to N and does not have units. It is simply the "entropy", but it needs to be called "total normalized extensive entropy" so that it is not confused with Shannon's (specific) entropy or physical entropy. For this example, S_{n<\sub>= 9.23.}

Shannon himself is the reason his "entropy/symbol" H function is very confusingly called "entropy". That's like calling a function that returns a speed a "meter". See section 1.7 of his classic [http://worrydream.com/refs/Shannon%20-%20A%20Mathematical%20Theory%20of%20Communication.pdf A Mathematical Theory of Communication] and search on "per symbol" and "units" to see he always stated his entropy H has units of "bits/symbol" or "entropy/symbol" or "information/symbol". So it is legitimate to say entropy NH is "information".

In keeping with Landauer's limit, the physics entropy generated from erasing N bits is

S = H_2 N k_B \ln(2)

if the bit storage device is perfectly efficient. This can be solved for H₂*N to (arguably) get the number of bits of information that a physical entropy represents.

=========
== From Shannon's H to ideal gas S ==
This is a way I can go from information theory to the Sackur-Tetrode equation by simply using the sum of the surprisals or in a more complicated way by using Shannon's H. It gives the same result and has 0.16% difference from ST for neon at standard conditions.

Assume each of the N atoms can either be still or be moving with the total energy divided among "i" of them. The message the set of atoms send is "the the total kinetic energy in this volume is E". The length of each possible message is the number of moving atoms "i". The number of different symbols they can use is N different energy levels as the number of moving atoms ranges from 1 to N. When fewer atoms are carrying the same total kinetic energy E, they will each have a larger momentum which increases the number of possible states they can have inside the volume in accordance with the uncertainty principle. A complication is that momentum increases as the square root energy and can go in 3 different directions (it's a vector, not just a magnitude), so there is a 3/2 power involved. The information theory entropy S in bits is the sum of the surprisals. The log gives the information each message sends, summed over N messages.

S_2 = \sum_{i=1}^{N} \log_2\left(\frac{\Omega_i}{i} \right)

where

\Omega_i = \Omega_N*\left(\frac{N}{i} \right)^\frac{3}{2}

To convert this to physical entropy, change the base of the logarithm to ln():

S=k_B \ln(2) S_2

You can make the following substitions to get the Sackur-Tetrode equation:

\Omega_N = \left(\frac{xp}{\frac{\hbar}{2\sigma^2}}\right)^3, x=V^\frac{1}{3}, p=\left(2mU/N\right)^\frac{1}{2}, \sigma = 0.341, U/N=\frac{3}{2} k_B T

The probability of encountering an atom in a certain momentum state depends (through the total energy constraint) on the state of the other atoms. So the probability of the states of the individual atoms are not a random variable with regard to the other atoms, so I can't write H as a function of the state of each atom (I can't use S=N*H) directly. But by looking at the math, it seems valid to re-interpret interpret an S=ΩH as an S=N*H. The H inside () is entropy per state. The 1/i makes it entropy per moving atom, then the sum over N gives total entropy. The sum for i over N was for the H, but then the 1/i re-interpreted it to be per atom. So it is an odd type of S=NH. Notice my sum for j does not actually use j as it is just adding the same p_i up for all states.

S_2 = \sum_{i=1}^{N}\left[ \frac{1}{i} *\left( -1*\sum_{j=1}^{\Omega_i} \frac{i}{\Omega_i}\log_2 \frac{i}{\Omega_i}\right)\right] = same =  \sum_{i=1}^{N} \log_2\left(\frac{\Omega_i}{i} \right)

Here's how I used Shannon's H in a more direct but complicated way but got the same result:

There will be N symbols to count in order to measure the Shannon entropy: empty states (there are Ω-N of them), states with a moving atom (i of them), and states with a still atom (N-i). Total energy determines what messages the physical system can send, not the length of the message. It can send many different messages. This is why it's hard to connect information entropy to physical entropy: physical entropy has more freedom than a normal message source. So in this approximation, N messages will have their Shannon H calculated and averaged. Total energy is evenly split among "i" moving atoms, where "i" will vary from 1 to N, giving N messages. The number of phase states (the length of each message) increases as "i" (moving atoms) decreases because each atom has to have more momentum to carry the total energy. The Ω_i states (message length) for a given "i" of moving atoms is a 3/2 power of energy because the uncertainty principle determining number of states is 1D and volume is 3D, and momentum is a square root of energy.

S_2 = \frac{1}{N} \sum_{i=1}^{N} \left[ H_i \Omega_i \right]

Again use

S=k_B \ln(2) S_2

to convert to physical entropy. Shannon's entropy H_i for the 3 symbols is the sum of the probability of encountering an empty state, a moving-atom state, and a "still" atom state. I am employing a cheat beyond the above reasoning by counting only 1/2 the entropy of the empty states. Maybe that's a QM effect.

H_i= -0.5*\frac{\Omega_i - N}{\Omega_i} \log_2\left(\frac{\Omega_i-N}{\Omega_i}\right) -  \frac{i}{\Omega_i} \log_2\left(\frac{i}{\Omega_i}\right) -  \frac{N-i}{\Omega_i} \log_2\left(\frac{N-i}{\Omega_i}\right)

Notice H*Ω simplifies to the count equation programmers use. E=empty state, M=state with moving atom, S=state with still atom.

S_i=H_i \Omega_i= 0.5 E \log_2\left(\frac{\Omega_i}{E}\right) + M\log_2\left(\frac{\Omega_i}{M}\right) + S \log_2\left(\frac{\Omega_i}{S}\right)

By some miracle the above simplifies to the previous equation. I couldn't do it, so I wrote a Perl program to calculate it directly to compare it to the ST equation for neon gas at ambient conditions for a 0.2 micron cube (so small to reduce number of loops to N<1 confirmed="" correct="" entropy="" equation="" is="" mega="" million.="" million="" molar="" nbsp="" official="" s="" st="" standard="" sup="" the="" with="">0

for neon: 146.22 entropy/mole / 6.022E23 * N. It was within 0.20%. I changed P, T, or N by 1/100 and 100x and the difference was from 0.24% and 0.12%. #!/usr/bin/perl
# neon gas entropy by Sackur-Tetrode (ST), sum of surprisals (SS), and Shannon's H (SH)
$T=298; $V=8E-21; $kB=1.381E-23; $m=20.8*1.66E-27; $h=6.6262E-34; $P=101325; # neon, 1 atm, 0.2 micron sides cube
$N = int($P*$V/$kB/$T+0.5); $U=$N*3/2*$kB*$T;
$ST = $kB*$N*(log($V/$N*(4*3.142/3*$m*$U/$N/$h**2)**1.5)+5/2)/log(2.718);
$x = $V**0.33333; $p = (2*$m*$U/$N)**0.5; $O = ($x*$p/($h/(4*3.142*0.341**2)))**3;
for ($i=1;$i<$N;$i++) { $Oi=$O*($N/$i)**1.5;
$SH += 0.5*($Oi-$N)*log($Oi/($Oi-$N)) + $i*log($Oi/$i) + ($N-$i)*log($Oi/($N-$i));
$SS += log($Oi/($i)); }
$SH += 0.5*($O-$N)*log($O/($O-$N)) + $N*log($O/$N); # for $i=$N
$SH = $kB*$SH/log(2.718)/$N;
$SS = $kB*$SS/log(2.718);
print "SH=$SH, SS=$SS, ST=$ST, SH/ST=".$SH/$ST.", N=".int($N).", Omega=".int($O).", O/N=".int($O/$N);
exit;
[[User:Ywaz|Ywaz]] ([[User talk:Ywaz|talk]]) 19:54, 27 January 2016 (UTC)

Economics, A.I., physics, & evolution

Tuesday, November 24, 2015

entropy: relation between physical and information

No comments:

Post a Comment