A post to reddit on possibility of Wei Dai being Satoshi.
Thanks for the explanation. Languages have various "attractors" for letters, words, and words groupings (idioms). The letter frequencies are not random because they represent phonemes that come from physical mouths that must follow a certain forward sequence. Listen to words backwards and you know how hard it would be to say it like that and you can't recognize anything. Listen to music backwards where the instruments are time-symmetrical due to less complexity compared to a mouth and in 2 seconds you know the song and it has the same emotional content, minus the words.
People expect certain word sequences. The word and phoneme "attractors" are like a gravitational field in 2 or 3 dimensions. Someone smart and always writing instead of talking can break away from the phoneme and expectation attractors and convey a lot in a few words. Einstein was like this. Szabo has half the frequency of his most common words compared to Satoshi and Wei which means his language is changing more. There's more true information content. On the other hand, someone smart or always talking instead of writing may want to be very clear to everyone and not break convention.
The extent to which a person has attractors (is living in a strong gravitational field) determines how sharply their word frequency drops down (Zipf's law for words in language, city populations, etc). Closer to "earth" would be higher word frequency, or living in a high gravitational field forces more words closer to Earth. Szabo's intelligence (or lack of too much concern if you can't follow) allows him to escape the gravity and say rare words more often, conveying more information. Measuring that was my original objective. That could be an independent way to identify an author (it's a different single dimension metric that conflates all the word dimensions you're talking about into one).
Large cities have an attractor based on opportunity and efficiency caused by the concentration of people that's self-re-enforcing. Convention in a community is self-re-enforces in words. So is "ease of speaking vowels" so they occur more frequently because less real energy is required to speak them, so they are in a low gravitational potential.
*[edit: My point in all this is to point out the curse of dimensionality as I understand it from you is that it assumes a random distribution. In my view, the "upper atmosphere", although larger in volume per radius increase from the center (the metric we're interested in), there will be fewer gas particles per volume (words) due to the gravity of a speaking/writing human's constraints). Our objective is to identify constraints that all people have, but also have varying gravitational constants for that constraint. People have different nitrogen to oxygen atom ratios in their atmospeheres. I have strong interest and experience in the relation between physical and information entropy, and words are at the intersection. Everything is a word, aka a symbol on a Turing machine, and people are running different algorithms on those symbols. The physical entropy is a function of ln(states/N!) where N is the number of particles and words also have this ln(1/N!) characteristic due to zipf's law and both are related to an energy in the system. Normal Shannon entropy assumes sampling with replacement is possible (2^n =states where n= number of bits and N=2 unique symbols), but this is not the case in physical entropy where each particle is sampled only once, so (1/N!)^n shows up as well as in fixed-length text where people have constraints on how often they can or will choose a word. computers do not have this constraint because there is not an energy cost to sampling with replacement. ]*
The origins of Zipf's law has always been a mystery. Many remember reading about it in Murray Gell-Mann's the Quark and the Jaguar. It was the only interesting thing in his book. But recently there have been good papers showing how it is probably derivable from Shannon's entropy when each word or person has a a a log of the energy cost or energy savings by being attracted to local groupings. There's feedback going on, or a blocking which means y=y' in differential equations so that the sum (integral) of y=1/x (which is Zipf's law, x=rank, y=frequency) gives a ln(x). So we're not fundamentally checking frequencies as much as we're comparing the rank of each word by using ln(x1/x2) which a subtraction of a frequency ln(x1) - ln(x2). Actually, we might need to do this on ranking's instead of frequencies, but you can see how similar it is. I did try it briefly and did not notice a difference. But there may be some good idea like applying it to singles with the other method on pairs, then finding conversion factor multpilier between the two before adding them (or a sum of their squares which won't make much difference) for a similarity (or author difference) metric.
It's always better to use lower-dimensions when there is a real low number of dimensional attractors working behind the scenes, if you know how to rank "how high" each word, word pair, or vowel is in that dimension. It's best (at least less noisy) but difficult to remove the effect the other 2 dimensions are having, probably requiring something like bayes theorem. Stylometry (word diagramming) would be a 4th dimension. There is a real physical person that works differently in those dimension, so it not good to be reducing them to a single dimension. The animal organ weights are only rough. Placing each weight in a dimension and not conflating the dimensions gives infinitely better categorization. Each word could be a dimension like you say, based on someone's experience and education. But if they are reading each other's writing and "attracted" to certain words and pairs because they know the other one uses it (Dai, Yudowksy, Back, Finney, and Satoshi) it reduces the chances they will NOT say the Satoshi words, by "taking up space" in what could have been said differently.
But in every word, letter, and idiom that is not in the core of the topic at hand, the simpler dimensions could show up and be measured by this sum of surprisals method, but broken out into 3 dimensions instead of 1. The group that won the Netflix prize started in hyperplanes of dimensions, whatever that means.
The open software SVMLight is the best way to do what I'm attempting (there's a simple ranking option), but I'd rather exhaust my ideas before trying to figure out how to use it.
What you're calling a gaussian" is really only because of a bad sampling of files, or having a true match. Great sampling should try to PREVENT having a "gaussian" good match by forcing it into a linear increase.
There should be a way to reduce or increase words in #1 and #2 as a result of comparing #1 and #2. Then increase or decrease the remaining word ratios. Then compare again with the mystery file and a true match should get better while the less match gets worse. "He who is the enemy of my enemy is my friend" or "He who is my friend's enemy is my enemy." It should be applied blindly, not making a distinction between #1 and #2 and being symmetrical.
Word pairs gave me twice as much distinction between the ratios I am saying are the key (#3-#2)/(#2-#1) = 5 whereas single, triple, and quad words gave 2.5. This was comparing Dai, Yudowsky, and gwern, all from the lesswrong site, and commonyl showing up in the same threads. I used 2MB on each to Satoshi's 253 kb.
Entropy of an ideal gas of n particles is S = A*ln[(Volume of container)^n/n!] +B*ln[((Energy in container)/n!)^n)]. This different from information entropy that takes the form S = log((values/memory location)^n) = N * H. Physical entropy carries more information per particle than information entropy does per symbol because of the n! that comes from the particles being selectable only once where symbols can be re-used. This would normally mean less information was possible. But the number of unique symbols in physical entropy is the number of states per particle which increases if not all the particles are carrying the energy. In short, physical entropy can carry information in different ways that information entropy cant.
But language has some physical entropy aspects to it. We can say the same message in many different ways that uses a larger or smaller set of symbols. Information entropy assumes the symbols used in a message were the only symbols that were available.
There is a physical energy cost for the different words we use, and there is a container of constraints (custom and word ordering) on the things we can say.
=============
udate: in trying to carry the above possible connection further, I've failed:
language entropy
S= N*sum(-k/rank/N*log(k/rank/N) = [A log(1) + B log(2) + ...] - k/(n/2*(n/2+1))* log(k)
Where N is a total words, not unique words n that are equal to max rank.
The entropy of an ideal gas (Sakur-Tetrode equation) of N molecules (and probably any physical entropy) can be written as
S = C*log((internal energy/N!)^N) + D*log(volume^N/N!)
S=N * [ C log(U) + D log(V) - C log(N!) ] - D log(N!)
===========
Thanks for the explanation. Languages have various "attractors" for letters, words, and words groupings (idioms). The letter frequencies are not random because they represent phonemes that come from physical mouths that must follow a certain forward sequence. Listen to words backwards and you know how hard it would be to say it like that and you can't recognize anything. Listen to music backwards where the instruments are time-symmetrical due to less complexity compared to a mouth and in 2 seconds you know the song and it has the same emotional content, minus the words.
People expect certain word sequences. The word and phoneme "attractors" are like a gravitational field in 2 or 3 dimensions. Someone smart and always writing instead of talking can break away from the phoneme and expectation attractors and convey a lot in a few words. Einstein was like this. Szabo has half the frequency of his most common words compared to Satoshi and Wei which means his language is changing more. There's more true information content. On the other hand, someone smart or always talking instead of writing may want to be very clear to everyone and not break convention.
The extent to which a person has attractors (is living in a strong gravitational field) determines how sharply their word frequency drops down (Zipf's law for words in language, city populations, etc). Closer to "earth" would be higher word frequency, or living in a high gravitational field forces more words closer to Earth. Szabo's intelligence (or lack of too much concern if you can't follow) allows him to escape the gravity and say rare words more often, conveying more information. Measuring that was my original objective. That could be an independent way to identify an author (it's a different single dimension metric that conflates all the word dimensions you're talking about into one).
Large cities have an attractor based on opportunity and efficiency caused by the concentration of people that's self-re-enforcing. Convention in a community is self-re-enforces in words. So is "ease of speaking vowels" so they occur more frequently because less real energy is required to speak them, so they are in a low gravitational potential.
*[edit: My point in all this is to point out the curse of dimensionality as I understand it from you is that it assumes a random distribution. In my view, the "upper atmosphere", although larger in volume per radius increase from the center (the metric we're interested in), there will be fewer gas particles per volume (words) due to the gravity of a speaking/writing human's constraints). Our objective is to identify constraints that all people have, but also have varying gravitational constants for that constraint. People have different nitrogen to oxygen atom ratios in their atmospeheres. I have strong interest and experience in the relation between physical and information entropy, and words are at the intersection. Everything is a word, aka a symbol on a Turing machine, and people are running different algorithms on those symbols. The physical entropy is a function of ln(states/N!) where N is the number of particles and words also have this ln(1/N!) characteristic due to zipf's law and both are related to an energy in the system. Normal Shannon entropy assumes sampling with replacement is possible (2^n =states where n= number of bits and N=2 unique symbols), but this is not the case in physical entropy where each particle is sampled only once, so (1/N!)^n shows up as well as in fixed-length text where people have constraints on how often they can or will choose a word. computers do not have this constraint because there is not an energy cost to sampling with replacement. ]*
The origins of Zipf's law has always been a mystery. Many remember reading about it in Murray Gell-Mann's the Quark and the Jaguar. It was the only interesting thing in his book. But recently there have been good papers showing how it is probably derivable from Shannon's entropy when each word or person has a a a log of the energy cost or energy savings by being attracted to local groupings. There's feedback going on, or a blocking which means y=y' in differential equations so that the sum (integral) of y=1/x (which is Zipf's law, x=rank, y=frequency) gives a ln(x). So we're not fundamentally checking frequencies as much as we're comparing the rank of each word by using ln(x1/x2) which a subtraction of a frequency ln(x1) - ln(x2). Actually, we might need to do this on ranking's instead of frequencies, but you can see how similar it is. I did try it briefly and did not notice a difference. But there may be some good idea like applying it to singles with the other method on pairs, then finding conversion factor multpilier between the two before adding them (or a sum of their squares which won't make much difference) for a similarity (or author difference) metric.
It's always better to use lower-dimensions when there is a real low number of dimensional attractors working behind the scenes, if you know how to rank "how high" each word, word pair, or vowel is in that dimension. It's best (at least less noisy) but difficult to remove the effect the other 2 dimensions are having, probably requiring something like bayes theorem. Stylometry (word diagramming) would be a 4th dimension. There is a real physical person that works differently in those dimension, so it not good to be reducing them to a single dimension. The animal organ weights are only rough. Placing each weight in a dimension and not conflating the dimensions gives infinitely better categorization. Each word could be a dimension like you say, based on someone's experience and education. But if they are reading each other's writing and "attracted" to certain words and pairs because they know the other one uses it (Dai, Yudowksy, Back, Finney, and Satoshi) it reduces the chances they will NOT say the Satoshi words, by "taking up space" in what could have been said differently.
But in every word, letter, and idiom that is not in the core of the topic at hand, the simpler dimensions could show up and be measured by this sum of surprisals method, but broken out into 3 dimensions instead of 1. The group that won the Netflix prize started in hyperplanes of dimensions, whatever that means.
The open software SVMLight is the best way to do what I'm attempting (there's a simple ranking option), but I'd rather exhaust my ideas before trying to figure out how to use it.
What you're calling a gaussian" is really only because of a bad sampling of files, or having a true match. Great sampling should try to PREVENT having a "gaussian" good match by forcing it into a linear increase.
There should be a way to reduce or increase words in #1 and #2 as a result of comparing #1 and #2. Then increase or decrease the remaining word ratios. Then compare again with the mystery file and a true match should get better while the less match gets worse. "He who is the enemy of my enemy is my friend" or "He who is my friend's enemy is my enemy." It should be applied blindly, not making a distinction between #1 and #2 and being symmetrical.
Word pairs gave me twice as much distinction between the ratios I am saying are the key (#3-#2)/(#2-#1) = 5 whereas single, triple, and quad words gave 2.5. This was comparing Dai, Yudowsky, and gwern, all from the lesswrong site, and commonyl showing up in the same threads. I used 2MB on each to Satoshi's 253 kb.
Entropy of an ideal gas of n particles is S = A*ln[(Volume of container)^n/n!] +B*ln[((Energy in container)/n!)^n)]. This different from information entropy that takes the form S = log((values/memory location)^n) = N * H. Physical entropy carries more information per particle than information entropy does per symbol because of the n! that comes from the particles being selectable only once where symbols can be re-used. This would normally mean less information was possible. But the number of unique symbols in physical entropy is the number of states per particle which increases if not all the particles are carrying the energy. In short, physical entropy can carry information in different ways that information entropy cant.
But language has some physical entropy aspects to it. We can say the same message in many different ways that uses a larger or smaller set of symbols. Information entropy assumes the symbols used in a message were the only symbols that were available.
There is a physical energy cost for the different words we use, and there is a container of constraints (custom and word ordering) on the things we can say.
=============
udate: in trying to carry the above possible connection further, I've failed:
language entropy
S= N*sum(-k/rank/N*log(k/rank/N) = [A log(1) + B log(2) + ...] - k/(n/2*(n/2+1))* log(k)
Where N is a total words, not unique words n that are equal to max rank.
The entropy of an ideal gas (Sakur-Tetrode equation) of N molecules (and probably any physical entropy) can be written as
S = C*log((internal energy/N!)^N) + D*log(volume^N/N!)
S=N * [ C log(U) + D log(V) - C log(N!) ] - D log(N!)
===========
An encoding scheme of a language when the language does NOT follow Zipf's law might result in the encoding following Benford's law (aka ~ Zipf's law). It might follow Benford's law better than most languages.
Language might follow Benford's law (data is more likely to begin with the number "1") instead of Zipf's law. I read English follows 1/rank0.85. In looking at the 1st table in the wolfram link below, I see Benford's law for rank 1 divided by rank 9 is almost exactly equal to saying English follows 1/rank0.85. Notice Benford's law is derived from a p(x)=1/x that might be the source of Zipf's law. The article says Benford's law (and the 1/x) results from a dimensional measurement that is scale-invariant or from the distribution of a distribution of a distribution... I do not know if word frequency is a physical measurement that is invariant under a change in scale, or if it is the distribution of a distribution of a distribution.... http://mathworld.wolfram.com/BenfordsLaw.html
So I have 3 possibilities for why language follows ~Zipf's law. My feeling is that it is not either of the above, but the 3rd possibility I mentioned before: the result of competitive positive feedback in the efficient use of symbols. The system of differential equations could cause Zipf's to fail at the upper and lower ends.
No comments:
Post a Comment