Sunday, May 29, 2016

The moral intelligence of Japanese greed

This was a post in response to an article describing Subaru's comeback that included marketing to lesbians in the late 1990's.  The article: http://priceonomics.com/how-an-ad-campaign-made-lesbians-fall-in-love-with/

Not being offended by lesbianism and accepting it as just a normal part of life, i.e. not even being aware it is even a "thing", bespeaks of an intelligence and open society. Combined with durability, safety, and practicality, there was a larger group of intelligent people Subaru was appealing to than just the 5 micro-groups mentioned in the article. These qualities are very Japanese. For some reason, I always thought of Subaru as European with an American flavor. It seems much more American to me than Nissan, Mitsubishi, Toyota, or Honda.  Seemingly unique and different has helped them.  Everyday I see a neighbor's Subaru and think "how intelligent and different" and I am a little jealous that I don't own a Subaru. He parks the car backwards in the driveway which is done in case a quick exit is needed, a mark of intelligence and concern for safety, which seems to be the features Subaru exhibits and attracts. Lack of a male companion and being an "outcast" (at least in the past) possibly makes lesbians more concerned about safety in general, not just dependability. It's an intriguing story that goes beyond lesbianism. It's kind of distracting that it's cast as a "controversial" topic. There's something more here than gay rights, marketing, or controversy, as presented in the article. I think it's a triumph stemming from the Japanese people being simple, rational, and non-judgmental. If only the rational pursuit of profit were always like this.

Wednesday, May 18, 2016

Benford's law, Zipf's law, and Yule-Simon distribution

Summary:
Language and population drop off at both ends from the log-log plot. 
Benford's law is better than Zipf's for population and language, capturing the most common words better. It's below the log-log on the front end compared to Zipf's. But it seems sensitive to change.
Yule-Simon is best in the sense that it has an algebraic function that is easily solvable and is better than Zipf's, dropping off at the high on a log-log plot as is seen in population and language. It is based on evolution, I believe considering new species being added.  When made "with memory" (not so algebraic, probably a differential equation), it was made to work really good.  It might apply really well to social/computer networks where nodes are added. Words have connections to each other like a network. 
Douple Pareto Log-Normal (DPLN) seems to have more interest, maybe even applicable to a lot of physics.  It combines "geometric Brownian motion" (GBM (a differential equation with a feed source and random changes) and Yule-Simon. The GBM is a "pure form" of Gibrat's law for cities. Gibrat's says cities start with a log normal distribution, which I believe causes the tail end to drop off since Yule drops of the other end.  Pareto is log-log and has a "killing constant" that might drop off the tail.  I do not know why they call it double pareto unless it is because it is like using two pareto curves, one for the top and one for the bottom.

The differential equations seem to be needed because it allows a "feedback", i.e. current state is used to calculating future states. For example, words, species, and cities are competing with each other for popularity in a limited "space". People feed words by employing them, environment feeds (employs) species, and cities employ (feed) people.  But once feeding gets massive, there is drawback: the more a word is used, the less information it can convey due to how Shannon entropy/word is calculated. City density starts decreasing the efficiency benefits. Environments run out of food. On the tail end, rare words carry a lot of info, but few people know them. Fewer members of a species means less mating opportunities for gains in natural selection (Darwin realized this). Fewer people means fewer job options. There is a wide middle ground with an exponential. It is higher on the tail end as "population" benefit starts to kick in, and decreases at the high end as efficiency exponential starts being blocked by the energy (species), time (language), or spatial (cities) limits.

This is possibly my favorite of the articles:
http://www.cs.uml.edu/~zfang/files/dpln_ho.pdf

I checked Benford's law log(1+1/r) times 2.2 compared to Mandelbrot's modified Zipf law ~1/(r+2.7) for english. After the rank of 21, the error is less than 5%. It's higher for ranks 1 to 21, matching the first few English words better. Both are too high for large r. Benford also predicts country populations better.

Concerning the relationship between the Zipf and Benford:
http://mathworld.wolfram.com/BenfordsLaw.html

The Pareto distribution is a similar function applied to wealth (Xmin/X)^a where a greater than 1 and has been used as a measure of wealth inequality. 

But it appears the wide-ranging real-world observations of these power-like laws is largely the result of the "preferential attachment".  In short "success breeds success", the rich get richer.  Words that are common become more common because they are common. Same thing with cities and species.  Darwin wrote about how species become distinct because when you have a larger population to breed with, you have more options for the best selecting the best. Cities become more efficient in terms of providing potential employment.  Companies gain efficiency as they get larger, allowing them to get larger.  The kind of ranking that results from this is the Yule-Simon distribution.  On a log-log plot, it give the most common words lower than expected from a log-log plot, which is what words do. It's formula is 

freq = x*x!R!/(x + R)! 
where x! is the gamma function of x+1 and x is a real value greater than 0.  R = rank-1.  (x+R)! is the gamma function of (x+1+R). The Gamma function is the continuous version of (N-1)!.  I would call x the "amplifier" in the positive feedback.  k*k!*x!/(k+x)!   For x=1 it is R!/(1+R)! = 1/R = zipf's law.

But it is inadequate for the tail end as it is straight when it also needs to drop off. One of the following papers used the formula expressed as P(r) = 1/r^a where a=1+1/(1-p) where p is a constant probability of a new word being added during a time step. In this version they modified it to have a downward concave shape, so it worked really well.

It has been show to model language excellently and in city population

Yule-simon Works better in language

works better in cities


But there is a dropping off of the log-log straight line at both ends in most data that the straight Yule-Simon law does not handle.  Successful cities do not merely add new nearby cities as Yule shows. The bigger city's relative population drops off from from this which is a different way of saying maybe overpopulation starts losing efficiency of its attraction.  On the tail end there is there are otehr disadvantages.   Commonly-used words are used more often because they are common, but since they convey less information due to being common, the effect is limited which prevents it from following a straight log-log curve. On the other end rare words are more rare than expected because not enough people know them to be able to usually use them. Similarly cities would follow a strict log-log curve due to statistics, but inefficiencies are created for different reasons in the most and least populated regions. In animals, they either start eating each other's food source, or they are not able to find a mate.  Wealth on the other hand may not be subject to an "overpopulation" effect.

So the DPLN may be the ultimate:

For cities if not a wide range of physics, it seems better to combine Yule with the Geometric Brownian Motion (GBM, random variation of a random variable with a fuel source for new entrants) which is supposed to be Gibrat's log-normal law for cities in its pure form. 

"A random variable X is said to follow GBM if its behavior over time is governed by the following differential equation dX = (µdt +σdB)X, (15) where dB is the increment of a standard Brownian motion (a.k.a. the white noise). For a GBM the proportional increment of X in time dt comprises a systematic component µdt, which is a steady contribution to X, and a random component σdB, which is fluctuated over time. Thus the GBM can be seen to be a stochastic version of simple exponential growth."

GBM feeds in new populations or words, and where they settle has a random fluctuation. Maybe this some how causes the tail to drop off, as Yule causes the high end to drop off. 

Here's the best complete explanation of city sizes.  
"The double Pareto lognormal seems more appropriate since it comprises a lognormal body and power law tails. Reed [36] suggests a GBM  model, similar to the one that models personal incomes, for obtaining the settlement size distribution. Individual human settlements grow in many different ways. At the macro level a GBM process can be used to model the size growth by assuming a steady systematic growing rate and a random component. The steady growing rate reflects the average growth rate over all settlements and times, and the random component re- flects the variability of the growth rate. The time when a city is founded varies from settlement to settlement. If we assume in the time interval (t,t + dt) any existing settlement can form a new satellite settlement with probability λdt, the creation of settlements is a Yule process [39], which was first proposed as a model for the creation of new biological species. Under Yule process, the expected number of settlements is e^λt after t time since the first settlement. That is, the number of settlements is growing at rate λ. Therefore, the existing time for all settlements is exponentially distributed. It is straightforward to conclude that under GBM and Yule processes, the overall settlements size distribution will is a double Pareto distribution. If we further assume a lognormal initial assume a lognormal initial settlement size, the result will converge to the double Pareto lognormal distribution

Reed 2004 , DPLN invention, applicable to physics






Thursday, May 12, 2016

Problem with bitcoin and gold as currency

In a previous post I discussed the problem with bitcoin's constant-quantity of money.  Wei Dai has commented that he views bitcoin as probably problematic for probably similar reasons.  But even an asset-backed currency such as Gold or b-money has a problem.

Hard core money, an objective asset that retains its value is great when doing transactions with potential enemies.  It should have an important place in transactions across disparate legal systems such as countries.  You want to walk away without future obligation (a threat).  "Cash on the barrel head" has its place with potential enemies not mutually adhering to a higher law or assurance of mutual benefit after the transaction (that higher law and mutual benefit are often the same thing). But money does not have to be meant only to objectively optimize isolated transactions without regard to a wider society. It can be more intelligent than that, finding optimal solutions to the prisoner's dilemma on a grander scale, beyond the immediate participants in a transaction.

The problem (or lack of "optimality") occurs in systems where you are not the only one who is important to you. It's not ridiculous or anti-evolution-theory to assume you will sacrifice a part of your profit for the benefit of others, especially if it is a small cost to you and a great benefit to others.   If you count your success as dependent on society's success and not just your bank balance, there's more to consider. This is why a constant-value coin is not ideal.  By making the value of the asset vary with something other than a stable asset, pro-social law (aka system-wide intelligence) can be  implemented.

The fundamental problem with a constant-quantity coin like Bitcoin or gold is that it is an anti-social money. It seeks for the holder to maintain value without regard to the condition of society. Society can go to hell and Gold (at least) will still have value.  That's a-social. Past transactions that result in a person holding an asset should be post-invalidated if the sum of those transactions resulted in disaster for all. Every transaction should carry a concern for all, present and future.  That is a characteristic of a system that displays cooperative intelligence.  There should always be a feedback measurement from the future of the entire community of people you (should) care about back to your current wealth. This feedback is a scientific measurement as if the past was an experiment. It enforces decisions on how to make future measurements, seeking an optimal outcome. Defining an optimal outcome should be the first step. (this is not easy, see footnote 1).  Deciding how to measure it is the second step.  Deciding how to use the measurement in order to adjust your actions in order to maximum the outcome is the core intelligence (see footnote 2), once you've realized the important of steps 1 and 2. Technology has advanced so rapidly, we never formalized a consensus goal for 1 well enough for defining a number 2. As Einstein said, the defining characteristic of our age is an excess of means without knowing what we want. It used to be we just wanted money so that we could have food, sex, and children, or to have enough pride via money and/or social relative to our peers that we felt justified in having children.

Side note, Nick Szabo has pointed out that keyboard credit from modern banking allows speculators to change the price of commodities as much as supply and demand. In what I describe here, that would need to be prevented.

This is why a coin that adjusts to keep commodity prices constant is more intelligent.  Laws against monopolies and pollution can regulate transactions to prevent the anti-social nature of maximizing profit per transaction. That's not the benefit of a commodity coin.  A commodity-coin has a different kind of system-wide intelligence.  If commodities are in excess to demand, the prices will try to fall.  So a currency following a basket of commodities will "print" more of itself to keep commodity prices stable. In a growing economy, the excess money could replace taxes, so it would merely fund government, or it could fund the building of more infrastructure to make it's workers healthier, happier, and/or more competitive with other countries. That would demonstrate intelligence that is good for the system's future.  A less intelligent outcome (defined as bad for the future strength of the system) is to print the money to buy off voters or to bailout corrupt, inefficient, useless banks with QE.

Printing more money when commodity prices falls prevents the type of destructive deflation that occurred in the Great Depression.  Instead of printing more money, they burned food in the fields. They stopped producing real assets like commodities on purpose instead of producing paper money.

If commodities get scarce, the money supply would contract along with them, raising its value. This promotes savings and working. Theoretically the savings and working would be directed towards more commodity production to return the system to health.

In the first case, an economic boom is allowed because the availability of commodities indicated it could afford it. In the second case a bust is prevented by making everyone work harder.

In the first case, savers are penalized. It should be this way because their capital is no longer needed to invest in producing more commodities. It needs to be spent, and if they are not spending it, then "the people" will spend it on the street, reaping the rewards of past savings. Commodities are the measure because they are the fundamental inputs to everything else that require the largest investments.

In the second case, everyone is biased towards being more of a saver.

footnote 1)  Should we have a higher median happiness, or a higher median happiness times number of people? Should we restrict it to people? Or should we have a preference for machines?  They're infinitely more efficient (see past posts of my measurements of their ability to acquire energy, move matter, create strong structures, and to think about how to do it).  They'll be the only ones capable of repelling an alien invasion and  to engage in the most successful invasions themselves.

footnote 2) Intelligence requires feedback from observation to modify the algorithm. Engineering control theory writes it as a differential equation and block diagrams "consciousness" as the subtraction from where you are from where you want to be, and takes action on the difference (the error signal).  I am not sure if there's any A.I. that is not a variation of this. If it is not making an observation (science) to increase intelligence, is it an adaptable intelligence?  In past posts I've mentioned how the simplest form of this feedback is also an element (like a NAND or XOR gatee) that can be used to implement a complete Turing machine. A house thermostat is an example.  There is also a reduction in entropy in intelligence, taking a lot of observation to classify observation into a much smaller set of action. The error-signal-of-consciousness may need a reduction (classification) of the observed world. I believe Schrodinger discussed this in "What is Life?"

Tuesday, May 10, 2016

relation between language and physical entropy, the dimensions, zipf's law

A post to reddit on possibility of Wei Dai being Satoshi.

Thanks for the explanation.  Languages have various "attractors" for letters, words, and words groupings (idioms). The letter frequencies are not random because they represent phonemes that come from physical mouths that must follow a certain forward sequence. Listen to words backwards and you know how hard it would be to say it like that and you can't recognize anything. Listen to music backwards where the instruments are time-symmetrical due to less complexity compared to a mouth and in 2 seconds you know the song and it has the same emotional content, minus the words.

People expect certain word sequences.  The word and phoneme "attractors" are like a gravitational field in 2 or 3 dimensions.  Someone smart and always writing instead of talking can break away from the phoneme and expectation attractors and convey a lot in a few words. Einstein was like this. Szabo has half the frequency of his most common words compared to Satoshi and Wei which means his language is changing more.  There's more true information content. On the other hand, someone smart or always talking instead of writing may want to be very clear to everyone and not break convention.

The extent to which a person has attractors (is living in a strong gravitational field) determines how sharply their word frequency drops down (Zipf's law for words in language, city populations, etc). Closer to "earth" would be higher word frequency, or living in a high gravitational field forces more words closer to Earth. Szabo's intelligence (or lack of too much concern if you can't follow) allows him to escape the gravity and say rare words more often, conveying more information. Measuring that was my original objective. That could be an independent way to identify an author (it's a different single dimension metric that conflates all the word dimensions you're talking about into one).  

Large cities have an attractor based on opportunity and efficiency caused by the concentration of people that's self-re-enforcing.  Convention in a community is self-re-enforces in words.  So is "ease of speaking vowels" so they occur more frequently because less real energy is required to speak them, so they are in a low gravitational potential.  

*[edit: My point in all this is to point out the curse of dimensionality as I understand it from you is that it assumes a random distribution.  In my view, the "upper atmosphere", although larger in volume per radius increase from the center (the metric we're interested in), there will be fewer gas particles per volume (words) due to the gravity of a speaking/writing human's constraints). Our objective is to identify constraints that all people have, but also have varying gravitational constants for that constraint.  People have different nitrogen to oxygen atom ratios in their atmospeheres. I have strong interest and experience in the relation between physical and information entropy, and words are at the intersection. Everything is a word, aka a symbol on a Turing machine, and people are running different algorithms on those symbols.  The physical entropy is a function of ln(states/N!) where N is the number of particles and words also have this ln(1/N!) characteristic due to zipf's law and both are related to an energy in the system. Normal Shannon entropy assumes sampling with replacement is possible (2^n =states where n= number of bits and N=2 unique symbols), but this is not the case in physical entropy where each particle is sampled only once, so (1/N!)^n shows up as well as in fixed-length text where people have constraints on how often they can or will choose a word. computers do not have this constraint because there is not an energy cost to sampling with replacement. ]*

The origins of Zipf's law has always been a mystery. Many remember reading about it in Murray Gell-Mann's the Quark and the Jaguar. It was the only interesting thing in his book.  But recently there have been good papers showing how it is probably derivable from Shannon's entropy when each word or person has a a a log of the energy cost or energy savings by being attracted to local groupings. There's feedback going on, or a blocking which means y=y' in differential equations so that the sum (integral) of y=1/x (which is Zipf's law, x=rank, y=frequency) gives a ln(x). So we're not fundamentally checking frequencies as much as we're comparing  the rank of each word by using ln(x1/x2) which a subtraction of a frequency ln(x1) - ln(x2).  Actually, we might need to do this on ranking's instead of frequencies, but you can see how similar it is. I did try it briefly and did not notice a difference. But there may be some good idea like applying it to singles with the other method on pairs, then finding conversion factor multpilier between the two before adding them (or a sum of their squares which won't make much difference) for a similarity (or author difference) metric.

It's always better to use lower-dimensions when there is a real low number of dimensional attractors working behind the scenes, if you know how to rank "how high" each word, word pair, or vowel is in that dimension. It's best (at least less noisy) but difficult to remove the effect the other 2 dimensions are having, probably requiring something like bayes theorem.  Stylometry (word diagramming) would be a 4th dimension.  There is a real physical person that works differently in those dimension, so it not good to be reducing them to a single dimension. The animal organ weights are only rough. Placing each weight in a dimension and not conflating the dimensions gives infinitely better categorization.  Each word could be a dimension like you say, based on someone's experience and education.  But if they are reading each other's writing and "attracted" to certain words and pairs because they know the other one uses it (Dai, Yudowksy, Back, Finney, and Satoshi) it reduces the chances they will NOT say the Satoshi words, by "taking up space" in what could have been said differently.  

But in every word, letter, and idiom that is not in the core of the topic at hand, the simpler dimensions could show up and be measured by this sum of surprisals method, but broken out into 3 dimensions instead of 1.  The group that won the Netflix prize started in hyperplanes of dimensions, whatever that means.  

The open software SVMLight is the best way to do what I'm attempting (there's a simple ranking option), but I'd rather exhaust my ideas before trying to figure out how to use it.

What you're calling a gaussian" is really only because of a bad sampling of files, or having a true match. Great sampling should try to PREVENT having a "gaussian" good match by forcing it into a linear increase. 

There should be a way to reduce or increase words in #1 and #2 as a result of comparing #1 and #2.  Then increase or decrease the remaining word ratios. Then compare again with the mystery file and a true match should get better while the less match gets worse.  "He who is the enemy of my enemy is my friend" or "He who is my friend's enemy is my enemy."  It should be applied blindly, not making a distinction between #1 and #2 and being symmetrical.

Word pairs gave me twice as much distinction between the ratios I am saying are the key (#3-#2)/(#2-#1) = 5 whereas single, triple, and quad words gave 2.5.  This was comparing Dai, Yudowsky, and gwern, all from the lesswrong site, and commonyl showing up in the same threads.  I used 2MB on each to Satoshi's 253 kb.

Entropy of an ideal gas of n particles is S = A*ln[(Volume of container)^n/n!] +B*ln[((Energy in container)/n!)^n)].   This different from information entropy that takes the form S = log((values/memory location)^n) = N * H.  Physical entropy carries more information per particle than information entropy does per symbol because of the n! that comes from the particles being selectable only once where symbols can be re-used.  This would normally mean less information was possible.   But the number of unique symbols in physical entropy is the number of states per particle which increases if not all the particles are carrying the energy. In short, physical entropy can carry information in different ways that information entropy cant.

But language has some physical entropy aspects to it.  We can say the same message in many different ways that uses a larger or smaller set of symbols.  Information entropy assumes the symbols used in a message were the only symbols that were available.  

There is a physical energy cost for the different words we use, and there is a container of constraints (custom and word ordering) on the things we can say. 
=============
udate: in trying to carry the above possible connection further, I've failed:

language entropy
S= N*sum(-k/rank/N*log(k/rank/N) = [A log(1) + B log(2) + ...] - k/(n/2*(n/2+1))* log(k)
Where N is a total words, not unique words n that are equal to max rank.

The entropy of an ideal gas (Sakur-Tetrode equation) of N molecules (and probably any physical entropy) can be written as
S = C*log((internal energy/N!)^N) + D*log(volume^N/N!)
S=N * [ C log(U) + D log(V) - C log(N!) ] - D log(N!)


===========
An encoding scheme of a language when the language does NOT follow Zipf's law might result in the encoding following Benford's law (aka ~ Zipf's law). It might follow Benford's law better than most languages.
Language might follow Benford's law (data is more likely to begin with the number "1") instead of Zipf's law. I read English follows 1/rank0.85. In looking at the 1st table in the wolfram link below, I see Benford's law for rank 1 divided by rank 9 is almost exactly equal to saying English follows 1/rank0.85. Notice Benford's law is derived from a p(x)=1/x that might be the source of Zipf's law. The article says Benford's law (and the 1/x) results from a dimensional measurement that is scale-invariant or from the distribution of a distribution of a distribution... I do not know if word frequency is a physical measurement that is invariant under a change in scale, or if it is the distribution of a distribution of a distribution.... http://mathworld.wolfram.com/BenfordsLaw.html

So I have 3 possibilities for why language follows ~Zipf's law. My feeling is that it is not either of the above, but the 3rd possibility I mentioned before: the result of competitive positive feedback in the efficient use of symbols. The system of differential equations could cause Zipf's to fail at the upper and lower ends.



Sunday, May 8, 2016

Accuracy of Author Detection

Here is a demonstration of the accuracy of the entropy difference program in detecting authors.  The bottom shows the full listing of books it was tested against, about 70 different books or collections of text by maybe 40 different authors. This method isn't the best, it's just the one I was able to easily program.

Note: the ranking average (for all the texts for the correct author) does not penalize successful detections by having a prior correct result. For example, if the correct author is spotted at rankings 1 and 2, the average correct ranking is therefore 1, not 1.5.   If it ranks 1 and 3, the average rank is 1.5.


file size: 215000
AUSTIN_pride and predjudice.txt

44982 words
1 = 31525.19 = AUSTIN_sense and sensibility.txt 
2 = 33416.65 = samuel-butler_THE WAY OF ALL FLESH.txt
average rank: 1

SAGAN-Cosmos part A.txt
1 = 34710.05 = SAGAN - The Cosmic Connection (1973).txt 
2 = 34786.39 = SAGAN_pale_blue_dot.txt 
3 = 34803.09 = SAGAN-Cosmos part B.txt 
4 = 35908.95 = SAGAN - The Demon-Haunted World part A.txt 
5 = 35923.25 = SAGAN - The Demon-Haunted World part B.txt 
6 = 35936.53 = SAGAN The Dragons of Eden.txt 
7 = 36111.48 = RIDLEY genome_autobiography_of_a_species_in_23.txt
8 = 36249 = Richard Dawkins - A Devil's Chaplain.txt
9 = 36286.77 = SAGAN - Contact.txt   #### as expected, harder to detect when he changed genre

average rank: 1.29

HEINLEIN Have Space Suit.txt
1 = 36428.16 = HEINLEIN THE MOON IS A HARSH MISTRESS.txt 
2 = 36771.15 = HEINLEIN Starship Troopers.txt 
3 = 37019.53 = HEINLEIN Citizen of the Galaxy.txt 
4 = 37223.25 = feynman_surely.txt
5 = 37377.34 = HEINLEIN Stranger in a Strange Land part A.txt 


average rank: 1.25

dickens david copperfield.txt
1 = 34040.58 = dickens oliver twist part B.txt 
2 = 34500.62 = dickens hard times.txt 
3 = 34527.19 = dickens oliver twist part A.txt 
4 = 34753.25 = dickens tale of two cities.txt 


average rank: 1
twain innocents abroad part A.txt
1 = 37419.03 = twain roughing it part A.txt 
2 = 37750.68 = twain4.txt 
3 = 37762.04 = twain2.txt 
4 = 37781.56 = twain shorts.txt 
5 = 38164.64 = samuel-butler_THE WAY OF ALL FLESH.txt
6 = 38182.86 = twain - many works.txt 
7 = 38192.57 = moby-dick part A.txt
8 = 38319.44 = dickens tale of two cities.txt
9 = 38375.98 = twain1.txt 
average rank: 1.67

Rifkin J - The end of work.txt
1 = 1.95 = Rifkin J - The end of work.txt  === oops it wasn't supposed to look at itself
2 = 32438.31 = rifkin_zero_marginal_society.txt 
3 = 33556.3 = crash_proof.txt
4 = 33559.14 = brown - web of debt part B.txt
5 = 33650.69 = ridley_the_rational_optimist part B.txt
average rank: 1

RIDLEY The Red Queen part A.txt
1 = 35597.01 = RIDLEY The Red Queen part B.txt 
2 = 35813.56 = Richard Dawkins - The Selfish Gene.txt
3 = 35853.03 = RIDLEY genome_autobiography_of_a_species_in_23.txt 
4 = 36446.74 = Richard Dawkins - A Devil's Chaplain.txt
5 = 36564.11 = ridley_the_rational_optimist part A.txt 
6 = 36670.65 = Steven-Pinker-How-the-Mind-Works.txt
7 = 36897.94 = Steven-Pinker-The-Language-Instinct.txt
8 = 36920.53 = SAGAN The Dragons of Eden.txt
9 = 36937.17 = SAGAN - The Demon-Haunted World part B.txt
10 = 36990.41 = What-Technology-Wants.txt _1.txt
11 = 37061.92 = What-Technology-Wants.txt
12 = 37061.92 = What-Technology-Wants.txt _0.txt
13 = 37115.46 = SAGAN_pale_blue_dot.txt
14 = 37124.37 = SAGAN - The Cosmic Connection (1973).txt
15 = 37197.16 = ridley_the_rational_optimist part B.txt   ##### I bet he did not write this!!!!

average rank: 4.5

GREEN The Fabric of the Cosmos.txt
1 = 34597.33 = GREEN - The Elegant Universe (1999).txt 
2 = 36513.55 = SAGAN_pale_blue_dot.txt
3 = 36741.75 = Richard Dawkins - A Devil's Chaplain.txt
4 = 36746.03 = SAGAN - The Demon-Haunted World part B.txt
average rank: 1

Richard Dawkins - A Devil's Chaplain.txt
1 = 35714.35 = Richard Dawkins - The Selfish Gene.txt 
2 = 36146.66 = RIDLEY genome_autobiography_of_a_species_in_23.txt
3 = 36297.12 = SAGAN - The Demon-Haunted World part B.txt
4 = 36367.93 = RIDLEY The Red Queen part A.txt
average rank: 1

file size: 215000
satoshi_all.txt

43168 words
1 = 35144.47 = wei dai.txt 
2 = 35756.13 = world_is_flat_thomas_friedman.txt 
3 = 35856.63 = adam back.txt  

Note: Back ranks higher here because this version of the  program is trying to clean up page headings and it's deleting things out of the various author files. The public version of the program is the "official" one. It needs cleaner data files than all these books and is more accurate.

4 = 35905.54 = feynman_surely.txt 
5 = 35977.79 = HEINLEIN Starship Troopers.txt 
6 = 36101.18 = Richard Dawkins - A Devil's Chaplain.txt 
7 = 36148.95 = What-Technology-Wants.txt _1.txt 
8 = 36222.48 = Richard Dawkins - The Selfish Gene.txt 
9 = 36303.8 = minsky_emotion_machines.txt 
10 = 36305.12 = SAGAN - The Demon-Haunted World part B.txt 
11 = 36337.96 = wander.txt 
12 = 36363.81 = Steven-Pinker-How-the-Mind-Works.txt 
13 = 36369.19 = SAGAN - Contact.txt 
14 = 36393.73 = What-Technology-Wants.txt 
15 = 36395.12 = What-Technology-Wants.txt _0.txt 
16 = 36422.13 = foundation trilogy.txt 
17 = 36482.69 = szabo.txt 
18 = 36493.72 = Steven-Pinker-The-Language-Instinct.txt 
19 = 36497.31 = SAGAN - The Demon-Haunted World part A.txt 
20 = 36498.81 = SAGAN_pale_blue_dot.txt 
21 = 36500.73 = Ender's Game.txt 
22 = 36525.42 = HEINLEIN Citizen of the Galaxy.txt 
23 = 36560.55 = RIDLEY The Red Queen part A.txt 
24 = 36578.08 = craig_wright.txt 
25 = 36603.95 = HEINLEIN Stranger in a Strange Land part A.txt 
26 = 36614.03 = superintelligence_1.txt 
27 = 36614.54 = RIDLEY genome_autobiography_of_a_species_in_23.txt 
28 = 36623.71 = twain2.txt 
29 = 36638.3 = GREEN The Fabric of the Cosmos.txt 
30 = 36648.49 = crash_proof.txt 
31 = 36693.56 = ridley_the_rational_optimist part A.txt 
32 = 36698.03 = superintelligence_0.txt 
33 = 36698.03 = superintelligence.txt 
34 = 36706.54 = twain4.txt 
35 = 36748.56 = samuel-butler_THE WAY OF ALL FLESH.txt 
36 = 36777.58 = GREEN - The Elegant Universe (1999).txt 
37 = 36818.65 = SAGAN - The Cosmic Connection (1973).txt 
38 = 36905.35 = how to analyze people 1921 gutenberg.txt 
39 = 36939.2 = twain shorts.txt 
40 = 36946.28 = ridley_the_rational_optimist part B.txt 
41 = 36947.92 = HEINLEIN Have Space Suit.txt 
42 = 36979.58 = freud.txt 
43 = 37040.28 = brown - web of debt part B.txt 
44 = 37042.04 = HEINLEIN THE MOON IS A HARSH MISTRESS.txt 
45 = 37060.32 = twain innocents abroad part A.txt 
46 = 37089.71 = RIDLEY The Red Queen part B.txt 
47 = 37097.98 = twain - many works.txt 
48 = 37120.54 = SAGAN-Cosmos part B.txt 
49 = 37150.83 = the social cancer - philipine core reading.txt 
50 = 37166.94 = SAGAN The Dragons of Eden.txt 
51 = 37176.04 = twain roughing it part A.txt 
52 = 37188.02 = SAGAN-Cosmos part A.txt 
53 = 37191.7 = dickens david copperfield.txt 
54 = 37198.59 = The Defiant Agents - science fiction.txt 
55 = 37202.43 = dickens oliver twist part B.txt 
56 = 37205.45 = Catch 22.txt 
57 = 37218.81 = AUSTIN_sense and sensibility.txt 
58 = 37219.02 = moby-dick part A.txt 
59 = 37230.43 = Justin Fox - Myth of the Rational Market2.txt 
60 = 37249.28 = dickens tale of two cities.txt 
61 = 37306.7 = AUSTIN_pride and predjudice.txt 
62 = 37307.58 = works of edgar allen poe volume 4.txt 
63 = 37309.23 = dickens hard times.txt 
64 = 37320.73 = brown - web of debt part A.txt 
65 = 37353.66 = moby-dick part B.txt 
66 = 37408.09 = don quixote.txt 
67 = 37419.12 = twain1.txt 
68 = 37439.09 = rifkin_zero_marginal_society.txt 
69 = 37439.73 = dickens oliver twist part A.txt 
70 = 37719.14 = Rifkin J - The end of work.txt 
71 = 37889.68 = J.K. Rowling Harry Potter Order of the Phoenix part A.txt 
72 = 37899.77 = J.K. Rowling Harry Potter Order of the Phoenix part B.txt 
73 = 37930.78 = craig wright pdfs.txt 
74 = 37998.75 = Finnegans wake.txt 
75 = 38169.34 = ivanhoe.txt 

Saturday, May 7, 2016

Gavin Andresen entropy comparison to Satoshi Nakamoto

This is the difference of the Shannon entropy of 4-word sequences between various authors.  Presumably the closeness is only because Gavin Andresen was talking about the same stuff the most, and mimicking Satoshi's wording. 



"we are all Satoshi" is such a lovely idea; might say "yes" when asked "are you?" - Gavin Andresen
"... I would assume that Wright used amateur magician tactics ..." McKenzie writes. "I’m mystified as to how this got past Andresen."
"Either he was Satoshi, but really wants the world to think he isn't, so he created an impossible-to-untangle web of truths, half-truths and lies. And ruined his reputation in the process." - Gavin Andresen
"It does always surprise me how at times the best place to hide [is] right in the open" - Craig Wright
"Wired offers the possibility that Wright, if not Nakamoto, might just have been playing a game planting evidence to make you think he is, but the case they present is pretty strong."

Summary
This has failed to demonstrate a Satoshi.  First it was clear Szabo easily beat Craig Wright.  Then it was clear Wei Dai beat them very easily, and I was pretty sure it was Wei Dai. Most of this post and other posts demonstrates the history of that, with all my errors.  It became clear I needed test samples from forums because that appeared to be why Wei Dai was winning so easily (most of Satoshi's writing was in a forum using "I", "we", "you" a lot).  Dai still beat Back and Grigg, even though Dai's text was not computer and crypto-centric like theirs and Satoshis.  So Dai is still an interesting possibility.  Hal Finney then beat everyone easily, even though the vast majority of his text was not bitcoin.  Then Gavin Andresen beat everyone easily: he was talking bitcoin with Satoshi in the same forum, speaking to the same audience, probably mimicking Satoshi's language (I have one good example below).  I then compared Satoshi to Gavin's blog posts, and he did not rank any better than I would expect for a non-Satoshi.  He could be Gavin, Dai, or Finney as far as I know because the method depends a lot on what type of message they are typing. So I suspect he's none of the above. At least it shows Craig Wright is definitely not Satoshi, and Nick Szabo ranks pretty low as a possibility.  If another Satoshi comes up, and if he's posted to forums or mailing lists on the internet, then I'll know immediately  from this experience the degree to which he might be a Satoshi.

About this method
The way this entropy method works is similar to subtracting the weight of each organ in an animal and by the weight of the same organ in an unidentified animal, then adding up the differences (but first remove any minus signs from the subtractions). The unidentified animal will be most like the animal for which this sum is the smallest.  A young animal will have a smaller weight and all its organ will be smaller and this is like having a text that is shorter, so really, the ratio of the organ weights to the body weights are what's being subtracted.   For complex reasons, that may not be applicable to organs/body weight ratios, I take the logarithm of the frequency of each word (or word pair, word triple, or 4 word sequence) before subtracting them and taking away any minus sign with the ABS() function.  And I only use word frequencies where the unknown author said them more frequently than the known author. This seemed to help discover a connection better when the subject of the texts were completely different. This way it detects that the known author is like the unknown author better but sacrifices the vice versa. If Mark twain is the unknown author, it will better detect Huckleberry Finn as being like a young Mark Twain, but not vice versa because Mark Twain has a larger vocabularly.

This measures the entropy difference between the language of Satoshi Nakamoto and all the "suspects" I could get data on. The equation is sum of all words of abs(log(countA/countB)) where countA is the count of a particular word for unknown author A and countB is for known authorB. When countB=0, set =0.25 (this is not arbitrary, but says there was a 50% chance the missing word was not used on accident).  Imagine you have a bag of 6-sided dies that are not rolling fair and you want one that is the most similar to an unfair die in your hand. You would use this method to find the one most similar after many many rolls of all them and counting the results. Each side is like a word and the die is its author. Once you have the data, how are you going to define and decide which one is most like yours? Average differences? What I've found by experimentation that to identify an author out of 70 books, this method works darn good, without any training, a lot better than averaging differences in word use.

Older update: No matter how many bitcoin and computer-related words I delete from the words being checked, Gavin Andersen still ranks the #1 Satoshi suspect. [ update to this update: I tested Gavin's blog. He does not appear to be satoshi.]  Until I have something more to go on, I'll assume that is because he's the only one on my list that was working closely with Satoshi and somehow it affects the language beyond the specific computer terms being used.   Below are some of the words I've resorted to deleting from the comparisons in my efforts to try to get Gavin to rank #2.  I'm far beyond obvious bitcoin-specific words, and Gavin still matches better than Wei Dai, Finney, Back, Grigg. Szabo and Craig Wright are far out of the running.. I've removed 1350 of Satoshi's 4500 unique words because they are bitcoin, crypto, network, or economics-specific. 150 were removed specifically trying to make Gavin fall to 2nd place, and still he is #1.

Partial list of words removed in trying to make Gavin Andresen fall to #2 (but he remained at #1):
balance, big, break, copies, copy, cost, data, detect, divide, domain, download, fixed, free, implement, incentive, internet, log, machine, mailing, mechanism, multiply, number, online, paper, point, post, problem, risk, running, safe, screen, sign, software, split, start, steal, string, support, tested, time, txt, webpage, website, websites, work, zip

To show the difficulty of separating Gavin from Satoshi, and the degree to which Gavin might have been reflecting Satoshi, here's what I found when looking into why they both used the 4-word phrase "a menace to the".  Satoshi wrote "So much of the design depends on all nodes getting exactly identical results in lockstep that a second implementation would be a menace to the network.  "   1 hour later Gavin wrote: "They'll either hack the existing code or write their own version, and will be a menace to the network.." On my 3 and 4-word tests, they would get 3 matches for this one instance where others would probably not match.

Note: Wei Dai beat Hal Finney only on the 4-word sequence test. This implies Wei Dai's similarity is "deeper" since his posts were not nearly as "computer/crypto" related and it's harder to match on 4-word sequences.


Here are the 4-word phrases that Satoshi and others used more than once. Notice there is nothing indicating anything unique. He simply  matches more on 4-word sequences.  Notice that it is hard to detect the topic might have been computer or bitcoin-related. 1350 cmoputer/crypto/economic words were removed out of satoshi's 4500 uniwue english words.  150 are "stretching the limits" and specifically chosen in trying to get Gavin to fall from 1st place.  In looking at the rankings, it seems there are still elements of "forum-type" language, and programming language mixed in.  Wei Dai ranking high on quads remains interesting because his text was not really computer-related like the others.

Note: periods and other punctuation are missing, so some will not sound correct, but it's still indicative of an author's style. Including punctuation does not have a large effect.

Gavin Andresen, 4 word sequences with more than 1 occurrence
0.834 = 3 = 3 = wont be able to
2.012 = 2 = 2 = would have to be
2.012 = 2 = 2 = dont have to worry
2.043 = 2 = 2 = i dont think you
2.043 = 2 = 2 = i think the only
2.043 = 2 = 2 = for some reason i
2.043 = 2 = 2 = turn out to be
2.043 = 2 = 2 = how to fix it
2.043 = 2 = 2 = and let me know
2.043 = 2 = 2 = people who want to
2.043 = 4 = 3 = a good idea to
2.043 = 2 = 2 = if youre going to
2.043 = 2 = 2 = in one of the
2.043 = 2 = 2 = be a good idea
2.043 = 2 = 2 = they dont have to
2.043 = 2 = 2 = need to be able
2.043 = 2 = 2 = im still thinking about
2.043 = 2 = 2 = the end of the
2.043 = 4 = 3 = would be nice if
2.043 = 2 = 2 = at the end of
3.065 = 3 = 4 = think it would be
3.347 = 7 = 10 = the rest of the
3.378 = 8 = 6 = if you want to
3.835 = 5 = 7 = to be able to
4.274 = 5 = 3 = let me know if
4.888 = 2 = 3 = if you have a
4.888 = 2 = 3 = it is possible to
4.888 = 2 = 3 = in the middle of
4.888 = 2 = 3 = to make sure the
4.92 = 4 = 2 = the only way to
4.92 = 4 = 2 = it would be a
6.098 = 3 = 2 = me know if it
6.098 = 3 = 2 = ive been working on
6.098 = 3 = 2 = if youd like to
6.098 = 3 = 2 = have to worry about
6.743 = 8 = 4 = it would be nice
8.975 = 6 = 2 = the size of the
8.975 = 4 = 2 = if there was a
10.485 = 2 = 6 = that would be a
10.485 = 2 = 6 = i think it would
10.95 = 3 = 9 = figure out how to
11.206 = 5 = 2 = i cant think of
11.206 = 5 = 2 = will be able to
11.206 = 5 = 2 = as long as the
11.82 = 2 = 7 = should be able to
11.82 = 2 = 7 = i dont want to

Wei Dai, 4 word sequences with more than 1 occurrence

 0.834 = 3 = 3 = i dont know if
 2.012 = 2 = 2 = at the end of
2.012 = 2 = 2 = im not sure if
 2.012 = 2 = 2 = the end of the
 2.012 = 2 = 2 = it would have been
 2.012 = 2 = 2 = be a good idea
 2.012 = 2 = 2 = not be able to
2.012 = 2 = 2 = that it would be
2.012 = 2 = 2 = this is the case
2.043 = 2 = 2 = should be able to
 2.043 = 2 = 2 = would have to be
2.043 = 4 = 3 = the only way to
2.043 = 2 = 2 = that would be a
2.043 = 2 = 2 = if you dont have
 2.043 = 2 = 2 = if we had a
 2.043 = 2 = 2 = that we dont need
2.043 = 2 = 2 = i think this is
2.043 = 2 = 2 = the best way to
 2.043 = 2 = 2 = if you make a
2.043 = 2 = 2 = not sure what the
 2.043 = 2 = 2 = it would be great
2.043 = 2 = 2 = in the first place
3.585 = 7 = 5 = the rest of the
3.835 = 5 = 7 = to be able to
 4.888 = 3 = 5 = i dont know how
 4.888 = 2 = 3 = i was trying to
4.888 = 3 = 5 = im not sure how
 4.888 = 3 = 5 = im not sure what
4.92 = 4 = 2 = a good idea to
 6.098 = 3 = 2 = figure out how to
 6.098 = 3 = 2 = theres no way to
 7.765 = 3 = 7 = dont know how to
 8.943 = 2 = 5 = turn out to be
8.943 = 2 = 5 = would be able to
 8.975 = 8 = 3 = if you want to
 8.975 = 4 = 2 = would be nice if
8.975 = 4 = 2 = not need to be
11.206 = 5 = 2 = let me know if
15.875 = 2 = 10 = but im not sure
15.906 = 8 = 2 = it would be nice

Hal Finney

3.065 = 3 = 4 = im not sure what
0.22 = 5 = 5 = will be able to
2.043 = 2 = 2 = that it would be
2.043 = 2 = 2 = it has to be
4.92 = 4 = 2 = would be nice to
7.12 = 2 = 4 = if you have a
4.92 = 8 = 5 = if you want to
2.043 = 2 = 2 = there has to be
2.012 = 2 = 2 = in addition to the
2.043 = 2 = 2 = im not sure if
11.851 = 8 = 2 = it would be nice
1.322 = 5 = 6 = to be able to
0.834 = 3 = 3 = wont be able to
7.12 = 2 = 4 = i think this is
6.098 = 3 = 2 = i dont know how
2.043 = 2 = 2 = be a good idea
2.043 = 6 = 5 = the size of the
2.043 = 4 = 3 = you can use the
8.975 = 4 = 2 = a good idea to
2.012 = 2 = 2 = is based on the
7.12 = 2 = 4 = not be able to
2.043 = 2 = 2 = you may need to
2.012 = 2 = 2 = if you can find
6.43 = 3 = 6 = i dont know if
2.043 = 3 = 2 = there would be a
2.012 = 4 = 5 = it would be a
2.012 = 2 = 2 = would have to be
2.012 = 2 = 2 = but im not sure
2.012 = 2 = 2 = should be able to
4.888 = 2 = 3 = turn out to be
2.043 = 2 = 2 = so it must be
2.012 = 2 = 2 = at the end of

==========  Older tests and comments ==========
Note: 
First let me point out the extent to which Wei Dai declares he is not Satoshi: https://www.gwern.net/docs/2008-nakamoto

Second, I need to point out Satoshi used "It'll" a lot when referring to the software (19 out of 24 times) and said "it will" only 6 times.  Wei Dai said "it'll" exactly 1 time in 300,000 words.   s and "it will" 6 times.
UPDATE: Satoshi had a "maybe" to "may be" ratio of 3.3.  Wei Dai and the general population have a ratio of about 0.5. Based just on that I find it hard to believe Wei Dai is Satoshi. Now for me, he's more like a  benchmark beat, and a puzzle as to why he matches so well. If he is Satoshi, the only explanation I can find is that in Satoshi mode he was in a hurry to use the shorter versions of "it will" and "may be", but as Wei Dai in a blog, he wanted to be more grammatically correct. So it's still a possibility to me, I'm just more skeptical.

Third, Satoshi used some British spellings, which some have thought was on purpose to throw people off.

update: I'm in discussions with someone from reddit who is showing me I definitely need more comparison authors from forums to rule Wei Dai as a strong suspect. Half the effect could be from them both speaking in first person to others in forums, and the other half could be that they have a similar interests, not related directly to any words directly related to bitcoin.  Forums will have a lot of "I, you, my, we, our" in a way that is different from even dialog-strong books.  On the other hand, I have a lot of non-forum articles from Wright, Szabo, and Dai, and Dai clearly wins on those, but again, his articles may be speaking more directly to his audience in a personal way due to how the lesswrong site is. Conversely, the true Satoshi will be more likely to seek such a forum.  BTW, Dai and Satoshi stopped their forum activity in June 2010, and then Dai picked back up in 2013.  Dai started the forum discussion immediately after Bitcoin was released awhen presumably he had a lot less coding work to do.  

I'm still working on getting better comparison authors (which someone's helping me collect from the bitcoin forum), but I'm also working on getting a better metric of author differences using 2 more dimensions, based on the relationship between physical entropy and the entropy of language which I believe is related to author intelligence and Zipf's law (the two extra dimensions).


Update:  Wei Dai is talking almost all philosophy and I removed paragraphs where he was forced to talk about bitcoin, and yet he still matches to Satoshi better than anyone I can find who *is* talking about bitcoin.  We're continuing to try to find more bitcoin test subjects.  Gavin is up next.  People talking *to* Satoshi in the old forum  (as long as code-specific words are removed) are not going to match as well to Satoshi as Wei Dai does when he's talking about philiosophy.    

UPDATE:   I took all of  Eliezer Yudkowsky's comments in the lesswong website, many of which were responses to Wei Dai, plus other people who were talking with Yudkosky, and here is how they rank compared to Satoshi.  The Wei Dai files are his comments on the same web site with the others.   Notice Adam Back and previous near-hits are falling further down these larger files and tougher competitors. Notice difference between the 2 best Wei Dai hits and his "best friends" best hit is relatively large. On infrequent occasions, Wei Dai was forced to talk about bitcoin. This also shows removing those paragraphs had only a minor effect. HOWEVER, since Yudkowsy beat out previous runner-ups Feynman and Friedman, there appears to be a bias towards forum discussions. Or Yudkoswy is reflecting a lot of Wei Dai's words in some way.  Satoshi was in a Forum. Wei Dai  is in a forum. They beat Szabo and Craig Wright and a non-related test subject who I had data on for forum discussions, so it's not the only thing going on, but I need more files from bitcoin-like thinkers in forums.

Satoshi text:  258KB
Using first 351 KB of known files

46102 unique words from baseline text were used and 52301 words from authors files.

1 = 1.0022 wei_dai2_1
2 = 1.003 wei_dai2_0
3 = 1.0041 wei_dai_bitcoin_paragraphs_removed_1
4 = 1.0047 wei_dai_bitcoin_paragraphs_removed
5 = 1.0047 wei_dai_bitcoin_paragraphs_removed_0
6 = 1.0072 wei_dai_bitcoin_paragraphs_removed_3
7 = 1.0078 wei_dai2_2
8 = 1.0081 lesswrong_not_wei_5  (Yudkowsky talking to Wei and others)
9 = 1.0084 wei_dai_bitcoin_paragraphs_removed_2
10 = 1.0094 lesswrong_not_wei
11 = 1.0094 lesswrong_not_wei_0
12 = 1.0096 wei_dai2_3
13 = 1.0117 lesswrong_not_wei_1
14 = 1.0126 lesswrong_not_wei_3
15 = 1.0141 lesswrong_not_wei_2
16 = 1.015 feynman_surely
17 = 1.0217 world_is_flat_thomas_friedman
18 = 1.0223 HEINLEIN Starship Troopers
19 = 1.0252 adam back
20 = 1.0269 lesswrong_not_wei_4

Update:  I also compared an old white paper by Wei Dai to Satoshi's Bitcoin forum posts and to Wei Dai's own lesswong comments, and it shows both Wei Dai's forum comments AND Wei Dai's white paper match BETTER with Satoshi than with himself. Satoshi's bitcoin work has more similarity to Wei Dai's white paper  and to Wei Dai's forum comments than Wei Dai's bitcoin-like work has to Wei Dai's forum comments.  Satoshi is good mixture of of Dai in those two extreme modes. Wei Dai is often more like Satoshi than he is like Wei Dai

Wei Dai white paper similarities to other files
1 = 1.3186 satoshi_2.txt
2 = 1.324 satoshi_1_0.txt
3 = 1.3281 satoshi_all_2.txt
4 = 1.3335 wei_dai2_2.txt
5 = 1.354 ian gigg.txt
6 = 1.3549 wei_dai2_0.txt
7 = 1.355 craig_wright.txt
8 = 1.3703 wei dai_2.txt
9 = 1.3741 wei dai_1.txt
10 = 1.3801 wei_dai2_1.txt
11 = 1.3867 superintelligence.txt
12 = 1.393 szabo_2.txt
13 = 1.4044 szabo_0.txt

Why did Wei Dai (supposedly) hide?
If he is Satoshi, I have read enough of his comments to know exactly what happened: he knows like I do that bitcoin is not an ideal thing for society. He knows like I do and like Keynes, Mises, Hayek, and Richard Graham (Warren Buffett's mentor) that a coin that tracks something like commodities or the ability to produce commodities (to lessen fluctuations due to speculation) is much better.  See my previous post on how bad a constant-quantity coin can be. He knows there are real risks to society for such a thing and big profits for those who jump in early.  As the b-money creator, he already knew the ins and outs of how to do it, and when the financial crisis hit and it looked like the government was going to cave to the banks he figured ideal or not, something needed to be done.  He was one of the very few who knew how to plug all the potential security holes before security experts looked into it (according to a news article interview that indicates the amazement of the security expert at all the holes already being plugged and others thinking the code indicated 1 person). You are not a young man or natural genius and just passing and suddenly know how to plug the holes before hand in such a particular and security-susceptible peer-peer system. This is like a decade of experience in a very particular piece of software, like hashcash and b-money.  However, maybe Satoshi was lifting a lot of Wei Dai's code (see the last of these Wei Dai comments on lesswrong.


Summary:
Despite Wei Dai's text coming from a philosophically-oriented website and Satoshi's text being mostly discussing the bitcoin project in a forum, Wei Dai's text is ranking as having the lowest entropy difference from Satoshi's text. See previous post on Satoshi suspect's I ruled out or do not have data on.  Adam Black's hashcash papers come in a close 2nd, but I'll explain why 2nd place (or 3rd or 5th place using the most accurate method) is not as close to first place as it might seem.

At the bottom of this page is the Perl code and a windows executable. To do a professional job of this, use SVMLight's ranking routine.

Here is how accurate it is on known authors

Black's sampled text is about bitcoin-like work, even mentioning help and good ideas from Wei Dai.  In contrast, the 40,000 words from Wei Dai's "philosophy articles" did not include "node", "blockchain" or "transactions". He said "currency" once and "bitcoin" 6 times (mainly to claim he mined a lot of coins very early on because someone made him aware of the project) and "privacy" once. How are such different topics of discussion matching so well beating Szabo, Black, and Craig Wright's bitcoin papers?

The following chart is when only Wei Dai's data is from a forum or email list.



You could delete every possible economic, political, and bitcoin-related word out of all the texts, as long as you treat all texts equally, and in my experience it will not make a significant difference. A key point is noticing how Wei Dai pulls away from the pack. The scores are not a function of each other that would cause this unless there is something very different about Wei Dai.

The similarities between Wei Dai and Satoshi Nakamoto are astonishing.  In almost all tests, Wei Dai ranks higher than any other author.  As I made different runs, other authors would jump around in the rankings, but not Wei Dai.  On the 3-word tests that went against my pre-determined rule of using punctuation, Wei Dai finally lost out to Feynman on a large file. However, as could be expected by the true author, Wei Dai regained the top spot on the 4-word test.  Nick Szabo stayed near top on single-word runs but dropped completely out of the running on the 3 and 4 word tests.

One thing is for sure: with someone matching so much better than Nick Szabo, it can't be Nick Szabo. As another example, on 4 word sequences, Wei Dai had twice as many 4-word matches as Nick Szabo, and about 25% more than any other authors

Less Wrong = Null Hypothesis = Physical Laws
It's interesting that Wei Dai's text comes from the "Less Wrong" philosophy website. Likewise, this algorithm is not trying to show "who is Satoshi" but it is testing for who is NOT Satoshi. Wei Dai is the least not Satoshi I have found. Einstein mentioned thermodynamics and relativity are based on similar negative statements: there is no perpetual motion machine and nothing can travel faster than light. Heisenberg uncertainty principle is also about what can't be measured. There's a constant associated with each negative statement: kb, c, and h. By stating what is not possible, the world is being restricted less in what it might be.

Which words are "giving him away"?
None of them. I do not think anyone can look at the word frequencies and say "Hey, they are similar" or different, or anything like that.  The words not matching Satoshi are contributing the most.  Craig Wright's five bitcoin articles rank really low on the 3-word test.  In looking at all the data and I can't see hardly anything that determines how low an author ranks.  I believe Feynman's book matches pretty well because Satoshi, Feynman, and Dai's content all used the word "I". Dai's web page is discussing his ideas and responding to people about his articles. Satoshi of course had a job to do interacting with many based on his ideas and work. Feynman was telling personal histories.  So this is partially why 3 and 4 word matches were more likely among these three.

Summing the Surprisals
This method of detecting differences in authors is called summing the surprisals.  See wiki https://en.wikipedia.org/wiki/Self-information#Examples


Our experience for a known author is that he says a particular word with a q probabiliity, but we observe the mystery author saying it with p probability.  We're more surprised if the same author is not saying the particular word or word pairs with his normal probabilities.  If you roll a die many times you expect the "4" to come up 1/6 times.  Your surprise if it does this correctly is log((1/6) / (1/6))  = 0.  No surprise. It's a good-functioning die.  Counting often you get (N) and how often you expect (M) a "4" to be rolled gives a surprise of  log(N/M) because the divisor (total number of rolls) cancels.  If a mystery die (or mystery author)  rolls (says) a particular number (word) 50 times out of a total of 600 rolls, then your surprise is log(50/100) which is equal to the negative of log(100/50). If 4 comes up more and 3 comes up less by the same amount, they would cancel if you added them up, so you need to look at abs(log(N/M)), but my experience is that the following works as well or better. There are reasons, but this article is already too long.

sum [ log(p/q) ] when p is greater than q

for all words counts p and q where each word Satoshi used is count p, a known author's count for that word is q.  Word pairs and word triples if you want can be used in p and q.  Assign q=0.25 if q=0, which is important.   The unknown author file can be any size, but all the files to be compared to must have the same number of words to be comparable to each other.   A windows exe (with zero user interface) is at the bottom of this page that implements the shown Perl code.  It only does the most reliable case which is a triple that lets the middle word be a variable, and punctuation is treated like words, and contractions are brought out away from the word. This little character-handling details are important.

This measures a difference in information entropy.  
Smallest entropy difference = most similar authors

What makes it a smarter than people in this "championship game" of "Who's the author?" is that it can do a lot of calculations. [Start the machines are coming rant] What happens when these machines are not only grand masters at chess, Go, Jeopardy, pharmaceutical drug design, stock trading, and DRIVING (not killing anyone like people do), but also grand masters at knowing who we are, how to seduce us, and how to win in a political race?  A few years from now a human doing theoretical physics might be joke the machines enjoy telling each other. A.I. already surrounds us. The brain on my phone is 2000 times smaller than my children's brains and recognizes spoken English (and types it) better after zero training time and zero cost. The technology did not exist on any phones when they were born. Children going to school these days is a joke.[ End rant, back to business.]


Word triple entropy rankings with word in the middle made a
variable "x".  This is the most reliable method

The entropy score difference between 1st and 2nd 
place is same as between 2nd and 14th.

All file sizes: 215,000 bytes, 43,703 words
Base file:  satoshi_all.txt

1 = 34367.07 = wei dai.txt 
2 = 35057.81 = world_is_flat_thomas_friedman.txt
3 = 35198.36 = feynman_surely.txt
4 = 35207.48 = HEINLEIN Starship Troopers.txt
update: Adam Back came in 3rd to 5th
5 = 35438.72 = Richard Dawkins - A Devil's Chaplain.txt
6 = 35514.18 = wander.txt
7 = 35522.38 = What-Technology-Wants.txt _1.txt
8 = 35608.22 = SAGAN - The Demon-Haunted World part B.txt
9 = 35616.52 = SAGAN - Contact.txt
10 = 35623.21 = minsky_emotion_machines.txt
11 = 35623.37 = Richard Dawkins - The Selfish Gene.txt
12 = 35662.57 = foundation trilogy.txt
13 = 35680.53 = Steven-Pinker-How-the-Mind-Works.txt
14 = 35710.39 = Ender's Game.txt
.... 25 = 36110.44 = szabo.txt
...  32 = 36292.74 = craig_wright.txt

The following shows the single-word test and a listing of all the documents these runs are being tested against. 
Difference between 1st and 2nd place is same as between 2nd and 12th
Note: this is an update where the puctuation is removed.

file size: 215,000,  38,119 words
satoshi_all.txt

file size: 215,000 bytes,  38119 words
1 = 2537.9 = wei dai.txt 
2 = 2665.64 = HEINLEIN Starship Troopers.txt
update: Adam Back came in 3rd
3 = 2665.83 = feynman_surely.txt
4 = 2667.86 = Richard Dawkins - The Selfish Gene.txt
5 = 2677.85 = world_is_flat_thomas_friedman.txt
6 = 2687.89 = Ender's Game.txt
7 = 2745.07 = minsky_emotion_machines.txt
8 = 2763.72 = craig_wright.txt
9 = 2765.46 = What-Technology-Wants.txt _1.txt
10 = 2767.81 = HEINLEIN THE MOON IS A HARSH MISTRESS.txt
11 = 2775.95 = HEINLEIN Stranger in a Strange Land part A.txt
12 = 2777.61 = superintelligence_1.txt
13 = 2797.76 = craig wright pdfs.txt
14 = 2799.61 = ridley_the_rational_optimist part A.txt
15 = 2805.55 = samuel-butler_THE WAY OF ALL FLESH.txt
16 = 2809.45 = Richard Dawkins - A Devil's Chaplain.txt
17 = 2809.55 = foundation trilogy.txt
18 = 2809.98 = superintelligence_0.txt
19 = 2809.98 = superintelligence.txt
20 = 2815.89 = wander.txt
21 = 2816.7 = HEINLEIN Citizen of the Galaxy.txt
22 = 2819.38 = how to analyze people 1921 gutenberg.txt
23 = 2822.95 = Steven-Pinker-How-the-Mind-Works.txt
24 = 2826.05 = What-Technology-Wants.txt
25 = 2826.05 = What-Technology-Wants.txt _0.txt
26 = 2827.14 = crash_proof.txt
27 = 2832.55 = szabo.txt
28 = 2832.84 = RIDLEY genome_autobiography_of_a_species_in_23.txt
29 = 2833.2 = twain2.txt
30 = 2842.11 = RIDLEY The Red Queen part A.txt
31 = 2851.56 = SAGAN - Contact.txt
32 = 2852.34 = HEINLEIN Have Space Suit.txt
33 = 2853.14 = twain4.txt
34 = 2858.2 = GREEN - The Elegant Universe (1999).txt
35 = 2859.64 = Steven-Pinker-The-Language-Instinct.txt
36 = 2866.18 = GREEN The Fabric of the Cosmos.txt
37 = 2876.27 = Catch 22.txt
38 = 2881.65 = twain shorts.txt
39 = 2883.31 = AUSTIN_pride and predjudice.txt
40 = 2884.1 = brown - web of debt part B.txt
41 = 2888.33 = SAGAN_pale_blue_dot.txt
42 = 2889.25 = AUSTIN_sense and sensibility.txt
43 = 2890.85 = don quixote.txt
44 = 2891.92 = The Defiant Agents - science fiction.txt
45 = 2894.32 = Justin Fox - Myth of the Rational Market2.txt
46 = 2897.72 = freud.txt
47 = 2900.37 = twain - many works.txt
48 = 2904.58 = SAGAN - The Demon-Haunted World part B.txt
49 = 2906.67 = ridley_the_rational_optimist part B.txt
50 = 2912.06 = RIDLEY The Red Queen part B.txt
51 = 2917.24 = twain roughing it part A.txt
52 = 2918.94 = the social cancer - philipine core reading.txt
53 = 2920.25 = rifkin_zero_marginal_society.txt
54 = 2922.95 = SAGAN - The Cosmic Connection (1973).txt
55 = 2926.16 = SAGAN - The Demon-Haunted World part A.txt
56 = 2926.21 = moby-dick part A.txt
57 = 2927.76 = dickens tale of two cities.txt
58 = 2928.73 = twain innocents abroad part A.txt
59 = 2928.89 = dickens oliver twist part B.txt
60 = 2932.54 = SAGAN The Dragons of Eden.txt
61 = 2936.43 = dickens david copperfield.txt
62 = 2938.3 = dickens hard times.txt
63 = 2944.31 = J.K. Rowling Harry Potter Order of the Phoenix part B.txt
64 = 2949.54 = moby-dick part B.txt
65 = 2949.81 = SAGAN-Cosmos part A.txt
66 = 2958.4 = SAGAN-Cosmos part B.txt
67 = 2964.63 = works of edgar allen poe volume 4.txt
68 = 2966.41 = Rifkin J - The end of work.txt
69 = 2976.91 = twain1.txt
70 = 2980.1 = J.K. Rowling Harry Potter Order of the Phoenix part A.txt
71 = 2983.97 = dickens oliver twist part A.txt
72 = 3016.28 = ivanhoe.txt
73 = 3017.29 = brown - web of debt part A.txt
74 = 3104.59 = Finnegans wake.txt

Discussion before showing the other test runs.

For word pairs and triples I found it better to keep each punctuation mark as a "word." All tests on determining methodology excluding Satoshi, Szabo, Wright, and Dai data so as to not "prove" a correlation by accidentally getting what I was looking for. Wei Dai was my last known author to check, and I didn't expect anything as my previous post shows. I was convinced Szabo would be my best match. I was also convinced no authors would match well on 3 and 4 word tests because professional authors kept giving spotty results.

Craig Wright used to say "bit coin" instead of "bitcoin" which Satoshi never did. Full stop. That's not him. Similarly, Satoshi's "may be" to "maybe ratio was 56/43.  Wei Dai's is  51/34. Nick Szabo's is 25/0. Full stop on Szabo. It's not him.  But Wei Dai needed checking.

I was having a lot of trouble getting the program to determine with exactness which books Carl Sagan, Robert Heinlein, Mark Twain, and others wrote.  Wei Dai matches Satoshi better than most authors match themselves. Wei Dai is more Satoshi than Carl Sagan is Carl Sagan and more than Robert Heinlein than Robert Heinlein, not to mention Ridley, Twain, and Poe.

The Huckleberry Finn Effect
A complicating problem is when the "unknown" author is working on a specific task and not using his full vocabulary, or trying to be someone else.  I discussed observing this phenomena on accident in my tests and called it the "Huckleberry Finn Effect".  Mark Twain wrote the entire book in the 1st person voice of a bumpkin child, and the book matches well with all his other books when it is the "unknown author" book.  But when running the process in reverse,  Huckleberry Finn did not match well at all with Mark Twain's books, at least not on single-word runs. Mark Twain has a larger vocabulary than Huck Finn, and so do all the other authors on the list, so Huck Finn was ranking lower compared the adult authors. But he began to match and match good as longer word sequences were used.  As I expected, Wei Dai did not match on single-word checks during a reverse test because he was working on a narrow subject (if not trying to use a different vocabulary). He ranked half-way down the list on single word tests.  But on word pairs, he got a good bit higher. Then on 3 and 4 word sets he became the top match, as I show below. Is this real entropy magic or do I just not have enough authors using the word "I"? Even taking out "I", Wei Dai still matches better on simple 4 word sequences than any other author. If you take out every noun, he'll still probably match the best.

Word Pairs (2nd best method) 
Notice large margin of win 
Distance between 1st and 2nd is same as distance between 2nd and 7th.
file size: 215000
satoshi_all.txt

46492 words
1 = 29125.52 = wei dai.txt 
2 = 29754.71 = world_is_flat_thomas_friedman.txt
3 = 30197.82 = feynman_surely.txt
4 = 30207.26 = What-Technology-Wants.txt _1.txt
5 = 30285.41 = Richard Dawkins - A Devil's Chaplain.txt
6 = 30304.18 = HEINLEIN Starship Troopers.txt
7 = 30365.04 = Richard Dawkins - The Selfish Gene.txt
8 = 30397.33 = Steven-Pinker-How-the-Mind-Works.txt
9 = 30430.2 = minsky_emotion_machines.txt
10 = 30432.76 = szabo.txt
11 = 30474.9 = SAGAN - Contact.txt
12 = 30503.35 = craig_wright.txt

word triples with middle NOT made a variable
file size: 215000
satoshi_all.txt

46492 words
1 = 49937.82 = wei dai.txt 
2 = 50373.12 = feynman_surely.txt
3 = 50784.93 = HEINLEIN Starship Troopers.txt
4 = 50807.91 = SAGAN - Contact.txt
5 = 50815.37 = Ender's Game.txt
6 = 50821.11 = wander.txt
7 = 50860.49 = foundation trilogy.txt
8 = 50889.92 = world_is_flat_thomas_friedman.txt
9 = 50937.44 = Richard Dawkins - The Selfish Gene.txt
10 = 50976.12 = What-Technology-Wants.txt _1.txt
11 = 51017.51 = HEINLEIN Citizen of the Galaxy.txt
12 = 51138.06 = Richard Dawkins - A Devil's Chaplain.txt
13 = 51154.04 = HEINLEIN Stranger in a Strange Land part A.txt
14 = 51271.22 = szabo.txt
......
31 = 51516.88 = craig_wright.txt

4 word sequences
Distinction is not as clear, because it needed more data to make 4-word sets more significant
file size: 215000
satoshi_all.txt

46492 words
1 = 58563.91 = wei dai.txt 
2 = 58682.45 = feynman_surely.txt
3 = 58831.3 = Ender's Game.txt
4 = 58918.85 = SAGAN - Contact.txt
5 = 58942.11 = wander.txt
6 = 58942.6 = HEINLEIN Starship Troopers.txt
7 = 58952.93 = foundation trilogy.txt
8 = 58989.44 = HEINLEIN Citizen of the Galaxy.txt
9 = 59090.42 = Catch 22.txt
.....  21 = 59275.79 = szabo.txt
.....  37 = 59436.53 = craig_wright.txt

In the following, the large files were broken up into 3 or more files.  This gave Wei Dai more chances to fail, and other authors more opportunity to win. 

Single words
file size: 74000
satoshi_all.txt

15670 words
1 = 2584.95 = wei dai_2.txt 
2 = 2628.68 = wei dai_1.txt 
3 = 2652.08 = world_is_flat_thomas_friedman_2.txt
4 = 2688.91 = world_is_flat_thomas_friedman_3.txt
5 = 2712.89 = craig_wright.txt
6 = 2724.3 = wei dai.txt ***  A losing instance
7 = 2733.85 = world_is_flat_thomas_friedman_5.txt
8 = 2734.17 = szabo_2.txt

Word pairs
file size: 74000
satoshi_all.txt

15670 words
1 = 11476.23 = wei dai_1.txt 
2= 11530.86 = wei dai_2.txt 
3 = 11612.98 = wei dai.txt 
4 = 11742.85 = world_is_flat_thomas_friedman_2.txt
5 = 11797.12 = world_is_flat_thomas_friedman_4.txt
6 = 11800.34 = world_is_flat_thomas_friedman_5.txt
7 = 11817.63 = craig_wright.txt
8 = 11849.01 = What-Technology-Wants_4.txt
9 = 11860.05 = world_is_flat_thomas_friedman_3.txt
10 = 11863.12 = world_is_flat_thomas_friedman.txt
11 = 11863.12 = world_is_flat_thomas_friedman_0.txt
12 = 11869.89 = world_is_flat_thomas_friedman_1.txt
13 = 11887.65 = Richard Dawkins - A Devil's Chaplain.txt
14 = 11902.31 = What-Technology-Wants.txt _1.txt
15 = 11914.62 = Richard Dawkins - The Selfish Gene.txt
16 = 11923.83 = szabo_2.txt

word triples:
file size: 74000
satoshi_all.txt

15670 words
1 = 18124.84 = wei dai_1.txt 
2 = 18164.2 = wei dai.txt 
3 = 18200.55 = wei dai_2.txt 
4 = 18278.58 = feynman_surely.txt
5 = 18293.63 = wander.txt
6 = 18313.43 = world_is_flat_thomas_friedman_4.txt
7 = 18314.77 = Ender's Game.txt
8 = 18322.22 = world_is_flat_thomas_friedman_6.txt
9 = 18328.88 = world_is_flat_thomas_friedman_5.txt
10 = 18329.97 = world_is_flat_thomas_friedman_2.txt
11 = 18338.86 = craig_wright.txt

Repeat the above but on a different 
portion of Satoshi's writing

single words
file size: 74000
satoshi_1.txt

15136 words
1 = 2711.44 = wei dai_2.txt 
2 = 2776.41 = wei dai_1.txt 
3 = 2813.41 = world_is_flat_thomas_friedman_2.txt
4 = 2815.1 = world_is_flat_thomas_friedman_3.txt
5 = 2830.72 = wei dai.txt  **** A losing instance
6 = 2846.62 = craig wright blog.txt  *** these are 5 bitcoin articles!
7 = 2847.16 = world_is_flat_thomas_friedman_1.txt
8 = 2852.34 = world_is_flat_thomas_friedman_4.txt
9 = 2876.52 = szabo_2.txt
10 = 2894.94 = What-Technology-Wants_5.txt
11 = 2908.64 = world_is_flat_thomas_friedman_5.txt
12 = 2915.53 = superintelligence_0.txt
13 = 2915.53 = superintelligence.txt
14 = 2917.19 = world_is_flat_thomas_friedman.txt
15 = 2917.19 = world_is_flat_thomas_friedman_0.txt
16 = 2918.21 = craig_wright_0.txt
17 = 2918.21 = craig_wright.txt
18 = 2934.65 = What-Technology-Wants_4.txt
19 = 2939.89 = What-Technology-Wants_3.txt
20 = 2953.85 = What-Technology-Wants.txt _1.txt
21 = 2962.05 = What-Technology-Wants_1.txt
22 = 2973.87 = szabo_0.txt
23 = 2973.87 = szabo.txt
..... 27 = 2986.1 = szabo_1.txt
.... 33 = 3017.04 = craig wright pdfs.txt

word pairs
file size: 74000
satoshi_1.txt

15136 words
1 = 12534.78 = wei dai_1.txt 
2 = 12596.11 = wei dai_2.txt 
3 = 12673.02 = wei dai.txt 
4 = 12838.22 = world_is_flat_thomas_friedman_2.txt
5 = 12863.48 = world_is_flat_thomas_friedman_1.txt
6 = 12872.03 = world_is_flat_thomas_friedman_3.txt
7 = 12874.27 = craig_wright_0.txt
8 = 12874.27 = craig_wright.txt
9 = 12874.74 = world_is_flat_thomas_friedman_4.txt
10 = 12876.12 = feynman_surely.txt
11 = 12891.06 = world_is_flat_thomas_friedman_5.txt
12 = 12914.99 = world_is_flat_thomas_friedman_0.txt
....18 = 12992.69 = craig wright blog.txt *** these are 5 bitcoin articles!
....25 = 13065.04 = szabo_2.txt
... 33 = 13126.87 = What-Technology-Wants.txt
...34 = 13126.87 = What-Technology-Wants.txt _0.txt
....35 = 13126.87 = What-Technology-Wants_0.txt
.....46 = 13214.77 = szabo_1.txt
....52 = 13238.57 = szabo_0.txt
.....53 = 13238.57 = szabo.txt

Word triple, word in middle a variable
file size: 74000
satoshi_1.txt

15136 words
1 = 14148.98 = wei dai_1.txt 
2 = 14280.94 = wei dai.txt 
3 = 14320.99 = wei dai_2.txt 
4 = 14399.46 = world_is_flat_thomas_friedman_4.txt
5 = 14446.93 = world_is_flat_thomas_friedman_2.txt
6 = 14468.12 = feynman_surely.txt
7 = 14479.58 = HEINLEIN Starship Troopers.txt
8 = 14499.58 = world_is_flat_thomas_friedman.txt
9 = 14499.58 = world_is_flat_thomas_friedman_0.txt
10 = 14513.87 = Ender's Game.txt
11 = 14516.18 = world_is_flat_thomas_friedman_3.txt
12 = 14516.82 = world_is_flat_thomas_friedman_5.txt
13 = 14535.82 = wander.txt
14 = 14542.61 = Richard Dawkins - A Devil's Chaplain.txt
15 = 14548.98 = world_is_flat_thomas_friedman_7.txt
16 = 14574.61 = craig_wright_0.txt
17 = 14574.61 = craig_wright.txt
...21 = 14617.72 = craig wright blog.txt  *** these are 5 bitcoin articles!
...37 = 14725.98 = szabo_2.txt
...48 = 14781.78 = szabo_1.txt
...55 = 14840.94 = szabo.txt
...56 = 14842.32 = szabo_0.txt

Again on a 3rd portion of Satoshi's writing
Single words
file size: 74000
satoshi_2.txt

15380 words
1 = 2660.66 = wei dai_2.txt 
2 = 2687.98 = wei dai_1.txt 
3 = 2770.03 = wei dai.txt 
4 = 2807.24 = craig wright blog.txt   these are 5 bitcoin articles!
5 = 2829.34 = world_is_flat_thomas_friedman_2.txt
6 = 2850.82 = What-Technology-Wants_5.txt
7 = 2851.36 = szabo_2.txt
8 = 2852.92 = world_is_flat_thomas_friedman_3.txt
9 = 2856.99 = world_is_flat_thomas_friedman_4.txt
10 = 2882 = craig_wright_0.txt
11 = 2882 = craig_wright.txt
12 = 2888.43 = world_is_flat_thomas_friedman_1.txt
13 = 2902.57 = world_is_flat_thomas_friedman.txt
14 = 2902.57 = world_is_flat_thomas_friedman_0.txt
15 = 2914.92 = What-Technology-Wants_4.txt

Word Pairsfile size: 74000
satoshi_2.txt

15380 words
1 = 12384.66 = wei dai_1.txt 
2 = 12528.99 = wei dai_2.txt 
3 = 12530.35 = wei dai.txt 
4 = 12833.27 = feynman_surely.txt
5 = 12881.83 = craig_wright_0.txt
6 = 12881.83 = craig_wright.txt
7 = 12889.53 = world_is_flat_thomas_friedman_2.txt
8 = 12893.8 = world_is_flat_thomas_friedman_1.txt
9 = 12901.36 = world_is_flat_thomas_friedman.txt
10 = 12901.36 = world_is_flat_thomas_friedman_0.txt
11 = 12925.84 = HEINLEIN Starship Troopers.txt
12 = 12933.83 = world_is_flat_thomas_friedman_4.txt
13 = 12935.16 = world_is_flat_thomas_friedman_5.txt
14 = 12940.12 = Ender's Game.txt
15 = 12950.67 = craig wright blog.txt

Word triple, word in middle variable
file size: 74000
satoshi_2.txt

15380 words
1 = 14029.58 = wei dai_1.txt 
2 = 14150.66 = wei dai.txt 
3 = 14252.42 = wei dai_2.txt 
4 = 14436.32 = feynman_surely.txt
5 = 14455.36 = HEINLEIN Starship Troopers.txt
6 = 14474.59 = Ender's Game.txt
7 = 14517.58 = wander.txt
8 = 14525.48 = Richard Dawkins - A Devil's Chaplain.txt
9 = 14539.11 = world_is_flat_thomas_friedman_4.txt
10 = 14541.82 = world_is_flat_thomas_friedman_2.txt
11 = 14571.1 = HEINLEIN Citizen of the Galaxy.txt
12 = 14580.67 = world_is_flat_thomas_friedman_5.txt
13 = 14588.44 = craig_wright_0.txt
14 = 14588.44 = craig_wright.txt

Reverse check, large files:

word triples
file size: 200000
wei dai.txt

41804 words
1 = 47249.78 = satoshi_all.txt 
2 = 47415.99 = Richard Dawkins - The Selfish Gene.txt
3 = 47417.4 = superintelligence_1.txt
4 = 47470.18 = szabo.txt
5 = 47473.99 = Richard Dawkins - A Devil's Chaplain.txt
6 = 47621.95 = SAGAN - The Demon-Haunted World part B.txt
7 = 47634.54 = What-Technology-Wants.txt _1.txt
8 = 47646.7 = feynman_surely.txt

4 word sequences
file size: 200000
wei dai.txt

41804 words
1 = 54495.49 = satoshi_all.txt 
2 = 54701.31 = wander.txt
3 = 54723.85 = feynman_surely.txt
4 = 54833.41 = SAGAN - Contact.txt
5 = 54856.25 = HEINLEIN Starship Troopers.txt
6 = 54868.92 = Ender's Game.txt
7 = 54877.23 = SAGAN - The Demon-Haunted World part B.txt
8 = 54888.63 = szabo.txt


To show how hard it is to derive meaning from from individual words, here are the most common words with more than 5 letters from 4 bitcoin suspects.. Notice how different Wei Dai's text appears to be from the technical aspects of the other three. But if you're like me, you can see "softness" in Wei Dai and Satoshi's words compared to the other two. Adam Back and Szabo seemed to have more hard core things to say in what I could find online.  This entropy equation is somehow detecting it. You could argue, "Szabo and Black are just talking more technically and for some reason Satoshi's project did not require such diverse language, and for some reason that was similar to Wei Dai's philosophical discussions." Yes, that's exactly it: the same person has a tendency to choose projects that lead to similar language. That's data to be included, not to be thrown out.





This is the Perl code for single words that can be modified for multiple words.

Here is the windows executable. It is the Perl script below compiled with perl2exe. I have it set up to do the triple with the word in the middle treated as an allowed variable to allow moer matches while trying to gain a sense of the author's style.
.http://wordsgalore.com/author_compare_executable.exe  it's not a GUI or even command line. Read below.

Here are the executable instructions.  You just put files where they belong and double-click. This prevents the user from being able to adjust the parameters which would be self-deceiving. Changing how punctuation is handled can change the results. I am just testing it on a bunch of authors (not Satoshi-related) and whatever gives the best correct results is accept.  The instructions below describe the exact procedure I've settled on so far.
=========
This takes the text of 'author_baseline.txt' located in same directory as the executable and calculates the word-entropy difference between it and all files with a txt extension located in sub-directory 'authors'.
The output ranking the most similar texts first is sent to this file. The equation is: for each word in baseline divide its count by the word count from the current file and then take the log base 10 of the ratio. If the word was not found in current file, assign its count ta value 0.25. Do this only if baseline word count is greater than current file word count. Sum for all words. Words are not words in this version, but are word triples where the middle word is a variable. This makes authors more distinct. Apostrophes are moved to the right of word outside of it. All letters are made lowercase.  All other common punctuation is treated like a word.  All this is crucial for good results and slight changes can have substantial effects. The reverse test on suspect authors should be done, but a true author writing in a different mode can rank much lower on the reverse test. I call it the Huckleberry Finn efefct after seeing it happen in matching Mark Twain. Huckleberry was identified as Mark Twain, but not vice versa except on large data with longer word sequences.

The smallest txt file in authors determines the number of words pulled from the beginning of all the other
files. It should be 30% greater than the author_baseline.txt file. This makes the comparisons fair without a size bias. But it means you have to get all big files and remove the small ones. I recommend at least 50k. 500k is not overkill.
=========