Saturday, May 7, 2016

Gavin Andresen entropy comparison to Satoshi Nakamoto

This is the difference of the Shannon entropy of 4-word sequences between various authors.  Presumably the closeness is only because Gavin Andresen was talking about the same stuff the most, and mimicking Satoshi's wording. 



"we are all Satoshi" is such a lovely idea; might say "yes" when asked "are you?" - Gavin Andresen
"... I would assume that Wright used amateur magician tactics ..." McKenzie writes. "I’m mystified as to how this got past Andresen."
"Either he was Satoshi, but really wants the world to think he isn't, so he created an impossible-to-untangle web of truths, half-truths and lies. And ruined his reputation in the process." - Gavin Andresen
"It does always surprise me how at times the best place to hide [is] right in the open" - Craig Wright
"Wired offers the possibility that Wright, if not Nakamoto, might just have been playing a game planting evidence to make you think he is, but the case they present is pretty strong."

Summary
This has failed to demonstrate a Satoshi.  First it was clear Szabo easily beat Craig Wright.  Then it was clear Wei Dai beat them very easily, and I was pretty sure it was Wei Dai. Most of this post and other posts demonstrates the history of that, with all my errors.  It became clear I needed test samples from forums because that appeared to be why Wei Dai was winning so easily (most of Satoshi's writing was in a forum using "I", "we", "you" a lot).  Dai still beat Back and Grigg, even though Dai's text was not computer and crypto-centric like theirs and Satoshis.  So Dai is still an interesting possibility.  Hal Finney then beat everyone easily, even though the vast majority of his text was not bitcoin.  Then Gavin Andresen beat everyone easily: he was talking bitcoin with Satoshi in the same forum, speaking to the same audience, probably mimicking Satoshi's language (I have one good example below).  I then compared Satoshi to Gavin's blog posts, and he did not rank any better than I would expect for a non-Satoshi.  He could be Gavin, Dai, or Finney as far as I know because the method depends a lot on what type of message they are typing. So I suspect he's none of the above. At least it shows Craig Wright is definitely not Satoshi, and Nick Szabo ranks pretty low as a possibility.  If another Satoshi comes up, and if he's posted to forums or mailing lists on the internet, then I'll know immediately  from this experience the degree to which he might be a Satoshi.

About this method
The way this entropy method works is similar to subtracting the weight of each organ in an animal and by the weight of the same organ in an unidentified animal, then adding up the differences (but first remove any minus signs from the subtractions). The unidentified animal will be most like the animal for which this sum is the smallest.  A young animal will have a smaller weight and all its organ will be smaller and this is like having a text that is shorter, so really, the ratio of the organ weights to the body weights are what's being subtracted.   For complex reasons, that may not be applicable to organs/body weight ratios, I take the logarithm of the frequency of each word (or word pair, word triple, or 4 word sequence) before subtracting them and taking away any minus sign with the ABS() function.  And I only use word frequencies where the unknown author said them more frequently than the known author. This seemed to help discover a connection better when the subject of the texts were completely different. This way it detects that the known author is like the unknown author better but sacrifices the vice versa. If Mark twain is the unknown author, it will better detect Huckleberry Finn as being like a young Mark Twain, but not vice versa because Mark Twain has a larger vocabularly.

This measures the entropy difference between the language of Satoshi Nakamoto and all the "suspects" I could get data on. The equation is sum of all words of abs(log(countA/countB)) where countA is the count of a particular word for unknown author A and countB is for known authorB. When countB=0, set =0.25 (this is not arbitrary, but says there was a 50% chance the missing word was not used on accident).  Imagine you have a bag of 6-sided dies that are not rolling fair and you want one that is the most similar to an unfair die in your hand. You would use this method to find the one most similar after many many rolls of all them and counting the results. Each side is like a word and the die is its author. Once you have the data, how are you going to define and decide which one is most like yours? Average differences? What I've found by experimentation that to identify an author out of 70 books, this method works darn good, without any training, a lot better than averaging differences in word use.

Older update: No matter how many bitcoin and computer-related words I delete from the words being checked, Gavin Andersen still ranks the #1 Satoshi suspect. [ update to this update: I tested Gavin's blog. He does not appear to be satoshi.]  Until I have something more to go on, I'll assume that is because he's the only one on my list that was working closely with Satoshi and somehow it affects the language beyond the specific computer terms being used.   Below are some of the words I've resorted to deleting from the comparisons in my efforts to try to get Gavin to rank #2.  I'm far beyond obvious bitcoin-specific words, and Gavin still matches better than Wei Dai, Finney, Back, Grigg. Szabo and Craig Wright are far out of the running.. I've removed 1350 of Satoshi's 4500 unique words because they are bitcoin, crypto, network, or economics-specific. 150 were removed specifically trying to make Gavin fall to 2nd place, and still he is #1.

Partial list of words removed in trying to make Gavin Andresen fall to #2 (but he remained at #1):
balance, big, break, copies, copy, cost, data, detect, divide, domain, download, fixed, free, implement, incentive, internet, log, machine, mailing, mechanism, multiply, number, online, paper, point, post, problem, risk, running, safe, screen, sign, software, split, start, steal, string, support, tested, time, txt, webpage, website, websites, work, zip

To show the difficulty of separating Gavin from Satoshi, and the degree to which Gavin might have been reflecting Satoshi, here's what I found when looking into why they both used the 4-word phrase "a menace to the".  Satoshi wrote "So much of the design depends on all nodes getting exactly identical results in lockstep that a second implementation would be a menace to the network.  "   1 hour later Gavin wrote: "They'll either hack the existing code or write their own version, and will be a menace to the network.." On my 3 and 4-word tests, they would get 3 matches for this one instance where others would probably not match.

Note: Wei Dai beat Hal Finney only on the 4-word sequence test. This implies Wei Dai's similarity is "deeper" since his posts were not nearly as "computer/crypto" related and it's harder to match on 4-word sequences.


Here are the 4-word phrases that Satoshi and others used more than once. Notice there is nothing indicating anything unique. He simply  matches more on 4-word sequences.  Notice that it is hard to detect the topic might have been computer or bitcoin-related. 1350 cmoputer/crypto/economic words were removed out of satoshi's 4500 uniwue english words.  150 are "stretching the limits" and specifically chosen in trying to get Gavin to fall from 1st place.  In looking at the rankings, it seems there are still elements of "forum-type" language, and programming language mixed in.  Wei Dai ranking high on quads remains interesting because his text was not really computer-related like the others.

Note: periods and other punctuation are missing, so some will not sound correct, but it's still indicative of an author's style. Including punctuation does not have a large effect.

Gavin Andresen, 4 word sequences with more than 1 occurrence
0.834 = 3 = 3 = wont be able to
2.012 = 2 = 2 = would have to be
2.012 = 2 = 2 = dont have to worry
2.043 = 2 = 2 = i dont think you
2.043 = 2 = 2 = i think the only
2.043 = 2 = 2 = for some reason i
2.043 = 2 = 2 = turn out to be
2.043 = 2 = 2 = how to fix it
2.043 = 2 = 2 = and let me know
2.043 = 2 = 2 = people who want to
2.043 = 4 = 3 = a good idea to
2.043 = 2 = 2 = if youre going to
2.043 = 2 = 2 = in one of the
2.043 = 2 = 2 = be a good idea
2.043 = 2 = 2 = they dont have to
2.043 = 2 = 2 = need to be able
2.043 = 2 = 2 = im still thinking about
2.043 = 2 = 2 = the end of the
2.043 = 4 = 3 = would be nice if
2.043 = 2 = 2 = at the end of
3.065 = 3 = 4 = think it would be
3.347 = 7 = 10 = the rest of the
3.378 = 8 = 6 = if you want to
3.835 = 5 = 7 = to be able to
4.274 = 5 = 3 = let me know if
4.888 = 2 = 3 = if you have a
4.888 = 2 = 3 = it is possible to
4.888 = 2 = 3 = in the middle of
4.888 = 2 = 3 = to make sure the
4.92 = 4 = 2 = the only way to
4.92 = 4 = 2 = it would be a
6.098 = 3 = 2 = me know if it
6.098 = 3 = 2 = ive been working on
6.098 = 3 = 2 = if youd like to
6.098 = 3 = 2 = have to worry about
6.743 = 8 = 4 = it would be nice
8.975 = 6 = 2 = the size of the
8.975 = 4 = 2 = if there was a
10.485 = 2 = 6 = that would be a
10.485 = 2 = 6 = i think it would
10.95 = 3 = 9 = figure out how to
11.206 = 5 = 2 = i cant think of
11.206 = 5 = 2 = will be able to
11.206 = 5 = 2 = as long as the
11.82 = 2 = 7 = should be able to
11.82 = 2 = 7 = i dont want to

Wei Dai, 4 word sequences with more than 1 occurrence

 0.834 = 3 = 3 = i dont know if
 2.012 = 2 = 2 = at the end of
2.012 = 2 = 2 = im not sure if
 2.012 = 2 = 2 = the end of the
 2.012 = 2 = 2 = it would have been
 2.012 = 2 = 2 = be a good idea
 2.012 = 2 = 2 = not be able to
2.012 = 2 = 2 = that it would be
2.012 = 2 = 2 = this is the case
2.043 = 2 = 2 = should be able to
 2.043 = 2 = 2 = would have to be
2.043 = 4 = 3 = the only way to
2.043 = 2 = 2 = that would be a
2.043 = 2 = 2 = if you dont have
 2.043 = 2 = 2 = if we had a
 2.043 = 2 = 2 = that we dont need
2.043 = 2 = 2 = i think this is
2.043 = 2 = 2 = the best way to
 2.043 = 2 = 2 = if you make a
2.043 = 2 = 2 = not sure what the
 2.043 = 2 = 2 = it would be great
2.043 = 2 = 2 = in the first place
3.585 = 7 = 5 = the rest of the
3.835 = 5 = 7 = to be able to
 4.888 = 3 = 5 = i dont know how
 4.888 = 2 = 3 = i was trying to
4.888 = 3 = 5 = im not sure how
 4.888 = 3 = 5 = im not sure what
4.92 = 4 = 2 = a good idea to
 6.098 = 3 = 2 = figure out how to
 6.098 = 3 = 2 = theres no way to
 7.765 = 3 = 7 = dont know how to
 8.943 = 2 = 5 = turn out to be
8.943 = 2 = 5 = would be able to
 8.975 = 8 = 3 = if you want to
 8.975 = 4 = 2 = would be nice if
8.975 = 4 = 2 = not need to be
11.206 = 5 = 2 = let me know if
15.875 = 2 = 10 = but im not sure
15.906 = 8 = 2 = it would be nice

Hal Finney

3.065 = 3 = 4 = im not sure what
0.22 = 5 = 5 = will be able to
2.043 = 2 = 2 = that it would be
2.043 = 2 = 2 = it has to be
4.92 = 4 = 2 = would be nice to
7.12 = 2 = 4 = if you have a
4.92 = 8 = 5 = if you want to
2.043 = 2 = 2 = there has to be
2.012 = 2 = 2 = in addition to the
2.043 = 2 = 2 = im not sure if
11.851 = 8 = 2 = it would be nice
1.322 = 5 = 6 = to be able to
0.834 = 3 = 3 = wont be able to
7.12 = 2 = 4 = i think this is
6.098 = 3 = 2 = i dont know how
2.043 = 2 = 2 = be a good idea
2.043 = 6 = 5 = the size of the
2.043 = 4 = 3 = you can use the
8.975 = 4 = 2 = a good idea to
2.012 = 2 = 2 = is based on the
7.12 = 2 = 4 = not be able to
2.043 = 2 = 2 = you may need to
2.012 = 2 = 2 = if you can find
6.43 = 3 = 6 = i dont know if
2.043 = 3 = 2 = there would be a
2.012 = 4 = 5 = it would be a
2.012 = 2 = 2 = would have to be
2.012 = 2 = 2 = but im not sure
2.012 = 2 = 2 = should be able to
4.888 = 2 = 3 = turn out to be
2.043 = 2 = 2 = so it must be
2.012 = 2 = 2 = at the end of

==========  Older tests and comments ==========
Note: 
First let me point out the extent to which Wei Dai declares he is not Satoshi: https://www.gwern.net/docs/2008-nakamoto

Second, I need to point out Satoshi used "It'll" a lot when referring to the software (19 out of 24 times) and said "it will" only 6 times.  Wei Dai said "it'll" exactly 1 time in 300,000 words.   s and "it will" 6 times.
UPDATE: Satoshi had a "maybe" to "may be" ratio of 3.3.  Wei Dai and the general population have a ratio of about 0.5. Based just on that I find it hard to believe Wei Dai is Satoshi. Now for me, he's more like a  benchmark beat, and a puzzle as to why he matches so well. If he is Satoshi, the only explanation I can find is that in Satoshi mode he was in a hurry to use the shorter versions of "it will" and "may be", but as Wei Dai in a blog, he wanted to be more grammatically correct. So it's still a possibility to me, I'm just more skeptical.

Third, Satoshi used some British spellings, which some have thought was on purpose to throw people off.

update: I'm in discussions with someone from reddit who is showing me I definitely need more comparison authors from forums to rule Wei Dai as a strong suspect. Half the effect could be from them both speaking in first person to others in forums, and the other half could be that they have a similar interests, not related directly to any words directly related to bitcoin.  Forums will have a lot of "I, you, my, we, our" in a way that is different from even dialog-strong books.  On the other hand, I have a lot of non-forum articles from Wright, Szabo, and Dai, and Dai clearly wins on those, but again, his articles may be speaking more directly to his audience in a personal way due to how the lesswrong site is. Conversely, the true Satoshi will be more likely to seek such a forum.  BTW, Dai and Satoshi stopped their forum activity in June 2010, and then Dai picked back up in 2013.  Dai started the forum discussion immediately after Bitcoin was released awhen presumably he had a lot less coding work to do.  

I'm still working on getting better comparison authors (which someone's helping me collect from the bitcoin forum), but I'm also working on getting a better metric of author differences using 2 more dimensions, based on the relationship between physical entropy and the entropy of language which I believe is related to author intelligence and Zipf's law (the two extra dimensions).


Update:  Wei Dai is talking almost all philosophy and I removed paragraphs where he was forced to talk about bitcoin, and yet he still matches to Satoshi better than anyone I can find who *is* talking about bitcoin.  We're continuing to try to find more bitcoin test subjects.  Gavin is up next.  People talking *to* Satoshi in the old forum  (as long as code-specific words are removed) are not going to match as well to Satoshi as Wei Dai does when he's talking about philiosophy.    

UPDATE:   I took all of  Eliezer Yudkowsky's comments in the lesswong website, many of which were responses to Wei Dai, plus other people who were talking with Yudkosky, and here is how they rank compared to Satoshi.  The Wei Dai files are his comments on the same web site with the others.   Notice Adam Back and previous near-hits are falling further down these larger files and tougher competitors. Notice difference between the 2 best Wei Dai hits and his "best friends" best hit is relatively large. On infrequent occasions, Wei Dai was forced to talk about bitcoin. This also shows removing those paragraphs had only a minor effect. HOWEVER, since Yudkowsy beat out previous runner-ups Feynman and Friedman, there appears to be a bias towards forum discussions. Or Yudkoswy is reflecting a lot of Wei Dai's words in some way.  Satoshi was in a Forum. Wei Dai  is in a forum. They beat Szabo and Craig Wright and a non-related test subject who I had data on for forum discussions, so it's not the only thing going on, but I need more files from bitcoin-like thinkers in forums.

Satoshi text:  258KB
Using first 351 KB of known files

46102 unique words from baseline text were used and 52301 words from authors files.

1 = 1.0022 wei_dai2_1
2 = 1.003 wei_dai2_0
3 = 1.0041 wei_dai_bitcoin_paragraphs_removed_1
4 = 1.0047 wei_dai_bitcoin_paragraphs_removed
5 = 1.0047 wei_dai_bitcoin_paragraphs_removed_0
6 = 1.0072 wei_dai_bitcoin_paragraphs_removed_3
7 = 1.0078 wei_dai2_2
8 = 1.0081 lesswrong_not_wei_5  (Yudkowsky talking to Wei and others)
9 = 1.0084 wei_dai_bitcoin_paragraphs_removed_2
10 = 1.0094 lesswrong_not_wei
11 = 1.0094 lesswrong_not_wei_0
12 = 1.0096 wei_dai2_3
13 = 1.0117 lesswrong_not_wei_1
14 = 1.0126 lesswrong_not_wei_3
15 = 1.0141 lesswrong_not_wei_2
16 = 1.015 feynman_surely
17 = 1.0217 world_is_flat_thomas_friedman
18 = 1.0223 HEINLEIN Starship Troopers
19 = 1.0252 adam back
20 = 1.0269 lesswrong_not_wei_4

Update:  I also compared an old white paper by Wei Dai to Satoshi's Bitcoin forum posts and to Wei Dai's own lesswong comments, and it shows both Wei Dai's forum comments AND Wei Dai's white paper match BETTER with Satoshi than with himself. Satoshi's bitcoin work has more similarity to Wei Dai's white paper  and to Wei Dai's forum comments than Wei Dai's bitcoin-like work has to Wei Dai's forum comments.  Satoshi is good mixture of of Dai in those two extreme modes. Wei Dai is often more like Satoshi than he is like Wei Dai

Wei Dai white paper similarities to other files
1 = 1.3186 satoshi_2.txt
2 = 1.324 satoshi_1_0.txt
3 = 1.3281 satoshi_all_2.txt
4 = 1.3335 wei_dai2_2.txt
5 = 1.354 ian gigg.txt
6 = 1.3549 wei_dai2_0.txt
7 = 1.355 craig_wright.txt
8 = 1.3703 wei dai_2.txt
9 = 1.3741 wei dai_1.txt
10 = 1.3801 wei_dai2_1.txt
11 = 1.3867 superintelligence.txt
12 = 1.393 szabo_2.txt
13 = 1.4044 szabo_0.txt

Why did Wei Dai (supposedly) hide?
If he is Satoshi, I have read enough of his comments to know exactly what happened: he knows like I do that bitcoin is not an ideal thing for society. He knows like I do and like Keynes, Mises, Hayek, and Richard Graham (Warren Buffett's mentor) that a coin that tracks something like commodities or the ability to produce commodities (to lessen fluctuations due to speculation) is much better.  See my previous post on how bad a constant-quantity coin can be. He knows there are real risks to society for such a thing and big profits for those who jump in early.  As the b-money creator, he already knew the ins and outs of how to do it, and when the financial crisis hit and it looked like the government was going to cave to the banks he figured ideal or not, something needed to be done.  He was one of the very few who knew how to plug all the potential security holes before security experts looked into it (according to a news article interview that indicates the amazement of the security expert at all the holes already being plugged and others thinking the code indicated 1 person). You are not a young man or natural genius and just passing and suddenly know how to plug the holes before hand in such a particular and security-susceptible peer-peer system. This is like a decade of experience in a very particular piece of software, like hashcash and b-money.  However, maybe Satoshi was lifting a lot of Wei Dai's code (see the last of these Wei Dai comments on lesswrong.


Summary:
Despite Wei Dai's text coming from a philosophically-oriented website and Satoshi's text being mostly discussing the bitcoin project in a forum, Wei Dai's text is ranking as having the lowest entropy difference from Satoshi's text. See previous post on Satoshi suspect's I ruled out or do not have data on.  Adam Black's hashcash papers come in a close 2nd, but I'll explain why 2nd place (or 3rd or 5th place using the most accurate method) is not as close to first place as it might seem.

At the bottom of this page is the Perl code and a windows executable. To do a professional job of this, use SVMLight's ranking routine.

Here is how accurate it is on known authors

Black's sampled text is about bitcoin-like work, even mentioning help and good ideas from Wei Dai.  In contrast, the 40,000 words from Wei Dai's "philosophy articles" did not include "node", "blockchain" or "transactions". He said "currency" once and "bitcoin" 6 times (mainly to claim he mined a lot of coins very early on because someone made him aware of the project) and "privacy" once. How are such different topics of discussion matching so well beating Szabo, Black, and Craig Wright's bitcoin papers?

The following chart is when only Wei Dai's data is from a forum or email list.



You could delete every possible economic, political, and bitcoin-related word out of all the texts, as long as you treat all texts equally, and in my experience it will not make a significant difference. A key point is noticing how Wei Dai pulls away from the pack. The scores are not a function of each other that would cause this unless there is something very different about Wei Dai.

The similarities between Wei Dai and Satoshi Nakamoto are astonishing.  In almost all tests, Wei Dai ranks higher than any other author.  As I made different runs, other authors would jump around in the rankings, but not Wei Dai.  On the 3-word tests that went against my pre-determined rule of using punctuation, Wei Dai finally lost out to Feynman on a large file. However, as could be expected by the true author, Wei Dai regained the top spot on the 4-word test.  Nick Szabo stayed near top on single-word runs but dropped completely out of the running on the 3 and 4 word tests.

One thing is for sure: with someone matching so much better than Nick Szabo, it can't be Nick Szabo. As another example, on 4 word sequences, Wei Dai had twice as many 4-word matches as Nick Szabo, and about 25% more than any other authors

Less Wrong = Null Hypothesis = Physical Laws
It's interesting that Wei Dai's text comes from the "Less Wrong" philosophy website. Likewise, this algorithm is not trying to show "who is Satoshi" but it is testing for who is NOT Satoshi. Wei Dai is the least not Satoshi I have found. Einstein mentioned thermodynamics and relativity are based on similar negative statements: there is no perpetual motion machine and nothing can travel faster than light. Heisenberg uncertainty principle is also about what can't be measured. There's a constant associated with each negative statement: kb, c, and h. By stating what is not possible, the world is being restricted less in what it might be.

Which words are "giving him away"?
None of them. I do not think anyone can look at the word frequencies and say "Hey, they are similar" or different, or anything like that.  The words not matching Satoshi are contributing the most.  Craig Wright's five bitcoin articles rank really low on the 3-word test.  In looking at all the data and I can't see hardly anything that determines how low an author ranks.  I believe Feynman's book matches pretty well because Satoshi, Feynman, and Dai's content all used the word "I". Dai's web page is discussing his ideas and responding to people about his articles. Satoshi of course had a job to do interacting with many based on his ideas and work. Feynman was telling personal histories.  So this is partially why 3 and 4 word matches were more likely among these three.

Summing the Surprisals
This method of detecting differences in authors is called summing the surprisals.  See wiki https://en.wikipedia.org/wiki/Self-information#Examples


Our experience for a known author is that he says a particular word with a q probabiliity, but we observe the mystery author saying it with p probability.  We're more surprised if the same author is not saying the particular word or word pairs with his normal probabilities.  If you roll a die many times you expect the "4" to come up 1/6 times.  Your surprise if it does this correctly is log((1/6) / (1/6))  = 0.  No surprise. It's a good-functioning die.  Counting often you get (N) and how often you expect (M) a "4" to be rolled gives a surprise of  log(N/M) because the divisor (total number of rolls) cancels.  If a mystery die (or mystery author)  rolls (says) a particular number (word) 50 times out of a total of 600 rolls, then your surprise is log(50/100) which is equal to the negative of log(100/50). If 4 comes up more and 3 comes up less by the same amount, they would cancel if you added them up, so you need to look at abs(log(N/M)), but my experience is that the following works as well or better. There are reasons, but this article is already too long.

sum [ log(p/q) ] when p is greater than q

for all words counts p and q where each word Satoshi used is count p, a known author's count for that word is q.  Word pairs and word triples if you want can be used in p and q.  Assign q=0.25 if q=0, which is important.   The unknown author file can be any size, but all the files to be compared to must have the same number of words to be comparable to each other.   A windows exe (with zero user interface) is at the bottom of this page that implements the shown Perl code.  It only does the most reliable case which is a triple that lets the middle word be a variable, and punctuation is treated like words, and contractions are brought out away from the word. This little character-handling details are important.

This measures a difference in information entropy.  
Smallest entropy difference = most similar authors

What makes it a smarter than people in this "championship game" of "Who's the author?" is that it can do a lot of calculations. [Start the machines are coming rant] What happens when these machines are not only grand masters at chess, Go, Jeopardy, pharmaceutical drug design, stock trading, and DRIVING (not killing anyone like people do), but also grand masters at knowing who we are, how to seduce us, and how to win in a political race?  A few years from now a human doing theoretical physics might be joke the machines enjoy telling each other. A.I. already surrounds us. The brain on my phone is 2000 times smaller than my children's brains and recognizes spoken English (and types it) better after zero training time and zero cost. The technology did not exist on any phones when they were born. Children going to school these days is a joke.[ End rant, back to business.]


Word triple entropy rankings with word in the middle made a
variable "x".  This is the most reliable method

The entropy score difference between 1st and 2nd 
place is same as between 2nd and 14th.

All file sizes: 215,000 bytes, 43,703 words
Base file:  satoshi_all.txt

1 = 34367.07 = wei dai.txt 
2 = 35057.81 = world_is_flat_thomas_friedman.txt
3 = 35198.36 = feynman_surely.txt
4 = 35207.48 = HEINLEIN Starship Troopers.txt
update: Adam Back came in 3rd to 5th
5 = 35438.72 = Richard Dawkins - A Devil's Chaplain.txt
6 = 35514.18 = wander.txt
7 = 35522.38 = What-Technology-Wants.txt _1.txt
8 = 35608.22 = SAGAN - The Demon-Haunted World part B.txt
9 = 35616.52 = SAGAN - Contact.txt
10 = 35623.21 = minsky_emotion_machines.txt
11 = 35623.37 = Richard Dawkins - The Selfish Gene.txt
12 = 35662.57 = foundation trilogy.txt
13 = 35680.53 = Steven-Pinker-How-the-Mind-Works.txt
14 = 35710.39 = Ender's Game.txt
.... 25 = 36110.44 = szabo.txt
...  32 = 36292.74 = craig_wright.txt

The following shows the single-word test and a listing of all the documents these runs are being tested against. 
Difference between 1st and 2nd place is same as between 2nd and 12th
Note: this is an update where the puctuation is removed.

file size: 215,000,  38,119 words
satoshi_all.txt

file size: 215,000 bytes,  38119 words
1 = 2537.9 = wei dai.txt 
2 = 2665.64 = HEINLEIN Starship Troopers.txt
update: Adam Back came in 3rd
3 = 2665.83 = feynman_surely.txt
4 = 2667.86 = Richard Dawkins - The Selfish Gene.txt
5 = 2677.85 = world_is_flat_thomas_friedman.txt
6 = 2687.89 = Ender's Game.txt
7 = 2745.07 = minsky_emotion_machines.txt
8 = 2763.72 = craig_wright.txt
9 = 2765.46 = What-Technology-Wants.txt _1.txt
10 = 2767.81 = HEINLEIN THE MOON IS A HARSH MISTRESS.txt
11 = 2775.95 = HEINLEIN Stranger in a Strange Land part A.txt
12 = 2777.61 = superintelligence_1.txt
13 = 2797.76 = craig wright pdfs.txt
14 = 2799.61 = ridley_the_rational_optimist part A.txt
15 = 2805.55 = samuel-butler_THE WAY OF ALL FLESH.txt
16 = 2809.45 = Richard Dawkins - A Devil's Chaplain.txt
17 = 2809.55 = foundation trilogy.txt
18 = 2809.98 = superintelligence_0.txt
19 = 2809.98 = superintelligence.txt
20 = 2815.89 = wander.txt
21 = 2816.7 = HEINLEIN Citizen of the Galaxy.txt
22 = 2819.38 = how to analyze people 1921 gutenberg.txt
23 = 2822.95 = Steven-Pinker-How-the-Mind-Works.txt
24 = 2826.05 = What-Technology-Wants.txt
25 = 2826.05 = What-Technology-Wants.txt _0.txt
26 = 2827.14 = crash_proof.txt
27 = 2832.55 = szabo.txt
28 = 2832.84 = RIDLEY genome_autobiography_of_a_species_in_23.txt
29 = 2833.2 = twain2.txt
30 = 2842.11 = RIDLEY The Red Queen part A.txt
31 = 2851.56 = SAGAN - Contact.txt
32 = 2852.34 = HEINLEIN Have Space Suit.txt
33 = 2853.14 = twain4.txt
34 = 2858.2 = GREEN - The Elegant Universe (1999).txt
35 = 2859.64 = Steven-Pinker-The-Language-Instinct.txt
36 = 2866.18 = GREEN The Fabric of the Cosmos.txt
37 = 2876.27 = Catch 22.txt
38 = 2881.65 = twain shorts.txt
39 = 2883.31 = AUSTIN_pride and predjudice.txt
40 = 2884.1 = brown - web of debt part B.txt
41 = 2888.33 = SAGAN_pale_blue_dot.txt
42 = 2889.25 = AUSTIN_sense and sensibility.txt
43 = 2890.85 = don quixote.txt
44 = 2891.92 = The Defiant Agents - science fiction.txt
45 = 2894.32 = Justin Fox - Myth of the Rational Market2.txt
46 = 2897.72 = freud.txt
47 = 2900.37 = twain - many works.txt
48 = 2904.58 = SAGAN - The Demon-Haunted World part B.txt
49 = 2906.67 = ridley_the_rational_optimist part B.txt
50 = 2912.06 = RIDLEY The Red Queen part B.txt
51 = 2917.24 = twain roughing it part A.txt
52 = 2918.94 = the social cancer - philipine core reading.txt
53 = 2920.25 = rifkin_zero_marginal_society.txt
54 = 2922.95 = SAGAN - The Cosmic Connection (1973).txt
55 = 2926.16 = SAGAN - The Demon-Haunted World part A.txt
56 = 2926.21 = moby-dick part A.txt
57 = 2927.76 = dickens tale of two cities.txt
58 = 2928.73 = twain innocents abroad part A.txt
59 = 2928.89 = dickens oliver twist part B.txt
60 = 2932.54 = SAGAN The Dragons of Eden.txt
61 = 2936.43 = dickens david copperfield.txt
62 = 2938.3 = dickens hard times.txt
63 = 2944.31 = J.K. Rowling Harry Potter Order of the Phoenix part B.txt
64 = 2949.54 = moby-dick part B.txt
65 = 2949.81 = SAGAN-Cosmos part A.txt
66 = 2958.4 = SAGAN-Cosmos part B.txt
67 = 2964.63 = works of edgar allen poe volume 4.txt
68 = 2966.41 = Rifkin J - The end of work.txt
69 = 2976.91 = twain1.txt
70 = 2980.1 = J.K. Rowling Harry Potter Order of the Phoenix part A.txt
71 = 2983.97 = dickens oliver twist part A.txt
72 = 3016.28 = ivanhoe.txt
73 = 3017.29 = brown - web of debt part A.txt
74 = 3104.59 = Finnegans wake.txt

Discussion before showing the other test runs.

For word pairs and triples I found it better to keep each punctuation mark as a "word." All tests on determining methodology excluding Satoshi, Szabo, Wright, and Dai data so as to not "prove" a correlation by accidentally getting what I was looking for. Wei Dai was my last known author to check, and I didn't expect anything as my previous post shows. I was convinced Szabo would be my best match. I was also convinced no authors would match well on 3 and 4 word tests because professional authors kept giving spotty results.

Craig Wright used to say "bit coin" instead of "bitcoin" which Satoshi never did. Full stop. That's not him. Similarly, Satoshi's "may be" to "maybe ratio was 56/43.  Wei Dai's is  51/34. Nick Szabo's is 25/0. Full stop on Szabo. It's not him.  But Wei Dai needed checking.

I was having a lot of trouble getting the program to determine with exactness which books Carl Sagan, Robert Heinlein, Mark Twain, and others wrote.  Wei Dai matches Satoshi better than most authors match themselves. Wei Dai is more Satoshi than Carl Sagan is Carl Sagan and more than Robert Heinlein than Robert Heinlein, not to mention Ridley, Twain, and Poe.

The Huckleberry Finn Effect
A complicating problem is when the "unknown" author is working on a specific task and not using his full vocabulary, or trying to be someone else.  I discussed observing this phenomena on accident in my tests and called it the "Huckleberry Finn Effect".  Mark Twain wrote the entire book in the 1st person voice of a bumpkin child, and the book matches well with all his other books when it is the "unknown author" book.  But when running the process in reverse,  Huckleberry Finn did not match well at all with Mark Twain's books, at least not on single-word runs. Mark Twain has a larger vocabulary than Huck Finn, and so do all the other authors on the list, so Huck Finn was ranking lower compared the adult authors. But he began to match and match good as longer word sequences were used.  As I expected, Wei Dai did not match on single-word checks during a reverse test because he was working on a narrow subject (if not trying to use a different vocabulary). He ranked half-way down the list on single word tests.  But on word pairs, he got a good bit higher. Then on 3 and 4 word sets he became the top match, as I show below. Is this real entropy magic or do I just not have enough authors using the word "I"? Even taking out "I", Wei Dai still matches better on simple 4 word sequences than any other author. If you take out every noun, he'll still probably match the best.

Word Pairs (2nd best method) 
Notice large margin of win 
Distance between 1st and 2nd is same as distance between 2nd and 7th.
file size: 215000
satoshi_all.txt

46492 words
1 = 29125.52 = wei dai.txt 
2 = 29754.71 = world_is_flat_thomas_friedman.txt
3 = 30197.82 = feynman_surely.txt
4 = 30207.26 = What-Technology-Wants.txt _1.txt
5 = 30285.41 = Richard Dawkins - A Devil's Chaplain.txt
6 = 30304.18 = HEINLEIN Starship Troopers.txt
7 = 30365.04 = Richard Dawkins - The Selfish Gene.txt
8 = 30397.33 = Steven-Pinker-How-the-Mind-Works.txt
9 = 30430.2 = minsky_emotion_machines.txt
10 = 30432.76 = szabo.txt
11 = 30474.9 = SAGAN - Contact.txt
12 = 30503.35 = craig_wright.txt

word triples with middle NOT made a variable
file size: 215000
satoshi_all.txt

46492 words
1 = 49937.82 = wei dai.txt 
2 = 50373.12 = feynman_surely.txt
3 = 50784.93 = HEINLEIN Starship Troopers.txt
4 = 50807.91 = SAGAN - Contact.txt
5 = 50815.37 = Ender's Game.txt
6 = 50821.11 = wander.txt
7 = 50860.49 = foundation trilogy.txt
8 = 50889.92 = world_is_flat_thomas_friedman.txt
9 = 50937.44 = Richard Dawkins - The Selfish Gene.txt
10 = 50976.12 = What-Technology-Wants.txt _1.txt
11 = 51017.51 = HEINLEIN Citizen of the Galaxy.txt
12 = 51138.06 = Richard Dawkins - A Devil's Chaplain.txt
13 = 51154.04 = HEINLEIN Stranger in a Strange Land part A.txt
14 = 51271.22 = szabo.txt
......
31 = 51516.88 = craig_wright.txt

4 word sequences
Distinction is not as clear, because it needed more data to make 4-word sets more significant
file size: 215000
satoshi_all.txt

46492 words
1 = 58563.91 = wei dai.txt 
2 = 58682.45 = feynman_surely.txt
3 = 58831.3 = Ender's Game.txt
4 = 58918.85 = SAGAN - Contact.txt
5 = 58942.11 = wander.txt
6 = 58942.6 = HEINLEIN Starship Troopers.txt
7 = 58952.93 = foundation trilogy.txt
8 = 58989.44 = HEINLEIN Citizen of the Galaxy.txt
9 = 59090.42 = Catch 22.txt
.....  21 = 59275.79 = szabo.txt
.....  37 = 59436.53 = craig_wright.txt

In the following, the large files were broken up into 3 or more files.  This gave Wei Dai more chances to fail, and other authors more opportunity to win. 

Single words
file size: 74000
satoshi_all.txt

15670 words
1 = 2584.95 = wei dai_2.txt 
2 = 2628.68 = wei dai_1.txt 
3 = 2652.08 = world_is_flat_thomas_friedman_2.txt
4 = 2688.91 = world_is_flat_thomas_friedman_3.txt
5 = 2712.89 = craig_wright.txt
6 = 2724.3 = wei dai.txt ***  A losing instance
7 = 2733.85 = world_is_flat_thomas_friedman_5.txt
8 = 2734.17 = szabo_2.txt

Word pairs
file size: 74000
satoshi_all.txt

15670 words
1 = 11476.23 = wei dai_1.txt 
2= 11530.86 = wei dai_2.txt 
3 = 11612.98 = wei dai.txt 
4 = 11742.85 = world_is_flat_thomas_friedman_2.txt
5 = 11797.12 = world_is_flat_thomas_friedman_4.txt
6 = 11800.34 = world_is_flat_thomas_friedman_5.txt
7 = 11817.63 = craig_wright.txt
8 = 11849.01 = What-Technology-Wants_4.txt
9 = 11860.05 = world_is_flat_thomas_friedman_3.txt
10 = 11863.12 = world_is_flat_thomas_friedman.txt
11 = 11863.12 = world_is_flat_thomas_friedman_0.txt
12 = 11869.89 = world_is_flat_thomas_friedman_1.txt
13 = 11887.65 = Richard Dawkins - A Devil's Chaplain.txt
14 = 11902.31 = What-Technology-Wants.txt _1.txt
15 = 11914.62 = Richard Dawkins - The Selfish Gene.txt
16 = 11923.83 = szabo_2.txt

word triples:
file size: 74000
satoshi_all.txt

15670 words
1 = 18124.84 = wei dai_1.txt 
2 = 18164.2 = wei dai.txt 
3 = 18200.55 = wei dai_2.txt 
4 = 18278.58 = feynman_surely.txt
5 = 18293.63 = wander.txt
6 = 18313.43 = world_is_flat_thomas_friedman_4.txt
7 = 18314.77 = Ender's Game.txt
8 = 18322.22 = world_is_flat_thomas_friedman_6.txt
9 = 18328.88 = world_is_flat_thomas_friedman_5.txt
10 = 18329.97 = world_is_flat_thomas_friedman_2.txt
11 = 18338.86 = craig_wright.txt

Repeat the above but on a different 
portion of Satoshi's writing

single words
file size: 74000
satoshi_1.txt

15136 words
1 = 2711.44 = wei dai_2.txt 
2 = 2776.41 = wei dai_1.txt 
3 = 2813.41 = world_is_flat_thomas_friedman_2.txt
4 = 2815.1 = world_is_flat_thomas_friedman_3.txt
5 = 2830.72 = wei dai.txt  **** A losing instance
6 = 2846.62 = craig wright blog.txt  *** these are 5 bitcoin articles!
7 = 2847.16 = world_is_flat_thomas_friedman_1.txt
8 = 2852.34 = world_is_flat_thomas_friedman_4.txt
9 = 2876.52 = szabo_2.txt
10 = 2894.94 = What-Technology-Wants_5.txt
11 = 2908.64 = world_is_flat_thomas_friedman_5.txt
12 = 2915.53 = superintelligence_0.txt
13 = 2915.53 = superintelligence.txt
14 = 2917.19 = world_is_flat_thomas_friedman.txt
15 = 2917.19 = world_is_flat_thomas_friedman_0.txt
16 = 2918.21 = craig_wright_0.txt
17 = 2918.21 = craig_wright.txt
18 = 2934.65 = What-Technology-Wants_4.txt
19 = 2939.89 = What-Technology-Wants_3.txt
20 = 2953.85 = What-Technology-Wants.txt _1.txt
21 = 2962.05 = What-Technology-Wants_1.txt
22 = 2973.87 = szabo_0.txt
23 = 2973.87 = szabo.txt
..... 27 = 2986.1 = szabo_1.txt
.... 33 = 3017.04 = craig wright pdfs.txt

word pairs
file size: 74000
satoshi_1.txt

15136 words
1 = 12534.78 = wei dai_1.txt 
2 = 12596.11 = wei dai_2.txt 
3 = 12673.02 = wei dai.txt 
4 = 12838.22 = world_is_flat_thomas_friedman_2.txt
5 = 12863.48 = world_is_flat_thomas_friedman_1.txt
6 = 12872.03 = world_is_flat_thomas_friedman_3.txt
7 = 12874.27 = craig_wright_0.txt
8 = 12874.27 = craig_wright.txt
9 = 12874.74 = world_is_flat_thomas_friedman_4.txt
10 = 12876.12 = feynman_surely.txt
11 = 12891.06 = world_is_flat_thomas_friedman_5.txt
12 = 12914.99 = world_is_flat_thomas_friedman_0.txt
....18 = 12992.69 = craig wright blog.txt *** these are 5 bitcoin articles!
....25 = 13065.04 = szabo_2.txt
... 33 = 13126.87 = What-Technology-Wants.txt
...34 = 13126.87 = What-Technology-Wants.txt _0.txt
....35 = 13126.87 = What-Technology-Wants_0.txt
.....46 = 13214.77 = szabo_1.txt
....52 = 13238.57 = szabo_0.txt
.....53 = 13238.57 = szabo.txt

Word triple, word in middle a variable
file size: 74000
satoshi_1.txt

15136 words
1 = 14148.98 = wei dai_1.txt 
2 = 14280.94 = wei dai.txt 
3 = 14320.99 = wei dai_2.txt 
4 = 14399.46 = world_is_flat_thomas_friedman_4.txt
5 = 14446.93 = world_is_flat_thomas_friedman_2.txt
6 = 14468.12 = feynman_surely.txt
7 = 14479.58 = HEINLEIN Starship Troopers.txt
8 = 14499.58 = world_is_flat_thomas_friedman.txt
9 = 14499.58 = world_is_flat_thomas_friedman_0.txt
10 = 14513.87 = Ender's Game.txt
11 = 14516.18 = world_is_flat_thomas_friedman_3.txt
12 = 14516.82 = world_is_flat_thomas_friedman_5.txt
13 = 14535.82 = wander.txt
14 = 14542.61 = Richard Dawkins - A Devil's Chaplain.txt
15 = 14548.98 = world_is_flat_thomas_friedman_7.txt
16 = 14574.61 = craig_wright_0.txt
17 = 14574.61 = craig_wright.txt
...21 = 14617.72 = craig wright blog.txt  *** these are 5 bitcoin articles!
...37 = 14725.98 = szabo_2.txt
...48 = 14781.78 = szabo_1.txt
...55 = 14840.94 = szabo.txt
...56 = 14842.32 = szabo_0.txt

Again on a 3rd portion of Satoshi's writing
Single words
file size: 74000
satoshi_2.txt

15380 words
1 = 2660.66 = wei dai_2.txt 
2 = 2687.98 = wei dai_1.txt 
3 = 2770.03 = wei dai.txt 
4 = 2807.24 = craig wright blog.txt   these are 5 bitcoin articles!
5 = 2829.34 = world_is_flat_thomas_friedman_2.txt
6 = 2850.82 = What-Technology-Wants_5.txt
7 = 2851.36 = szabo_2.txt
8 = 2852.92 = world_is_flat_thomas_friedman_3.txt
9 = 2856.99 = world_is_flat_thomas_friedman_4.txt
10 = 2882 = craig_wright_0.txt
11 = 2882 = craig_wright.txt
12 = 2888.43 = world_is_flat_thomas_friedman_1.txt
13 = 2902.57 = world_is_flat_thomas_friedman.txt
14 = 2902.57 = world_is_flat_thomas_friedman_0.txt
15 = 2914.92 = What-Technology-Wants_4.txt

Word Pairsfile size: 74000
satoshi_2.txt

15380 words
1 = 12384.66 = wei dai_1.txt 
2 = 12528.99 = wei dai_2.txt 
3 = 12530.35 = wei dai.txt 
4 = 12833.27 = feynman_surely.txt
5 = 12881.83 = craig_wright_0.txt
6 = 12881.83 = craig_wright.txt
7 = 12889.53 = world_is_flat_thomas_friedman_2.txt
8 = 12893.8 = world_is_flat_thomas_friedman_1.txt
9 = 12901.36 = world_is_flat_thomas_friedman.txt
10 = 12901.36 = world_is_flat_thomas_friedman_0.txt
11 = 12925.84 = HEINLEIN Starship Troopers.txt
12 = 12933.83 = world_is_flat_thomas_friedman_4.txt
13 = 12935.16 = world_is_flat_thomas_friedman_5.txt
14 = 12940.12 = Ender's Game.txt
15 = 12950.67 = craig wright blog.txt

Word triple, word in middle variable
file size: 74000
satoshi_2.txt

15380 words
1 = 14029.58 = wei dai_1.txt 
2 = 14150.66 = wei dai.txt 
3 = 14252.42 = wei dai_2.txt 
4 = 14436.32 = feynman_surely.txt
5 = 14455.36 = HEINLEIN Starship Troopers.txt
6 = 14474.59 = Ender's Game.txt
7 = 14517.58 = wander.txt
8 = 14525.48 = Richard Dawkins - A Devil's Chaplain.txt
9 = 14539.11 = world_is_flat_thomas_friedman_4.txt
10 = 14541.82 = world_is_flat_thomas_friedman_2.txt
11 = 14571.1 = HEINLEIN Citizen of the Galaxy.txt
12 = 14580.67 = world_is_flat_thomas_friedman_5.txt
13 = 14588.44 = craig_wright_0.txt
14 = 14588.44 = craig_wright.txt

Reverse check, large files:

word triples
file size: 200000
wei dai.txt

41804 words
1 = 47249.78 = satoshi_all.txt 
2 = 47415.99 = Richard Dawkins - The Selfish Gene.txt
3 = 47417.4 = superintelligence_1.txt
4 = 47470.18 = szabo.txt
5 = 47473.99 = Richard Dawkins - A Devil's Chaplain.txt
6 = 47621.95 = SAGAN - The Demon-Haunted World part B.txt
7 = 47634.54 = What-Technology-Wants.txt _1.txt
8 = 47646.7 = feynman_surely.txt

4 word sequences
file size: 200000
wei dai.txt

41804 words
1 = 54495.49 = satoshi_all.txt 
2 = 54701.31 = wander.txt
3 = 54723.85 = feynman_surely.txt
4 = 54833.41 = SAGAN - Contact.txt
5 = 54856.25 = HEINLEIN Starship Troopers.txt
6 = 54868.92 = Ender's Game.txt
7 = 54877.23 = SAGAN - The Demon-Haunted World part B.txt
8 = 54888.63 = szabo.txt


To show how hard it is to derive meaning from from individual words, here are the most common words with more than 5 letters from 4 bitcoin suspects.. Notice how different Wei Dai's text appears to be from the technical aspects of the other three. But if you're like me, you can see "softness" in Wei Dai and Satoshi's words compared to the other two. Adam Back and Szabo seemed to have more hard core things to say in what I could find online.  This entropy equation is somehow detecting it. You could argue, "Szabo and Black are just talking more technically and for some reason Satoshi's project did not require such diverse language, and for some reason that was similar to Wei Dai's philosophical discussions." Yes, that's exactly it: the same person has a tendency to choose projects that lead to similar language. That's data to be included, not to be thrown out.





This is the Perl code for single words that can be modified for multiple words.

Here is the windows executable. It is the Perl script below compiled with perl2exe. I have it set up to do the triple with the word in the middle treated as an allowed variable to allow moer matches while trying to gain a sense of the author's style.
.http://wordsgalore.com/author_compare_executable.exe  it's not a GUI or even command line. Read below.

Here are the executable instructions.  You just put files where they belong and double-click. This prevents the user from being able to adjust the parameters which would be self-deceiving. Changing how punctuation is handled can change the results. I am just testing it on a bunch of authors (not Satoshi-related) and whatever gives the best correct results is accept.  The instructions below describe the exact procedure I've settled on so far.
=========
This takes the text of 'author_baseline.txt' located in same directory as the executable and calculates the word-entropy difference between it and all files with a txt extension located in sub-directory 'authors'.
The output ranking the most similar texts first is sent to this file. The equation is: for each word in baseline divide its count by the word count from the current file and then take the log base 10 of the ratio. If the word was not found in current file, assign its count ta value 0.25. Do this only if baseline word count is greater than current file word count. Sum for all words. Words are not words in this version, but are word triples where the middle word is a variable. This makes authors more distinct. Apostrophes are moved to the right of word outside of it. All letters are made lowercase.  All other common punctuation is treated like a word.  All this is crucial for good results and slight changes can have substantial effects. The reverse test on suspect authors should be done, but a true author writing in a different mode can rank much lower on the reverse test. I call it the Huckleberry Finn efefct after seeing it happen in matching Mark Twain. Huckleberry was identified as Mark Twain, but not vice versa except on large data with longer word sequences.

The smallest txt file in authors determines the number of words pulled from the beginning of all the other
files. It should be 30% greater than the author_baseline.txt file. This makes the comparisons fair without a size bias. But it means you have to get all big files and remove the small ones. I recommend at least 50k. 500k is not overkill.
=========


No comments:

Post a Comment