Monday, May 2, 2016

Text Analysis of Satoshi Nakamoto, Nick Szabo, and Craig Wright

update: I was working on this list of suspects below, going by the wikiopedia article. The last one on the list was Wei Dai. And guess what.

Nick Szabo does not appear to be Satoshi Nakamoto. 43 out of 56 times Satoshi said "maybe". 25 for 25 Szabo wrote "may be". Seven out of 7 times Satoshi wrote "optimisation" which is a non-American spelling. He also used colour, organise, analyse, and synchronise which Szabo did not.
Shinichi Mochizuki also used "may be on his web page several times, but not "maybe".
Ross Ulbricht and Hal Finney are/were American. 
Vili Lehdonvirta uses British english, but Satoshi ranks an incredibly low 42 compared to him, and on the vice versa, he ranks a pretty low 19 when Satoshi is the baseline.
I could not find writings by Mike Clear.
The patent by Neal King, Vladimir Oksman and Charles Bry ranks Satoshi respectable 9, but still below Szabo and Wright.

I'm working on a windows executable of this and will list it here above the script when it is done.

To do a professional job of this, use open source SVMLight with its ranking option.

If you think someone else might be Satoshi, send me 100KB of his writing for a comparison.

I wrote the Perl program below to compare texts of various KNOWN authors to a single "MYSTERY" author.  I originally tried Shannon entropy, K-L divergence, and Zipf's law, and various power and constant modifications to them, but the following came out the best. I also tried word pairs, word triples, variable-in-the-middle, and including punctuation and/or capitalization. Words pairs with punctuation could work almost as good. (p/q)^0.2 instead of log(p/q) worked about as good, as it seems to be a similar function. The ability to group into nouns, pronouns, verbs, etc while ignoring actual word and making it triples or more would be good as a different entropy calculation of the difference that would have enough data and be a completely different measure making summable to better accuracy (it's a different "dimension" of the author) but I do not have good word lists separated into those categories.

The equation
Sum the following for each word if count of a mystery word is greater than the known author word: log(count_mystery_words/count_known_words). Lowest score wins. Score indicates a naive difference in entropy between the different authors. This is a Shannon entropy difference aka K-L divergence which p*log(p/q) but it appears that because of Zipf's law, the p outside the log had a negative effect in identifying authors in my tests, so I left it out, as determined by experimentation.  To prevent divide by zero, if count_known=0, then cont_known=0.25/number of words in text (half as likely to have had a 50% chance accidentally not being used). Experimentally the 0.25 was nearly optimal with 1 to 0.05 not having much difference. Not using the conditional and just running it straight had little effect, sometimes good, sometimes bad for up to 5%, but it saves computation. It only loops through the mystery words where known author words may be zero, but not vice versa. To get decent results by looping through the known authors, the words not used by the mystery author must be completely skipped which is very different from needing to assign a reasonable probability to the missing words in the mystery author loop. This is explained below.

The Huckleberry Finn Effect
Looping though all mystery words but not vice versa has an interesting effect.  If you run it on "Huckleberry Finn", it is easy to see he is correlated with Mark Twain. But if you run it on other texts by Mark Twain as the mystery text, Huckleberry Finn is nearly LAST (see sample below). The larger the text sampled, the bigger the effect because Huck's vocabulary was less than all the other authors who started matching better with Twain in large samples.  This method is saying Huckleberry Finn is Mark Twain, but Mark Twain is not Huckleberry Finn.  Twain wrote Huckleberry Finn in 1st person as an endearing and uneducated but smart kid with a limited vocabulary. Finn was not able to get away from the simple words Twain is biased towards, but Twain was able to use words Finn did not know more like other authors. So there is at least one way authors may not be disguising themselves as well as they think. Except for word pairs: Huck jumped up higher when word pairs were treated like words. Treating common punctuation as words was needed to make word pairs as good as single words, and single word treatment seemed to do worse when treating punctuation as words.

If Satoshi was focused on the task at hand, having to repeat himself, it is a lot different than Szabo writing complex educational articles with a large vocabulary and pronouns.. Satoshi matches with Szabo well, but Satoshi is not as well of a match with Szabo. Combing them gives Satoshi rank #2 behind "World is Flat" which has a much more common general language than the others that rank high when running it one way. Szabo with his larger language was more like a larger number of other authors because of Satoshi's restricted task.

In summary, it is one-sided because it only compares words when mystery words were more common than known author.  I could not get better results any other way on the test runs (always excluding Satoshi, Wright and Szabo). Apparently, it needs to claim mystery author could have been like any of known authors if he had wanted but not vice versa. It is measuring degree to which known authors are not mystery author, but not vice versa. People are complex and on some days mystery author could have been like any of the 60 known authors, but it is less likely that any particular one of the known authors was trying to be like the single mystery author on this particular day.  Almost 1/2 the data isbeing thrown out.  I was not able to find a way to use of it.

Which words give an author away?
All of them. It measures how much two texts are not similar.  This is a long list of words for all authors. The similar texts have a longer list of words with a similar frequency and a shorter list of the non-similar frequencies.But they're both long lists, with each word carrying a different weight based on log(countA/countB).  The counts are proportionally the same as log(freqA/freqB).

Description of the test data
All files were at least 250 KB.  The Craig Wright file does not include his recent 5 Bitcoin "blog" posts or his DNS pdf paper. The Satoshi file is emails, forum posts, then the white paper. I deleted any paragraphs that looked heavy on code talk  Szabo's files are from his blog  going back to 2011 and the 6 or so links from that blog back to his older articles. I made sure to not include his "Trusted Third Parties" article which is a dead ringer for Satohi's white paper and his bitcoin-centric "Dawn of trustworthy computing". There were also 4 paragraphs in recent papers that mentioned bitcoin and I removed them. "Crypto" appears only 3 times in the remaining Szabo sample file. Obvious and common bitcoin-related words can be removed from all files and it usually has no effect, and yet any bitcoin-centric papers will have a large effect..

Craig Wright jumps to top if I include his 5 new bitcoins articles. If sample size is reduced below 150k, Wright gets ahead of Szabo. This indicates it's not a very accurate program. Anyone with a strong interest in the same area as Satoshi probably has a 50% chance of ranking higher than Szabo.

A couple of interesting notes
Doesn't last name come first in Japan so his nickname should be Naka Sato?  Craig Wright said "bit coin" instead of "bitcoin" about 4 out of the 5 times he mentioned it in his 2011 text.  Satoshi never did that. So human smarts can give a better ranking.

Author Comparison

mystery text: satoshi_all.txt 240000 bytes. 
known directory: books
Words to ignore: 
Using only first 240000 x 1 bytes of known files

First 42991 words from mystery text above and known texts below were compared.

1 = 699 szabo.txt 
2 = 703 What-Technology-Wants.txt _1.txt 
3 = 710 superintelligence_0.txt 
4 = 716 world_is_flat_thomas_friedman_0.txt 
5 = 719 What-Technology-Wants.txt _0.txt 
6 = 722 Richard Dawkins - A Devil's Chaplain.txt 
7 = 723 superintelligence_1.txt 
8 = 724 brown - web of debt part B.txt 
9 = 724 craig_wright.txt 
10 = 727 SAGAN - The Demon-Haunted World part A.txt 
11 = 732 Steven-Pinker-How-the-Mind-Works.txt 
12 = 735 crash_proof.txt 
13 = 737 SAGAN - The Demon-Haunted World part B.txt 
14 = 738 ridley_the_rational_optimist part B.txt 
15 = 738 craig wright pdfs.txt 
16 = 740 world_is_flat_thomas_friedman_1.txt 
17 = 740 rifkin_zero_marginal_society.txt 
18 = 742 ridley_the_rational_optimist part A.txt 
19 = 747 Steven-Pinker-The-Language-Instinct.txt 
20 = 752 Justin Fox - Myth of the Rational Market2.txt 
21 = 755 Rifkin J - The end of work.txt 
23 = 764 RIDLEY genome_autobiography_of_a_species_in_23.txt 
24 = 771 SAGAN - The Cosmic Connection (1973).txt 
25 = 771 SAGAN_pale_blue_dot.txt 
26 = 773 Richard Dawkins - The Selfish Gene.txt 
27 = 783 SAGAN-Cosmos part A.txt 
28 = 783 brown - web of debt part A.txt 
29 = 784 HEINLEIN Stranger in a Strange Land part B.txt 
30 = 785 HEINLEIN Stranger in a Strange Land part A.txt 
31 = 786 RIDLEY The Red Queen part A.txt 
32 = 789 SAGAN-Cosmos part B.txt 
33 = 792 foundation trilogy.txt 
34 = 794 SAGAN The Dragons of Eden.txt 
35 = 794 GREEN - The Elegant Universe (1999).txt 
36 = 795 GREEN The Fabric of the Cosmos.txt 
37 = 796 minsky_emotion_machines.txt 
38 = 796 RIDLEY The Red Queen part B.txt 
39 = 805 HEINLEIN Citizen of the Galaxy.txt 
40 = 811 HEINLEIN Starship Troopers.txt 
41 = 813 wander.txt 
42 = 817 how to analyze people 1921 gutenberg.txt 
43 = 820 twain shorts.txt 
44 = 821 HEINLEIN Have Space Suit.txt 
45 = 822 twain roughing it part B.txt 
46 = 826 works of edgar allen poe volume 4.txt 
47 = 831 freud.txt 
48 = 836 feynman_surely.txt 
49 = 839 twain roughing it part A.txt 
50 = 840 twain innocents abroad part A.txt 
51 = 843 The Defiant Agents - science fiction.txt 
52 = 845 moby-dick part B.txt 
53 = 846 twain innocents abroad part B.txt 
54 = 846 dickens hard times.txt 
55 = 847 samuel-butler_THE WAY OF ALL FLESH.txt 
56 = 847 Catch 22.txt 
57 = 850 Finnegans wake.txt 
58 = 851 Ender's Game.txt 
59 = 854 moby-dick part A.txt 
60 = 854 the social cancer - philipine core reading.txt 
61 = 856 J.K. Rowling Harry Potter Order of the Phoenix part A.txt 
62 = 859 dickens oliver twist part A.txt 
63 = 861 dickens tale of two cities.txt 
64 = 862 J.K. Rowling Harry Potter Order of the Phoenix part B.txt 
65 = 867 ivanhoe.txt 
66 = 870 twain - many works.txt 
67 = 871 dickens oliver twist part B.txt 
68 = 884 dickens david copperfield.txt 
69 = 887 AUSTIN_pride and predjudice.txt 
70 = 890 don quixote.txt 
71 = 896 AUSTIN_sense and sensibility.txt 

Perl code:

Examples of its accuracy on 240kb  (bottom ~55 files not shown)

SAGAN The Dragons of Eden.txt

43149 words
1 = 41227.33 = SAGAN - The Demon-Haunted World part A.txt
2 = 41449.17 = SAGAN - The Demon-Haunted World part B.txt
3 = 42541.09 = SAGAN - The Cosmic Connection (1973).txt
4 = 43153.35 = What-Technology-Wants.txt
5 = 43544.19 = SAGAN-Cosmos part B.txt
6 = 43801.02 = Richard Dawkins - A Devil's Chaplain.txt
7 = 44435.02 = SAGAN_pale_blue_dot.txt
8 = 44544.71 = RIDLEY genome_autobiography_of_a_species_in_23.txt
9 = 44608.56 = Steven-Pinker-The-Language-Instinct.txt
10 = 44721.12 = Steven-Pinker-How-the-Mind-Works.txt
11 = 44805.36 = SAGAN - Contact.txt

twain innocents abroad part A.txt
1 = 49152.42 = twain innocents abroad part B.txt
2 = 52359.88 = twain shorts.txt
3 = 52479.89 = twain roughing it part A.txt
4 = 52761.57 = twain roughing it part B.txt
5 = 56852.54 = twain1.txt
6 = 57091.27 = moby-dick part A.txt
7 = 57402.24 = twain4.txt
8 = 57414.97 = works of edgar allen poe volume 4.txt
9 = 57454.54 = twain2.txt
10 = 57494.54 = moby-dick part B.txt
11 = 58166.21 = the social cancer - philipine core reading.txt
12 = 58468.89 = twain - many works.txt
60 = 66056.21 = twain huckleberry_finn.txt WEIRD, nearly last

twain huckleberry_finn.txt
1 = 579.23 = twain huckleberry_finn_0.txt
2 = 27604.16 = twain - many works.txt
3 = 29175.39 = twain roughing it part B.txt
4 = 29268.58 = twain roughing it part A.txt
5 = 29467.46 = twain shorts.txt
6 = 30673.77 = twain2.txt
7 = 30721.36 = moby-dick part A.txt
8 = 31124.08 = twain innocents abroad part A_0.txt
9 = 31124.08 = twain innocents abroad part A.txt
10 = 31465.82 = twain4.txt
11 = 31548.63 = twain innocents abroad part B.txt
12 = 31631.72 = Finnegans wake.txt
13 = 32100.17 = twain1.txt

dickens oliver twist part A.txt
1 = 32084.14 = dickens oliver twist part B.txt
2 = 35266.22 = dickens tale of two cities.txt
3 = 35764.28 = works of edgar allen poe volume 4.txt
4 = 36244.83 = dickens hard times.txt
5 = 36361.77 = twain shorts.txt
6 = 36497.48 = moby-dick part A.txt
7 = 36533.16 = twain roughing it part A.txt
8 = 36781.17 = dickens david copperfield.txt

HEINLEIN Starship Troopers.txt
1 = 37785.89 = HEINLEIN Stranger in a Strange Land part A.txt
2 = 38258.97 = HEINLEIN Stranger in a Strange Land part B.txt
3 = 38914.36 = HEINLEIN Citizen of the Galaxy.txt
5 = 40128.47 = twain shorts.txt
6 = 40552.46 = HEINLEIN Have Space Suit.txt

What Sherlock Holmes, Einstein, Heisenberg, the semantic web, the Null Hypothesis, Atheists, Scientists have in common
Short answer: like this equation, they demonstrate the usefulness of deducing the unlikely or the impossible. They measure what is not true, not what is true. Ruling out the impossible does not mean the alternatives you are aware of is the truth. Only an omniscient being can know all the alternative possibilities. So unless you're a God, you can fall victim to this Holmesian Fallacy in reasoning. Physics is also based on the null hypothesis, declaring what is not true rather than truths:  Einstein said thermodynamics and relativity are based on simple negative statements: there is no perpetual motion machine and nothing can go faster than light. Heisenberg's uncertainty principle similarly states a negative, that it is impossible to state both position and momentum, or time and energy.   The null hypothesis and "not believing" in science is a formal embodiment of this approach. Sherlock Holmes falls victim to the fallacy named after him in the same way the null hypothesis in drug screening can victimize innocents "When you have eliminated the impossible, whatever remains, however improbable, must be the truth" - Sherlock Holmes. The problem is that the truth includes things you can't imagine, error in measurement, or other compounds that have the same chemical reaction.  Why is the null hypothesis so useful? Because reality is bigger and more complex than we can imagine, bigger than our physics can describe.  Likewise, people and their writing is so diverse, it is a lot easier to measure the degree to which two texts are different. The semantic web absolutely depends on fuzzy definitions to be able to describe a world that is bigger than reason.  See Godel's Theorem and why mathematicians have trouble walking and talking.

1 comment:

  1. Claim free bitcoins over at CLAIM BTC Faucet. Up to 57 satoshis every 20 mins.