Nick Szabo does not appear to be Satoshi Nakamoto. 43 out of 56 times Satoshi said "maybe". 25 for 25 Szabo wrote "may be". Seven out of 7 times Satoshi wrote "optimisation" which is a non-American spelling. He also used colour, organise, analyse, and synchronise which Szabo did not.
Shinichi Mochizuki also used "may be on his web page several times, but not "maybe".
Ross Ulbricht and Hal Finney are/were American.
Vili Lehdonvirta uses British english, but Satoshi ranks an incredibly low 42 compared to him, and on the vice versa, he ranks a pretty low 19 when Satoshi is the baseline.
I could not find writings by Mike Clear.
The patent by Neal King, Vladimir Oksman and Charles Bry ranks Satoshi respectable 9, but still below Szabo and Wright.
I'm working on a windows executable of this and will list it here above the script when it is done.
To do a professional job of this, use open source SVMLight with its ranking option.
If you think someone else might be Satoshi, send me 100KB of his writing for a comparison.
I wrote the Perl program below to compare texts of various KNOWN authors to a single "MYSTERY" author. I originally tried Shannon entropy, K-L divergence, and Zipf's law, and various power and constant modifications to them, but the following came out the best. I also tried word pairs, word triples, variable-in-the-middle, and including punctuation and/or capitalization. Words pairs with punctuation could work almost as good. (p/q)^0.2 instead of log(p/q) worked about as good, as it seems to be a similar function. The ability to group into nouns, pronouns, verbs, etc while ignoring actual word and making it triples or more would be good as a different entropy calculation of the difference that would have enough data and be a completely different measure making summable to better accuracy (it's a different "dimension" of the author) but I do not have good word lists separated into those categories.
The equation
Sum the following for each word if count of a mystery word is greater than the known author word: log(count_mystery_words/count_known_words). Lowest score wins. Score indicates a naive difference in entropy between the different authors. This is a Shannon entropy difference aka K-L divergence which p*log(p/q) but it appears that because of Zipf's law, the p outside the log had a negative effect in identifying authors in my tests, so I left it out, as determined by experimentation. To prevent divide by zero, if count_known=0, then cont_known=0.25/number of words in text (half as likely to have had a 50% chance accidentally not being used). Experimentally the 0.25 was nearly optimal with 1 to 0.05 not having much difference. Not using the conditional and just running it straight had little effect, sometimes good, sometimes bad for up to 5%, but it saves computation. It only loops through the mystery words where known author words may be zero, but not vice versa. To get decent results by looping through the known authors, the words not used by the mystery author must be completely skipped which is very different from needing to assign a reasonable probability to the missing words in the mystery author loop. This is explained below.
The Huckleberry Finn Effect
Looping though all mystery words but not vice versa has an interesting effect. If you run it on "Huckleberry Finn", it is easy to see he is correlated with Mark Twain. But if you run it on other texts by Mark Twain as the mystery text, Huckleberry Finn is nearly LAST (see sample below). The larger the text sampled, the bigger the effect because Huck's vocabulary was less than all the other authors who started matching better with Twain in large samples. This method is saying Huckleberry Finn is Mark Twain, but Mark Twain is not Huckleberry Finn. Twain wrote Huckleberry Finn in 1st person as an endearing and uneducated but smart kid with a limited vocabulary. Finn was not able to get away from the simple words Twain is biased towards, but Twain was able to use words Finn did not know more like other authors. So there is at least one way authors may not be disguising themselves as well as they think. Except for word pairs: Huck jumped up higher when word pairs were treated like words. Treating common punctuation as words was needed to make word pairs as good as single words, and single word treatment seemed to do worse when treating punctuation as words.
If Satoshi was focused on the task at hand, having to repeat himself, it is a lot different than Szabo writing complex educational articles with a large vocabulary and pronouns.. Satoshi matches with Szabo well, but Satoshi is not as well of a match with Szabo. Combing them gives Satoshi rank #2 behind "World is Flat" which has a much more common general language than the others that rank high when running it one way. Szabo with his larger language was more like a larger number of other authors because of Satoshi's restricted task.
In summary, it is one-sided because it only compares words when mystery words were more common than known author. I could not get better results any other way on the test runs (always excluding Satoshi, Wright and Szabo). Apparently, it needs to claim mystery author could have been like any of known authors if he had wanted but not vice versa. It is measuring degree to which known authors are not mystery author, but not vice versa. People are complex and on some days mystery author could have been like any of the 60 known authors, but it is less likely that any particular one of the known authors was trying to be like the single mystery author on this particular day. Almost 1/2 the data isbeing thrown out. I was not able to find a way to use of it.
Which words give an author away?
All of them. It measures how much two texts are not similar. This is a long list of words for all authors. The similar texts have a longer list of words with a similar frequency and a shorter list of the non-similar frequencies.But they're both long lists, with each word carrying a different weight based on log(countA/countB). The counts are proportionally the same as log(freqA/freqB).
Description of the test data
All files were at least 250 KB. The Craig Wright file does not include his recent 5 Bitcoin "blog" posts or his DNS pdf paper. The Satoshi file is emails, forum posts, then the white paper. I deleted any paragraphs that looked heavy on code talk Szabo's files are from his blog going back to 2011 and the 6 or so links from that blog back to his older articles. I made sure to not include his "Trusted Third Parties" article which is a dead ringer for Satohi's white paper and his bitcoin-centric "Dawn of trustworthy computing". There were also 4 paragraphs in recent papers that mentioned bitcoin and I removed them. "Crypto" appears only 3 times in the remaining Szabo sample file. Obvious and common bitcoin-related words can be removed from all files and it usually has no effect, and yet any bitcoin-centric papers will have a large effect..
Craig Wright jumps to top if I include his 5 new bitcoins articles. If sample size is reduced below 150k, Wright gets ahead of Szabo. This indicates it's not a very accurate program. Anyone with a strong interest in the same area as Satoshi probably has a 50% chance of ranking higher than Szabo.
A couple of interesting notes
Doesn't last name come first in Japan so his nickname should be Naka Sato? Craig Wright said "bit coin" instead of "bitcoin" about 4 out of the 5 times he mentioned it in his 2011 text. Satoshi never did that. So human smarts can give a better ranking.
Author Comparison
mystery text: satoshi_all.txt 240000 bytes.
known directory: books
Words to ignore:
Using only first 240000 x 1 bytes of known files
First 42991 words from mystery text above and known texts below were compared.
1 = 699 szabo.txt
2 = 703 What-Technology-Wants.txt _1.txt
3 = 710 superintelligence_0.txt
4 = 716 world_is_flat_thomas_friedman_0.txt
5 = 719 What-Technology-Wants.txt _0.txt
6 = 722 Richard Dawkins - A Devil's Chaplain.txt
7 = 723 superintelligence_1.txt
8 = 724 brown - web of debt part B.txt
9 = 724 craig_wright.txt
10 = 727 SAGAN - The Demon-Haunted World part A.txt
11 = 732 Steven-Pinker-How-the-Mind-Works.txt
12 = 735 crash_proof.txt
13 = 737 SAGAN - The Demon-Haunted World part B.txt
14 = 738 ridley_the_rational_optimist part B.txt
15 = 738 craig wright pdfs.txt
16 = 740 world_is_flat_thomas_friedman_1.txt
17 = 740 rifkin_zero_marginal_society.txt
18 = 742 ridley_the_rational_optimist part A.txt
19 = 747 Steven-Pinker-The-Language-Instinct.txt
20 = 752 Justin Fox - Myth of the Rational Market2.txt
21 = 755 Rifkin J - The end of work.txt
22 = 763 HEINLEIN THE MOON IS A HARSH MISTRESS.txt
23 = 764 RIDLEY genome_autobiography_of_a_species_in_23.txt
24 = 771 SAGAN - The Cosmic Connection (1973).txt
25 = 771 SAGAN_pale_blue_dot.txt
26 = 773 Richard Dawkins - The Selfish Gene.txt
27 = 783 SAGAN-Cosmos part A.txt
28 = 783 brown - web of debt part A.txt
29 = 784 HEINLEIN Stranger in a Strange Land part B.txt
30 = 785 HEINLEIN Stranger in a Strange Land part A.txt
31 = 786 RIDLEY The Red Queen part A.txt
32 = 789 SAGAN-Cosmos part B.txt
33 = 792 foundation trilogy.txt
34 = 794 SAGAN The Dragons of Eden.txt
35 = 794 GREEN - The Elegant Universe (1999).txt
36 = 795 GREEN The Fabric of the Cosmos.txt
37 = 796 minsky_emotion_machines.txt
38 = 796 RIDLEY The Red Queen part B.txt
39 = 805 HEINLEIN Citizen of the Galaxy.txt
40 = 811 HEINLEIN Starship Troopers.txt
41 = 813 wander.txt
42 = 817 how to analyze people 1921 gutenberg.txt
43 = 820 twain shorts.txt
44 = 821 HEINLEIN Have Space Suit.txt
45 = 822 twain roughing it part B.txt
46 = 826 works of edgar allen poe volume 4.txt
47 = 831 freud.txt
48 = 836 feynman_surely.txt
49 = 839 twain roughing it part A.txt
50 = 840 twain innocents abroad part A.txt
51 = 843 The Defiant Agents - science fiction.txt
52 = 845 moby-dick part B.txt
53 = 846 twain innocents abroad part B.txt
54 = 846 dickens hard times.txt
55 = 847 samuel-butler_THE WAY OF ALL FLESH.txt
56 = 847 Catch 22.txt
57 = 850 Finnegans wake.txt
58 = 851 Ender's Game.txt
59 = 854 moby-dick part A.txt
60 = 854 the social cancer - philipine core reading.txt
61 = 856 J.K. Rowling Harry Potter Order of the Phoenix part A.txt
62 = 859 dickens oliver twist part A.txt
63 = 861 dickens tale of two cities.txt
64 = 862 J.K. Rowling Harry Potter Order of the Phoenix part B.txt
65 = 867 ivanhoe.txt
66 = 870 twain - many works.txt
67 = 871 dickens oliver twist part B.txt
68 = 884 dickens david copperfield.txt
69 = 887 AUSTIN_pride and predjudice.txt
70 = 890 don quixote.txt
71 = 896 AUSTIN_sense and sensibility.txt
Perl code:
Examples of its accuracy on 240kb (bottom ~55 files not shown)
SAGAN The Dragons of Eden.txt
43149 words
1 = 41227.33 = SAGAN - The Demon-Haunted World part A.txt
2 = 41449.17 = SAGAN - The Demon-Haunted World part B.txt
3 = 42541.09 = SAGAN - The Cosmic Connection (1973).txt
4 = 43153.35 = What-Technology-Wants.txt
5 = 43544.19 = SAGAN-Cosmos part B.txt
6 = 43801.02 = Richard Dawkins - A Devil's Chaplain.txt
7 = 44435.02 = SAGAN_pale_blue_dot.txt
8 = 44544.71 = RIDLEY genome_autobiography_of_a_species_in_23.txt
9 = 44608.56 = Steven-Pinker-The-Language-Instinct.txt
10 = 44721.12 = Steven-Pinker-How-the-Mind-Works.txt
11 = 44805.36 = SAGAN - Contact.txt
twain innocents abroad part A.txt
1 = 49152.42 = twain innocents abroad part B.txt
2 = 52359.88 = twain shorts.txt
3 = 52479.89 = twain roughing it part A.txt
4 = 52761.57 = twain roughing it part B.txt
5 = 56852.54 = twain1.txt
6 = 57091.27 = moby-dick part A.txt
7 = 57402.24 = twain4.txt
8 = 57414.97 = works of edgar allen poe volume 4.txt
9 = 57454.54 = twain2.txt
10 = 57494.54 = moby-dick part B.txt
11 = 58166.21 = the social cancer - philipine core reading.txt
12 = 58468.89 = twain - many works.txt
....
60 = 66056.21 = twain huckleberry_finn.txt WEIRD, nearly last
twain huckleberry_finn.txt
1 = 579.23 = twain huckleberry_finn_0.txt
2 = 27604.16 = twain - many works.txt
3 = 29175.39 = twain roughing it part B.txt
4 = 29268.58 = twain roughing it part A.txt
5 = 29467.46 = twain shorts.txt
6 = 30673.77 = twain2.txt
7 = 30721.36 = moby-dick part A.txt
8 = 31124.08 = twain innocents abroad part A_0.txt
9 = 31124.08 = twain innocents abroad part A.txt
10 = 31465.82 = twain4.txt
11 = 31548.63 = twain innocents abroad part B.txt
12 = 31631.72 = Finnegans wake.txt
13 = 32100.17 = twain1.txt
dickens oliver twist part A.txt
1 = 32084.14 = dickens oliver twist part B.txt
2 = 35266.22 = dickens tale of two cities.txt
3 = 35764.28 = works of edgar allen poe volume 4.txt
4 = 36244.83 = dickens hard times.txt
5 = 36361.77 = twain shorts.txt
6 = 36497.48 = moby-dick part A.txt
7 = 36533.16 = twain roughing it part A.txt
8 = 36781.17 = dickens david copperfield.txt
HEINLEIN Starship Troopers.txt
1 = 37785.89 = HEINLEIN Stranger in a Strange Land part A.txt
2 = 38258.97 = HEINLEIN Stranger in a Strange Land part B.txt
3 = 38914.36 = HEINLEIN Citizen of the Galaxy.txt
4 = 39771.37 = HEINLEIN THE MOON IS A HARSH MISTRESS.txt
5 = 40128.47 = twain shorts.txt
6 = 40552.46 = HEINLEIN Have Space Suit.txt
What Sherlock Holmes, Einstein, Heisenberg, the semantic web, the Null Hypothesis, Atheists, Scientists have in common
Short answer: like this equation, they demonstrate the usefulness of deducing the unlikely or the impossible. They measure what is not true, not what is true. Ruling out the impossible does not mean the alternatives you are aware of is the truth. Only an omniscient being can know all the alternative possibilities. So unless you're a God, you can fall victim to this Holmesian Fallacy in reasoning. Physics is also based on the null hypothesis, declaring what is not true rather than truths: Einstein said thermodynamics and relativity are based on simple negative statements: there is no perpetual motion machine and nothing can go faster than light. Heisenberg's uncertainty principle similarly states a negative, that it is impossible to state both position and momentum, or time and energy. The null hypothesis and "not believing" in science is a formal embodiment of this approach. Sherlock Holmes falls victim to the fallacy named after him in the same way the null hypothesis in drug screening can victimize innocents "When you have eliminated the impossible, whatever remains, however improbable, must be the truth" - Sherlock Holmes. The problem is that the truth includes things you can't imagine, error in measurement, or other compounds that have the same chemical reaction. Why is the null hypothesis so useful? Because reality is bigger and more complex than we can imagine, bigger than our physics can describe. Likewise, people and their writing is so diverse, it is a lot easier to measure the degree to which two texts are different. The semantic web absolutely depends on fuzzy definitions to be able to describe a world that is bigger than reason. See Godel's Theorem and why mathematicians have trouble walking and talking.
No comments:
Post a Comment