Here is a demonstration of the accuracy of the entropy difference program in detecting authors. The bottom shows the full listing of books it was tested against, about 70 different books or collections of text by maybe 40 different authors. This method isn't the best, it's just the one I was able to easily program.
Note: the ranking average (for all the texts for the correct author) does not penalize successful detections by having a prior correct result. For example, if the correct author is spotted at rankings 1 and 2, the average correct ranking is therefore 1, not 1.5. If it ranks 1 and 3, the average rank is 1.5.
file size: 215000
AUSTIN_pride and predjudice.txt
44982 words
1 = 31525.19 = AUSTIN_sense and sensibility.txt
2 = 33416.65 = samuel-butler_THE WAY OF ALL FLESH.txt
average rank: 1
SAGAN-Cosmos part A.txt
1 = 34710.05 = SAGAN - The Cosmic Connection (1973).txt
2 = 34786.39 = SAGAN_pale_blue_dot.txt
3 = 34803.09 = SAGAN-Cosmos part B.txt
4 = 35908.95 = SAGAN - The Demon-Haunted World part A.txt
5 = 35923.25 = SAGAN - The Demon-Haunted World part B.txt
6 = 35936.53 = SAGAN The Dragons of Eden.txt
7 = 36111.48 = RIDLEY genome_autobiography_of_a_species_in_23.txt
8 = 36249 = Richard Dawkins - A Devil's Chaplain.txt
9 = 36286.77 = SAGAN - Contact.txt #### as expected, harder to detect when he changed genre
average rank: 1.29
HEINLEIN Have Space Suit.txt
1 = 36428.16 = HEINLEIN THE MOON IS A HARSH MISTRESS.txt
2 = 36771.15 = HEINLEIN Starship Troopers.txt
3 = 37019.53 = HEINLEIN Citizen of the Galaxy.txt
4 = 37223.25 = feynman_surely.txt
5 = 37377.34 = HEINLEIN Stranger in a Strange Land part A.txt
average rank: 1.25
dickens david copperfield.txt
1 = 34040.58 = dickens oliver twist part B.txt
2 = 34500.62 = dickens hard times.txt
3 = 34527.19 = dickens oliver twist part A.txt
4 = 34753.25 = dickens tale of two cities.txt
average rank: 1
twain innocents abroad part A.txt
1 = 37419.03 = twain roughing it part A.txt
2 = 37750.68 = twain4.txt
3 = 37762.04 = twain2.txt
4 = 37781.56 = twain shorts.txt
5 = 38164.64 = samuel-butler_THE WAY OF ALL FLESH.txt
6 = 38182.86 = twain - many works.txt
7 = 38192.57 = moby-dick part A.txt
8 = 38319.44 = dickens tale of two cities.txt
9 = 38375.98 = twain1.txt
average rank: 1.67
Rifkin J - The end of work.txt
1 = 1.95 = Rifkin J - The end of work.txt === oops it wasn't supposed to look at itself
2 = 32438.31 = rifkin_zero_marginal_society.txt
3 = 33556.3 = crash_proof.txt
4 = 33559.14 = brown - web of debt part B.txt
5 = 33650.69 = ridley_the_rational_optimist part B.txt
average rank: 1
RIDLEY The Red Queen part A.txt
1 = 35597.01 = RIDLEY The Red Queen part B.txt
2 = 35813.56 = Richard Dawkins - The Selfish Gene.txt
3 = 35853.03 = RIDLEY genome_autobiography_of_a_species_in_23.txt
4 = 36446.74 = Richard Dawkins - A Devil's Chaplain.txt
5 = 36564.11 = ridley_the_rational_optimist part A.txt
6 = 36670.65 = Steven-Pinker-How-the-Mind-Works.txt
7 = 36897.94 = Steven-Pinker-The-Language-Instinct.txt
8 = 36920.53 = SAGAN The Dragons of Eden.txt
9 = 36937.17 = SAGAN - The Demon-Haunted World part B.txt
10 = 36990.41 = What-Technology-Wants.txt _1.txt
11 = 37061.92 = What-Technology-Wants.txt
12 = 37061.92 = What-Technology-Wants.txt _0.txt
13 = 37115.46 = SAGAN_pale_blue_dot.txt
14 = 37124.37 = SAGAN - The Cosmic Connection (1973).txt
15 = 37197.16 = ridley_the_rational_optimist part B.txt ##### I bet he did not write this!!!!
average rank: 4.5
GREEN The Fabric of the Cosmos.txt
1 = 34597.33 = GREEN - The Elegant Universe (1999).txt
2 = 36513.55 = SAGAN_pale_blue_dot.txt
3 = 36741.75 = Richard Dawkins - A Devil's Chaplain.txt
4 = 36746.03 = SAGAN - The Demon-Haunted World part B.txt
average rank: 1
Richard Dawkins - A Devil's Chaplain.txt
1 = 35714.35 = Richard Dawkins - The Selfish Gene.txt
2 = 36146.66 = RIDLEY genome_autobiography_of_a_species_in_23.txt
3 = 36297.12 = SAGAN - The Demon-Haunted World part B.txt
4 = 36367.93 = RIDLEY The Red Queen part A.txt
average rank: 1
file size: 215000
satoshi_all.txt
43168 words
1 = 35144.47 = wei dai.txt
2 = 35756.13 = world_is_flat_thomas_friedman.txt
3 = 35856.63 = adam back.txt
Note: Back ranks higher here because this version of the program is trying to clean up page headings and it's deleting things out of the various author files. The public version of the program is the "official" one. It needs cleaner data files than all these books and is more accurate.
4 = 35905.54 = feynman_surely.txt
5 = 35977.79 = HEINLEIN Starship Troopers.txt
6 = 36101.18 = Richard Dawkins - A Devil's Chaplain.txt
7 = 36148.95 = What-Technology-Wants.txt _1.txt
8 = 36222.48 = Richard Dawkins - The Selfish Gene.txt
9 = 36303.8 = minsky_emotion_machines.txt
10 = 36305.12 = SAGAN - The Demon-Haunted World part B.txt
11 = 36337.96 = wander.txt
12 = 36363.81 = Steven-Pinker-How-the-Mind-Works.txt
13 = 36369.19 = SAGAN - Contact.txt
14 = 36393.73 = What-Technology-Wants.txt
15 = 36395.12 = What-Technology-Wants.txt _0.txt
16 = 36422.13 = foundation trilogy.txt
17 = 36482.69 = szabo.txt
18 = 36493.72 = Steven-Pinker-The-Language-Instinct.txt
19 = 36497.31 = SAGAN - The Demon-Haunted World part A.txt
20 = 36498.81 = SAGAN_pale_blue_dot.txt
21 = 36500.73 = Ender's Game.txt
22 = 36525.42 = HEINLEIN Citizen of the Galaxy.txt
23 = 36560.55 = RIDLEY The Red Queen part A.txt
24 = 36578.08 = craig_wright.txt
25 = 36603.95 = HEINLEIN Stranger in a Strange Land part A.txt
26 = 36614.03 = superintelligence_1.txt
27 = 36614.54 = RIDLEY genome_autobiography_of_a_species_in_23.txt
28 = 36623.71 = twain2.txt
29 = 36638.3 = GREEN The Fabric of the Cosmos.txt
30 = 36648.49 = crash_proof.txt
31 = 36693.56 = ridley_the_rational_optimist part A.txt
32 = 36698.03 = superintelligence_0.txt
33 = 36698.03 = superintelligence.txt
34 = 36706.54 = twain4.txt
35 = 36748.56 = samuel-butler_THE WAY OF ALL FLESH.txt
36 = 36777.58 = GREEN - The Elegant Universe (1999).txt
37 = 36818.65 = SAGAN - The Cosmic Connection (1973).txt
38 = 36905.35 = how to analyze people 1921 gutenberg.txt
39 = 36939.2 = twain shorts.txt
40 = 36946.28 = ridley_the_rational_optimist part B.txt
41 = 36947.92 = HEINLEIN Have Space Suit.txt
42 = 36979.58 = freud.txt
43 = 37040.28 = brown - web of debt part B.txt
44 = 37042.04 = HEINLEIN THE MOON IS A HARSH MISTRESS.txt
45 = 37060.32 = twain innocents abroad part A.txt
46 = 37089.71 = RIDLEY The Red Queen part B.txt
47 = 37097.98 = twain - many works.txt
48 = 37120.54 = SAGAN-Cosmos part B.txt
49 = 37150.83 = the social cancer - philipine core reading.txt
50 = 37166.94 = SAGAN The Dragons of Eden.txt
51 = 37176.04 = twain roughing it part A.txt
52 = 37188.02 = SAGAN-Cosmos part A.txt
53 = 37191.7 = dickens david copperfield.txt
54 = 37198.59 = The Defiant Agents - science fiction.txt
55 = 37202.43 = dickens oliver twist part B.txt
56 = 37205.45 = Catch 22.txt
57 = 37218.81 = AUSTIN_sense and sensibility.txt
58 = 37219.02 = moby-dick part A.txt
59 = 37230.43 = Justin Fox - Myth of the Rational Market2.txt
60 = 37249.28 = dickens tale of two cities.txt
61 = 37306.7 = AUSTIN_pride and predjudice.txt
62 = 37307.58 = works of edgar allen poe volume 4.txt
63 = 37309.23 = dickens hard times.txt
64 = 37320.73 = brown - web of debt part A.txt
65 = 37353.66 = moby-dick part B.txt
66 = 37408.09 = don quixote.txt
67 = 37419.12 = twain1.txt
68 = 37439.09 = rifkin_zero_marginal_society.txt
69 = 37439.73 = dickens oliver twist part A.txt
70 = 37719.14 = Rifkin J - The end of work.txt
71 = 37889.68 = J.K. Rowling Harry Potter Order of the Phoenix part A.txt
72 = 37899.77 = J.K. Rowling Harry Potter Order of the Phoenix part B.txt
73 = 37930.78 = craig wright pdfs.txt
74 = 37998.75 = Finnegans wake.txt
75 = 38169.34 = ivanhoe.txt
No comments:
Post a Comment