Sunday, May 8, 2016

Accuracy of Author Detection

Here is a demonstration of the accuracy of the entropy difference program in detecting authors.  The bottom shows the full listing of books it was tested against, about 70 different books or collections of text by maybe 40 different authors. This method isn't the best, it's just the one I was able to easily program.

Note: the ranking average (for all the texts for the correct author) does not penalize successful detections by having a prior correct result. For example, if the correct author is spotted at rankings 1 and 2, the average correct ranking is therefore 1, not 1.5.   If it ranks 1 and 3, the average rank is 1.5.


file size: 215000
AUSTIN_pride and predjudice.txt

44982 words
1 = 31525.19 = AUSTIN_sense and sensibility.txt 
2 = 33416.65 = samuel-butler_THE WAY OF ALL FLESH.txt
average rank: 1

SAGAN-Cosmos part A.txt
1 = 34710.05 = SAGAN - The Cosmic Connection (1973).txt 
2 = 34786.39 = SAGAN_pale_blue_dot.txt 
3 = 34803.09 = SAGAN-Cosmos part B.txt 
4 = 35908.95 = SAGAN - The Demon-Haunted World part A.txt 
5 = 35923.25 = SAGAN - The Demon-Haunted World part B.txt 
6 = 35936.53 = SAGAN The Dragons of Eden.txt 
7 = 36111.48 = RIDLEY genome_autobiography_of_a_species_in_23.txt
8 = 36249 = Richard Dawkins - A Devil's Chaplain.txt
9 = 36286.77 = SAGAN - Contact.txt   #### as expected, harder to detect when he changed genre

average rank: 1.29

HEINLEIN Have Space Suit.txt
1 = 36428.16 = HEINLEIN THE MOON IS A HARSH MISTRESS.txt 
2 = 36771.15 = HEINLEIN Starship Troopers.txt 
3 = 37019.53 = HEINLEIN Citizen of the Galaxy.txt 
4 = 37223.25 = feynman_surely.txt
5 = 37377.34 = HEINLEIN Stranger in a Strange Land part A.txt 


average rank: 1.25

dickens david copperfield.txt
1 = 34040.58 = dickens oliver twist part B.txt 
2 = 34500.62 = dickens hard times.txt 
3 = 34527.19 = dickens oliver twist part A.txt 
4 = 34753.25 = dickens tale of two cities.txt 


average rank: 1
twain innocents abroad part A.txt
1 = 37419.03 = twain roughing it part A.txt 
2 = 37750.68 = twain4.txt 
3 = 37762.04 = twain2.txt 
4 = 37781.56 = twain shorts.txt 
5 = 38164.64 = samuel-butler_THE WAY OF ALL FLESH.txt
6 = 38182.86 = twain - many works.txt 
7 = 38192.57 = moby-dick part A.txt
8 = 38319.44 = dickens tale of two cities.txt
9 = 38375.98 = twain1.txt 
average rank: 1.67

Rifkin J - The end of work.txt
1 = 1.95 = Rifkin J - The end of work.txt  === oops it wasn't supposed to look at itself
2 = 32438.31 = rifkin_zero_marginal_society.txt 
3 = 33556.3 = crash_proof.txt
4 = 33559.14 = brown - web of debt part B.txt
5 = 33650.69 = ridley_the_rational_optimist part B.txt
average rank: 1

RIDLEY The Red Queen part A.txt
1 = 35597.01 = RIDLEY The Red Queen part B.txt 
2 = 35813.56 = Richard Dawkins - The Selfish Gene.txt
3 = 35853.03 = RIDLEY genome_autobiography_of_a_species_in_23.txt 
4 = 36446.74 = Richard Dawkins - A Devil's Chaplain.txt
5 = 36564.11 = ridley_the_rational_optimist part A.txt 
6 = 36670.65 = Steven-Pinker-How-the-Mind-Works.txt
7 = 36897.94 = Steven-Pinker-The-Language-Instinct.txt
8 = 36920.53 = SAGAN The Dragons of Eden.txt
9 = 36937.17 = SAGAN - The Demon-Haunted World part B.txt
10 = 36990.41 = What-Technology-Wants.txt _1.txt
11 = 37061.92 = What-Technology-Wants.txt
12 = 37061.92 = What-Technology-Wants.txt _0.txt
13 = 37115.46 = SAGAN_pale_blue_dot.txt
14 = 37124.37 = SAGAN - The Cosmic Connection (1973).txt
15 = 37197.16 = ridley_the_rational_optimist part B.txt   ##### I bet he did not write this!!!!

average rank: 4.5

GREEN The Fabric of the Cosmos.txt
1 = 34597.33 = GREEN - The Elegant Universe (1999).txt 
2 = 36513.55 = SAGAN_pale_blue_dot.txt
3 = 36741.75 = Richard Dawkins - A Devil's Chaplain.txt
4 = 36746.03 = SAGAN - The Demon-Haunted World part B.txt
average rank: 1

Richard Dawkins - A Devil's Chaplain.txt
1 = 35714.35 = Richard Dawkins - The Selfish Gene.txt 
2 = 36146.66 = RIDLEY genome_autobiography_of_a_species_in_23.txt
3 = 36297.12 = SAGAN - The Demon-Haunted World part B.txt
4 = 36367.93 = RIDLEY The Red Queen part A.txt
average rank: 1

file size: 215000
satoshi_all.txt

43168 words
1 = 35144.47 = wei dai.txt 
2 = 35756.13 = world_is_flat_thomas_friedman.txt 
3 = 35856.63 = adam back.txt  

Note: Back ranks higher here because this version of the  program is trying to clean up page headings and it's deleting things out of the various author files. The public version of the program is the "official" one. It needs cleaner data files than all these books and is more accurate.

4 = 35905.54 = feynman_surely.txt 
5 = 35977.79 = HEINLEIN Starship Troopers.txt 
6 = 36101.18 = Richard Dawkins - A Devil's Chaplain.txt 
7 = 36148.95 = What-Technology-Wants.txt _1.txt 
8 = 36222.48 = Richard Dawkins - The Selfish Gene.txt 
9 = 36303.8 = minsky_emotion_machines.txt 
10 = 36305.12 = SAGAN - The Demon-Haunted World part B.txt 
11 = 36337.96 = wander.txt 
12 = 36363.81 = Steven-Pinker-How-the-Mind-Works.txt 
13 = 36369.19 = SAGAN - Contact.txt 
14 = 36393.73 = What-Technology-Wants.txt 
15 = 36395.12 = What-Technology-Wants.txt _0.txt 
16 = 36422.13 = foundation trilogy.txt 
17 = 36482.69 = szabo.txt 
18 = 36493.72 = Steven-Pinker-The-Language-Instinct.txt 
19 = 36497.31 = SAGAN - The Demon-Haunted World part A.txt 
20 = 36498.81 = SAGAN_pale_blue_dot.txt 
21 = 36500.73 = Ender's Game.txt 
22 = 36525.42 = HEINLEIN Citizen of the Galaxy.txt 
23 = 36560.55 = RIDLEY The Red Queen part A.txt 
24 = 36578.08 = craig_wright.txt 
25 = 36603.95 = HEINLEIN Stranger in a Strange Land part A.txt 
26 = 36614.03 = superintelligence_1.txt 
27 = 36614.54 = RIDLEY genome_autobiography_of_a_species_in_23.txt 
28 = 36623.71 = twain2.txt 
29 = 36638.3 = GREEN The Fabric of the Cosmos.txt 
30 = 36648.49 = crash_proof.txt 
31 = 36693.56 = ridley_the_rational_optimist part A.txt 
32 = 36698.03 = superintelligence_0.txt 
33 = 36698.03 = superintelligence.txt 
34 = 36706.54 = twain4.txt 
35 = 36748.56 = samuel-butler_THE WAY OF ALL FLESH.txt 
36 = 36777.58 = GREEN - The Elegant Universe (1999).txt 
37 = 36818.65 = SAGAN - The Cosmic Connection (1973).txt 
38 = 36905.35 = how to analyze people 1921 gutenberg.txt 
39 = 36939.2 = twain shorts.txt 
40 = 36946.28 = ridley_the_rational_optimist part B.txt 
41 = 36947.92 = HEINLEIN Have Space Suit.txt 
42 = 36979.58 = freud.txt 
43 = 37040.28 = brown - web of debt part B.txt 
44 = 37042.04 = HEINLEIN THE MOON IS A HARSH MISTRESS.txt 
45 = 37060.32 = twain innocents abroad part A.txt 
46 = 37089.71 = RIDLEY The Red Queen part B.txt 
47 = 37097.98 = twain - many works.txt 
48 = 37120.54 = SAGAN-Cosmos part B.txt 
49 = 37150.83 = the social cancer - philipine core reading.txt 
50 = 37166.94 = SAGAN The Dragons of Eden.txt 
51 = 37176.04 = twain roughing it part A.txt 
52 = 37188.02 = SAGAN-Cosmos part A.txt 
53 = 37191.7 = dickens david copperfield.txt 
54 = 37198.59 = The Defiant Agents - science fiction.txt 
55 = 37202.43 = dickens oliver twist part B.txt 
56 = 37205.45 = Catch 22.txt 
57 = 37218.81 = AUSTIN_sense and sensibility.txt 
58 = 37219.02 = moby-dick part A.txt 
59 = 37230.43 = Justin Fox - Myth of the Rational Market2.txt 
60 = 37249.28 = dickens tale of two cities.txt 
61 = 37306.7 = AUSTIN_pride and predjudice.txt 
62 = 37307.58 = works of edgar allen poe volume 4.txt 
63 = 37309.23 = dickens hard times.txt 
64 = 37320.73 = brown - web of debt part A.txt 
65 = 37353.66 = moby-dick part B.txt 
66 = 37408.09 = don quixote.txt 
67 = 37419.12 = twain1.txt 
68 = 37439.09 = rifkin_zero_marginal_society.txt 
69 = 37439.73 = dickens oliver twist part A.txt 
70 = 37719.14 = Rifkin J - The end of work.txt 
71 = 37889.68 = J.K. Rowling Harry Potter Order of the Phoenix part A.txt 
72 = 37899.77 = J.K. Rowling Harry Potter Order of the Phoenix part B.txt 
73 = 37930.78 = craig wright pdfs.txt 
74 = 37998.75 = Finnegans wake.txt 
75 = 38169.34 = ivanhoe.txt 

No comments:

Post a Comment