Tuesday, May 29, 2018

Using coin fees to internally evolve intelligence and law


Posted to Zcash grant proposal issue

Each message (sent in the memo field of BTC-like coins) could be news or a proposed law. The community would automatically elect to finance or starve the writer or proposed law via direction of required fees from each recipient. It's required, but the recipient would select if the fee goes to the sender of the message (or to finance the law) or to be destroyed, sent to devs, or sent back to the chain (giving the protocol permission to issue more coin in the future, enabling a cycling of coin through an internal economy)

Senders of news must also pay to send messages with similar options. They would remove recipients who elected to not direct the fee to the sender. There would be a cap on messages per block, so only highest paying senders get to send. So there would be an evolution towards getting the "best" (most popular) news or other info.

The coin would have a reason for existing that is not threatened by any other coin. There would have to be a different coin for many different sources of information because it needs 2 tx per recipient per message. The message would be the recipient's decryption key for an RSA-encrypted article somewhere on the internet.

This could be used to select and finance a president anonymously based on the quality of his "orations". He would reveal himself at the beginning of a coup that he plans, and possibly finances via the fees received.

This is representation in exchange for taxation, automatically selecting the best representation.

The fees would need to be expressed in constant value which can be determined accurately over years by the coin's difficulty and a bounded estimator to get a Moore's law adjustment, provided the POW does not change and no algorithmic breakthroughs occur.

Wednesday, April 11, 2018

POW idea optimized for cell phones and CPUS

Cell phones should be able run the following algorithm more efficiently than ASICs and FPGAs at about the same electrical cost as GPUs and CPUs ... if you can write a routine to create a general class of functions from a seed that has the following characteristics:

1) Input size equal to output size, say 64 bits.
2) The class is complex enough to not be optimizable as a class (hence ASICs and FPGAs are not useful)
3) The class has an upper and lower bound on computation time for a given "CPU"

The last item seems to require a similar distribution of operations and a restricted way of using them. This would make achieving the 2nd to last item more difficult. To relax the computation time bounds in the last item without risking miners abandoning slow functions and going to the next nonce, the nonces must be in sequence.

The POW solution is then:


myhash = ffx32 
unless (myhash < target) {
   nonce++
   j =  hash(previous block hash + nonce)
   for (1 to 100)  { 
       j++
       function = generate_function( hash( j ) )
       for  (k=1 to 100,000) {
           input=output
           output = function( input )
       }
   }
   myhash = hash(output)
}
if ( myhash < target ) { claim my block }


The goal is to make memory useless and try to make the "CPU" burn as much electricity as possible. This is not more "polluting" or "wasteful" than investing in more hardware to avoid electrical expense.

The extra cost is similarly a waste of resources like electricity. For the most part that cost (like all costs) is also ultimately tied to a fuel cost or fictitious asset prices.

The amount of economic resources wasted to mine a coin per block is directly proportional to the dollar reward per block, which is equal to the difficulty times a proportionality constant. When x = the leading zeros of the target, the following should be approximately true for all coins:

2^x * difficulty = j * wasted dollars of fuel per block = k * dollar reward per block
where j is about 1/2 k and both are constants that should be about the same for all coins.

==========
https://github.com/monero-project/monero/issues/3545#issuecomment-380678092
https://github.com/zcash/zcash/issues/1211

>The rules that your automaton operates under are the actual instructions that matter, and again, they're fully known in advance.

They're not known in advance because I am changing the "automata" algorithm 100 times per nonce. See my psuedocode above. I thought of automata because I knew off-hand some classes can't be solved for "N loops" into the future without having already run the recursion itself. Then I remembered most real world differential equations are the same. My psuedocode is more general than either. The recursiveness was needed to address the problem of unknown execution time. Otherwise, I'm not arguing for something different than your idea. I'm trying to make it work and optimize it for cell phones.

Cryptnight's algorithm is known in advanced but ours' are not. My smaller instruction set is optional, but I wanted to pick instructions that are more efficient on RISC cell phones than on CPUs and GPUs. I didn't want complex instructions that are not implementable directly and completely on a RISC chip. A simpler instruction set may also enable better control of execution time and theoretical characterization of what the results will be.

To try to address your complaint about a simpler instruction set, suppose I have 5x more combinations of instructions than your scheme but use only 4 instructions. Then I would have 4^5 =1024 advanced instructions in each group of 5 instruction. Even 1 instruction is feasible: All javascript is expressible in NAND or XOR instructions in a specific type of arrangement.

Let's say I use only the NAND gate (a simple binary if-then statement). My output every 1,000 loops will be hashed to get a new set of NAND gate combinations out of a possible 2^256 combinations. There could be more if needed. An FPGA or ASIC or GPU program would have to figure out a simplified version of the NAND gate 100 times per nonce. This is possible if the inner loop of 1000 is too many loops. So there's an upper limit on it. Also, the algorithm generating function loop of 100 has an upper bound: it's a slow algorithm that could be optimized by ASIC or GPU programmer, so the time to generate the next algorithm must not be a large fraction of the time to do the inner loop.

A larger instruction set has the same problems to the exact same extent if I use log2(javascript instructions) more combinations of NAND gates. Both schemes seem to need some bound on how the instructions can be combined in order to get approximately reasonable exceution, but that should not change this point.

I found the need to change your scheme by making it change the algorithm 100 times per nonce in order to get an averaging out of solvetimes to be approximately the same while preventing ASIC and GPU programmers from skipping nonces that give apparently slow algorithms. Generalized loops and skipping sections of code within our created algorithms are not feasible due to creating an unknown completion time, which is a deal breaker. I have an interior loop of 1,000 to keep the created algorithm 1000x smaller, but thiis just to make sure no one benefits from larger memory. This is for cell phone efficiency per instruction while negating the benefit of a large CPU cache.
=======
Skipping nonces: what I'm saying is that I don't think you have bounds on the time-difficulty of the generated algorithms, but that an ASIC could be designed to quickly recognize if an algorithm is going to be difficult to solve, so they'll skip it.

I can't see if you still disagree. In reference to ASIC complexity, as I was saying, as I would just use the smaller set of instructions a larger number of times to equal your larger set of instructions with fewer steps. I want to send it assembly meant for ARM processors and for it to be as difficult as your javascript. We can go so far beyond what an ASIC can predict, I'm trying to cheat by redirecting it's output to its input, enabling "only" 2^256 different functions on random 32 byte seeds.

I think the general method of a small instruction set is the only one that can have predictably similarly difficult functions like I'm saying seems absolutely necessary. With say 4 instruction (they might just be well-chosen bit operations) it seems possible to know what average amount of computation they have, and to prove the 4 selected will continually add complexity rather than going towards patterns like all 0's or all 1's.

It's not that I know a larger instruction set won't work. The problem is that I view the task of getting similar execute time to be less daunting with fewer instructions. BTC script had to exclude loops and other things for the same reason. And still many scripts can go bad. Fewer instructions might be useful in proving the algorithms are well-behaved.

I think not only limiting the type of instructions is needed, but probably also limiting how they can combine. This increases the risk of an ASIC optimization of the gneral class of algorithms you produce.

+, -, /, and * creates something wider in scope than power series which can approximate any function. But maybe it should be required to follow the rules of power series. Actually power series are recursive like automata and D.E.s, it does fit in well with my desire that the only loop be external to the "function". I'm been using "function" very unmathematically to mean "algorithm of any type", since I don't have those details worked out.

@zawy12's point that it [fewer instruction] makes future optimization less likely.

I don't think I said that. I don't view my scheme as more difficult for ASICs to "hack", but that I have more hope it will help the design process in creating an "algorithm generator" that has well behaved output and similar execution time.

There's another potential advantage to doing recursive simple steps "on chip". When you have a large algorithm full of a wide range of instructions that are not going to be native to the chip, there are delays from going back and forth to RAM. If you have an instruction set of say 256, and if they all need to have 64 bit inputs and 64 bit outputs and follow each other sequentially (in order to be well behaved), then a lookup table 3GB in size (or much less with compression) could be used instead of actually computing each sequence of 4 instructions. There may be ways around this like changing input data before instructions by a 1 to 4 bit shift based on something unpredictable, or similarly every instruction could have two inputs where one is the output of a previous instruction and the other is seed data.

Fewer instructions has the same problem with this, but I'm hoping a chip can go through more instruction than can be pre-implemented in a lookup table, faster than the time for a lookup-memory transfer in an ASIC. I probably need to use as many on-chip instructions as possible, supporting @hyc 's viewpoint. But my reasons for wanting to keep a limit on instructions remain.

GPUs as well as CPUs can use a lookup table, and may not be improved upon by ASICs, but I feel like this is a problem that should tried to be avoided, and but my starting point (same size inputs and outputs going from 1 instruction to the next) is seductively simple in terms of getting constant compute time and good behavior implemented.

If one of the two ways to foil lookup tables works, an ASIC may still implement the actual logic of every hard-to-compute pair, triple, and maybe quadruple sequence of complex instructions. It can't do every pair due to the large instruction set. Just implementing all non-redundant pairs of 256 instructions could be difficult. It would have a lot more flexibility in exploiting sequences of fewer instructions, so that's a +1 for @hyc . I mean it could easily optimize and implement every 5-instruction sequence of a 4-instruction set, naively giving it a 4x speedup, if it can be done as a fast as on chip. So I'll say I'll make the instruction set as large as I can as long as it stays on the chip, and as long as I can make them predictable in computation time and otherwise well-behaved. So maybe I'm in trevador's ball park.

My pseudocode (that I've been changing all day as I realized problems) applies as well to large sets as small: I think the recursion outside the algorithm as the only loop allowed is a requirement for this to work. Without it, the algorithm will have to be a lot more complex which means a lot of complicated memory usage that means too much room for GPU optimization. It's not that I'm that strongly against GPUs, but I don't the long drawn out process of GPU devs slowly increasing hashpower for a pecentage. It makes me think if GPU optimization can be done, there's a loophole for ASICs. Assembly code doable solely on the chip seems like it has the nice possibility of never becoming optimizeable.

If the algorithm generator can make algorithms with solvetimes +/- 5%, then my 100 algos per nonce is not needed, and 1 algorithm per block might be sufficient. I don't think you can get a routine to optimize the algorithm and automatically reprogram an FPGA that fast. But since GPUs could, I might still prefer the per nonce method.

Tuesday, February 27, 2018

Herbie the love bug and rise of the machines

The 1969 Disney movie "The Love Bug" seemed to hit the nail on the head via the philosophizing of artist Tennessee Steinmetz played by Buddy Hackett who has a very likable speech impediment.


- Well then, if everything you say about this car is true, it's already starting to happen.
- What's starting to happen?
- Us human beings. We had a chance to make something out of this world. We blew it. Another kind of civilization is gonna take a turn ... I'm sitting up on top of this mountain, right?
- Right.
- I'm surrounded by these gurus and swamis and monks, right?
- Right.
- I got some contemplation goin'. I see things like they are. I coulda told you all this was comin'.
- What's coming?
- Jim, it's happening right under our noses and we can't see it. We take machines and stuff 'em with information until they're smarter than we are. ... Pretty soon, you know what? The machine starts to think it is somebody. .... You follow me?
- Yeah. I think you were up on that mountaintop too long.
- You're not listenin' to me.
- Don't lose your grip, old buddy. This little car didn't do one thing tonight that can't be explained...
- I don't think you got the picture.

Friday, January 26, 2018

Fundamental limits of computing

Moore's law has almost come to a halt in our typical computing devices. By "typical" I mean based on silicon and non-reversible synchronous computing. Reversible computing might achieve a lot more, but it is a suspect idea due to quantum-based dispersion of precision in moving forward and backwards. Switching to something like carbon nanotubes could extend the limits, but those and any other material will still have limits based on causes like those below. Asynchronous computing has no limit, but breaking tasks up into parts for disparate computing has fundamental problems. On the other hand, brains are mostly asynchronous and capable.

The computing limit for irreversible synchronous non-quantum silicon computing is based on:

1) A logic device requires a certain minimum number of atoms.

2) Heat per bit change is released in irreversible computing, aka Landauer's principle, Q = T*k*ln(2). There may be a way around this. This does not change the basic limit in entropy cost that will ultimately come at an energy cost, but it might be capable of computing without a local increase in heat.

3) Probability of error increases as the energy we use to store or transmit a bit gets near Landauer's limit. I think the probability of error per computation or transmission is based on e^(-E/Q) where E is the energy per bit used to transmit or store the bit and Q is Landauer's limit. A modest 100 M transistor 3 GHz device would have an error every hour if E is 40x Q.

4) The speed of light is limited.

5) Atoms break apart if they get too hot.

6) The amount of heat a cold sink can extract out of a computing device is limited, even if it is only 1-layer thick.

In order for transistors to change state in step with each other, the distance across the chip is limited by the speed of light and the processor speed. The faster the processor, the smaller the chip has to be. For example, if the longest trace from the clock input to a remote transistor is 4x the chip die diameter and every transistor has to acquire the needed state in 1/4 a clock cycle, then the max diameter at 3 GHz owing to speed of light is 3E8 m/s / 3 GHz / 4 / 4 = 6 mm. This is approximately what is seen in current chips. Since the number of transistors increases as diameter squared, and the available size limit due to speed of light increases linearly with decreases in speed, more synchronous computing can be done with slower chip speeds. This also reduces the heat problem so that more layers of transistors can be used. This is why chip speed is not increasing. To take advantage of this (if we wnet to a slower speed), going to more channels (128 and higher instead of 64 bit) or more cores would be needed. More cores is currently the method that takes advantage of the largest die diameter allowed by the speed of light. A 3 MHz chip could do 1,000x more computing at a cost of 1,000,000x more transistors. It would not need to be 6 meters per side as this implies because it needs to dissipate 1,000x less energy per second which means it could be more layers thick.

Based on the speed of light, the above was pretty good at predicting current chip diameter. Now I'll try to predict chips speeds based on Landauer's limit. I explained energy usage per bit needs to be at least 40x Q to have 50% chance of error every hour. It only needs to be 50x to increase that to 1 error every 100,000 hours. I'll assume in practice it is currently 100x. I see a 14 nm AMD Ryzen 7 1800X chip has 4.8 billion transistors (14 mm per side) and uses a max of 95 Watts. This is not an intel chip, but intel says a 14 nm die has a max of 105 C. I'll guess transistors actually reach about 150 C locally. It's 4 GHz. I'll guess the average transistor changes state only 1/4 the time (once per 4 clock cycles). So, 100x Landauer's limit times the number of transistors times the clock speed divided by 4 = 100*(150+273 kelvins)*1.38E-23*ln(2)*4.8E9*4E9/4 = 2 watts. Actually, if every transistor needs to be able to not an error in 100,000 hours with my 2x energy safety/inefficiency factor, then the 1/4 maybe should not be applied., giving 8 watts in a weird adjusted basis. They seem to be wasting 95/8 = 12x more energy than the limit of what they could. The limit here seems to be chip temperature. They've made it as fast as they could without "melting". They've made the dies as big as they could for that speed. Smaller chip processes allow a squared factor (or more) of less energy usage while allowing a squared factor of more chips per die of the same size. So they can continue to make them smaller at the same speed and the heat profile will not change. The 50x limit implies they could go SQRT(12) = 3.5x smaller with my 100x estimate of a hard limit. That would be down to 14/3.5 = 4 nm. This is far from the limit based on heat dissipation. My assumptions could have made it 1 nm. I see 5 nm is scheduled for about 2020. Since atoms are only 0.2 nm wide, that must be getting close to the size limit before they have to resort to slower processor speeds to enable larger dies (or more layers).

Sunday, January 21, 2018

Compressing short texts

I want to compress text to 512 bytes as much asp ossible for Zcash and Hush memo field.

I didn't have any luck searching the internet on how to compress short texts....until I had finished doing it from scratch. Normal compression algorithms are not useful because they need to include various types of look-up tables in the data which can make small texts even bigger.

I used Huffman-like look-up tables for the language at hand, using the known frequency of letters and words to make decisions, combined with other tricks based on knowing things like a capital letter will probably be a period. I used 2 bits to tell the decoder how many bits were about to follow depending on which table contained the word or letter to replace. 4, 6, 7, or 12 bits followed the 2 bits. The 4 bit table was for the 16 most common characters (e,t,a,i,o,l,s,n, newline, period, comma, from memory). The 12 bit table was for less-common words. I had a table of 4096 words. A 6 bit table was for the rest of ascii-printable characters, and the 7-bit was for the most common words.

But here's a better job of it in precise detail:
https://academic.oup.com/comjnl/article-pdf/24/4/324/1039097/240324.pdf

He and I got about compression down to about 45% of the original english files. He did it with only 250 words instead of 4096, so his method is really good. We're using almost exactly the same scheme, but he used only 4 bits for the most common letters which is why he was able to do as good with a lot less. It's hard to get better no matter how big the tables are, but my next improvement could be to follow his method exactly, but add another table for 4096 words with 12 bits after the 4-bit.

His method has a big additional advantage: by the bits staying multiple of 4, a common procedure can be applied: Burrow-Wheeler transform (BWT) followed by move-to-front (MTF), then Arithmetic encoding (AC). BWT gets like-symbols closer together which enables MTF to more frequently assign a low-number position number. BWT and MTF do not do compression. AC does the compression following MTF to make use of the lower entropy due to more frequent position numbers. By itself, the BWT-MTF-AC method got the original file down to 60% using 4 bit sequences. It did not benefit my method because my odd-length method offsets the bits all the time, making 4-bit sequences more random. 4-bit sequences are better instead of byte for short messages because AC has to attach a "dictionary" to the data (i.e. a list of each symbol's count in the original) and 4 bits means I need only 16 counts up 256 if no symbol is more than 25% of a 1000 byte pre-compression text (since my goal is 512 bytes and I won't get more than 50% on text already pre-compressed). That means only 16 extra bytes needed verses 256 for 8-bits (I'll just list the count in a logical order of fixed length instead of including both symbol and count). MTF will be a little longer since the ranking on 1000 bytes will require a distance of up to 2000 for 4-bits instead of 1000.

The 4-bit method is also immediately applicable to UTF-8.

Wednesday, December 27, 2017

Using difficulty to get constant-value dev fees

I posted this to bitcoin-ml and bitcoin-dev mailists and to reddit and bitcointalk, but no one seems interested.
========
Has anyone used difficulty to get constant-dollar developer or node
fees? Difficulty is exactly proportional to network hashrate, and
network hashrate is closely proportional to coin price.

Say a coin is currently $1.23 and someone wants to get a fixed income
from the coin like $0.01 each time something occurs. To achieve this
they could use a constant that is multiplied by the difficulty:

fee = 0.0123 * difficulty_at_$1.23_per_coin / current_difficulty / reward_per_block_at_$1.23 * current_reward_per_block

Dollar value here is constant-value relative to when the ratio was
determined (when difficulty was at $1.23). If hash power is not able
to keep up with coin price (which is a temporary effect), the value
would be larger than expected. Otherwise, the real-world value slowly
decreases as hashing efficiency increases, which may be a desired
effect if it is for dev fees because software gets outdated. But
Moore's law has gotten very slow for computers. Hashing should get
closer to being a constant hardware cost per hash.

Also, electricity is more than half the current cost of hashing and
could soon be 3/4 or more of the cost. Worldwide electricity cost is
very stable and possibly the best single-commodity measure of constant
value.

Also, all coins for a given POW (if not in an even more general sense) will have the same factor above, adjusted only by multiplying by 2^(x-y) where x is the number of leading zeros in maxTarget for the other coin, and y is the number in the coin above.

Tuesday, November 21, 2017

Best difficulty algorithm

I'm changing this as I find improvements. Last change: 12/1/2017.

Newest information is here:
https://github.com/zawy12/difficulty-algorithms/issues/1


This first one appears to be the best all around:
# Jacob Eliosoff's EMA (exponential moving average)
# https://en.wikipedia.org/wiki/Moving_average#Application_to_measuring_computer_performance
# ST = solvetime, T=target solvetime
# height = most recently solved block
# N=70
# MTP should not be used

for (i=height - 10 to height) {  # check timestamps starting 10 blocks into past)
   previous_max = max
   max=timestamp[i] if timestamp[i] > max
}
ST = max - previous_max
ST = max(T/100,min(T*10, ST))
next_D = previous_D * ( T/ST + e^(-ST/T/N) * (1-T/ST) )
A word of caution with the above EMA: when converting to and from bits field (aka target) to difficulty, it is important to not make a consistently wrong "rounding" error. For example, if previous difficulty was 100,000, it is important that nothing in the code make it consistently +50 more or consistently -50 less (0.05% error). That would cause the EMA at N=70 to have 3.5% error in solvetime. At 0.5% error per block, there is 35% error in the solvetimes (difficulty is 30% too high or low). The error that develops seems to be based on about 1.005^70 = 41%. If half the time it is +1,000 too high and the other half -1,000 too low, then it's OK. just don't be consistently wrong in the same direction. Error in the value for e=2.7183 does not hurt it.

In an simple average (SMA), a multiplicative error like this only causes a proportional error in solvetime, not a compounded error.

There is a T/solvetime ratio in two places. It must be the same in both places. I don't know how it would be coded to give something different, but it's something to be aware of.

==========
The following attempts to make it smoother while having some ability to respond to faster to big hashrate changes.
# Dynamic EMA difficulty algo (Jacob Eliosoff's EMA and Zawy's adjustable window). 
# Bitcoin cash dev(?) came up with the median of three to reduce timestamp errors.
# For EMA origins see 
# https://en.wikipedia.org/wiki/Moving_average#Application_to_measuring_computer_performance
# "Dynamic" means it triggers to a faster-responding value for N if a substantial change in hashrate
# is detected. It increases from that event back to Nmax

Nmax=100 # max EMA window
Nmin=25 # min EMA window
A=10, B=2, C=0.37  # A,B,C = 10,2,37 or 20, 1.65 0.45, 

# TS=timestamp, T=target solvetime, i.e. 600 seconds
# Find the most recent unusual 20-block event
for (i=height-Nmax to height) {  # height=current block index
    if ( (median(TS[i],TS[i-1],TS[i-2]) - median(TS[i-20],TS[i-21],TS[i-22]))/T/A  > B 
           or  
         (median(TS[i],TS[i-1],TS[i-2]) - median(TS[i-20],TS[i-21],TS[i-22]))/T/A  < C  ) 
      {   unusual_event=height - i + Nmin   } 
}
N = min(Nmax, unusual_event))

# now do the EMA shown above with this N 

Here's another algorithm that seems to be about as good as the EMA
#  Weighted-Weighted Harmonic Mean (WWHM) difficulty algorithm
# Original idea from Tom Harding (Deger8) called "WT-144"  
# No limits, filtering, or tempering should be attempted.
# MTP should not be used.

# set constants
# N=60 # See Masari coin for live data with N=60
# T=600 # coin's Target solvetime. If this changes, nothing else needs to be changed.
# adjust=0.99 # 0.98 for N=30, 0.99 for N=60
#  k = (N+1)/2 *adjust * T # there is not a missing N. 
# height = most recently solved block

# algorithm
d=0, t=0, j=0
previous_max=timestamp[height - N] 
for ( i = height-N+1; i < height+1; i++) {  # (N most recent blocks)
    max_timestamp=max(timestamp[i], previous_max)
    solvetime = max_timestamp - previous_max
    solvetime=1 if solvetime < 1
   # for N=60, 10*T solvetimes drives difficulty too far down, so:
    solvetime = 10*T if solvetime > 10*T 
    previous_max=max_timestamp
    j++
    t +=  solvetime * j 
    d += D[i] # sum the difficulties this is how WWHM is different from Tom's WT.
}
t=T*N if t < T*N  # in case of startup weirdness, keep t reasonable
next_D = d * k / t 

# limits on next_D do not need to be used because of the above solvetime restrictions
# but if you still want limits on D's change per block & expect max 5x hash rate
# changes or want to replace the solvetime restrictions, use
# limit = 5^(3/N)
# next_D = previous_D*limit if next_D > previous_D*limit
# next_D = previous_D/limit if next_D > previous_D/limit