1) Input size equal to output size, say 64 bits.

2) The class is complex enough to not be optimizable as a class (hence ASICs and FPGAs are not useful)

3) The class has an upper and lower bound on computation time for a given "CPU"

The last item seems to require a similar distribution of operations and a restricted way of using them. This would make achieving the 2nd to last item more difficult. To relax the computation time bounds in the last item without risking miners abandoning slow functions and going to the next nonce, the nonces must be in sequence.

The POW solution is then:

myhash = ffx32 unless (myhash < target) { nonce++ j = hash(previous block hash + nonce) for (1 to 100) { j++ function = generate_function( hash( j ) ) for (k=1 to 100,000) { input=output output = function( input ) } } myhash = hash(output) } if ( myhash < target ) { claim my block }

The goal is to make memory useless and try to make the "CPU" burn as much electricity as possible. This is not more "polluting" or "wasteful" than investing in more hardware to avoid electrical expense.

The extra cost is similarly a waste of resources like electricity. For the most part that cost (like all costs) is also ultimately tied to a fuel cost or fictitious asset prices.

The amount of economic resources wasted to mine a coin per block is directly proportional to the dollar reward per block, which is equal to the difficulty times a proportionality constant. When x = the leading zeros of the target, the following should be approximately true for all coins:

2^x * difficulty = j * wasted dollars of fuel per block = k * dollar reward per block

where j is about 1/2 k and both are constants that should be about the same for all coins.

==========

https://github.com/monero-project/monero/issues/3545#issuecomment-380678092

https://github.com/zcash/zcash/issues/1211

>The rules that your automaton operates under are the actual instructions that matter, and again, they're fully known in advance.

They're not known in advance because I am changing the "automata" algorithm 100 times per nonce. See my psuedocode above. I thought of automata because I knew off-hand some classes can't be solved for "N loops" into the future without having already run the recursion itself. Then I remembered most real world differential equations are the same. My psuedocode is more general than either. The recursiveness was needed to address the problem of unknown execution time. Otherwise, I'm not arguing for something different than your idea. I'm trying to make it work and optimize it for cell phones.

Cryptnight's algorithm is known in advanced but ours' are not. My smaller instruction set is optional, but I wanted to pick instructions that are more efficient on RISC cell phones than on CPUs and GPUs. I didn't want complex instructions that are not implementable directly and completely on a RISC chip. A simpler instruction set may also enable better control of execution time and theoretical characterization of what the results will be.

To try to address your complaint about a simpler instruction set, suppose I have 5x more combinations of instructions than your scheme but use only 4 instructions. Then I would have 4^5 =1024 advanced instructions in each group of 5 instruction. Even 1 instruction is feasible: All javascript is expressible in NAND or XOR instructions in a specific type of arrangement.

Let's say I use only the NAND gate (a simple binary if-then statement). My output every 1,000 loops will be hashed to get a new set of NAND gate combinations out of a possible 2^256 combinations. There could be more if needed. An FPGA or ASIC or GPU program would have to figure out a simplified version of the NAND gate 100 times per nonce. This is possible if the inner loop of 1000 is too many loops. So there's an upper limit on it. Also, the algorithm generating function loop of 100 has an upper bound: it's a slow algorithm that could be optimized by ASIC or GPU programmer, so the time to generate the next algorithm must not be a large fraction of the time to do the inner loop.

A larger instruction set has the same problems to the exact same extent if I use log2(javascript instructions) more combinations of NAND gates. Both schemes seem to need some bound on how the instructions can be combined in order to get approximately reasonable exceution, but that should not change this point.

I found the need to change your scheme by making it change the algorithm 100 times per nonce in order to get an averaging out of solvetimes to be approximately the same while preventing ASIC and GPU programmers from skipping nonces that give apparently slow algorithms. Generalized loops and skipping sections of code within our created algorithms are not feasible due to creating an unknown completion time, which is a deal breaker. I have an interior loop of 1,000 to keep the created algorithm 1000x smaller, but thiis just to make sure no one benefits from larger memory. This is for cell phone efficiency per instruction while negating the benefit of a large CPU cache.

=======

Skipping nonces: what I'm saying is that I don't think you have bounds on the time-difficulty of the generated algorithms, but that an ASIC could be designed to quickly recognize if an algorithm is going to be difficult to solve, so they'll skip it.

I can't see if you still disagree. In reference to ASIC complexity, as I was saying, as I would just use the smaller set of instructions a larger number of times to equal your larger set of instructions with fewer steps. I want to send it assembly meant for ARM processors and for it to be as difficult as your javascript. We can go so far beyond what an ASIC can predict, I'm trying to cheat by redirecting it's output to its input, enabling "only" 2^256 different functions on random 32 byte seeds.

I think the general method of a small instruction set is the only one that can have predictably similarly difficult functions like I'm saying seems absolutely necessary. With say 4 instruction (they might just be well-chosen bit operations) it seems possible to know what average amount of computation they have, and to prove the 4 selected will continually add complexity rather than going towards patterns like all 0's or all 1's.

It's not that I know a larger instruction set won't work. The problem is that I view the task of getting similar execute time to be less daunting with fewer instructions. BTC script had to exclude loops and other things for the same reason. And still many scripts can go bad. Fewer instructions might be useful in proving the algorithms are well-behaved.

I think not only limiting the type of instructions is needed, but probably also limiting how they can combine. This increases the risk of an ASIC optimization of the gneral class of algorithms you produce.

+, -, /, and * creates something wider in scope than power series which can approximate any function. But maybe it should be required to follow the rules of power series. Actually power series are recursive like automata and D.E.s, it does fit in well with my desire that the only loop be external to the "function". I'm been using "function" very unmathematically to mean "algorithm of any type", since I don't have those details worked out.

@zawy12's point that it [fewer instruction] makes future optimization less likely.

I don't think I said that. I don't view my scheme as more difficult for ASICs to "hack", but that I have more hope it will help the design process in creating an "algorithm generator" that has well behaved output and similar execution time.

There's another potential advantage to doing recursive simple steps "on chip". When you have a large algorithm full of a wide range of instructions that are not going to be native to the chip, there are delays from going back and forth to RAM. If you have an instruction set of say 256, and if they all need to have 64 bit inputs and 64 bit outputs and follow each other sequentially (in order to be well behaved), then a lookup table 3GB in size (or much less with compression) could be used instead of actually computing each sequence of 4 instructions. There may be ways around this like changing input data before instructions by a 1 to 4 bit shift based on something unpredictable, or similarly every instruction could have two inputs where one is the output of a previous instruction and the other is seed data.

Fewer instructions has the same problem with this, but I'm hoping a chip can go through more instruction than can be pre-implemented in a lookup table, faster than the time for a lookup-memory transfer in an ASIC. I probably need to use as many on-chip instructions as possible, supporting @hyc 's viewpoint. But my reasons for wanting to keep a limit on instructions remain.

GPUs as well as CPUs can use a lookup table, and may not be improved upon by ASICs, but I feel like this is a problem that should tried to be avoided, and but my starting point (same size inputs and outputs going from 1 instruction to the next) is seductively simple in terms of getting constant compute time and good behavior implemented.

If one of the two ways to foil lookup tables works, an ASIC may still implement the actual logic of every hard-to-compute pair, triple, and maybe quadruple sequence of complex instructions. It can't do every pair due to the large instruction set. Just implementing all non-redundant pairs of 256 instructions could be difficult. It would have a lot more flexibility in exploiting sequences of fewer instructions, so that's a +1 for @hyc . I mean it could easily optimize and implement every 5-instruction sequence of a 4-instruction set, naively giving it a 4x speedup, if it can be done as a fast as on chip. So I'll say I'll make the instruction set as large as I can as long as it stays on the chip, and as long as I can make them predictable in computation time and otherwise well-behaved. So maybe I'm in trevador's ball park.

My pseudocode (that I've been changing all day as I realized problems) applies as well to large sets as small: I think the recursion outside the algorithm as the only loop allowed is a requirement for this to work. Without it, the algorithm will have to be a lot more complex which means a lot of complicated memory usage that means too much room for GPU optimization. It's not that I'm that strongly against GPUs, but I don't the long drawn out process of GPU devs slowly increasing hashpower for a pecentage. It makes me think if GPU optimization can be done, there's a loophole for ASICs. Assembly code doable solely on the chip seems like it has the nice possibility of never becoming optimizeable.

If the algorithm generator can make algorithms with solvetimes +/- 5%, then my 100 algos per nonce is not needed, and 1 algorithm per block might be sufficient. I don't think you can get a routine to optimize the algorithm and automatically reprogram an FPGA that fast. But since GPUs could, I might still prefer the per nonce method.