Tuesday, June 13, 2017

Difficulty algorithm and time warp attacks

I spent a lot of work on trying to figure out the best difficulty algorithm for Zcash and two other coins have run into my comments.  This is a summary of what I found out. It was a comment on github.  Frankly, I can't believe the amount of work people have spent on designing difficulty algorithms when they should have just been using a simple average, and I'm glad to see coins are starting to use the simple average, and would like to think it is due to my work.

After much consideration, I strongly recommend a simple average of the most recent solve times. By increasing N, you get stability at the expense of responsiveness. Digishield was using N= 17 or N=18 in a more complicated way. Last I checked, this is what Zcash was doing (N=17) but they might be doing something different. People still refer to them as using a "modified digishield v3", but it's just an average that I believe came as a result of my pestering them about it.
Next difficulty = (avg last N difficulties) x (target solve time) / (avg of past N solve times)
As a result of improper times being reported by miners, the "average" solve time is not used, but the median is. If you have accurate solve times, use the average.
I could not confirm the Zcash code was getting a good median (I could not trace the variables back to their origins), but it seemed to work.
The lack of a known time allows time-warp attacks especially when mining is concentrated. It was observed in the Zcash testnet. To help minimize this, you limit how far ahead or behind the previous time that the miner can report their time. I believe Zcash made the limit same as bitcoin, 3600 seconds, but the limit should have been 1/4 of bitcoin's since their blocks come 4x faster. 4x more blocks in that time limit means more opportunity for a time warp attack. The attack seems to apply only when mining is concentrated, having at least maybe 20% of network power. Also, it is used if electricity or computer time is their main expense. They crunch a lot to get maybe up to 1/2 the blocks in 1/2 of N, so that the calculated median is artificially old because they are reporting old times to trick the algorithm into lowering the difficulty. They continue for another 1/2 of N which is when the algo finally figures it out, thanks to the limit of how far back they can set times. Then they stop to let others suffer a high difficulty for 1 N, then they do it again. I think that's a good summary.
This is not to say a smaller N prevents or reduces a time warp attack. The median and the limit on the reported time helps reduce the possibility of it occurring. I'm describing the attack in order to check for it by watching the times being reported and to see if it makes the difficulty oscillate. You look for a series old times being reported on the scale of 1/4 to 1 N , then it stopping for a while to be repeated later.
The problem with many difficulty algorithms is that they try to "think too much" instead of just looking at recent data, and going from there. If you try to look at a recent increase in hash rate and let the difficulty "jump ahead" of the actual recent average in a predictive manner, then it invites oscillations, both natural and intentional. A similar thing occurs if you try to limit the increase or decrease in the difficulty, like Digishield did, especially if the limits are not the same (symmetrical). Just looking at the avg of the most recent past is the most scientific method.
You may be fully aware of all the above, but I wanted to distill what I had found out in the past.

The above does not explain this exactly. This post to bitcointalk is a little more clear on time warp attacks.

where Zcash uses N=17.  N is not too critical. Larger N is less responsive to network changes but more stable.  

I think Zcash retained some semblance to Digishield v3 by retaining a limit on the increase and/or decrease. This caused oscillation problems in Digishield V1, V2, and Zcash testnet. I was arguing for them to remove those limits completely.  I think they made them high enough that they are mostly irrelevant. But it's better and more scientific to just stick to the observed average.  If a big pool (10x hash rate increase) jumps on for an hour and then off, the penalty of everyone having to wait 10x longer for the next 8 blocks is the best option. If a limit is placed on the difficulty rise, they get blocks cheaper for longer and too many coins will be released per day (assuming you do not make other miners pay the price for the ill-gotten gains by making the fall equally slow).  The core problem is the lack of a valid timestamp. Change the word "median" in the equation to "avg" if you're not like bitcoin and Zcash and are somehow getting a known timestamp.  

If you're using 150 seconds between blocks, the maximum-allowed miner-reported time between blocks should be 1/4 of bitcoin's to minimize time-warp attacks. I think Zcash did not change the max time from bitcoin (3600 seconds?) even though their blocks are coming 4x faster. Time warps seems to be a danger only when there's a potential big miner (>15% network hash rate).  A big miner would want to cycle through small coins so that he has no down time while getting coins with a artificially low difficulty. It's made a little easier if they're all using N=17. Otherwise it seems only beneficial if his primary expense is computing time or electricity. 

Just sticking with Zcash is probably fine, but I hate seeing people refer to Digishield because it implies limits on the rise and fall when it's the simple "scientific" average that works. Limits on the rise and fall are a political and ideological influence, wanting to prevent users from suffering long transaction wait times. But it's a violation of the science that has a cost elsewhere (either the coin is released too fast, or small miners suffer). To be more "scientific" N=30 would be good. It would be more stable but less responsive. That's better than keeping a limit on the rise but has a similar effect.

To clarify and correct another post to github

I think N from 14 to N=50 might be OK. I would select N=30 to make difficulty more predictable for everyone and there is a statistical standard in letting 30 be a cut-off. A lot of research is immediately rejected if N < 30. Large N also makes it more obvious a big miner is present. It does not theoretically affect big-miner "cheating" because if N is small, they just come back more often. They profit during N/2 then have to go away for N/2 to wait for the median to go back to normal. Since it's the avg of N difficulties, it is better for them to wait N. This is without a time-warp attack. It's just what they can do if they have a large percentage of hashrate. There is no fix. It's only profitable if they can easily switch what coin they are mining, or if computer time or electricity are largest expense and it's OK to let their equipment go idle (unlikely).
I think 6x the target solve time to copy bitcoin should be the maximum difference in the timestamps. 6x4x60 = 1440 seconds. I think it is risky to adjust the median by combing it with the average, if I understand your comments.
I've said N value is not critical, but it may interact with the 6x. I can't figure out if N=17 or N=30 is better if the limit is 6x the 4 minutes. So I'm not sure N=30 is better.
A timewarp attack can be seen if timestamps are set as far ahead in the future as possible to cause the difficulty to drop, then they'll mine full power and set timestamps as low as possible. The difficulty will stay low until the median time-to-solve catches up to the forwarded time. This is why 6x is loses half as many blocks to the attacker as 12x. You can see the actual hashrate of the attacker by looking at how fast block solves are coming into your personal node. If he has 2x the network hashrate, you'll see solves coming in about 2x faster than your 4 minutes.
I think other miners are paying the price when someone has a lot of hash power and is cycling through several alt coins when it sees a low difficulty. So the other miners have to suffer a higher difficulty when they leave, and do not benefit when the difficulty is low. But it's hard to call it an attack. The big miner profits from the difficulty accidentally being low, and their actions cause the difficulty to rise and fall more adding excess variablility to solve times. If the big miner stayed around all the time, the other miners would get the same amount. So the other miners are paying the price in the sense that if they have 1% of average network hashrate, they get something like only 0.66% of the blocks if the big miner mines N/2 and then waits N to return. So he needs to find 3 small alt coins with N=17. Again, this is not related to a time warp attack.
I may have made an error in my previous post: The timewarp attack may only cause more coins to be issued than the protocol wants (not really harming other miners directly). It's an attack on the value of everyone's coins, not on other miners, unless they are combing it with cycling through to other coins. It's an "attack" because it is lying about the time. It's an unsolvable deficiency in the protocol in not being able to know the real time itself. It's unsolvable without relying on a trusted third party, like a group of peers it trusts (a trusted consensus) which I believe is what ETH does to get a an accurate timestamp.
The following is an actual timewarp attack on the Zcash testnet. The first chart is the rate at which blocks are being issued. The target is 24 per hour. You can see too many blocks were issued during the attack. The second chart shows a positive spike when the timestamp was set >2000 seconds into the future from previous block. It shows a negative spike when they blocks had timestamps less than 10 seconds apart. The darkness of the downward spikes shows they got a lot of blocks quickly. They stopped when the difficulty returned to normal.:


Your equation:
Next = (avg past 17 Diff) x 240 / (A + 1/4x(A - M))
is same as
Next = (avg past 17 Diff) x 240 / (1.25xA - 0.25xM)

Avg of past 17 or 16 Diff does not make much difference but if you're dividing by M and A of 16 then I would definitely also use 16 for Diff.
I don't like it because of the heavy reliance on the Avg for the reasons I gave in a previous post that describes the effect of using A instead of M. I like my previous equation better (at the bottom of this post I'll explain a problem with this and give another idea):

Next Diff = (avg last N Diff) x T / (Median of past N solve times)
with limits:
Max timestamp = Median-(N/2+0.5+6 )x T
Min timestamp =Median+(N/2+0.5+6) x T
The 0.5 is because Median is used. It's negative is N is odd.
T= target time = 240 in this case.
From deep thought over many days I came to the conclusion that the best when there is no timewarp attack is
Next Diff = Avg N Diff x Target / (Avg N SolvesTimes)
Timewarp (timestamp error) necessitates median and limits. The median is ugly, but it is as close to avg as you can get so I strongly favor the median. I saw how "insane" digishield v1 and v2 was in using rise and fall limits on Diff, so I'm trusting myself more than (1.25A-0.25M). My equation with the 3/4 is wrong, but I was thinking the code was using (0.75xM+0.25(most recent)) which seems OK but I still like the bold equation best.
The problem with my equation above is that it ignore a sudden increase in hashrate for 8x240 and then suddenly responds. So a smart big miner will want to come in for 8 blocks, then leave for 8 or 16 blocks to go to another alt, then come back. I don't know if they actually do that. Maybe N=12 is good, but it will have a lot of accidental variation. Here's an alternative in order to be more responsive and cut time warp profit in half
Next Diff = (avg of last N Diff) x T / (1/2 M + 1/2 A)
where M and A are median and average of last N block solve times.
Bitcoin takes forever in responding to hashrate changes, so median having an 8 block delay is a lot better than bitcoin even if it's not responding fast.

So my desired T/A and your T / (1.25A-0.25M) are basically the same thing but my timewarp argument is against these. Str4d on Zcash told me they used T/M instead of T/A because of timewarp and I had to think about why and came up with the explanation above. Since I can't read their code, maybe they actually use the same as you because they have that 1/4, so maybe they really do use something close to T/A.
I think p and q = 0.5 is fine. Your current algo might be fine. But be sure to use 6xT as a limit, and allow the negatives.
An idea I have that also slows responsiveness but dulls the effect of bad timestamps is to use the same idea that "the avg of truth and lies is better than risking two lies in a row" when measuring block solve time. Instead of block solve time = (current timestamp) - (previous timestamp) an idea I am just "throwing out there" is:
block solve time = (current timestamp) - [ (Avg of past N/2 timestamps) + (N/4) x T ]
So this skeptically dulls any extreme high or low solve times. The 6x limit would be enforced on the output of this equation, not the input.

What I've described may not be a timewarp attack. Here is a post that says it requires ignoring the interval between two blocks like bitcoin does which means it depends on not using a rolling average. In their example, they reset difficulty every 3 blocks. We are talking about a rolling average, so it's really different.
My description of the timewarp might be all wrong. Let's say and honest timestamp on block 1 is 0. The attacker takes ~240 but assigns 1440 for block 2. Then miner with an accurate takes ~240 and assigns ~480 for block 3 because it is two blocks after block 1. But the block solve times you have are 1440 and (480 - 1440) = -960. So the average of the two is 240. So maybe the M should not be used. I would just use A.
My "avg of a larger number of truth and lies is better than risking two lies in a row" is referring to a method of how to enforce the 1440 limit, or how to assign the block solve time. But because of how it works out nicely above, maybe that should not be tried either.
The problem I see with median is that it does nothing for N/2 then suddenly reacts. The output I'm seeing from it is not really good. The 1.1 factor has to be 1.5 if it's 100% median. As far as I know, I was completely wrong in sending you guys down the median path. p=1.25 and q=-0.75 has to use 0.9 correction.
time to 1 occurrence = -240*ln(rand()) is the random time of a single occurrence that will result from a Poisson distribution that has a mean of 240. Poisson is simply based on a standard random variable, so RAND() is what should be used. Derivation is below.
To include the 1440 limit I had to create another column F with =if(A1>1440, 1440, A1). Then let the median and average columns point to column F instead of A, but keep my measurements of the output and charting on column A. For example, to know if the solve time is averaging 240, do an average on all of column A.
I'm still working on trying to get the oscillations out of these averaging methods. I'm hoping the reason they seem to oscillate with 2N intervals is because the D x T / A method assumes a linearity and symmetry by using the A and that this is a false assumption in Poisson distributions.
I was lucky to find -ln(rand())*240 on the internet and I did not see a derivation, so I'll give one here.
The Poisson distribution is based strictly on the same basics as Gaussian and binomial. Gaussian is a continuous version of binomial. Binomial is a coin toss. Poisson is Gaussian in a sort of backwards way and summed up. That is why a simple formula like this can apply. The Poisson is not esoteric, but a direct "consequence" of this simple formula,
time to 1 occurrence = -T*LN(RAND())
which is apparently all you need to know to derive all Poisson results. But I want to do it backwards from wikipedia.
From Wikipedia Poisson 1st equation
P of k occurrences when on average there is L = 1 occurrence per some T = e^-L * L^k / k!
L=mean occurrences per interval = 1/240
k=observed occurrences in L
L can be use decimals but k must be an integer.
P is a random variable that is between 0 and 1.
This equation is also the "probability mass function".
We want random times as they vary around 240 with k=1. So the equation simplifies to
P = e^(-L) * L^1 / 1
Assume the mass function nature of it allows me to substitute
L=L*t and therefore
P(t) = e^(-Lt) * Lt
Now integrate P from t=0 to t=t.
With an integration table and log rules, the integral yields
Integral(P)/t = e^(-t/T)
Notice that the left side is an average P expected for a given t when k=1. Rearranging:
t= -T * ln( avg P per t)
Which is the equation I seek: RAND() appears to supply an average P per t. This makes sense but avg(P/t) is a weird kind of ratio.
The best I can find after trying many things is this:
next D = Avg of D x (T/A)^2
The cycles seem less and it responds to attacks a lot faster. It varies more but it does not go as low. It is kind of a hornet that I would think about twice before attacking. By not following other diff algos, attackers would not know as well what to do with it. It causes the difficulty to jump about 2x higher on occasion than without the ^2.
For constant hash rate, the frequency of of seeing solve times that are 240/2 is 6x more often than seeing solve times that are 240 x 2. The averages reduce this effect, but I believe this could be what causes the oscillations. I can't find a solution. I've done curve fitting to adjust each solve time before applying the average and tried adjusting the average, but I can't do better than the simple average. I've tried about 10 other things, using adjustable constants in many different ways.
N needs to be > than the length of time a flash attack occurs. They get N/2 for free.

Bitcoin timestamps:
Bitcoin wiki says:
A timestamp is accepted as valid if it is greater than the median timestamp of previous 11 blocks, and less than the network-adjusted time + 2 hours. "Network-adjusted time" is the median of the timestamps returned by all nodes connected to you. As a result, block timestamps are not exactly accurate, and they do not even need to be in order. Block times are accurate only to within an hour or two.
So they have something like a 6x to 12x. timestamp limit. I believe the method I've described is better (and is probably more strict so that if the same rule applies to cryptonote using BTC code, then my limits will override BTC). The reason I think it is better is because a manipulating of the timestamp may also have better control of the nodes connected to it and therefore the timestamp variation they will allow. So an honest miner with different node connections would not be able to reverse the bad timestamp as in my scheme.
Some knowledgeable says "if you have an accurate timestamp, you do not need mining" and people like Szabo have referred to using block height as their clock. I think if every peer has an accurate clock, then updating a distributed database without update conflicts ("double payment") is easy. For example, you could update the blockchain every minute and anyone wanting to make an update would have the first 15 seconds to submit changes and if you tried a "double payment" on the same coin in that 15 seconds then both payments would be rejected. The only reason GPS and time servers are not used is because people didn't want to rely on them. But nodes could use star trackers to occasionally validate their GPS signal and revert to the star trackers to correct their CPU time if GPS time failed and use 10 minute block time if phone cameras and accelerometers are limited to 2.5% accuracy. Tracking the stars for economic reasons goes back as far as knowing when to plant.

No comments:

Post a Comment