Over the last couple of days we have seen some major mempool spikes on the Chia network, knocking farmers offline. Basically it looks similar, although not exactly, like the Chia dust storm a couple of months ago. Personally my farm has been knocked offline a couple of times, and the same is true for a number of farmers around the world.
So what is happening? It looks like someone is stuffing thousands of transactions into the mempool all at once and that is causing a cascade of issues as slower nodes lose sync and then start acting as zombie nodes causing issues and additional load on the nodes that do stay in sync.

As you can see the normal transaction volume is being completely dwarfed by spikes that are hitting it right now. But if you look at the Bitcoin or Ethereum mempools they are significantly busier than the Chia network is even during these spikes. If Chia wants to supplant Ethereum as the programmable smart chain of choice it needs to be able to easily support more than 10x this load at all times.

I don’t really have any answers here, but you can be sure I will be asking for some. Both from Chia Network, who has responded to these issues in the past, and from others in the community about why the Chia node falls over with what would seem to be a feather brush to an Ethereum node. If it is just low powered nodes, then they need to go. Permanently. If it is something else then that needs to be fixed because right now even with decent hardware its luck of the draw if you or your peers get knocked out by one of these spikes.
“…right now even with decent hardware its luck of the draw if you or your peers get knocked out by one of these spikes.” Not true. If you have sufficient hardware power and network response, there is no issue IMHO and in my experience. Chia’s insistence that Pis or similar are “good enough”, for example, is hurting Chia’s chances of graduating into the big league of crypto, such as Ethereum you mention, in terms of transactions.
It is true. I’m farming a rocket lake i5 with 32gb ram and I’ve been knocked out by both
Chia is not RAM size bound, but rather RAM speed bound. So, most of your RAM is kind of being wasted (yeah, OS is using it for some general caching). Potentially, using it as a cache in front of your db folder would make a better use of it.
Also, Chia is not CPU per se bound, but rather a single core bound. Once you have one core maxed out, your node is SOL, regardless of how many cores you have.
I have been knocked out with AMD Ryzen 7 3700X 3,6/4,4GHz 32MB AM4 Wraith Prism BOX CPU and 32 GB RAM
I guess, if we want to move this issue forward, there is no point to repeat empty statements that play well in Chia’s favor (general rambling about RPis or other non-specified “low spec” nodes). I would rather focus on what issues we see with our own setups, rather than pointing fingers at “others” that don’t have “as good” setups as we have.
As far as your statement “rocket lake i5 with 32gb ram” was behind, what symptoms did you see?
My node (i5-8250U) fell behind today. The first problem was the CPU usage. On the first glance, it looked OK, as it was running at “only” 30% (normal is ~5%); however, when I looked at the individual cores, one was fully choked, where other cores were kind of idling (4 physical, multithreaded 8). Behind that choked core was start_full_node process. So, it was initially obvious to me that my box cannot handle all those transactions. Not so fast. I use ChiaDog to monitor my farm, and it was complaining at my node struggling. Also, my pool was telling me that I have plenty of stalls (usually I have next to zero). I checked my debug.log, and that was a mess. My log was 99% full of add_spendbundle entries and was maxing out in 5 minutes. Those lines were coming less than one millisecond apart. I dropped the log level to ERROR (basically disabling ChiaDog, or any other monitoring tool). I also dropped the number of peers down to 10. (My lesson from the original dust storm).
After restarting chia (few hours ago), it is running smoothly right now. No issues, no stales, CPU usage a bit more than normal, but no single core being choked. At least so far. At the moment, my box can handle this small dust storm traffic, but I guess, it will not scale much more than that.
The obvious question for me is what kind of idiot puts engineering logs in the production code? Why those logs entries are not forked to a different (low priority) thread, so even if more lines will be generated, it will (less) affect the base code and if being overwhelmed will drop some lines to let the main code run better. Why there is no higher-level code that will suppress repeating log entries, and just eventually spit out one stating that it was repeated XYZ times in the last XYZ seconds. What is the main code that the node should have? Is it logging or rather running the node?
As far as that code choking that one core, that is a sign that the code is single-threaded. Looking at other parts of the code (e.g., plotter), there is no multithreading expertise behind this project. This is why we had the first dust storm (if you read what was said during the first dust storm and the post mortem what was “fixed”) – basically screwed up thread synchronization between peer handling and db handling processes (what I flagged during the dust storm, and was acknowledged as one of the problems). It will really not matter what H/W we will have, if the code will choke on one core, where other will be running mostly idle.
Another issue is using some magical numbers. In this case that is the peer count. Default value is set to 80. Anyone, and I mean, anyone that writes anything network related understands that the code should not be based on some arbitrary numbers, but rather on the state of the network. If the network handling is overwhelmed, there is a need for a backing off protocol. When the network is in a good shape, you may increase your connections / handle more traffic.
So, should I blame the software engineer that did it? Possibly not, as those are clearly entry level / no training problems. In a normal company, that code would go through a code review process. Then it would go through a QA process. Then, it should be approved by both head of engineering and head of QA. If there are known problems, the team should get some extra training, or potentially some more experienced people should be hired. So, on my book that is where the problem is – the head of engineering.
Lastly, here is a quote from Twitter team that rewrote their base code using a different (better) language (the original was Ruby), and a different db:
“In 2008, Twitter search handled an average of 20 TPS [tweets per second] and 200 QPS. By October 2010, when we replaced MySQL with Earlybird, the system was handling 1,000 TPS and 12,000 QPS on average”
That was about 50x faster end result. However, that was a deliberate action of Twitter team that acknowledged that H/W scaling will is potentially at the end of the line, and there is a need to look for some other solution (code rewrite, not just patches here and there).
Another way of looking at this problem is that we have over 300k nodes out there. Behind those nodes, there is about the same number of people maintaining those nodes. Is the solution to blame part of that 300k nodes, or rather to blame the company that is not hiring a couple of experienced engineers to rewrite the base code. Or another option is to outsource the code, or parts of it to third party developers. Flex, Farmr, MadMax, BitBlade are good example of what can be done by more or less a single person that understands the problem on hand. Flex is already working on a full node solution.
Well, fully agreed for Madmax and Bladebit, but the extreme sensitivity of Flexpool’s implementations to dust storms does not make them the most obvious example of the resilient infrastructure we’re all expecting imo.
I fully agree with you about Flex. However, at some point people will need to decide whether to go with Flex or keep upgrading their H/W, without any assurance what will work and what not. If Flex will be successful in running full node on RPis (it looks like it will), we may see plenty of people switching to them, regardless of what approach they take. Potentially their approach is only relevant to people like you and me.
This is also why I stated that Chia “should outsource” – i.e., give them explicit guidelines what needs to be done, and pay for code that meets their requirements. Not giving them a blank card to do whatever they want to do.
Also, by having some other options for a full node would put more pressure on Chia to get their act together.
The net space dropped by about 3.5 EB. I was checking Flex pool / Top Farmers (10 farms, 2.5-8.9 PB), and they had about 2% of stales. That leaves about 3 EB for mostly smaller farmers Assuming that an average farm that was knocked off was around 50 TB, that would amount to about 60k farms going down – about 20% of farms out there.
It would be interesting to know, whether there are any estimates how many farms got severly affected, and what hardware they have.
By the way, I would really like to have Chia run their own dust storms once a month or so (say the first Sat of the month), just to get some network status assessment. Otherwise, we will get those script kiddies running from their mom’s basements scoring some brownie points.
So if I read those “graphs” correctly, 6k transactions basicly was a major issue for the Chia Blockchain while ETH was rocking 120k transactions? Please, someone correct me and educate this young padawan! Otherwise, how the F is Chia even hoping to go anywhere when they can’t even handle 1/20 of what the current ETH network can do?
Please have that understanding
Multithreading Vs Multiprocessing.
The purpose of both Multithreading and Multiprocessing is to maximize the CPU utilization and improve the execution speed. But there are some fundamental differences between Thread and Process.
When a process creates threads to execute parallelly, these threads share the memory and other resources of the main process. This makes threads dependent on each other.
Unlike threads, the process doesn’t share the resources among them, therefore they can run in a completely independent fashion.
Python provides the functionality for both Multithreading and Multiprocessing. But Multithreading in Python has a problem and that problem is called GIL (Global Interpreter Lock) issue.
Because of GIL issue, people choose Multiprocessing over Multithreading.
Python provides a multiprocessing package, which allows to spawning processes from the main process which can be run on multiple cores parallelly and independently.
TheMultiprocessing package provides a Pool class, which allows the parallel execution of a function on the multiple input values. Pool divides the multiple inputs among the multiple processes which can be run parallelly.
many (maybe almost all) farmers mine more forks at once 🙁