Over the past few months, as the Chia blockchain has gotten a lot busier, many farmers have been having a ton of problems with their Chia full node. From database sync problems caused by power outages, to losing sync because of high transaction volume a lot of Chia farmers and services are having significant issues with the Full Node. One of the key complaints being made is that it is written in python, and so it sucks. This clearly isn’t true, and farmer problems don’t tell the whole story. The full node definitely needs some work, but it isn’t all bad either.
One of the main issues with farming Chia right now is reliability. Even with a really good setup, you cannot be sure you will not be affected by transaction issues. During the Chia Dust Storm a couple months ago I had no issues with my setup while people all over the world were unable to maintain sync. Most of the issues seemed to come from low power farmers not able to keep up with the network. After that problem Chia Network promised, and delivered, an update that would prioritize new blocks and hopefully fix the sync issues.
It did not work. Not entirely.
Twice in the past week my farm has just lost sync with the network for many hours while the GUI showed it was in sync with the network. A restart of all the services allowed it to reconnect to the network and resync the 2000 blocks or so I was behind. Judging by the 10% drop in netspace I am clearly not the only one. Not only that, but services like XCHScan and ChiaExplorer have also been having sync issues recently.
My farming setup is not slow. I am running a 6 core Rocket Lake i5 running with power limits disabled and 32GB of 3200mhz DDR4. It can haul ass when it needs to. And I don’t think a faster system would have helped here because I suspect the problem is that a significant amount of the network lagged and I got caught in a pocket that just never caught up. Nothing in my system can change the fact that all my peers were behind. I have since radically reduced my number of peers and gotten ruthless when disconnecting peers that look a little behind and it seems to have helped. For now.
Also, if I, or anyone in that pocket, had been running a fast timelord in that pocket we may have forked the chain. This is no joke and, if the enterprise partners that Chia is courting end up running their own nodes and timelord for reliability it can’t cause them to fork the chain and run a lower weight copy for 2 hours because some kid decides to flood the mempool with transactions. Reliability of the blockchain means reliability for everyone, always. Not just in general.
But why does the full node lag when transaction volumes spike? This is infuriating because from what I can tell its not spiking even close to the theoretical limits of the blockchain. Yes, 12 000 transactions in the mempool all at once is a lot, but if a database system can’t handle 12k concurrent transactions once in awhile on modern hardware without falling apart then it isn’t working right, full stop. The first time this happens to a business they will be enraged. The 3rd time in a few months? They will walk away.
And there is at least one previously (and still) successful business that relies on the Chia full node to operate: Flexpool. Alex from Flexpool has been consistent for many months about how the Chia Full Node is a huge risk to operations and that the python base just simply was not performant enough for prime time. Flexpool has been open about working on a Golang based Chia node for themselves so they mitigate this risk. And it is getting pretty hard to argue with him about this after the last few “hiccups”.
I also have my concerns with 100s of thousands of end-users running Python web servers from a security perspective, as it is really easy for an (admittedly rare) Python Remote Code Execution (RCE) vulnerability to become a full on remote shell when running as a privileged user (like Chia does on Windows). This is, of course, true of any language or environment as Log4j has so painfully reminded us and I would have concerns with any end-user web server. But this is not about security, and that’s when I need to consult experts. When I spoke to Alex about this he mentioned Python as an inherent issue, but he also had specific problems he has identified with how the node operates. This is going to get technical because, well, its Alex. And that’s how he is.
The worst thing ever about the Chia node is that it is using Python. Each programming language has its own purpose, and Python is definitely not designed for this. Another problem here is that the Chia node is required to evaluate the spend bundle in order to include it to the mempool; other that that, Chia node has no mempool limits as of now whatsoever. Besides that, during the dust attack, all nodes start to propagate the same spend bundles to the entire network, thus effectively DDoS’ing it.
Looking at other blockchain node implementations, Ethereum’s for example, has a fixed mempool size limit, and when the mempool size is exceeding the limits, the node is automatically shifting low fee transactions in favor of the ones who pay more.
But of course, the most significant flaw is one of the first I’ve specified before – is that the node is required to evaluate the bundle in order to include it to the mempool. Neither Bitcoin nor Ethereum are not required to do that.Alex, Lead Engineer Flexpool
I also asked him what he thought was going on with the disconnects and why the whole network was having issues and he told me that it comes down to slower nodes falling behind and then spamming the network with GetBlock and attempts to re-establish sync. He says a lot of why the node behaves this way is because the fee is defined dynamically, so its not like there isn’t a reason for this stuff. Its just the effect on the network that is problematic. I asked if their Go implementation of the Chia node would solve this issues and the answer was an emphatic “Yes!”
But there is more to this story. When I spoke to Gene Hoffman, COO and President of Chia Network, about the issue he had a different take on it. Apparently while clearing out the mempool was handling around 50% more transactions per hour than the Ethereum network does, with 80 000 per hour to Ethereum’s 50 000. If that’s the case, then its not so bad. I will be looking into this, but will need some time to look at data.
As for Alex’s comments about the problems with how Chia has built things, Gene obviously disagreed. As to the spendbundle evaluation, his response was simply “that’s how blockchains work”. He said he had no idea how Ethereum handles them “But BTC evaluates every spend before mempool inclusion.” I am trying to figure out who is right about Bitcoin and who is not here, but this actually isn’t that well documented at this low a level, at least not anywhere I know about. If anyone independent knows, or has documentation to that effect I would love to see it. I do trust that Alex knows the Ethereum node very well, and I think that’s actually a better comparison to Chia than Bitcoin is because of the block speed. Bitcoin only forms 1 block every 10 minute, with Ethereum and Chia being orders of magnitude faster.
As to the mempool limits, well they do seem to be set in Chia. It is currently set to 10 blocks in the code and anyone building their own node from source (or just comfortable modifying a python script) sould be able to adjust it easily. It is available here on lines 41/42 as MEMPOOL_BLOCK_BUFFER. But 10 blocks is a lot, and that limit is clearly not helping the farmers who are getting behind during these transaction spikes. So there is a definable limit, but it is set so high that it isn’t helping small farmers stay online, which in turn is knocking a bunch more of us off too. I do not know what the implications would be radically reducing that for small farmers, so don’t say I told you to.
The main point here is that from Chia’s perspective the network is operating just fine, overall. If some nodes go offline its not an issue at all because there are lots of nodes to secure the network. They are looking at the network a bit like a kubernetes cluster where no one node makes any difference. However, each one of those nodes is a person who has invested in the network and learned how things work. I don’t think they are ignoring the issue per se, but it does look like they are treating the issues less seriously than the community because the overall network has been fine. And it has. During the first dust storm, pre-update, the high transaction volumes caused block issuance to slow down. Over the past few days this time, issuance has actually been a little above average.
All in all I do not think the full node is trash. It is doing its job at scale and I think a smaller network of performant multicore 100w CPUs would have no issue all staying in sync together. However, that’s not how Chia Network advertises their network nor is it their design goal. There has been a lot of back and forth discussion in my discord if they are going to end up either refactoring the network code to C or if they are going to abandon the Raspberry Pi as their minimum specification. This is a bigger question than I think it might seem, as a huge amount of their corporate identity involves having lots of low powered nodes.
I think eventually they will do both, increasing the minimum requirements as well as refactoring to a compiled language like C. Or maybe it will be Rust, that’s a popular choice for high performance network code. Either way, there is a lot of work to do before Chia can handle the load the Ethereum network handles all day every day and I think they have a long way to go before doing it with a fleet of Raspberry Pis, regardless of what language used.
9 thoughts on “Is the Chia Full Node hot garbage?”
I’m worried about what will happen if the Dust Stormers leave the script on for like a week or a month. How would you feel if your node was down for 1-2-4 weeks with no rewards and no way to sync up – cause all you could connect to was Zombie Nodes clawing you down? It’s not a happy perspective I think..
I disagree that the problem is the networking code in python. If you actually trace through the code and measure the parts that take up most CPU time, you will quickly discover that those parts are already written in C or ASM.
However, there is still some potential when it comes to multithreading. Chia 1.2.11 already made some improvements on that front (separating block validation from spendbundle validation and moving them to separate threads). But there probably still is some room for improvement. As it stands right now, if you’re running on a decent system, single core speed is one of the bottlenecks. If you look at your core utilization during the dust storm, you will find that one core is usually maxed – that is the block validation process, which is the main factor for slow node progression. Better multithreading is what needs the focus now.
As Kent Beck once said: “Make it work, make it right, make it fast.” – The team did a great job on the first two parts. Now that those are in place, we’ll hopefully see some performance improvements.
Depending how you look at networking code. The low-level code may be well optimized, but it is not what you have, but rather how you use it. Looking at what happened over the past few days, I would divide those affected nodes into two categories: 1. those that clearly fell behind and couldn’t recover, and 2. those that were struggling. Sure, those struggling nodes at some point got tipped, and became the left behind nodes, but for the sake of this argument, let’s keep it like that.
Looking at those that fell behind, what needs to be considered is how chia does syncing. Basically, that is a serialized process, where data is downloaded from just one node at a time, and slowly rotating through nodes with higher height. As such, when a node is behind (say 20 or 50 blocks), it should drop all activity but syncing. Also, it should drop peer count to something like 5 or 10, as anything more than that is just a waste (processing one block doesn’t take much data to download, but takes long time, as such rotation doesn’t need to be fast, so no need for a lot of peers). This would immediately reduce burden both on those nodes, as well as other nodes that those nodes were connecting to.
The other group (struggling one) had plenty peers with 0 height. Those peers with 0 height were not exactly peers that were behind or struggling, but rather an indication that the struggling node couldn’t fully go through the handshake with all those peers, as such couldn’t read height of those peers, not to mention read/write much (potentially what was read was dropped due to synchronization problems). Also, those other peers that had good heights potentially had the same problem – those heights were basically stale counter. Therefore, nodes that see connections, where Up/Down MB rate stalls, or doesn’t go through the handshake should drop those connections after some timeout (say 1 minute). In addition to that, the new connection should not be made after another timeout (say 5x of the previous one, or better yet adaptive timeout that would grow if more nodes would need to be dropped – standard practice). This would give those nodes some primitive back-off mechanism, and of course more CPU to keep up fully synced.
So, both things are rather simple to implement, and have really nothing to do how good (fast) or bad is the network code, but obviously point to problems with how it is being used.
Also, the exact same problem was with logging on my box. Logs were pushed out less than one millisecond apart. Assuming that at some point the OS level cache was exhausted (handling two db with over 40 GB, plus writing 100 bytes chunks to the same media every millisecond basically brings any process down / any media down. The offending process (maxing out one core) was start_full_node. The minute I lowered log levels to ERROR, my node got a breather. Again, it is not how code looks like, but how badly it is being used. I have never seen in my life a production code that is pumping disk-bound logs as much as the CPU allows it to do. It is just senseless coding.
And I fully agree with you on the multithreading issue. My one core was maxed by one start_full_node process, where about 10 other start_full_node processes where idling. The fact that those logs were not forked out to a low priority thread is just moronic.
Although, I think that we disagree on that Kent Beck quote, as far as “make it right” 🙂
Yes, Python is slow as it is human friendly and it is intended to be more on the prototype level BUT let’s not forget that it is an extremely flexible language and when you know your prototype you can then implement the typical Python speedup tricks like JIT / PyPy / C modules etc etc. Disclamer: I haven’t checked the chia-blockchain Python code in detail……
Maybe the Chia company feels comfortable knowing that many nodes offline mean there are still a lot more online, so the network is secure and the blockchain progressing as normal. But they can’t be happy with the delay in transactions getting through. There have been reports of transactions taking many hours or even longer to be confirmed on the blockchain. Applications such as the carbon rights deal with the Worldbank might not be bothered but it will be a problem for higher volume/low latency like applications. At least I think so.
I think the problem is that Chia team promises too much.
E.g. I have a full Daedalus Cardano wallet and it crashes my older iMac with 16GB ram as it eats up all the resources, whereas I can easily run a Chia full node plus several other forks on the same machine with no issues, nonetheless I use an even more powerful machine, a Dell T7920 which I use for plotting too. With this setup everything works smoothly and had no problems whatsoever during the dust storms.
So to be honest I don’t think Chia full node is too bad although refactoring it in Go or c will certainly improve it. So probably the main issue is Chia team that promises a full node can run on a Raspberry pi!
I know it would be frustrating to be a farmer knocked offline during a dust storm, especially if it was a long duration. Seems a bit excessive to change alot just to keep a small subset of farmers online all the time when they have the option of upgrading their hardware. I feel like that is a market dynamic of the proof of work system. I didnt get knocked offline with my crappy i7 but if i did i would consider upgrading. I dont disagree that it would be good to keep everyone online all the time if there was a reasonable way to achieve it however that would be achieved. That side of things definetly isnt my skillset. I do agree with you that it presents a possible future issue as the usage on the network grows and there is an attack.
Im just happy the network didnt go down.
Im kinda seeing the issue like living in the middle of a forest miles away from the nearest town and expecting your power to never go out in storms. If power reliability is an issue buy a generator. If you cant afford a generator then you’ll just have to wait until all trees get picked up off the power lines. Its an inconvenience but theres no real harm done.
I’ve read some people can weather the dust storms on a Raspberry Pi 4 with SSD. But the Full Node is hot garbage on a Raspberry Pi 4 using an SD card. After fighting with it for months, I’ve decided to upgrade to a i7 with SSD and it is a much better experience. If I’d started with the Raspberry Pi with SSD it might have been better, but at this point I just want something that works and exceeds the minimum specs.