Over the past few months, as the Chia blockchain has gotten a lot busier, many farmers have been having a ton of problems with their Chia full node. From database sync problems caused by power outages, to losing sync because of high transaction volume a lot of Chia farmers and services are having significant issues with the Full Node. One of the key complaints being made is that it is written in python, and so it sucks. This clearly isn’t true, and farmer problems don’t tell the whole story. The full node definitely needs some work, but it isn’t all bad either.
One of the main issues with farming Chia right now is reliability. Even with a really good setup, you cannot be sure you will not be affected by transaction issues. During the Chia Dust Storm a couple months ago I had no issues with my setup while people all over the world were unable to maintain sync. Most of the issues seemed to come from low power farmers not able to keep up with the network. After that problem Chia Network promised, and delivered, an update that would prioritize new blocks and hopefully fix the sync issues.
It did not work. Not entirely.
Twice in the past week my farm has just lost sync with the network for many hours while the GUI showed it was in sync with the network. A restart of all the services allowed it to reconnect to the network and resync the 2000 blocks or so I was behind. Judging by the 10% drop in netspace I am clearly not the only one. Not only that, but services like XCHScan and ChiaExplorer have also been having sync issues recently.
My farming setup is not slow. I am running a 6 core Rocket Lake i5 running with power limits disabled and 32GB of 3200mhz DDR4. It can haul ass when it needs to. And I don’t think a faster system would have helped here because I suspect the problem is that a significant amount of the network lagged and I got caught in a pocket that just never caught up. Nothing in my system can change the fact that all my peers were behind. I have since radically reduced my number of peers and gotten ruthless when disconnecting peers that look a little behind and it seems to have helped. For now.
Also, if I, or anyone in that pocket, had been running a fast timelord in that pocket we may have forked the chain. This is no joke and, if the enterprise partners that Chia is courting end up running their own nodes and timelord for reliability it can’t cause them to fork the chain and run a lower weight copy for 2 hours because some kid decides to flood the mempool with transactions. Reliability of the blockchain means reliability for everyone, always. Not just in general.
But why does the full node lag when transaction volumes spike? This is infuriating because from what I can tell its not spiking even close to the theoretical limits of the blockchain. Yes, 12 000 transactions in the mempool all at once is a lot, but if a database system can’t handle 12k concurrent transactions once in awhile on modern hardware without falling apart then it isn’t working right, full stop. The first time this happens to a business they will be enraged. The 3rd time in a few months? They will walk away.
And there is at least one previously (and still) successful business that relies on the Chia full node to operate: Flexpool. Alex from Flexpool has been consistent for many months about how the Chia Full Node is a huge risk to operations and that the python base just simply was not performant enough for prime time. Flexpool has been open about working on a Golang based Chia node for themselves so they mitigate this risk. And it is getting pretty hard to argue with him about this after the last few “hiccups”.
I also have my concerns with 100s of thousands of end-users running Python web servers from a security perspective, as it is really easy for an (admittedly rare) Python Remote Code Execution (RCE) vulnerability to become a full on remote shell when running as a privileged user (like Chia does on Windows). This is, of course, true of any language or environment as Log4j has so painfully reminded us and I would have concerns with any end-user web server. But this is not about security, and that’s when I need to consult experts. When I spoke to Alex about this he mentioned Python as an inherent issue, but he also had specific problems he has identified with how the node operates. This is going to get technical because, well, its Alex. And that’s how he is.
The worst thing ever about the Chia node is that it is using Python. Each programming language has its own purpose, and Python is definitely not designed for this. Another problem here is that the Chia node is required to evaluate the spend bundle in order to include it to the mempool; other that that, Chia node has no mempool limits as of now whatsoever. Besides that, during the dust attack, all nodes start to propagate the same spend bundles to the entire network, thus effectively DDoS’ing it.
Looking at other blockchain node implementations, Ethereum’s for example, has a fixed mempool size limit, and when the mempool size is exceeding the limits, the node is automatically shifting low fee transactions in favor of the ones who pay more.
But of course, the most significant flaw is one of the first I’ve specified before – is that the node is required to evaluate the bundle in order to include it to the mempool. Neither Bitcoin nor Ethereum are not required to do that.Alex, Lead Engineer Flexpool
I also asked him what he thought was going on with the disconnects and why the whole network was having issues and he told me that it comes down to slower nodes falling behind and then spamming the network with GetBlock and attempts to re-establish sync. He says a lot of why the node behaves this way is because the fee is defined dynamically, so its not like there isn’t a reason for this stuff. Its just the effect on the network that is problematic. I asked if their Go implementation of the Chia node would solve this issues and the answer was an emphatic “Yes!”
But there is more to this story. When I spoke to Gene Hoffman, COO and President of Chia Network, about the issue he had a different take on it. Apparently while clearing out the mempool was handling around 50% more transactions per hour than the Ethereum network does, with 80 000 per hour to Ethereum’s 50 000. If that’s the case, then its not so bad. I will be looking into this, but will need some time to look at data.
As for Alex’s comments about the problems with how Chia has built things, Gene obviously disagreed. As to the spendbundle evaluation, his response was simply “that’s how blockchains work”. He said he had no idea how Ethereum handles them “But BTC evaluates every spend before mempool inclusion.” I am trying to figure out who is right about Bitcoin and who is not here, but this actually isn’t that well documented at this low a level, at least not anywhere I know about. If anyone independent knows, or has documentation to that effect I would love to see it. I do trust that Alex knows the Ethereum node very well, and I think that’s actually a better comparison to Chia than Bitcoin is because of the block speed. Bitcoin only forms 1 block every 10 minute, with Ethereum and Chia being orders of magnitude faster.
As to the mempool limits, well they do seem to be set in Chia. It is currently set to 10 blocks in the code and anyone building their own node from source (or just comfortable modifying a python script) sould be able to adjust it easily. It is available here on lines 41/42 as MEMPOOL_BLOCK_BUFFER. But 10 blocks is a lot, and that limit is clearly not helping the farmers who are getting behind during these transaction spikes. So there is a definable limit, but it is set so high that it isn’t helping small farmers stay online, which in turn is knocking a bunch more of us off too. I do not know what the implications would be radically reducing that for small farmers, so don’t say I told you to.
The main point here is that from Chia’s perspective the network is operating just fine, overall. If some nodes go offline its not an issue at all because there are lots of nodes to secure the network. They are looking at the network a bit like a kubernetes cluster where no one node makes any difference. However, each one of those nodes is a person who has invested in the network and learned how things work. I don’t think they are ignoring the issue per se, but it does look like they are treating the issues less seriously than the community because the overall network has been fine. And it has. During the first dust storm, pre-update, the high transaction volumes caused block issuance to slow down. Over the past few days this time, issuance has actually been a little above average.
All in all I do not think the full node is trash. It is doing its job at scale and I think a smaller network of performant multicore 100w CPUs would have no issue all staying in sync together. However, that’s not how Chia Network advertises their network nor is it their design goal. There has been a lot of back and forth discussion in my discord if they are going to end up either refactoring the network code to C or if they are going to abandon the Raspberry Pi as their minimum specification. This is a bigger question than I think it might seem, as a huge amount of their corporate identity involves having lots of low powered nodes.
I think eventually they will do both, increasing the minimum requirements as well as refactoring to a compiled language like C. Or maybe it will be Rust, that’s a popular choice for high performance network code. Either way, there is a lot of work to do before Chia can handle the load the Ethereum network handles all day every day and I think they have a long way to go before doing it with a fleet of Raspberry Pis, regardless of what language used.