Yesterday a couple of Chia Pools were hit with a Denial of Service (DoS) attack and experienced some degree of downtime on their Chia nodes, and probably a ton of stress. Both PoolChia and FlexPool were running their pool Full Nodes in Amazon’s AWS. A couple of other pools also saw some spikes in traffic, but nobody else I spoke to was taken offline.
But what can the pool’s do to protect themselves? The first thing is, do not attach a public IP to the Full Node. If at all possible put the Node on a private network and forward requests back to it. If you have inbound ports NAT’d back to a private IP you are in a bit better position, but not significantly unless your firewall has some intelligence to it for dropping malformed packets. Most Chia farmers will be vulnerable to a DDOS, potentially, due to the nature of the software.
This is about to get very technical, with diagrams and everything, so if that’s something that will put you to sleep then now is a good time to get back.
Now, I haven’t seen any logs of what happened specifically to the pools, so the following is going to be an educated guess based on experience dealing with issues like this. The Chia Full Node client is, at it’s heart, an http web server. It uses TLS and TLS auth for communication with other nodes. You can check this out yourself by using NMap or another network scanning tool, but below you will see the results from my Intrusion Detection System and NMap output for sessions on port 8444.
The two most common types of DOS attacks that hit web servers like this are Layer 7 and Layer 4 (HTTP flood and SYN flood, in the most general sense). CloudFlare has a good writeup about these at a high level. A SYN flood attack can be handled by most normal firewalls assuming the volume doesn’t overwhelm their physical connections. If it does happen what you will is a whole pile of new half open connections on your firewall waiting for them to complete. This will consume firewall resources until it crashes or cannot accept new connections. This is the most traditional type of attack and the primary reason you should always run everything behind a firewall.
The other possibility is that it was an HTTP flood attack, on layer 7. This involves creating a huge number of HTTP requests that cost the server most processing time to deal with than the attack to create. This will frequently use a botnet, but doesn’t have to, and will keep the application layer of the server so busy that it cannot perform normal tasks. This is the most common form of targeted DOS attack at this point because it is much more difficult to mitigate at a network level. This is the kind of attack I have seen most frequently against this site, and why I pay for a Web Application Firewall.
My guess is that the specific DOS yesterday was layer 7, because I’m fairly certain AWS would stop a SYN flood at the network level, but I cannot be sure. Either way, the preferred solution would address both issues. In their post on the attack FlexPool once again places the blame on the Chia Full Node code, which is Python wrapped C and C++ from my understanding. While possible that their upcoming Golang Full Node could be more performant, there is no magical programming language that prevents flood attacks. The solution to attacks like this is architectural, not programmatic.
The key is to not expose the full node operating your pool directly to the internet, but it needs to be connected to other nodes in order to actually do its job. There are going to be 3 main ways to accomplish this, each with each one costing more in infrastructure and maintenance. The biggest roadblock to deploying truly secure pool infrastructure is going to be cost. When I spoke to J. Eckert at Chia Network about this issue he told me that they had all their stuff behind Cloudflare, and that should be standard. I did not argue with him, but that is not really a feasible solution for most people. Sure, the pool endpoints and other 443/https services can easily be put behind Cloudflare and you can get a WAF with it for $20/month. But to use a custom TCP reverse proxy with DDOS protection requires an Enterprise account which is going to be very expensive, hundreds or thousands of dollars a month expensive.
The first, and easiest – and should be the minimum – solution to this problem is to simply stick a TCP reverse proxy in front of your Full Node and allow no direct traffic to it from the internet. It should not even have a public IP address. All the communication from your node will be to the internal address of the proxy and all the other nodes will talk to the external connection. Even better if it has a Web Application Firewall module with it so you can address common layer 7 attacks. There are a number of open source solutions here, and while complicated is almost certainly worth it and not so complicated that pools will not be able to deploy.
The second solution is a bit more extensive and will require configuring a second full node. One to talk to the outside world, still behind a TCP reverse proxy and WAF, and one for your pool application to talk to. This would probably require some reconfiguration of the existing setup, and is a little more complicated but it would give your pool a small buffer if the first node gets taken out by an attack. This setup still has some risk, but it almost certainly going to be easier to sort out the first node while the pool is talking to the second, still safely up and just not talking to anyone. You can also, in an emergency, allow your protected node to reach outbound to other nodes if your first node is under a sustained attack in order to keep the pool up.
The following is a little extreme. It is the correct way to architect a multilayer application like this, but the costs can be significant. It can get even more wild than this with zero-trust solutions, but that requires so much additional infrastructure I don’t consider it a relevant option for even the largest pools. The following diagram shows a load-balanced set of full nodes. This ensures a highly available front end connection to the wider Chia network and allows for an attack, or a typical maintenance window, to take one node out without any disruption to operations. It also includes a back end set of load balancers to ensure that the Application node can access a Public Node at any time. This will ensure proper availability for a 24/7/365 service.
This is expensive. It might not happen. As we have discussed before the pools are running at very slim margins. Its possible Space Pool should adopt the 3rd architecture to ensure constant uptime but security costs money. It will be slightly cheaper for pools running on their own hardware, but anyone running on public cloud infrastructure will need to be careful about increasing operational costs too high with the price of XCH so low. But security cannot be an afterthought, and it has to be built right into the design.