I ordered a set of 10 Compute Blades in April 2023 (two years ago), and they just arrived a few weeks ago. In that time Raspberry Pi upgraded the CM4 to a CM5, so I ordered a set of 10 16GB CM5 Lite…
There's another Pi-powered blade computer, the Xerxes Pi. It's smaller and cheaper, but it just wrapped up its own Kickstarter. Will it ship in less than two years? Who knows, but I'm a sucker for crowdfunded blade computers, so of course I backed it!
But my main question, after sinking in a substantial amount of money: are Pi clusters even worth it anymore? I There's no way this cluster could beat the $8,000, 4-node Framework Desktop cluster in performance. But what about in price per gigaflop, or in efficiency or compute density?
There's only one way to find out.
I made a video going over everything in this blog post—and the entire cluster build (and rebuild, and rebuild again) process. You can watch it here, or on YouTube:
But if you're on the blog, you're probably not the type to sit through a video anyway. So moving on...
In the course of going from 'everything's in the box' to 'running AI and HPC benchmarks reliably', I rebuilt the cluster basically three times:
The first benchmark I ran was my top500 High Performance Linpack cluster benchmark. This is my favorite cluster benchmark, because it's the traditional benchmark they'd run on massive supercomputers to get on the top500 supercomputer list.
Before I installed heatsinks, the cluster got 275 Gflops, which is an 8.5x speedup over a single 8 GB CM5. Not bad, but I noticed the cluster was only using 105 Watts of power during the run. Definitely more headroom available.
After fixing the thermals, the cluster did not throttle, and used around 130W. At full power, I got 325 Gflops, which is a 10x performance improvement (for 10x 16GB CM5s) over a single 8 GB CM5.
Compared to the $8,000 Framework Cluster I benchmarked last month, this cluster is about 4 times faster:
This is a bad cluster. Except for maybe blade 9, which dies every time I run a benchmark. But I will keep it going, knowing it's definitely easier to maintain than the 1,050 node Pi cluster at UC Santa Barbera, which to my knowledge is still the world's largest!
Before I go, I just wanted to give a special thanks to everyone who supports my on Patreon, GitHub, YouTube Memberships, and Floatplane. It really helps when I take on these months- (or years!) long projects.
You might not want to replicate my cluster setup — but I always get asked what parts I used (especially the slim Ethernet cables... everyone asks about those!), so here's the parts list:
For anyone interested in playing with distributed systems, I'd really recommend getting a single machine with latest 16-core CPU from AMD and just running 8 virtual machines on it. 8 virtual machines, with 4 hyper threads pinned per machine, and 1/8 of total RAM per machine. Create a network between them virtually within your virtualization software of choice (such as Proxmox).
And suddenly you can start playing with distributed software, even though it's running on a single machine. For resiliency tests you can unplug one machine at a time with a single click. It will annihilate a Pi cluster in Perf/W as well, and you don't have to assemble a complex web of components to make it work. Just a single CPU, motherboard, m.2 SSD, and two sticks of RAM.
Naturally, using a high core count machine without virtualization will get you best overall Perf/W in most benchmarks. What's also important but often not highlighted in benchmarks in Idle W if you'd like to keep your cluster running, and only use it occasionally.
I've been saying this for years. When the last Raspberry Pi shortage happened people were scrambling to get them for building these toy clusters and it's such a shame. The Pi was made for paedogogy but I feel like most of them are wasted.
I run a K8s "cluster" on a single xcp-ng instance, but you don't even really have to go that far. Docker Machine could easily spin up docker hosts with a single command, but I see that project is dead now. Docker Swarm I think still lets you scale up/down services, no hypervisor required.
> I've been saying this for years. When the last Raspberry Pi shortage happened people were scrambling to get them for building these toy clusters and it's such a shame. The Pi was made for paedogogy but I feel like most of them are wasted.
You're describing people using RPis to learn distributed systems, and you conclude that these RPis are wasted because RPis were made for paedogogy?
> I run a K8s "cluster" on a single xcp-ng instance, but you don't even really have to go that far.
That's perfectly fine. You do what works for you, just like everyone else. How would you handle someone else accusing your computer resourcss of being wasted?
I’ve learned so much setting up a pi cluster. There is something so cool about seeing code run across different pieces of hardware.
The point was you don't need to wait for 8 Pis to become available when most people can get going straight away with what they already have.
If you want to learn physical networking or really need to "see" things happening on physically separate machines just get a free old PC from gumtree or something.
> The point was you don't need to wait for 8 Pis to become available when most people can get going straight away with what they already have.
You also don't need RPis to learn anything about programming, networking, electronics, etc.
But people do it anyways.
I really don't see what point anyone thinks they are making regarding pedogogy. RPis are synonymous with tinkering, regardless of how you cut it. Distributed systems too.
I think you misread my comment, maybe it's clearer if I say "(admittedly) the pi is meant for paedogogy (however) I feel like most of them are wasted".
Old quad core won't have all the virtualisation extensions.
> Old quad core won't have all the virtualisation extensions.
Intel's first quad core was Kentsfield in 2006. It supports VT-x. AMD's first quad core likewise supports AMD-V. The newer virtualization extensions mostly just improve performance a little or do things you probably won't use anyway like SR-IOV.
Ivy Bridge is 13 years old today. You need to do the the things to buy something older than that in 2025.
Virtualization existed long before virtualization instructions. Not strictly necessary.
An old Xeon then.
Newer CPUs have significantly better performance per watt under load, essentially by being a lot faster while using a similar amount of power. Idle CPU power consumption hasn't changed much in 10+ years simply because by that point it was already a single digit number of watts.
The thing that matters more than the CPU for idle power consumption is how efficient the system's power supply is under light loads. The variance between them is large and newer power supplies aren't all inherently better at it.
Also worth noting, as this is a common point for the homelabbers out there, fans in surplus enterprise hardware can actually be a significant source of not just noise, but power usage, even at idle.
I remember back in the R710 days (circa 2008 and Nehalem/Westmere cpu's) that under like 30% cpu load, most of your power draw came from fans that you couldn't spin down below a certain threshold without an firmware/idrac script, as well as what you mentioned about those PSU's being optimized for high sustained loads and thus being inefficient at near idle and low usage.
IIRC System Idle power profile on those was only like 15% CPU (that's combined for both CPUs), with the rest being fans, ram and the various other vendor stuff (iDrac, PERC etc) and low-load PSU inefficiencies.
Newer hardware has gotten better, but servers are still generally engineered for above 50% sustained loads rather than under, and those fans still can easily pull a dozen plus watts even at very low usage each in those servers (of course, depends on exact model), so, point being, splitting hairs over a dozen watts or so between CPU's is a bit silly when your power floor from fans and PSU inefficiencies alone puts you at 80W+ draw anyway, not to mention the other components (NIC, Drives, Storage controller, OoB, RAM etc). Also, this is primarily relevant for surplus servers, but lot of people building systems at home for the usecase relevant to this discussion often turn to or are recommended these servers, so just wanted to add this food for thought.
Yeah, the server vendors give negative fucks about idle power consumption. I have a ~10 year old enterprise desktop quad core with a full-system AC power consumption of 6 watts while powered on and idle. I've seen enterprise servers of a similar vintage -- from the same vendor -- draw 40 watts when they're off.
If the point is a multi-tasking sandbox, not heavy/sustained data-crunching, those old CPU's w/ boosting turned off or a mild underclock/undervolt (or an L spec which comes iwth that out of the box) really aren't any more power hungry than a newer Ryzen unless you intend on running whatever you buy at high load for long times. Yeah, on paper it still could be a double digit percentage difference, but in reality we're talking a difference of 10W or 20W if you're not running stuff above 50% load for sustained periods.
Again, lots of variables there and it really depends on how heavily you intend to use/rely on that sandbox as to what's the better play. Regional pricing also comes into it.
Yeah this is how I practiced Postgres hot standby and read replicas,
It was also how I learned to setup a Hadoop cluster, and a Cassandra cluster (this was 10 years ago when these technologies were hot)
Having knowledge of these systems and being able to talk about how I set them up and simulated recovery directly got me jobs that 2x and then 3x my salary, I would highly recommend all medium skilled developers setup systems like this and get practicing if you want to get up into the next level
Honestly why do you need so much cpu power? You can play with distributed systems just by installing Erlang and running a couple of nodes on whatever potato-level linux box you have laying around, including a single raspberry pi.
Tangentially related: I really expected running old MPI programs on stuff like the AMD multi-chip workstation packages to become a bigger thing.
I actually worked with some MPI code way back. What MPI programs are you referring to?
I don't know, but when I was playing with finite difference code as an undergrad in Physics, all of the docs I could find (it was a while ago, though) assumed that I was going to use MPI to run a distributed workload across the university's supercomputer. My needs were less, so I just ran my Boost.Thread code on the four cores of one node.
What if you had a single server with a zillion cores in it? Maybe you could take some 15 year old MPI code and run it locally -- it'd be like a mini supercomputer with an impossibly fast network.
I’m not thinking of one code in particular. Just, observing that in the multi-chiplet, even inside a CPU package we’re already talking over a sort of little internal network anyway. Might as well use code that was designed to run on a network, right?
Yes, but this is boring. Saying this as an owner of home server with ProxMox.
Reminds me a bit of one of my favorite NormConf sessions, "Just use one big machine for model training and inference." https://youtu.be/9BXMWDXiugg?si=4MnGtOSwx45KQqoP
Or the oldie-but-goodie paper "Scalability! But at what COST?": https://www.usenix.org/system/files/conference/hotos15/hotos...
Long story short, performance considerations with parallelism go way beyond Amdahl's Law, because supporting scale-out also introduces a bunch of additional work that simply doesn't exist in a single node implementation. (And, for that matter, multithreading also introduces work that doesn't exist for a sequential implementation.) And the real deep down black art secret to computing performance is that the fastest operations are the ones you don't perform.
> The first benchmark I ran was my top500 High Performance Linpack cluster benchmark. This is my favorite cluster benchmark, because it's the traditional benchmark they'd run on massive supercomputers to get on the top500 supercomputer list. […]
> After fixing the thermals, the cluster did not throttle, and used around 130W. At full power, I got 325 Gflops
I was sort of surprised to find that the top500 list on their website only goes back to 1993. I was hoping to find some ancient 70’s version of the list where his ridiculous Pi cluster could sneak on. Oh well, might as well take a look… I’ll pull from the sub-lists of
https://www.top500.org/lists/top500/
They give the top 10 immediately.
First list (June 1993):
placement name RPEAK (GFlop/s)
1 CM-5/1024 131.00
10 Y-MP C916/16256 15.24
Last list he wins, I think (June 1996): 1 SR2201/1024 307.20
10 SX-4/32 64.00
First list he’s bumped out of the top 10 (November 1997): 1 ASCI Red 1,830.40
10 T3E 326.40
I think he gets bumped off the full top500 list around 2002-2003. Unfortunately I made the mistake of going by Rpeak here, but they sort by Rmax, and I don’t want to go through the whole list.Apologies for any transcription errors.
Actually, pretty good showing for such a silly cluster. I think I’ve been primed by stuff like “your watch has more compute power than the Apollo guidance computer” or whatever to expect this sort of thing to go way, way back, instead of just to the 90’s.