Hacker News

I regret building this $3000 Pi AI cluster

2025-09-1914:28466359www.jeffgeerling.com

I ordered a set of 10 Compute Blades in April 2023 (two years ago), and they just arrived a few weeks ago. In that time Raspberry Pi upgraded the CM4 to a CM5, so I ordered a set of 10 16GB CM5 Lite…

Show article

There's another Pi-powered blade computer, the Xerxes Pi. It's smaller and cheaper, but it just wrapped up its own Kickstarter. Will it ship in less than two years? Who knows, but I'm a sucker for crowdfunded blade computers, so of course I backed it!

But my main question, after sinking in a substantial amount of money: are Pi clusters even worth it anymore? I There's no way this cluster could beat the $8,000, 4-node Framework Desktop cluster in performance. But what about in price per gigaflop, or in efficiency or compute density?

There's only one way to find out.

Compute Blade Cluster Build

I made a video going over everything in this blog post—and the entire cluster build (and rebuild, and rebuild again) process. You can watch it here, or on YouTube:

But if you're on the blog, you're probably not the type to sit through a video anyway. So moving on...

Clustering means doing everything over n times

In the course of going from 'everything's in the box' to 'running AI and HPC benchmarks reliably', I rebuilt the cluster basically three times:

First, my hodgepodge of random NVMe SSDs laying around the office was unreliable. Some drives wouldn't work with the Pi 5's PCIe bus, it seems, other ones were a little flaky (there's a reason these were spares sitting around the place, and not in use!)
After replacing all the SSDs with Patriot P300s, they were more reliable, but the CM5s would throttle under load
I put these CM heatsinks on without screwing them in... then realized they would pop off sometimes, so I took all the blades out again and screwed them into the CM5s/Blades so they were more secure for the long term.

Compute Blade Cluster HPL Top500 Test

The first benchmark I ran was my top500 High Performance Linpack cluster benchmark. This is my favorite cluster benchmark, because it's the traditional benchmark they'd run on massive supercomputers to get on the top500 supercomputer list.

Before I installed heatsinks, the cluster got 275 Gflops, which is an 8.5x speedup over a single 8 GB CM5. Not bad, but I noticed the cluster was only using 105 Watts of power during the run. Definitely more headroom available.

After fixing the thermals, the cluster did not throttle, and used around 130W. At full power, I got 325 Gflops, which is a 10x performance improvement (for 10x 16GB CM5s) over a single 8 GB CM5.

Compared to the $8,000 Framework Cluster I benchmarked last month, this cluster is about 4 times faster:

This is a bad cluster. Except for maybe blade 9, which dies every time I run a benchmark. But I will keep it going, knowing it's definitely easier to maintain than the 1,050 node Pi cluster at UC Santa Barbera, which to my knowledge is still the world's largest!

Before I go, I just wanted to give a special thanks to everyone who supports my on Patreon, GitHub, YouTube Memberships, and Floatplane. It really helps when I take on these months- (or years!) long projects.

Parts Used

You might not want to replicate my cluster setup — but I always get asked what parts I used (especially the slim Ethernet cables... everyone asks about those!), so here's the parts list:

Read the original article

speckx

Karma: 14772

@Hacker__News
@hacker._news

Comments

By densh 2025-09-1918:086 reply

For anyone interested in playing with distributed systems, I'd really recommend getting a single machine with latest 16-core CPU from AMD and just running 8 virtual machines on it. 8 virtual machines, with 4 hyper threads pinned per machine, and 1/8 of total RAM per machine. Create a network between them virtually within your virtualization software of choice (such as Proxmox).

And suddenly you can start playing with distributed software, even though it's running on a single machine. For resiliency tests you can unplug one machine at a time with a single click. It will annihilate a Pi cluster in Perf/W as well, and you don't have to assemble a complex web of components to make it work. Just a single CPU, motherboard, m.2 SSD, and two sticks of RAM.

Naturally, using a high core count machine without virtualization will get you best overall Perf/W in most benchmarks. What's also important but often not highlighted in benchmarks in Idle W if you'd like to keep your cluster running, and only use it occasionally.

By globular-toast 2025-09-1919:301 reply

I've been saying this for years. When the last Raspberry Pi shortage happened people were scrambling to get them for building these toy clusters and it's such a shame. The Pi was made for paedogogy but I feel like most of them are wasted.

I run a K8s "cluster" on a single xcp-ng instance, but you don't even really have to go that far. Docker Machine could easily spin up docker hosts with a single command, but I see that project is dead now. Docker Swarm I think still lets you scale up/down services, no hypervisor required.

By motorest 2025-09-204:232 reply

> I've been saying this for years. When the last Raspberry Pi shortage happened people were scrambling to get them for building these toy clusters and it's such a shame. The Pi was made for paedogogy but I feel like most of them are wasted.

You're describing people using RPis to learn distributed systems, and you conclude that these RPis are wasted because RPis were made for paedogogy?

> I run a K8s "cluster" on a single xcp-ng instance, but you don't even really have to go that far.

That's perfectly fine. You do what works for you, just like everyone else. How would you handle someone else accusing your computer resourcss of being wasted?

By 0xDE7EC71V3 2025-09-205:13

I’ve learned so much setting up a pi cluster. There is something so cool about seeing code run across different pieces of hardware.

By globular-toast 2025-09-207:271 reply

The point was you don't need to wait for 8 Pis to become available when most people can get going straight away with what they already have.

If you want to learn physical networking or really need to "see" things happening on physically separate machines just get a free old PC from gumtree or something.

By motorest 2025-09-2015:441 reply

> The point was you don't need to wait for 8 Pis to become available when most people can get going straight away with what they already have.

You also don't need RPis to learn anything about programming, networking, electronics, etc.

But people do it anyways.

I really don't see what point anyone thinks they are making regarding pedogogy. RPis are synonymous with tinkering, regardless of how you cut it. Distributed systems too.

By globular-toast 2025-09-2221:31

I think you misread my comment, maybe it's clearer if I say "(admittedly) the pi is meant for paedogogy (however) I feel like most of them are wasted".

By qmr 2025-09-1919:422 reply

No need for so much CPU power, any old quad core would work.

By subscribed 2025-09-1921:324 reply

Old quad core won't have all the virtualisation extensions.

By AnthonyMouse 2025-09-206:15

> Old quad core won't have all the virtualisation extensions.

Intel's first quad core was Kentsfield in 2006. It supports VT-x. AMD's first quad core likewise supports AMD-V. The newer virtualization extensions mostly just improve performance a little or do things you probably won't use anyway like SR-IOV.

By justsomehnguy 2025-09-2018:04

Ivy Bridge is 13 years old today. You need to do the the things to buy something older than that in 2025.

By qmr 2025-09-200:02

Virtualization existed long before virtualization instructions. Not strictly necessary.

By afzalive 2025-09-1922:41

An old Xeon then.

By anaganisk 2025-09-202:462 reply

Aren’t newer CPUs especially AMDs more energy efficient?

By AnthonyMouse 2025-09-206:221 reply

Newer CPUs have significantly better performance per watt under load, essentially by being a lot faster while using a similar amount of power. Idle CPU power consumption hasn't changed much in 10+ years simply because by that point it was already a single digit number of watts.

The thing that matters more than the CPU for idle power consumption is how efficient the system's power supply is under light loads. The variance between them is large and newer power supplies aren't all inherently better at it.

By 0manrho 2025-09-2016:261 reply

Also worth noting, as this is a common point for the homelabbers out there, fans in surplus enterprise hardware can actually be a significant source of not just noise, but power usage, even at idle.

I remember back in the R710 days (circa 2008 and Nehalem/Westmere cpu's) that under like 30% cpu load, most of your power draw came from fans that you couldn't spin down below a certain threshold without an firmware/idrac script, as well as what you mentioned about those PSU's being optimized for high sustained loads and thus being inefficient at near idle and low usage.

IIRC System Idle power profile on those was only like 15% CPU (that's combined for both CPUs), with the rest being fans, ram and the various other vendor stuff (iDrac, PERC etc) and low-load PSU inefficiencies.

Newer hardware has gotten better, but servers are still generally engineered for above 50% sustained loads rather than under, and those fans still can easily pull a dozen plus watts even at very low usage each in those servers (of course, depends on exact model), so, point being, splitting hairs over a dozen watts or so between CPU's is a bit silly when your power floor from fans and PSU inefficiencies alone puts you at 80W+ draw anyway, not to mention the other components (NIC, Drives, Storage controller, OoB, RAM etc). Also, this is primarily relevant for surplus servers, but lot of people building systems at home for the usecase relevant to this discussion often turn to or are recommended these servers, so just wanted to add this food for thought.

By AnthonyMouse 2025-09-2019:10

Yeah, the server vendors give negative fucks about idle power consumption. I have a ~10 year old enterprise desktop quad core with a full-system AC power consumption of 6 watts while powered on and idle. I've seen enterprise servers of a similar vintage -- from the same vendor -- draw 40 watts when they're off.

By 0manrho 2025-09-2016:20

If the point is a multi-tasking sandbox, not heavy/sustained data-crunching, those old CPU's w/ boosting turned off or a mild underclock/undervolt (or an L spec which comes iwth that out of the box) really aren't any more power hungry than a newer Ryzen unless you intend on running whatever you buy at high load for long times. Yeah, on paper it still could be a double digit percentage difference, but in reality we're talking a difference of 10W or 20W if you're not running stuff above 50% load for sustained periods.

Again, lots of variables there and it really depends on how heavily you intend to use/rely on that sandbox as to what's the better play. Regional pricing also comes into it.

By malux85 2025-09-1922:15

Yeah this is how I practiced Postgres hot standby and read replicas,

It was also how I learned to setup a Hadoop cluster, and a Cassandra cluster (this was 10 years ago when these technologies were hot)

Having knowledge of these systems and being able to talk about how I set them up and simulated recovery directly got me jobs that 2x and then 3x my salary, I would highly recommend all medium skilled developers setup systems like this and get practicing if you want to get up into the next level

By cyberpunk 2025-09-1919:24

Honestly why do you need so much cpu power? You can play with distributed systems just by installing Erlang and running a couple of nodes on whatever potato-level linux box you have laying around, including a single raspberry pi.

By bee_rider 2025-09-1919:001 reply

Tangentially related: I really expected running old MPI programs on stuff like the AMD multi-chip workstation packages to become a bigger thing.

By le-mark 2025-09-202:072 reply

I actually worked with some MPI code way back. What MPI programs are you referring to?

By MathMonkeyMan 2025-09-204:34

I don't know, but when I was playing with finite difference code as an undergrad in Physics, all of the docs I could find (it was a while ago, though) assumed that I was going to use MPI to run a distributed workload across the university's supercomputer. My needs were less, so I just ran my Boost.Thread code on the four cores of one node.

What if you had a single server with a zillion cores in it? Maybe you could take some 15 year old MPI code and run it locally -- it'd be like a mini supercomputer with an impossibly fast network.

By bee_rider 2025-09-206:41

I’m not thinking of one code in particular. Just, observing that in the multi-chiplet, even inside a CPU package we’re already talking over a sort of little internal network anyway. Might as well use code that was designed to run on a network, right?

By user432678 2025-09-2014:21

Yes, but this is boring. Saying this as an owner of home server with ProxMox.

By bunderbunder 2025-09-1914:51

Reminds me a bit of one of my favorite NormConf sessions, "Just use one big machine for model training and inference." https://youtu.be/9BXMWDXiugg?si=4MnGtOSwx45KQqoP

Or the oldie-but-goodie paper "Scalability! But at what COST?": https://www.usenix.org/system/files/conference/hotos15/hotos...

Long story short, performance considerations with parallelism go way beyond Amdahl's Law, because supporting scale-out also introduces a bunch of additional work that simply doesn't exist in a single node implementation. (And, for that matter, multithreading also introduces work that doesn't exist for a sequential implementation.) And the real deep down black art secret to computing performance is that the fastest operations are the ones you don't perform.

By bee_rider 2025-09-1916:25

> The first benchmark I ran was my top500 High Performance Linpack cluster benchmark. This is my favorite cluster benchmark, because it's the traditional benchmark they'd run on massive supercomputers to get on the top500 supercomputer list. […]

> After fixing the thermals, the cluster did not throttle, and used around 130W. At full power, I got 325 Gflops

I was sort of surprised to find that the top500 list on their website only goes back to 1993. I was hoping to find some ancient 70’s version of the list where his ridiculous Pi cluster could sneak on. Oh well, might as well take a look… I’ll pull from the sub-lists of

https://www.top500.org/lists/top500/

They give the top 10 immediately.

First list (June 1993):

     placement  name            RPEAK (GFlop/s)
     1          CM-5/1024       131.00
     10         Y-MP C916/16256 15.24

Last list he wins, I think (June 1996):

     1          SR2201/1024     307.20  
     10         SX-4/32         64.00

First list he’s bumped out of the top 10 (November 1997):

     1          ASCI Red        1,830.40
     10         T3E             326.40

I think he gets bumped off the full top500 list around 2002-2003. Unfortunately I made the mistake of going by Rpeak here, but they sort by Rmax, and I don’t want to go through the whole list.

Apologies for any transcription errors.

Actually, pretty good showing for such a silly cluster. I think I’ve been primed by stuff like “your watch has more compute power than the Apollo guidance computer” or whatever to expect this sort of thing to go way, way back, instead of just to the 90’s.