
So that we may educate as well as horrify: the internals of our new Sprites execution platform.
Image by
Annie RuygtReplacement-level homeowners buy boxes of pens and stick them in “the pen drawer”. What the elites know: you have to think adversarially about pens. “The purpose of a system is what it does”; a household’s is to uniformly distribute pens. Months from now, the drawer will be empty, no matter how many pens you stockpile. Instead, scatter pens every place you could possibly think to look for one — drawers, ledges, desks. Any time anybody needs a pen, several are at hand, in exactly the first place they look.
This is the best way I’ve found to articulate the idea of Sprites, the platform we just launched at Fly.io. Sprites are ball-point disposable computers. Whatever mark you mean to make, we’ve rigged it so you’re never more than a second or two away from having a Sprite to do it with.
Sprites are Linux virtual machines. You get root. They create in just a second or two: so fast, the experience of creating and shelling into one is identical to SSH'ing into a machine that already exists. Sprites all have a 100GB durable root filesystem. They put themselves to sleep automatically when inactive, and cost practically nothing while asleep.
As a result, I barely feel the need to name my Sprites. Sometimes I’ll just type sprite create dkjsdjk and start some task. People at Fly.io who use Sprites have dozens hanging around.
There aren’t yet many things in cloud computing that have the exact shape Sprites do:
This is a post about how we managed to get this working. We created a new orchestration stack that undoes some of the core decisions we made for Fly Machines, our flagship product. Turns out, these new decisions make Sprites drastically easier for us to scale and manage. We’re pretty psyched.
Lucky for me, there happen to be three big decisions we made that get you 90% of the way from Fly Machines to Sprites, which makes this an easy post to write. So, without further ado:
This is the easiest decision to explain.
Fly Machines are approximately OCI containers repackaged as KVM micro-VMs. They have the ergonomics of Docker but the isolation and security of an EC2 instance. We love them very much and they’re clearly the wrong basis for a ball-point disposable cloud computer.
The “one weird trick” of Fly Machines is that they start and stop instantly, fast enough that they can wake in time to handle an incoming HTTP request. But they can only do that if you’ve already created them. You have to preallocate. Creating a Fly Machine can take over a minute. What you’re supposed to do is to create a whole bunch of them and stop them so they’re ready when you need them. But for Sprites, we need create to be so fast it feels like they’re already there waiting for you.
Most of what’s slow about creating a Fly Machine is containers. I say this with affection: your containers are crazier than a soup sandwich. Huge and fussy, they take forever to pull and unpack. The regional locality sucks; create a Fly Machine in São Paulo on gru-3838, and a create on gru-d795 is no faster. A truly heartbreaking amount of engineering work has gone into just allowing our OCI registry to keep up with this system.
It’s a tough job, is all I’m saying. Sprites get rid of the user-facing container. Literally: problem solved. Sprites get to do this on easy mode.
Now, today, under the hood, Sprites are still Fly Machines. But they all run from a standard container. Every physical worker knows exactly what container the next Sprite is going to start with, so it’s easy for us to keep pools of “empty” Sprites standing by. The result: a Sprite create doesn’t have any heavy lifting to do; it’s basically just doing the stuff we do when we start a Fly Machine.
You can create a couple dozen Sprites right now if you want. It’ll only take a second.
Make a Sprite.→
Every Sprite comes with 100GB of durable storage. We’re able to do that because the root of storage is S3-compatible object storage.
You can arrange for 100GB of storage for a Fly Machine. Or 200, or 500. The catch:
flyctl); we can’t reasonably default it in.
We designed the storage stack for Fly Machines for Postgres clusters. A multi-replica Postgres cluster gets good mileage out of Fly Volumes. Attached storage is fast, but can lose data† — if a physical blows up, there’s no magic what rescues its stored bits. You’re stuck with our last snapshot backup. That’s fine for a replicated Postgres! It’s part of what Postgres replication is for. But for anything without explicit replication, it’s a very sharp edge.
Worse, from our perspective, is that attached storage anchors workloads to specific physicals. We have lots of reasons to want to move Fly Machines around. Before we did Fly Volumes, that was as simple as pushing a “drain” button on a server. Imagine losing a capability like that. It took 3 years to get workload migration right with attached storage, and it’s still not “easy”.
Sprites jettison this model. We still exploit NVMe, but not as the root of storage. Instead, it’s a read-through cache for a blob on object storage. S3-compatible object stores are the most trustworthy storage technology we have. I can feel my blood pressure dropping just typing the words “Sprites are backed by object storage.”
The implications of this for orchestration are profound. In a real sense, the durable state of a Sprite is simply a URL. Wherever he lays his hat is his home! They migrate (or recover from failed physicals) trivially. It’s early days for our internal tooling, but we have so many new degrees of freedom to work with.
I could easily do another 1500-2000 words here on the Cronenberg film Kurt came up with for the actual storage stack, but because it’s in flux, let’s keep it simple.
The Sprite storage stack is organized around the JuiceFS model (in fact, we currently use a very hacked-up JuiceFS, with a rewritten SQLite metadata backend). It works by splitting storage into data (“chunks”) and metadata (a map of where the “chunks” are). Data chunks live on object stores; metadata lives in fast local storage. In our case, that metadata store is kept durable with Litestream. Nothing depends on local storage.
This also buys Sprites fast checkpoint and restore. Checkpoints are so fast we want you to use them as a basic feature of the system and not as an escape hatch when things go wrong; like a git restore, not a system restore. That works because both checkpoint and restore merely shuffle metadata around.
Our stack sports a dm-cache-like feature that takes advantage of attached storage. A Sprite has a sparse 100GB NVMe volume attached to it, which the stack uses to cache chunks to eliminate read amplification. Importantly (I can feel my resting heart rate lowering) nothing in that NVMe volume should matter; stored chunks are immutable and their true state lives on the object store.
Our preference for object storage goes further than the Sprite storage stack. The global orchestrator for Sprites is an Elixir/Phoenix app that uses object storage as the primary source of metadata for accounts. We then give each account an independent SQLite database, again made durable on object storage with Litestream.
In the cloud hosting industry, user applications are managed by two separate, yet equally important components: the host, which orchestrates workloads, and the guest, which runs them. Sprites flip that on its head: the most important orchestration and management work happens inside the VM.
Here’s the trick: user code running on a Sprite isn’t running in the root namespace. We’ve slid a container between you and the kernel. You see an inner environment, managed by a fleet of services running in the root namespace of the VM.
I wish we’d done Fly Machines this way to begin with. I’m not sure there’s a downside. The inner container allows us to bounce a Sprite without rebooting the whole VM, even on checkpoint restores. I think Fly Machines users could get some mileage out of that feature, too.
With Sprites, we’re pushing this idea as far as we can. The root environment hosts the majority of our orchestration code. When you talk to the global API, chances are you’re talking directly to your own VM. Furthermore:
*:8080, we’ll make it available outside the Sprite — yep, that’s in the root namespace too.
Platform developers at Fly.io know how much easier it can be to hack on init (inside the container) than things like flyd, the Fly Machines orchestrator that runs on the host. Changes to Sprites don’t restart host components or muck with global state. The blast radius is just new VMs that pick up the change. We sleep on how much platform work doesn’t get done not because the code is hard to write, but because it’s so time-consuming to ensure benign-looking changes don’t throw the whole fleet into metastable failure. We had that in mind when we did Sprites.
Sprites running on Fly.io take advantage of the infrastructure we already have. For instance: Sprites might be the fastest thing there currently exists to get Claude or Gemini to build a full-stack application on the Internet.
That’s because Sprites plug directly into Corrosion, our gossip-based service discovery system. When you ask the Sprite API to make a public URL for your Sprite, we generate a Corrosion update that propagates across our fleet instantly. Your application is then served, with an HTTPS URL, from our proxy edges.
Sprites live alongside Fly Machines in our architecture. They include some changes that are pure wins, but they’re mostly tradeoffs:
Sprites are optimized for a different kind of computing than Fly Machines, and while Kurt believes that the future belongs to malleable, personalized apps, I’m not so sure. To me, it makes sense to prototype and acceptance-test an application on Sprites. Then, when you’re happy with it, containerize it and ship it as a Fly Machine to scale it out. An automated workflow for that will happen.
Finally, Sprites are a contract with user code: an API and a set of expectations about how the execution environment works. Today, they run on top of Fly Machines. But they don’t have to. Jerome’s working on an open-source local Sprite runtime. We’ll find other places to run them, too.
I can’t not sound like a shill. Sprites are the one thing we’ve shipped that I personally experience as addictive. I haven’t fully put my finger on why it feels so much easier to kick off projects now that I can snap my finger and get a whole new computer. The whole point is that there’s no reason to parcel them out, or decide which code should run where. You just make a new one.
So to make this fully click, I think you should just install the sprite command, make a Sprite, and then run an agent in it. We’ve preinstalled Claude, Gemini, and Codex, and taught them how to do things like checkpoint/restore, registering services, and getting logs. Claude will run in --dangerously-skip-permissions mode (because why wouldn’t it). Have it build something; I built a “Chicago’s best sandwich” bracket app for a Slack channel.
Sprites bill only for what you actually use (in particular: only for storage blocks you actually write, not the full 100GB capacity). It’s reasonable to create a bunch. They’re ball-point disposable computers. After you get a feel for them, it’ll start to feel weird not having them handy.
I appreciate the Fly.io team’s enthusiasm and am optimistic this will mature into a product I’d pay for, but my initial impression was of a lack of polish.
Documentation is sparse, or not even available? The API docs don’t tell you much about the service itself, and a Google search for docs returns an inaccessible website as the first result (https://docs.sprites.dev). Blog posts and forum threads and Claude skills shouldn’t be a substitute.
The snappiness of the sprites is very cool and I can definitely see it integrating into future Claude Code workflows. But the lack of a base container images means you’re still doing setup work on the sprite before you can begin development. I understand the philosophy is that sandboxes should be persistent, but Claude Code sessions also work better when isolated from each other, so it’d be nice to have some precepts to get a workspace setup quickly (given agentic coding is clearly a target).
I also found the CLI unintuitive but maybe that was just me!
So very cool idea but left with the impression that the Fly.io team’s should have spent a couple weeks on polish before shipping.
You're not wrong. The documentation actually had a hallucinated link to an Anthropic dependency in it when we shipped. Right now the attitude is mostly "if we have to document it extensively, we're doing something wrong". It's been in the works for awhile, with a small team, and we're just getting it out there right now.
I've been needling Kurt for several months now that if we wait until it's polished enough that we don't see comments like this, we're doing it wrong.
For what it's worth, I evaluated Fly.io during a divorce from Heroku some time in mid 2022 (I think), found the platform was... way too rough around the edges at the time to want to migrate any real workloads. I kept it on my radar and shipped an MVP with it in 2024, found it was a lot more polished, and now have multiple production apps running there. I'm genuinely pumped about Sprites and have started building against the API—I did notice the weirdness with the docs, but you guys have been doing well on the "this thing that {was broken|I didn't like|was missing} now works the way I'd hoped it would" front.
Appreciate your perspective and totally understand that at some point you just have to ship it! From the outside it looks like a bit less time on XYZ feature and bit more time on marketing polish might have been a good call. But can only speculate what the trade offs were internally. Best of luck maturing the product!
The main things i think are missing is (1) how much am i spending and (2) why isn't my sprite paused, and (3) how can i get my stuff out (it would be nice to be able to mount in either direction or else integrate with git/git worktrees).
I ended up using it (and enjoying yolo mode!) but then my sprites weren't pausing and i was worried about spending too much so i deleted them.
I'm sure this is a difference-of-learning or whatever, but I'm usually unwilling to try a product until I can understand it and how it works from the documentation
Understandable. Our current take is that there's not really much to know, and that the people this will really light up are good with that. Of course, we'll flesh out documentation!
I'm really jazzed about this particular product as a product (I just really enjoy using it), but the post is mostly about how we built it, and deliberately not much about how best to use it.
I hate being negative but it sounds like par for the course for fly.
Incredible (truly, incredible, world-class) engineers that somehow lack that final 10% of polish/naming/documentation that makes things...well, seriously usable.
I remember last time I tried them the bizarre hoops/documentation around database creation. I _think_ they solved that but I remember at the time it felt almost like I was getting looked down upon as a user. Ugh, you need clarity? how amateurish!
+1. This thread, the thread about documentation, and the thread about turning off Sprites, when taken together, thoroughly illustrate why I'm not currently a Fly user.
The name is excellent.
Tried it. Docker wasn't preinstalled so asked Claude to do it and make sure it's running (supervisord or the likes isn't preinstalled).
It neatly did so and "registered" it as a sprite service. Then I exited my session, waiting for the sprite to go idle, but I don't think it ever does.. Still have it active. Don't know how to idle it.
Can't tell for sure if this means I'm losing credits as there is no billing usage shown anywhere.
Also waiting for the moment where I can launch a sprite from another's checkpoint.
We had a bug where some sprites would fail to properly suspend while entering their suspended state. You're not eating into credits so no worries there. We've been rolling out a fix across the fleet today so you should be seeing proper status soon.
This is great news! If we upgraded our sprite already how long should it take to suspend? I noticed the upgrade earlier and installed it but my sprite is still running.
ahh finally success—a fresh sprite goes to sleep as it should. unfortunately the original one i created doesn't, so I guess I'm going to have to kill that one off.
Ok so, "running" sprite status has had some cache consistency issues. You're not being charged for idle sprites, but they may show as "running" even when you're not using them. The UX has improved, and it reliably shows what you expect. Some of the existing sprites need an environment upgrade, but you'll see those improve over the next few days.
There's no way to stop sprites from the CLI.
Supposedly they auto-stop when inactive.
But I've tried multiple times and they don't stop, and it's not just Docker that prevents them from stopping.
I created a new sprite and installed ffmpeg. Then exited. Next day I run `sprite ls` and it's been running continuously for 23 hours.
No way to tell if I'm being billed for it or not.
And the per-hour pricing is extremely expensive.
So for now it's `sprite destroy -s spritename`.
Maybe I'll check it out again in a few months after the fly team has iterated on this a few times.
Sprites are active when:
* They're servicing an incoming HTTP request.
* You're interacting with a console.
They're hair-trigger inactive otherwise. They don't bill CPU unless they're active. The idea is that there isn't really any uncertainty about when it's running; when you stop interacting with it it stops metering.
This is a new shape for a cloud computing thingy and there'll be snags this week with it, but we don't make our money by billing people for stuff they don't want. We've always gone out of our way not to nickel-and-dime casual users and we're trying hard to find new ways to lean into that here.
(Destroying a Sprite you're done with is a perfectly reasonable move; they're disposable.)
No console activity, no HTTP requests, but it doesn't stop.
Can't find any place in CLI or web UI to see how many minutes are charged for CPU, memory, storage.
$ sprite ls
Sprites in organization <redacted>:
┌────┬───────┬────────┐
│NAME│STATUS │CREATED │
├────┼───────┼────────┤
│duh │running│ 23h ago│
└────┴───────┴────────┘
Total: 1 sprite(s)My read of his response is that, even though the sprite is in a running state, that doesn’t mean it’s in a billable state given you aren’t connected; that’s not said explicitly, and I’m making an inference, and so it would be helpful if you let us know if you are billed for these hours.
> Sprites are active when: * They're servicing an incoming HTTP request. * You're interacting with a console.
They are advocated as Linux machines. How about daemons then, or cron jobs? What semantics can we expect from them?
I think the idling feature still needs some work. I created one over the weekend that hasn't idled once, and I've run several tests with sprites that have nothing in them—just `sprite create` and log out, just to see what happens (which unfortunately is nothing, left alone it keeps on running as well.)
I love the idea and most of the execution, I've really enjoyed getting my first sprite configured just the way I want it. It just needs the idling feature to work as advertised before I think I can use it as cost-effectively as it promises.
What does "when you stop interacting with it" mean? Closing all TCP connections? When CPU usage drops to below some threshold?
There needs to be a way to see how much it is being used then and not simply the life of the Sprite.
You can. There’s a usage dashboard
Where is it?
For what it's worth, CPU pricing is based on CPU util. A sprite sitting idle CPU costs almost nothing, even when "active".
So is it similar to railway in this context?
Yeah. My sprites never idled inspite of having nothing running in them and had to be destroyed. Ideally there should be two settings
1. A timeout after the last console session is exited 2. Force idle using the CLI
Just tried it again. Creation took a very long time, then errored.
$ sprite version
sprite version v0.0.1-rc30
$ sprite create quk
Error creating sprite: failed to create sprite: Post "https://api.sprites.dev/v1/sprites": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
But despite the error, the sprite was apparently created, and it runs continuously. $ sprite ls
Sprites in organization <redacted>:
┌────┬───────┬────────┐
│NAME│STATUS │CREATED │
├────┼───────┼────────┤
│quk │running│14m ago │
└────┴───────┴────────┘
Total: 1 sprite(s)Get a console in your sprite. Run “screen”. Run a loop in there : while date; do sleep 1; done. Detach screen and exit the session. Wait a few minutes and go back into the sprite. Reattach screen. You’ll see a gap in the timestamps.
They do suspend even when they say they are “running”.
Just use container-os as your runtime image: https://hub.docker.com/r/miget/container-os
and you should be good
I created a "mcp server" for sprites yesterday through this new ecosystem I am working on, you can clone the collection, and then just add this https://tpmjs.com/api/mcp/ajax/sprites/sse mcp sse url to claude desktop, or anything you want.
tpmjs is a registry of ai sdk npm packages (i created them for sprites), which you can add to personal collections, we automatically server your collections as mcp servers if you want.
Creating sprites in an example chat bot -> https://imgur.com/a/ETNxR1o
Creating sprites in claude desktop -> https://imgur.com/myC0U28
Listing out my sprites in claude desktop -> https://imgur.com/rgBU0jm
---
You can view the collection of tools here -> https://tpmjs.com/ajax/collections/sprites (fork it to use it yourself)
I'm looking into exe.dev and sprites.dev to build out extra features into tpmjs, agent sandboxes make a lot of sense.