A distributed queue in a single JSON file on object storage

2026-02-2110:3117155turbopuffer.com

How to build a single global queue for distributed systems on object storage: Start with a single file on object storage, then add write batching, a stateless broker, and high-availability.

February 12, 2026Dan Harrison (Engineer)

We recently replaced our internal indexing job queue, which notifies indexing nodes to build and update search indexes after data is written to the WAL. The queue is not part of the write path; it's purely a notification system used to schedule asynchronous indexing work. The prior version sharded queues across indexing nodes, so a slow node would block all jobs assigned to it even if other nodes were idle. The new version uses a single queue file on object storage with a stateless broker for FIFO execution, at-least-once guarantees, and 10x lower tail latency versus our prior implementation, so indexing jobs spend less time in the queue.

Why are we so obsessed with building on object storage? Because it's simple, predictable, easy to be on-call for, and extremely scalable. We know how it behaves, and as long as we design within those boundaries, we know it will perform.

Rather than present the final design of our new queue from the top down, let's build it from the bottom up, starting with the simplest thing that works and adding complexity as needed.

Step 1: queue.json

The total size of the data in a turbopuffer job queue is small, well less than 1 GiB. This easily fits in memory, so the simplest functional design is a single file (e.g., queue.json) repeatedly overwritten with the full contents of the queue.

A queue pusher reads the contents of the queue, appends a new job to the end, and writes it using compare-and-set (CAS).

A queue worker similarly uses CAS to mark the first unclaimed job as in progress (○ → ◐), and then gets to work.

We'll call pushers and workers clients, and push and claim operations requests.

The compare-and-set (CAS) primitive makes this atomic. The write only succeeds if queue.json hasn't changed since it was read. If it has changed, the client reads the new contents and tries again. This gives strong consistency guarantees without complex locking.

queue.json                     
┌─────────────────────────────────┐ 
│ {"jobs":["◐","○","○","○","○",]} │ 
└─────────────────────────────────┘ 
            ▲                 ▲   
            │                 │
            │                 │
        CAS │             CAS │
      write │           write │
            │                 │
            │                 │
      ┌─────┴──┐        ┌─────┴──┐
      │ worker │        │ pusher │
      └────────┘        └────────┘

This simplest of queues works surprisingly well! For up to 1 request per second (a limit imposed by GCS), it's already production grade thanks to everything that object storage does for us.

But most queues (including ours) receive more than one request per second. We need more throughput.

Step 2: queue.json with group commit

Object storage has many virtues, but low write latency is not one of them. Replacing a file can take up to 200ms, so instead of writing jobs one-by-one, we need to batch. Whenever a write is in flight, we buffer incoming requests in memory. As soon as the write finishes, we flush the buffer as the next CAS write.

This technique is commonly called group commit, and it's the same pattern turbopuffer uses for batching writes to the WAL. Traditional databases also use this technique to coalesce fsync(2) calls to maximize the committed throughput to disk.

queue.json                     
┌─────────────────────────────────┐ 
│ {"jobs":["◐","◐","◐","○","○",]} │ 
└─────────────────────────────────┘ 
                ▲             ▲
          group │       group │   
         commit │      commit │ 
                │             │
    ┌─buffer────┴─┐ ┌─buffer──┴───┐
    │┌───┬───┬───┐│ │┌───┬───┬───┐│
    ││ ◐ │ ◐ │ ◐ ││ ││ ○ │ ○ │ ○ ││
    │└───┴───┴───┘│ │└───┴───┴───┘│
    └──────▲──────┘ └──────▲──────┘
           │               │
      ┌────┴───┐      ┌────┴───┐
      │ worker │      │ pusher │
      └────────┘      └────────┘

Group commit solves our throughput problem by decoupling write rate from request rate. The scaling bottleneck shifts from write latency (~200ms/write) to network bandwidth (~10 GB/s) – far greater than what turbopuffer needs to track indexing jobs.

However, there’s still a problem. In any turbopuffer region, tens or hundreds of clients will contend over the single queue object as new data is written to many namespaces.

Since CAS ensures strong consistency by forcing each write to be non-overlapping in time, we can only fit 1 / ~200ms = ~5 writes / second (and we still have the 1 RPS limit on GCS).

The problem is no longer throughput. We need fewer writers.

Note: This design, coupled with sharding to local queues, is roughly what we had in production prior to this update. The next sections describe turbopuffer's current production indexing queue.

Step 3: queue.json with a brokered group commit

To eliminate contention over the queue object, we introduce a stateless broker which is responsible for all interactions with object storage. All clients must now liaise with the broker instead of writing to object storage directly.

The broker runs a single group commit loop on behalf of all clients, so no one contends for the object. Critically, it doesn't acknowledge a write until the group commit has landed in object storage. No client moves on until its data is durably committed.

Now the broker is the bottleneck, but a single broker process can serve hundreds or thousands of clients without breaking a sweat because the writes are so small. It's just holding open connections and buffering requests in memory while waiting on I/O. Object storage does the heavy lifting.

queue.json                     
┌─────────────────────────────────┐ 
│ {"jobs":["◐","◐","◐","○","○",]} │ 
└─────────────────────────────────┘ 
                ▲                
                │ brokered
                │ group commit
                │
╔══ broker ═════╧═════════════════╗
║  ┌─ buffer ───────────────────┐ ║
║  │ ┌───┬───┬───┬───┬───┬───┐  │ ║
║  │ │ ◐ │ ◐ │ ◐ │ ○ │ ○ │ ○ │  │ ║
║  │ └───┴───┴───┴───┴───┴───┘  │ ║
║  └────────────────────────────┘ ║
╚════════╤═══════════════╤════════╝
         │               │
    ┌────┴────┐     ┌────┴────┐
    │ workers │     │ pushers │
    └─────────┘     └─────────┘

That's it for scaling. The system can now handle turbopuffer's indexing traffic. But we need high-availability.

Step 4: queue.json with an HA brokered group commit

The broker's machine might die at any time. Similarly, some worker might claim a job and then never finish it. The fix for each of these has the same shape — notice when something is gone and hand off the responsibility — but the details differ.

If any request from a client to the broker takes too long, we start a new broker. Clients will need a way to find the new broker, so we write the broker's address to queue.json.

The broker is stateless, so it's easy and inexpensive to move. And if we end up with more than one broker at a time? That's fine: CAS ensures correctness even with two brokers. The previous broker eventually discovers it's no longer the broker when it gets a CAS failure on queue.json. The only downside is a bit of contention, and thus slowness, for this brief duration.

For the job claims, we add a heartbeat. Periodically, the worker will confirm that it's still on track by sending the broker a timestamp, which is then written to queue.json for that job (one heartbeat per claimed job). If the last heartbeat for a job in the queue is ever more than some timeout, we assume the original worker is gone and the next worker takes over where it left off.

queue.json                     
┌─────────────────────────────────┐ 
│  {                              │
│   "broker":"10.0.0.42:3000",    │
│   "jobs":["◐(♥)","◐(♥)","○",]   │
│  }                              │
└─────────────────────────────────┘ 
                ▲               ▲
       brokered │          read │
   group commit │               │
                │               │
╔══ broker ═════╧═════════════════╗
║  ┌─ buffer ───────────────────┐ ║
║  │ ┌───┬───┬───┬───┬───┬───┐  │ ║
║  │ │ ◐ │ ◐ │ ○ │ ○ │ ○ │ ○ │  │ ║
║  │ └───┴───┴───┴───┴───┴───┘  │ ║
║  └────────────────────────────┘ ║
╚════════╤═══════════════╤════════╝
         │               │      │
    ┌────┴────┐     ┌────┴────┐ │
    │ workers │     │ pushers │─┘
    └─────────┘     └─────────┘

Ship it

We built a reliable distributed job queue with just a single file on object storage and a handful of stateless processes. It easily handles our throughput, guarantees at-least-once delivery, and fails over to any node as needed. Those familiar with turbopuffer's core architecture will see the parallels. Object storage offers few, but powerful, primitives. Once you learn how they behave, you can wield them to build resilient, performant, and highly scalable distributed systems with what's already there.

We host 2.5T+ documents, handle writes at 10M+ writes/s, and serve 10k+ queries/s. We are ready for far more. We hope you'll trust us with your queries.

Get started

Read the original article

Comments

  • By pjc50 2026-02-2411:172 reply

    Several things going on here:

    - concurrency is very hard

    - .. but object storage "solves" most of that for you, handing you a set of semantics which work reliably

    - single file throughput sucks hilariously badly

    - .. because 1Gb is ridiculously large for an atomic unit

    - (this whole thing resembles a project I did a decade ago for transactional consistency on TFAT on Flash, except that somehow managed faster commit times despite running on a 400Mhz MIPS CPU. Edit: maybe I should try to remember how that worked and write it up for HN)

    - therefore, all of the actual work is shifted to the broker. The broker is just periodically committing its state in case it crashes

    - it's not clear whether the broker ACKs requests before they're in durable storage? Is it possible to lose requests in flight anyway?

    - there's a great design for a message queue system between multiple nodes that aims for at least once delivery, and has existed for decades, while maintaining high throughput: SMTP. Actually, there's a whole bunch of message queue systems?

    • By jitl 2026-02-2413:481 reply

      > The broker runs a single group commit loop on behalf of all clients, so no one contends for the object. Critically, it doesn't acknowledge a write until the group commit has landed in object storage. No client moves on until its data is durably committed.

      • By aduffy 2026-02-2421:16

        Yea, the group commit is the real insight here.

        I read this blog post and to help wrap my head around it I put together a simple TCP-based KV store with group commit, helped make it click for me.

        https://github.com/a10y/group-commit/

    • By candiddevmike 2026-02-2412:443 reply

      AFAIK you can kinda "seek" reads in S3 using a range header, WCGW? =D

      • By staticassertion 2026-02-2414:11

        You can, and it's actually great if you store little "headers" etc to tell you those offsets. Their design doesn't seem super amenable to it because it appears to be one file, but this is why a system that actually intends to scale would break things up. You then cache these headers and, on cache hit, you know "the thing I want is in that chunk of the file, grab it". Throw in bloom filters and now you have a query engine.

        Works great for Parquet.

      • By Sirupsen 2026-02-2415:27

        Yep! Other than random reads (~p99=200ms on larger ranges), it's essential to get good download performance of a single file. A single (range) request can "only" drive ~500 MB/s, so you need multiple offsets.

        https://github.com/sirupsen/napkin-math

      • By UltraSane 2026-02-2414:171 reply

        Amazon S3 Select enables SQL queries directly on CSV, JSON, or Apache Parquet objects, allowing retrieval of filtered data subsets to reduce latency and costs

        • By staticassertion 2026-02-2414:252 reply

          S3 Select is, very sadly, deprecated. It also supported HTTP RANGE headers! But they've killed it and I'll never forgive them :)

          Still, it's nbd. You can cache a billion Parquet header/footers on disk/ memory and get 90% of the performance (or better tbh).

          • By dotgov 2026-02-2516:481 reply

            Caching Parquet headers/footers sounds super interesting. Can you say more about how you implemented it?

            • By staticassertion 2026-02-2518:491 reply

              Currently there's nothing in my headers, but the footer is straightforward. There's the schema, row group metadata, some statistics, byte offsets for each column in a group, page index, etc. It's everything you'd want if you wanted to reject a query outright or, if necessary, query extremely efficiently.

              min/max stats for a column are huge because I pre-encode any low-cardinality strings into integers. This means I can skip entire row groups without every touching S3, just with that footer information, and if I don't have it cached I can read it and skip decoding anything that doesn't have my data.

              Footers can get quite large in one sense - 10s-100s of KB for a very large file. But that's obviously tiny compared to a multi-GB Parquet file, and the data can compress extremely well for a second/ third tier cache. You can store 1000s of these pre-parsed in memory no problem, and store 10s of thousands more on disk.

              I've spent 0 time optimizing my footers currently. They can get smaller than they are, I assume, but I've not put much thought. In fact, I don't have to assume, I know that my own custom metadata overlaps with the existing parquet stats and I just haven't bothered to deal with it. TBH there are a bunch of layout optimizations I've yet to explore, like using headers would obviously have some benefits (streaming) whereas right now I do a sort of "attempt to grab the footer from the end in chunks until we find it lol". But it doesn't come up because... caching. And there are worse things than a few spurious RANGE requests.

              • By UltraSane 2026-02-2519:181 reply

                Have you tried AWS s3 tables which is a manged iceberg service?

                • By staticassertion 2026-02-2522:201 reply

                  I haven't. I'm sort of aware of it but I guess I prefer to just have tight control over the protocol/ data layout. It's not that hard and it gives me a ton of room to make niche optimizations. I doubt I'd get the same performance if I used it, but I could be wrong. Usually the more you can push your use case into the protocol the better.

                  • By UltraSane 2026-02-261:201 reply

                    Like most managed services it is a trade off of control vs ease of operation. And like everything with S3 it scales to absurd levels with 10,000 tables per table bucket

                    • By staticassertion 2026-02-261:32

                      Makes sense and tbh there's a very good chance that I'd consider it if I were trying to stay more "standard" but I'd have to learn more.

          • By UltraSane 2026-02-2417:53

            Wow I didn't know that. To be fair now that S3 tables exists it is rather redundant.

  • By Normal_gaussian 2026-02-2411:10

    The original graph appears to simply show the blocking issue of their previous synchronisation mechanism; 10 min to process an item down to 6 min. Any central system would seem to resolve this for them.

    In any organisation its good to make choices for simplicity rather than small optimisations - you're optimising maintenance, incident resolution, and development.

    Typically I have a small pg server for these things. It'll work out slightly more expensive than this setup for one action, yet it will cope with so much more - extending to all kinds of other queues and config management - with simple management, off the shelf diagnostics etc.

    While the object store is neat, there is a confluence of factors which make it great and simple for this workload, that may not extend to others. 200ms latency is a lot for other workloads, 5GB/s doesn't leave a lot of headroom, etc. And I don't want to be asked to diagnose transient issues with this.

    So I'm torn. It's simple to deploy and configure from a fresh deployment PoV. Yet it wouldn't be accepted into any deployment I have worked on.

  • By staticassertion 2026-02-2414:094 reply

    Yeah, I mean, I think we're all basically doing this now, right? I wouldn't choose this design, but I think something similar to DeltaLake can be simplified down for tons of use cases. Manifest with CAS + buffered objects to S3, maybe compaction if you intend to do lots of reads. It's not hard to put it together.

    You can achieve stupidly fast read/write operations if you do this right with a system that is shocking simple to reason about.

    > Step 4: queue.json with an HA brokered group commit > The broker is stateless, so it's easy and inexpensive to move. And if we end up with more than one broker at a time? That's fine: CAS ensures correctness even with two brokers.

    TBH this is the part that I think is tricky. Just resolving this in a way that doesn't end up with tons of clients wasting time talking to a broker that buffers their writes, pushes them, then always fails. I solved this at one point with token fencing and then decided it wasn't worth it and I just use a single instance to manage all writes. I'd again point to DeltaLake for the "good" design here, which is to have multiple manifests and only serialize compaction, which also unlocks parallel writers.

    The other hard part is data deletion. For the queue it looks deadly simple since it's one file, but if you want to ramp up your scale and get multiple writers or manage indexes (also in S3) then deletion becomes something you have to slip into compaction. Again, I had it at one point and backed it out because it was painful.

    But I have 40k writes per second working just fine for my setup, so I'm not worrying. I'd suggest others basically punt as hard as possible on this. If you need more writes, start up a separate index with its own partition for its own separate set of data, or do naive sharding.

    • By allknowingfrog 2026-02-2417:071 reply

      This is news to me. What motivates you to reach for an S3-backed queue versus SQS?

      • By staticassertion 2026-02-2417:521 reply

        I'm not building a queue, but a lot of things on s3 end up being queue-shaped (more like 'log shaped') because it's very easy to compose many powerful systems out of CAS + "buffer, then push". Basically, you start with "build an immutable log" with those operations and the rest of your system becomes a matter of what you do with that log. A queue needs to support a "pop", but I am supporting other operations. Still, the architecture overlap all begins with CAS + buffer.

        That said, I suspect that you can probably beat SQS for a number of use cases, and definitely if you want to hold onto the data long term or search over it then S3 has huge options there.

        Performance will be extremely solid unless you need your worst case latency for "push -> pop" to be very tight in your p90.

        • By loevborg 2026-02-2418:551 reply

          This is fascinating. It sounds like you're building "cloud datastructures" based on S3+CAS. What are the benefits, in your view, of doing using S3 instead of, say, dynamo or postgres? Or reaching for NATS/rabbitmq/sqs/kafka. I'd love to hear a bit more about what you're building.

          • By staticassertion 2026-02-2419:14

            It's just trade-offs. If you have a lot of data, s3 is just the only option for storing it. You don't want to pay for petabytes of storage in Dynamo or Postgres. I also don't want to manage postgres, even RDS - dealing with write loads that S3 handles easily is very annoying, dealing with availability, etc, all is painful. S3 "just works" but you need to build some of the protocol yourself.

            If you want consistently really low latency/ can't tolerate a 50ms spike, don't retain tons of data, have <10K/s writes, and need complex indexing that might change over time, Postgres is probably what you want (or some other thing). If you know how your data should be indexed ahead of time, you need to store a massive amount, you care more about throughput than a latency spike here or there, or really a bunch of other use cases probably, S3 is just an insanely powerful primitive.

            Insane storage also unlocks new capabilities. Immutable logs unlock "time travel" where you can ask questions like "what did the system look like at this point?" since no information is lost (unless you want to lose it, up to you).

            Everything about a system like this comes down to reducing the cost of a GET. Bloom filters are your best friend, metadata is your best friend, prefetching is a reluctant friend, etc.

            I'm not sure what I'm building. I had this idea years ago before S3 CAS was a thing and I was building a graph database on S3 with the fundamental primitive being an immutable event log (at the time using CRDTs for merge semantics, but I've abandoned that for now) and then maintaining an external index in Scylla with S3 Select for projections. Years later, I have fun poking at it sometimes and redesigning it. S3 CAS unlocked a lot of ways to completely move the system to S3.

    • By thomas_fa 2026-02-2415:101 reply

      A lot of good insights here. I am also wandering if they can just simply put different jobs (unclaimed, in-progress, deleted/done) into different directory/prefix, and rely on atomic object rename primitive [1][2][3] to solve the problem more gracefully (group commit can still be used if needed).

      [1] https://docs.cloud.google.com/storage/docs/samples/storage-m... [2] https://docs.aws.amazon.com/AmazonS3/latest/API/API_RenameOb... [3] https://fractalbits.com/blog/why-we-built-another-object-sto...

      • By staticassertion 2026-02-2421:47

        I didn't know about atomic object rename... it's going to take me a long time to think through the options here.

        > RenameObject is only supported for objects stored in the S3 Express One Zone storage class.

        Ah interesting, I don't use this but I bet in a year+ AWS will have this everywhere lol S3 is just too good.

    • By tomnicholas1 2026-02-2419:101 reply

      What you describe is very similar to how Icechunk[1] works. It works beautifully for transactional writes to "repos" containing PBs of scientific array data in object storage.

      [1]: https://icechunk.io/en/latest/

    • By zbentley 2026-02-2415:061 reply

      > I solved this at one point with token fencing

      Could you expand on that? Even if it wasn't the approach you stuck with, I'm curious.

      • By staticassertion 2026-02-2415:28

        Oof, I probably misspoke there just slightly. I attempted to solve this with token fencing, I honestly don't know if it worked under failure conditions. This was also a while ago. But the idea was basically that there were two tiers - one was a ring based approach where a single file determined which writer was allocated a 'space' in the ring. Then every write was prepended with that token. Even if a node dropped/ joined and others didn't know about it (because they hadn't re-read the ring file), every write had this token.

        Writes were not visible until compaction in this system. At compaction time, tokens would be checked and writes for older tokens would be rejected, so even if two nodes thought that they owned a 'place' in the ring, only writes for the higher value would be accepted. Soooomething like that. I ended up disliking this because it had undesirable failure modes like lots of stale/ wasted writes, and the code sucked.

HackerNews