ANN v3: 200ms p99 query latency over 100B vectors

2026-01-2119:5810947turbopuffer.com

Our latest ANN release supports scales of 100+ billion vectors in a single search index, with 200ms p99 query latency at 1k QPS and 92% recall.

January 21, 2026Nathan VanBenschoten (Chief Architect)

The pursuit of scale is not vanity. When you take existing systems and optimize them from first principles to achieve a step change in scalability, you can create something entirely new.

Nothing has demonstrated that more clearly than the explosion in deep learning over the past decade. The ML community took decades-old ideas and combined them with advancements in hardware, new algorithms, and hyper-specialization to forge something remarkable.

Both inspired by the ML community and in service of it, we recently rebuilt vector search in turbopuffer to support scales of up to 100 billion vectors in a single search index. We call this technology Approximate Nearest Neighbor (ANN) Search v3, and it is available now.

In this post, I'll dive into the technical details behind how we built for 100 billion vectors. Along the way, we’ll examine turbopuffer’s architecture, travel up the modern memory hierarchy, zoom into a single CPU core, and then back out to the scale of a distributed cluster.

Let’s look at the numbers to get a sense of the challenge: 100 billion vectors, 1024 dimensions per vector, 2 bytes per dimension (f16). This is vector search over 200TiB of dense vector data. We want to serve a high rate (> 1k QPS) of ANN queries over this entire dataset, each with a latency target of 200ms or less.

With a healthy dose of mechanical sympathy, let’s consider how our hardware will run this workload and where it will encounter bottlenecks. If one part of the system bottlenecks (disk, network, memory, or CPU), other parts of the system will go underutilized. The key to making the most of the available hardware is to push down bottlenecks and balance resource utilization.

turbopuffer’s architecture is simple and opinionated. This simplicity makes the exercise tractable. turbopuffer’s query tier is a stateless layer on top of object storage, consisting of a caching hierarchy and compute. That’s it.

      ╔════════════╗
      ║   client   ║░
      ╚════════════╝░
       ░░░░░║░░░░░░░░
            ▼
╔═ turbopuffer ════════════╗
║  ┏━━━━━━━━━━━━━━━━━━━━┓  ║░
║  ┃    Memory/SSD      ┃  ║░
║  ┃      Cache         ┃  ║░
║  ┗━━━━━━━━┳━━━━━━━━━━━┛  ║░
║           ▼              ║░
║  ┏━━━━━━━━━━━━━━━━━━━━┓  ║░
║  ┃    Object Storage  ┃  ║░
║  ┃      (S3)          ┃  ║░
║  ┗━━━━━━━━━━━━━━━━━━━━┛  ║░
╚══════════════════════════╝░
 ░░░░░░░░░░░░░░░░░░░░░░░░░░░░

Identifying the bottleneck

When trying to process large amounts of data at high throughput, this system architecture could bottleneck in one of two ways. First, it could bottleneck on the CPU instructions needed to process the data (”compute-bound”). Second, it could bottleneck on the data path up the memory hierarchy feeding the CPU (”bandwidth-bound”).

We can borrow a strategy from the GPU community in order to estimate where it will bottleneck by classifying our workload’s arithmetic intensity. Arithmetic intensity is the ratio of arithmetic operations to memory operations. It is often defined using GPU FLOPs and bytes transferred through GPU memory, but it is generalizable to other domains.

Different algorithms have different intensities. For example, a matrix-matrix multiplication is more intensive than a vector dot product. This is because in a matrix multiplication (SGEMM), each element in one matrix is multiplied against N elements (a full row or column) in the other matrix. In a vector dot product (SDOT), each element is multiplied only against one element from the other vector.


  performance
  (FLOPS/s)
  ▲
  │                       ╱
  │      memory bandwidth
  │       (bytes/second) ╱
  │                         arithmetic
  │                     ╱   bandwidth
  ├ ─ ─ ─ ─ ─ ─ ─ ─ ─ ██──────┬──────
  │                   ╱ ridge │
  │                  ╱  point │
  │                 ╱         │
  │                ╱          │
  │               ╱           │
  │              ╱│           │
  │             ╱             │
  │            ╱  │           │
  │           ╱               │ 
  │          ╱    │ memory-   │ compute-
  │         ╱       bound     │ bound
  │        ╱      │           │
  │       ╱                   │
  │      ╱        │           │
  │     ╱                     │
  │    ╱          │           │
  │   ╱                       │
  └──╳────────────┴───────────┴──▶
              arithmetic intensity
                      (FLOPs/byte)

adapted from https://modal.com/gpu-glossary/perf/arithmetic-intensity

As a rule of thumb, a workload that has a small constant arithmetic intensity (e.g., SDOT) will be memory-bound. A workload that has a large constant or linear arithmetic intensity (e.g., SGEMM) will be compute-bound. The intuition is simple enough — if a byte pulled into a compute register is only used once or twice, more work goes into fetching the byte than is needed to operate on it. Meanwhile, if that byte is used many times, the memory fetch is amortized and the computation over it dominates.

If we imagine the kernel of a vector search, the system fetches each data vector and performs a distance calculation between it and a query vector. This distance function is essentially a vector dot product, multiplying the data and query value in each of the corresponding vector dimensions.

    
  ╭  ╮   ╭  ╮    
  │d1│   │q1│ 
  │d2│   │q2│ 
  │d3│ • │q3│ = ∑ di • qi
  │••│   │••│
  │di│   │qi│
  ╰  ╯   ╰  ╯
            

Since each element in a data vector is used only once by the distance function, the arithmetic intensity of vector search is low. Most of the work goes into pulling many large data vectors into CPU registers. Recognizing this, we can predict that vector search will be bandwidth-bound, as is the case for many analytics and search systems.

Consequently, it doesn’t really matter how efficient the CPU instructions of our distance kernel are (within reason). If we are trying to maximize throughput (queries per second) on a machine, we are going to be limited by the number of data vectors we can run the kernel over each second.

With this insight in hand, our objective with ANN v3 is to utilize cache space efficiently and balance bandwidth demands to prevent the network, disk, or main memory from being a dominant bottleneck that limits the system’s ability to scale.

Let’s take a look at all of the places in turbopuffer’s memory hierarchy that might prevent a bandwidth-bound workload from scaling.


┌─────────────┐
│ CPU         │_ _│Size: <1KB
│ Registers   │   │  BW: >10 TB/s
├─────────────┤
│ L1/L2/L3    │_ _│Size: KB-MB
│ Cache       │   │  BW: 1-10s TB/s
├─────────────┤
│ Main Memory │_ _│Size: GB-TB
│ (DRAM)      │   │  BW: 100-500 GB/s
├─────────────┤
│ NVMe SSD    │_ _│Size: 1-10s TBs
│ (direct I/O)│   │  BW: 1-30 GB/s
├─────────────┤
│ Object      │_ _│Size: PB-EB
│ Storage     │   │  BW: 1-10 GB/s
└─────────────┘
    

We observe four distinct boundaries where bandwidth may become the limiting factor.

Take note of the difference in bandwidth between levels of the hierarchy, but also differences in size. Higher tiers are orders of magnitude smaller, but can service many orders of magnitude higher rates of data loads.

To balance bandwidth across this hierarchy, ANN v3 combines two complementary techniques: hierarchical clustering and binary quantization.

Each exploits the same general strategy of “approximation and refinement”. ANN v3 works by first quickly answering: roughly where is the answer? and only then answering: out of that set, exactly what is the answer?

Hierarchical clustering to narrow the search space

The first technique is hierarchical clustering in the index structure. Vector indexes in turbopuffer are based on SPFresh, a centroid-based approximate nearest neighbor index that supports incremental updates. In a centroid-based index, vectors are grouped into clusters, each represented by a single "centroid" vector (typically the mean of all vectors in that cluster). At query time, we first compare the query to centroids to identify promising clusters, then search only within those clusters. We extended the SPTAG graph-based index described in the original SPFresh paper, nesting clusters hierarchically in a multi-dimensional tree structure.

While hierarchical clustering is not new to v3, it is a very important aspect of cold query performance in turbopuffer. When a namespace is cold (not cached on SSD), turbopuffer must fetch some or all of it from object storage. Instead of traversing a graph with sequential object storage round-trips to locate the relevant data clusters, the hierarchy bounds the number of round-trips to object storage to the height of the SPFresh tree. This places a bound on tail latency, even for the coldest query.


      ┌───────────────┐
      │ root centroid │
      │       c0      │
      └───────────────┘
        ╱     │     ╲
       ╱      │      ╲
┌────────┐┌────────┐┌────────┐
│centroid││centroid││centroid│
|   c1   ││   c2   ││   c3   │
└────────┘└────────┘└────────┘
      ╱│╲       ╱│╲       ╱│╲
     ╱ │ ╲     ╱ │ ╲     ╱ │ ╲
    ╱  │  ╲   ╱  │  ╲   ╱  │  ╲
┌──────┐ ┌──────┐ ┌──────┐ ┌──────
│ data │ │ data │ │ data │ │ data  
|  v1  │ │  v2  │ │  v3  │ │ •••
└──────┘ └──────┘ └──────┘ └──────

In the case of 100 billion vector search, we can’t afford to contend with the low bandwidth of object storage (<5 GB/s) for even a fraction of data vector reads, so we size deployments to store the entire tree on SSD. Yet even when cached, the tree structure complements the hardware.

Clustering interacts well with the memory hierarchy because it provides spatial locality. Vectors closer in space - those likely to be accessed together - are stored contiguously. This makes memory and disk accesses efficient. Specifically, it means that there is very little amplification when reading from lower levels of the hierarchy, even when those levels enforce a minimum read granularity (e.g., 4KiB from disk). Every byte fetched will be put to good use.

Hierarchical clustering, specifically, interacts well with the memory hierarchy because it provides temporal locality. Vector clusters in the upper levels of the tree are accessed frequently, so they will naturally remain resident in main memory. We use a 100x branching factor between levels of our tree to balance tree depth with cluster size. Each node in the tree has approximately 100 children, creating a wide, shallow tree structure. This branching factor roughly matches the size ratio between DRAM and SSD (10x - 50x), meaning that if we can fit all data vectors on SSD, we can fit all centroid vectors in DRAM.

All of this gets us back to the original purpose of the approximate nearest neighbor index: reducing the search space for each query. Approximate indexes are a compromise between performance and recall. For centroid-based indexes, we navigate this compromise by controlling how many clusters are scanned at each level of the tree (often called the "probes" or "beam width" parameter).

For vector search at this scale, we found experimentally that with good clustering, we needed to search about 500 data vector clusters (each 100 vectors large) on each machine to achieve our recall target. This equates to a bandwidth requirement of 100MB per level of the tree:


      100 vecs   1024D   2B
500 x ──────── x ───── x ── = 100MB 
      cluster    vector  dim  per level
    

We can use this to estimate throughput limits:

ANN tree levelsmemory hierarchy tierbandwidthmax qps
centroid vectors (upper levels)3DRAM300 GB/s1,000 qps (300GB/(3x100MB))
data vectors (lowest level)1NVMe SSD10 GB/s100 qps (10GB/(1x100MB))

The derivation shows that with hierarchical clustering alone, we will end up disk-bound fetching data vector clusters and maxing out at around 100 queries per second. 100 qps over 100 billion vectors is admirable, but we can do better.

Binary quantization to compress vector sizes

The second technique is binary quantization of data vectors. When vectors are inserted into turbopuffer, the system stores both the full-precision vector (f32 or f16 per dimension) and a transparently computed, quantized form of the vector (1-bit per dimension).


╭       ╮               ╭   ╮
│  0.94 │               │ 1 │
│ -0.01 │               │ 0 │
│  0.39 │               │ 1 │
│ -0.72 │ ────────────▶ │ 0 │
│  0.02 │    binary     │ 1 │
│ -0.85 │ quantization  │ 0 │
│ -0.18 │               │ 0 │
│  0.99 │               │ 1 │
│  0.45 │               │ 1 │
╰       ╯               ╰   ╯
    

The math works out as expected; binary quantization provides a 16-32x compression for data vectors. This allows these quantized vectors to be stored higher in the memory hierarchy and minimizes their memory bandwidth demands.

To avoid this compression leading to a loss of search quality, ANN v3 employs the RaBitQ quantization method. RaBitQ exploits the mathematical properties of high-dimensional space (concentration of measure) to compress aggressively while preserving high recall. Specifically, in high dimensions, vector components naturally become more uniformly distributed, which means quantization errors are spread evenly across all dimensions rather than concentrated in problematic directions. This uniform error distribution enables RaBitQ to provide tight theoretical error bounds alongside distance estimations made on quantized vectors.

For example, consider a data vector Vd and query vector Vq. A full-precision distance computation might compute the cosine distance between Vd and Vq as 0.75. Now consider their binary quantized forms, Vd' and Vq'. Due to the loss of information from quantizing, RaBitQ cannot perfectly compute the true cosine distance by looking just at the quantized vectors. Instead, it will compute an estimated range like [0.69, 0.83].

Despite being imprecise, this confidence interval can be used to conclude that Vd is closer to Vq than some other vector Vd2 whose quantized distance estimate has a range [0.87, 0.91]. For other data vectors with overlapping distance estimate ranges (e.g., [0.51, 0.77]), RaBitQ makes no promises. Such vectors must be recompared using their full precision vectors.

During a vector search, turbopuffer first evaluates the search on quantized vectors, uses the error bounds to determine all vectors that could be in the true top-k, then fetches their corresponding full-precision vectors, and reranks over those to compute the final result. In practice, we find that less than 1% of data vectors in the narrowed search space need to be reranked to avoid an impact on recall.

Putting it all together...

These two techniques compose, and their benefits multiply.

Upper levels of a quantized ANN tree are small but frequently accessed. As a result, they naturally remain resident all the way up in the L3 CPU cache. We can write out the math to demonstrate this to ourselves. With a branching factor of 100, each level L in the tree contains 100^L * 1024/8 bytes of quantized vector data (1 bit per dimension). With a 504MiB shared L3 cache, we can fit all three upper levels of the tree in L3 cache, the largest requiring 100^3 * 128 = 128 MiB of cache space.

The lowest level of the quantized ANN tree is stored in DRAM, as was the case before. However, because these vectors are compressed, they require less memory bandwidth to access.

Meanwhile, the full-precision vectors remain on local SSD. However, only a small fraction is fetched during the reranking phase, through a highly concurrent scatter-gather. This access pattern is ideal for modern NVMe drives, which have random read latencies around 50-100 microseconds and excel at handling many parallel I/O operations simultaneously, allowing us to access the full bandwidth of the disks.

We can again estimate throughput limits. Remember that quantized vectors are 16x smaller than unquantized f16 vectors, equating to a bandwidth requirement of 500 x 100 x 1024 x 2 / 16 = 6MB per level of the tree.

ANN tree levelsmemory hierarchy tierbandwidthmax qps
quantized centroid vectors (upper levels)3L3 cache600GB/s33,000 qps (600GB/(3x6MB))
quantized data vectors (lowest level)1DRAM300GB/s50,000 qps (300GB/(1x6MB))
full precision data vectors (lowest level)1% of 1NVMe SSD10 GB/s10,000 qps (10GB/(0.01x100MB))

The changes here demonstrate a remarkable dynamic in the memory hierarchy. Data compression both reduces bandwidth requirements and allows data to remain resident in higher bandwidth tiers at the same time, leading to a multiplicative effect. Whereas before the theoretical throughput limit was 100 qps, quantization unlocks a theoretical limit of 10,000 qps!

...to end up compute-bound...

By combining hierarchical clustering with binary quantization, ANN v3 makes efficient use of cache space and balances bandwidth demands across the tiers of the memory hierarchy.

However, if we run a production load against it, we notice something interesting. Instead of hitting the theoretical 10,000 qps, the system ends up saturated around 1,000 qps, 10% of where we’d like to be.

What is happening here? Why are we no longer limited by disk bandwidth, and instead limited somewhere else?

Returning to our framework for arithmetic intensity, we discover what changed. When we added in binary quantization, intensity shot up. With f16 vector elements, every two bytes fetched from memory were used for a single logical operation. With binary quantized elements, however, every two bytes fetched are used for sixteen logical operations (one per bit). In fact, the RaBitQ algorithm reuses each bit four times when computing distance estimates (section 3.3.2 of the paper), leading to 64x higher arithmetic intensity.

This large constant arithmetic intensity is enough to tip the scale towards the system being compute-bound. Each CPU core in our system cannot keep up with the rate of highly compressed vector data being fed to it, bottlenecking throughput.

Optimizing a compute-bound system is a different game, requiring an obsessive focus on:

As an example, we were recently surprised to find that AVX2 (256-bit x86 SIMD) does not have an accelerated popcount instruction. Counting the number of bits set in a series of bytes is an important operation for RaBitQ, so making this operation fast was essential. We switched a core distance estimation kernel over to AVX-512 to gain access to its VPOPCNTDQ instruction. This instruction is capable of counting the bits set to one across a 512-bit register in only 3 cpu cycles of latency. Better yet, it can be pipelined for a throughput of 1 instruction per cycle.

Making the switch improved the performance of the kernel in microbenchmarks by 30% and improved end-to-end production throughput by about 5%. When microbenchmarks translate to end-to-end performance gains, you have correctly identified the system’s bottleneck.

A more general discussion of optimizing a compute-bound system is a deeper topic than we have room for in this post, but it remains a focus of our work to this day as we continue to optimize ANN v3 for higher throughput.

...in a distributed cluster

Careful readers may have noticed one inconsistency in the description of ANN v3 so far. I mentioned up top that our goal was to handle vector search over 200 TiB of dense vector data. We also discussed that for this workload shape, all of this data should be cached on SSD to avoid the limited bandwidth and high latency of object storage. And yet, NVMe SSDs only get so large...

This is where distribution comes in. Modern clouds provide storage-dense VMs with fast locally attached NVMe drives (e.g., GCP's z3, AWS’s I7i, Azure’s Lsv4). These drives are large (10-40TiB) but not 200TiB large. To achieve the desired aggregate SSD capacity, we use a cluster of storage-optimized machines, each storing a subset of the index.

At this scale, simple random sharding works well. During ingestion, each vector is randomly assigned to one of the shards of the index. An ANN search query is broadcast to all shards, and the global top-k is stitched together from the sub-results.

This technique can scale to arbitrarily large indexes, but its cost scales linearly with the number of machines in the cluster. This is why it is crucial to maximize the efficiency of a single machine before turning to distribution. Moving in the other direction leads to an unnecessarily expensive system.

turbopuffer customers can use this technique now by splitting their largest indexes into 4 TiB namespaces and randomly assigning vectors across them. Then, contact us to pin each shard to its own SSD (this option will soon be available directly in the dashboard). We're also working on making the sharding fully transparent in the future.

Conclusion

When performance at scale becomes cost-efficient, it stops being a benchmarking exercise and becomes a building block.

The ANN v3 architecture pushes turbopuffer to 100 billion-vector scale at thousands of QPS, while holding p99 latency at 200ms. More importantly, it does this while keeping costs low enough to run continuously in production. It achieves this through hierarchical clustering that matches the memory hierarchy, binary quantization that compresses vectors by 16-32x with reranking to maintain recall, and distribution across storage-dense machines - all working together to maximize hardware utilization and minimize bottlenecks.

We host 2.5T+ documents, handle writes at 10M+ writes/s, and serve 10k+ queries/s. We are ready for far more. We hope you'll trust us with your queries.

Get started

Read the original article

Comments

  • By jascha_eng 2026-01-220:383 reply

    This is legitimately pretty impressive. I think the rule of thumb is now, go with postgres(pgvector) for vector search until it breaks, then go with turbopuffer.

    • By sa-code 2026-01-2515:392 reply

      Qdrant is also a good default choice, since it can work in-memory for development, with a hard drive for small deployments and also for "web scale" workloads.

      As a principal eng, side-stepping a migration and having a good local dev experience is too good of a deal to pass up.

      That being said, turbopuffer looks interesting. I will check it out. Hopefully their local dev experience is good

      • By nostrebored 2026-01-2516:432 reply

        Qdrant is one of the few vendors I actively steer people away from. Look at the GitHub issues, look at what their CEO says, look at their fake “advancements” that they pay for publicity on…

        The number of people I know who’ve had unrecoverable shard failures on Qdrant is too high to take it seriously.

      • By benesch 2026-01-2516:164 reply

        For local dev + testing, we recommend just hitting the production turbopuffer service directly, but with a separate test org/API key: https://turbopuffer.com/docs/testing

        Works well for the vast majority of our customers (although we get the very occasional complaint about wanting a dev environment that works offline). The dataset sizes for local dev are usually so small that the cost rounds to free.

        • By lambda 2026-01-2519:343 reply

          > although we get the very occasional complaint about wanting a dev environment that works offline

          It's only occasional because the people who care about dev environments that work offline are most likely to just skip you and move on.

          For actual developer experience, as well as a number of use cases like customers with security and privacy concerns, being able to host locally is essential.

          Fair enough if you don't care about those segments of the market, but don't confuse a small number of people asking about it with a small number of people wanting it.

          • By nik9000 2026-01-2520:44

            As someone who works for a competitor, they are probably right holding off on that segment for a while. Supporting both cloud and local deployments is somewhere between 20% harder and 300% harder depending on the day.

            I'm watching them with excitement. We all learn from each other. There's so much to do.

          • By sa-code 2026-01-2520:31

            Can confirm. With a setup that works offline, one can

            - start small on a laptop. Going through procurement at companies is a pain

            - test things in CI reliably. Outages don’t break builds

            - transition from laptop scale to web scale easily with the same API with just a different backend

            Otherwise it’s really hard to justify not using S3 vectors here

            The current dev experience is to start with faiss for PoCs, move to pgvector and then something heavy duty like one of the Lucene wrappers.

          • By benesch 2026-01-2520:59

            Yep, we're well aware of the selection bias effects in product feedback. As we grow we're thinking about how to make our product more accessible to small orgs / hobby projects. Introducing a local dev environment may be part of that.

            Note that we already have a in-your-own-VPC offering for large orgs with strict security/privacy/regulatory controls.

        • By sroussey 2026-01-2516:32

          That’s not local though

        • By enigmo 2026-01-2518:471 reply

          having a local simulator (DynamoDB, Spanner, others) helps me a lot for offline/local development and CI. when a vendor doesn't off this I have often end up mocking it out (one way or another) and have to wait for integration or e2e tests for feedback that could have been pushed further to the left.

          in many CI environments unit tests don't have network access, it's not purely a price consideration.

          (not a turbopuffer customer but I have been looking at it)

          • By benesch 2026-01-2519:302 reply

            > in many CI environments unit tests don't have network access, it's not purely a price consideration.

            I've never seen a hard block on network access (how do you install packages/pull images?) but I am sympathetic to wanting to enforce that unit tests run quickly by minimizing/eliminating RTT to networked services.

            We've considered the possibility of a local simulator before. Let me know if it winds up being a blocker for your use case.

            • By lambda 2026-01-2519:361 reply

              > how do you install packages/pull images

              You pre-build the images with packages installed beforehand, then use those image offline.

              • By benesch 2026-01-2520:131 reply

                My point is it's enough of a hassle to set up that I've yet to see that level of restriction in practice (across hundreds of CI systems).

                • By dzbarsky 2026-01-2520:481 reply

                  Look into Bazel, a very standard build system used at many large tech companies. It splits fetches from build/test actions and allows blocking network for build/test actions with a single CLI flag. No hassle at all.

                  The fact that you haven't come across this kind of setup suggests that your hundreds of CI systems are not representative of the industry as a whole.

                  • By benesch 2026-01-2522:58

                    I agree our sample may not be representative but we try to stay focused on the current and next crop of tpuf customers rather than the software industry as a whole. So far "CI prohibits network access during tests" just hasn't come up as a pain point for any of them, but as I mentioned in another comment [0], we're definitely keeping an open mind about introducing an offline dev experience.

                    (I am familiar with Bazel, but I'll have to save the war stories for another thread. It's not a build tool we see our particular customers using.)

                    [0]: https://news.ycombinator.com/item?id=46758156

            • By enigmo 2026-01-2614:45

              you pull packages from a trusted package repository, not from the internet. this is not rare in my experience (financial services, security) and will become increasingly common due to software supply chain issues.

        • By sa-code 2026-01-2520:161 reply

          I should have clarified, by local dev and testing I did in fact mean offline usage.

          Without that it’s unfortunately a non starter

          • By benesch 2026-01-2520:301 reply

            So I can note this down on our roadmap, what's the root of your requirement here? Supporting local dev without internet (airplanes, coffee shops, etc.)? Unit test speed? Something else?

    • By jauntywundrkind 2026-01-2518:49

      I'd love to know how they compare versus MixedBread, what relative strengths each has. https://www.mixedbread.com/

      I really really enjoy & learn a lot from the mixedbread blog. And they find good stuff to open source (although the product itself is closed). https://www.mixedbread.com/blog

      I feel like there's a lot of overlap but also probably a lot of distinction too. Pretty new to this space of products though.

    • By _peregrine_ 2026-01-2217:05

      seems like a good rule of thumb to me! though i would perhaps lump "cost" into the "until it breaks" equation. even with decent perf, pg_vector's economics can be much worse, especially in multi-tenant scenarios where you need many small indexes (this is true of any vector db that builds indexes primarily on RAM/SSD)

  • By kgeist 2026-01-2514:313 reply

    Are there vector DBs with 100B vectors in production which work well? There was a paper which showed that there's 12% loss in accuracy at just 1 mln vectors. Maybe some kind of logical sharding is another option, to improve both accuracy and speed.

    • By lmeyerov 2026-01-2517:06

      I don't know at these scales, but at the 1M-100M, we found switching from out-of-box embeddings to fine-tuning our embeddings gave less of a sting in the compression/recall trade-off . We had a 10-100X win here wrt comparable recall with better compression.

      I'm not sure how that'd work with the binary quantization phase though. For example, we use Matroyska, and some of the bits matter way more than others, so that might be super painful.

    • By jasonjmcghee 2026-01-2517:04

      So many missing details...

      Different vector indexes have very different recall and even different parameters for each dramatically impact this.

      HNSW can have very good recall even at high vector counts.

      There's also the embedding model, whether you're quantizing, if it's pure rag vs hybrid bm25 / static word embeddings vs graph connections, whether you're reranking etc etc

    • By _peregrine_ 2026-01-2514:481 reply

      the solution described in the blog post is currently in production at 100B vectors

  • By mmaunder 2026-01-2515:041 reply

    For those of us who operate on site, we have to add back network latency, which negates this win entirely and makes a proprietary cloud solution like this a nonstarter.

    • By benesch 2026-01-2516:33

      Often not a dealbreaker, actually! We can spin up new tpuf regions and procure dedicated interconnects to minimize latency to the on-prem network on request (and we have done this).

      When you're operating at the 100B scale, you're pushing beyond the capacity that most on-prem setups can handle. Most orgs have no choice but to put a 100B workload into the nearest public cloud. (For smaller workloads, considerations are different, for sure.)

HackerNews