Our latest ANN release supports scales of 100+ billion vectors in a single search index, with 200ms p99 query latency at 1k QPS and 92% recall.
January 21, 2026Nathan VanBenschoten (Chief Architect)
The pursuit of scale is not vanity. When you take existing systems and optimize them from first principles to achieve a step change in scalability, you can create something entirely new.
Nothing has demonstrated that more clearly than the explosion in deep learning over the past decade. The ML community took decades-old ideas and combined them with advancements in hardware, new algorithms, and hyper-specialization to forge something remarkable.
Both inspired by the ML community and in service of it, we recently rebuilt vector search in turbopuffer to support scales of up to 100 billion vectors in a single search index. We call this technology Approximate Nearest Neighbor (ANN) Search v3, and it is available now.
In this post, I'll dive into the technical details behind how we built for 100 billion vectors. Along the way, we’ll examine turbopuffer’s architecture, travel up the modern memory hierarchy, zoom into a single CPU core, and then back out to the scale of a distributed cluster.
Let’s look at the numbers to get a sense of the challenge: 100 billion vectors,
1024 dimensions per vector, 2 bytes per dimension (f16). This is vector
search over 200TiB of dense vector data. We want to serve a high rate (> 1k
QPS) of ANN queries over this entire dataset, each with a latency target of
200ms or less.
With a healthy dose of mechanical sympathy, let’s consider how our hardware will run this workload and where it will encounter bottlenecks. If one part of the system bottlenecks (disk, network, memory, or CPU), other parts of the system will go underutilized. The key to making the most of the available hardware is to push down bottlenecks and balance resource utilization.
turbopuffer’s architecture is simple and opinionated. This simplicity makes the exercise tractable. turbopuffer’s query tier is a stateless layer on top of object storage, consisting of a caching hierarchy and compute. That’s it.
╔════════════╗
║ client ║░
╚════════════╝░
░░░░░║░░░░░░░░
▼
╔═ turbopuffer ════════════╗
║ ┏━━━━━━━━━━━━━━━━━━━━┓ ║░
║ ┃ Memory/SSD ┃ ║░
║ ┃ Cache ┃ ║░
║ ┗━━━━━━━━┳━━━━━━━━━━━┛ ║░
║ ▼ ║░
║ ┏━━━━━━━━━━━━━━━━━━━━┓ ║░
║ ┃ Object Storage ┃ ║░
║ ┃ (S3) ┃ ║░
║ ┗━━━━━━━━━━━━━━━━━━━━┛ ║░
╚══════════════════════════╝░
░░░░░░░░░░░░░░░░░░░░░░░░░░░░When trying to process large amounts of data at high throughput, this system architecture could bottleneck in one of two ways. First, it could bottleneck on the CPU instructions needed to process the data (”compute-bound”). Second, it could bottleneck on the data path up the memory hierarchy feeding the CPU (”bandwidth-bound”).
We can borrow a strategy from the GPU community in order to estimate where it will bottleneck by classifying our workload’s arithmetic intensity. Arithmetic intensity is the ratio of arithmetic operations to memory operations. It is often defined using GPU FLOPs and bytes transferred through GPU memory, but it is generalizable to other domains.
Different algorithms have different intensities. For example, a matrix-matrix
multiplication is more intensive than a vector dot product. This is because in a
matrix multiplication (SGEMM), each element in one matrix is multiplied
against N elements (a full row or column) in the other matrix. In a vector dot
product (SDOT), each element is multiplied only against one element from the
other vector.
performance
(FLOPS/s)
▲
│ ╱
│ memory bandwidth
│ (bytes/second) ╱
│ arithmetic
│ ╱ bandwidth
├ ─ ─ ─ ─ ─ ─ ─ ─ ─ ██──────┬──────
│ ╱ ridge │
│ ╱ point │
│ ╱ │
│ ╱ │
│ ╱ │
│ ╱│ │
│ ╱ │
│ ╱ │ │
│ ╱ │
│ ╱ │ memory- │ compute-
│ ╱ bound │ bound
│ ╱ │ │
│ ╱ │
│ ╱ │ │
│ ╱ │
│ ╱ │ │
│ ╱ │
└──╳────────────┴───────────┴──▶
arithmetic intensity
(FLOPs/byte)
adapted from https://modal.com/gpu-glossary/perf/arithmetic-intensity
As a rule of thumb, a workload that has a small constant arithmetic intensity (e.g., SDOT) will be memory-bound. A workload that has a large constant or linear arithmetic intensity (e.g., SGEMM) will be compute-bound. The intuition is simple enough — if a byte pulled into a compute register is only used once or twice, more work goes into fetching the byte than is needed to operate on it. Meanwhile, if that byte is used many times, the memory fetch is amortized and the computation over it dominates.
If we imagine the kernel of a vector search, the system fetches each data vector and performs a distance calculation between it and a query vector. This distance function is essentially a vector dot product, multiplying the data and query value in each of the corresponding vector dimensions.
╭ ╮ ╭ ╮
│d1│ │q1│
│d2│ │q2│
│d3│ • │q3│ = ∑ di • qi
│••│ │••│
│di│ │qi│
╰ ╯ ╰ ╯
Since each element in a data vector is used only once by the distance function, the arithmetic intensity of vector search is low. Most of the work goes into pulling many large data vectors into CPU registers. Recognizing this, we can predict that vector search will be bandwidth-bound, as is the case for many analytics and search systems.
Consequently, it doesn’t really matter how efficient the CPU instructions of our distance kernel are (within reason). If we are trying to maximize throughput (queries per second) on a machine, we are going to be limited by the number of data vectors we can run the kernel over each second.
With this insight in hand, our objective with ANN v3 is to utilize cache space efficiently and balance bandwidth demands to prevent the network, disk, or main memory from being a dominant bottleneck that limits the system’s ability to scale.
Let’s take a look at all of the places in turbopuffer’s memory hierarchy that might prevent a bandwidth-bound workload from scaling.
┌─────────────┐
│ CPU │_ _│Size: <1KB
│ Registers │ │ BW: >10 TB/s
├─────────────┤
│ L1/L2/L3 │_ _│Size: KB-MB
│ Cache │ │ BW: 1-10s TB/s
├─────────────┤
│ Main Memory │_ _│Size: GB-TB
│ (DRAM) │ │ BW: 100-500 GB/s
├─────────────┤
│ NVMe SSD │_ _│Size: 1-10s TBs
│ (direct I/O)│ │ BW: 1-30 GB/s
├─────────────┤
│ Object │_ _│Size: PB-EB
│ Storage │ │ BW: 1-10 GB/s
└─────────────┘
We observe four distinct boundaries where bandwidth may become the limiting factor.
Take note of the difference in bandwidth between levels of the hierarchy, but also differences in size. Higher tiers are orders of magnitude smaller, but can service many orders of magnitude higher rates of data loads.
To balance bandwidth across this hierarchy, ANN v3 combines two complementary techniques: hierarchical clustering and binary quantization.
Each exploits the same general strategy of “approximation and refinement”. ANN v3 works by first quickly answering: roughly where is the answer? and only then answering: out of that set, exactly what is the answer?
The first technique is hierarchical clustering in the index structure. Vector indexes in turbopuffer are based on SPFresh, a centroid-based approximate nearest neighbor index that supports incremental updates. In a centroid-based index, vectors are grouped into clusters, each represented by a single "centroid" vector (typically the mean of all vectors in that cluster). At query time, we first compare the query to centroids to identify promising clusters, then search only within those clusters. We extended the SPTAG graph-based index described in the original SPFresh paper, nesting clusters hierarchically in a multi-dimensional tree structure.
While hierarchical clustering is not new to v3, it is a very important aspect of cold query performance in turbopuffer. When a namespace is cold (not cached on SSD), turbopuffer must fetch some or all of it from object storage. Instead of traversing a graph with sequential object storage round-trips to locate the relevant data clusters, the hierarchy bounds the number of round-trips to object storage to the height of the SPFresh tree. This places a bound on tail latency, even for the coldest query.
┌───────────────┐
│ root centroid │
│ c0 │
└───────────────┘
╱ │ ╲
╱ │ ╲
┌────────┐┌────────┐┌────────┐
│centroid││centroid││centroid│
| c1 ││ c2 ││ c3 │
└────────┘└────────┘└────────┘
╱│╲ ╱│╲ ╱│╲
╱ │ ╲ ╱ │ ╲ ╱ │ ╲
╱ │ ╲ ╱ │ ╲ ╱ │ ╲
┌──────┐ ┌──────┐ ┌──────┐ ┌──────
│ data │ │ data │ │ data │ │ data
| v1 │ │ v2 │ │ v3 │ │ •••
└──────┘ └──────┘ └──────┘ └──────
In the case of 100 billion vector search, we can’t afford to contend with the low bandwidth of object storage (<5 GB/s) for even a fraction of data vector reads, so we size deployments to store the entire tree on SSD. Yet even when cached, the tree structure complements the hardware.
Clustering interacts well with the memory hierarchy because it provides spatial locality. Vectors closer in space - those likely to be accessed together - are stored contiguously. This makes memory and disk accesses efficient. Specifically, it means that there is very little amplification when reading from lower levels of the hierarchy, even when those levels enforce a minimum read granularity (e.g., 4KiB from disk). Every byte fetched will be put to good use.
Hierarchical clustering, specifically, interacts well with the memory hierarchy because it provides temporal locality. Vector clusters in the upper levels of the tree are accessed frequently, so they will naturally remain resident in main memory. We use a 100x branching factor between levels of our tree to balance tree depth with cluster size. Each node in the tree has approximately 100 children, creating a wide, shallow tree structure. This branching factor roughly matches the size ratio between DRAM and SSD (10x - 50x), meaning that if we can fit all data vectors on SSD, we can fit all centroid vectors in DRAM.
All of this gets us back to the original purpose of the approximate nearest neighbor index: reducing the search space for each query. Approximate indexes are a compromise between performance and recall. For centroid-based indexes, we navigate this compromise by controlling how many clusters are scanned at each level of the tree (often called the "probes" or "beam width" parameter).
For vector search at this scale, we found experimentally that with good clustering, we needed to search about 500 data vector clusters (each 100 vectors large) on each machine to achieve our recall target. This equates to a bandwidth requirement of 100MB per level of the tree:
100 vecs 1024D 2B
500 x ──────── x ───── x ── = 100MB
cluster vector dim per level
We can use this to estimate throughput limits:
| ANN tree levels | memory hierarchy tier | bandwidth | max qps | |
|---|---|---|---|---|
| centroid vectors (upper levels) | 3 | DRAM | 300 GB/s | 1,000 qps (300GB/(3x100MB)) |
| data vectors (lowest level) | 1 | NVMe SSD | 10 GB/s | 100 qps (10GB/(1x100MB)) |
The derivation shows that with hierarchical clustering alone, we will end up disk-bound fetching data vector clusters and maxing out at around 100 queries per second. 100 qps over 100 billion vectors is admirable, but we can do better.
The second technique is binary quantization of data vectors. When vectors are
inserted into turbopuffer, the system stores both the full-precision vector
(f32 or f16 per dimension) and a transparently computed, quantized form of
the vector (1-bit per dimension).
╭ ╮ ╭ ╮
│ 0.94 │ │ 1 │
│ -0.01 │ │ 0 │
│ 0.39 │ │ 1 │
│ -0.72 │ ────────────▶ │ 0 │
│ 0.02 │ binary │ 1 │
│ -0.85 │ quantization │ 0 │
│ -0.18 │ │ 0 │
│ 0.99 │ │ 1 │
│ 0.45 │ │ 1 │
╰ ╯ ╰ ╯
The math works out as expected; binary quantization provides a 16-32x compression for data vectors. This allows these quantized vectors to be stored higher in the memory hierarchy and minimizes their memory bandwidth demands.
To avoid this compression leading to a loss of search quality, ANN v3 employs the RaBitQ quantization method. RaBitQ exploits the mathematical properties of high-dimensional space (concentration of measure) to compress aggressively while preserving high recall. Specifically, in high dimensions, vector components naturally become more uniformly distributed, which means quantization errors are spread evenly across all dimensions rather than concentrated in problematic directions. This uniform error distribution enables RaBitQ to provide tight theoretical error bounds alongside distance estimations made on quantized vectors.
For example, consider a data vector Vd and query vector Vq. A full-precision distance computation might compute the cosine distance between Vd and Vq as 0.75. Now consider their binary quantized forms, Vd' and Vq'. Due to the loss of information from quantizing, RaBitQ cannot perfectly compute the true cosine distance by looking just at the quantized vectors. Instead, it will compute an estimated range like [0.69, 0.83].
Despite being imprecise, this confidence interval can be used to conclude that Vd is closer to Vq than some other vector Vd2 whose quantized distance estimate has a range [0.87, 0.91]. For other data vectors with overlapping distance estimate ranges (e.g., [0.51, 0.77]), RaBitQ makes no promises. Such vectors must be recompared using their full precision vectors.
During a vector search, turbopuffer first evaluates the search on quantized vectors, uses the error bounds to determine all vectors that could be in the true top-k, then fetches their corresponding full-precision vectors, and reranks over those to compute the final result. In practice, we find that less than 1% of data vectors in the narrowed search space need to be reranked to avoid an impact on recall.
These two techniques compose, and their benefits multiply.
Upper levels of a quantized ANN tree are small but frequently accessed. As a
result, they naturally remain resident all the way up in the L3 CPU cache. We
can write out the math to demonstrate this to ourselves. With a branching factor
of 100, each level L in the tree contains 100^L * 1024/8 bytes of quantized
vector data (1 bit per dimension). With a 504MiB shared L3 cache, we can fit all
three upper levels of the tree in L3 cache, the largest requiring
100^3 * 128 = 128 MiB of cache space.
The lowest level of the quantized ANN tree is stored in DRAM, as was the case before. However, because these vectors are compressed, they require less memory bandwidth to access.
Meanwhile, the full-precision vectors remain on local SSD. However, only a small fraction is fetched during the reranking phase, through a highly concurrent scatter-gather. This access pattern is ideal for modern NVMe drives, which have random read latencies around 50-100 microseconds and excel at handling many parallel I/O operations simultaneously, allowing us to access the full bandwidth of the disks.
We can again estimate throughput limits. Remember that quantized vectors are 16x
smaller than unquantized f16 vectors, equating to a bandwidth requirement of
500 x 100 x 1024 x 2 / 16 = 6MB per level of the tree.
| ANN tree levels | memory hierarchy tier | bandwidth | max qps | |
|---|---|---|---|---|
| quantized centroid vectors (upper levels) | 3 | L3 cache | 600GB/s | 33,000 qps (600GB/(3x6MB)) |
| quantized data vectors (lowest level) | 1 | DRAM | 300GB/s | 50,000 qps (300GB/(1x6MB)) |
| full precision data vectors (lowest level) | 1% of 1 | NVMe SSD | 10 GB/s | 10,000 qps (10GB/(0.01x100MB)) |
The changes here demonstrate a remarkable dynamic in the memory hierarchy. Data compression both reduces bandwidth requirements and allows data to remain resident in higher bandwidth tiers at the same time, leading to a multiplicative effect. Whereas before the theoretical throughput limit was 100 qps, quantization unlocks a theoretical limit of 10,000 qps!
By combining hierarchical clustering with binary quantization, ANN v3 makes efficient use of cache space and balances bandwidth demands across the tiers of the memory hierarchy.
However, if we run a production load against it, we notice something interesting. Instead of hitting the theoretical 10,000 qps, the system ends up saturated around 1,000 qps, 10% of where we’d like to be.
What is happening here? Why are we no longer limited by disk bandwidth, and instead limited somewhere else?
Returning to our framework for arithmetic intensity, we discover what changed.
When we added in binary quantization, intensity shot up. With f16 vector
elements, every two bytes fetched from memory were used for a single logical
operation. With binary quantized elements, however, every two bytes fetched are
used for sixteen logical operations (one per bit). In fact, the RaBitQ
algorithm reuses each bit four times when computing distance estimates (section
3.3.2 of the paper), leading to 64x higher
arithmetic intensity.
This large constant arithmetic intensity is enough to tip the scale towards the system being compute-bound. Each CPU core in our system cannot keep up with the rate of highly compressed vector data being fed to it, bottlenecking throughput.
Optimizing a compute-bound system is a different game, requiring an obsessive focus on:
As an example, we were recently surprised to find that
AVX2 (256-bit x86
SIMD) does not have an accelerated
popcount instruction. Counting
the number of bits set in a series of bytes is an important operation for
RaBitQ, so making this operation fast was essential. We switched a core distance
estimation kernel over to AVX-512 to
gain access to its VPOPCNTDQ instruction. This instruction is capable of
counting the bits set to one across a 512-bit register in only 3 cpu cycles of
latency. Better yet, it can be pipelined for a throughput of 1 instruction per
cycle.
Making the switch improved the performance of the kernel in microbenchmarks by 30% and improved end-to-end production throughput by about 5%. When microbenchmarks translate to end-to-end performance gains, you have correctly identified the system’s bottleneck.
A more general discussion of optimizing a compute-bound system is a deeper topic than we have room for in this post, but it remains a focus of our work to this day as we continue to optimize ANN v3 for higher throughput.
Careful readers may have noticed one inconsistency in the description of ANN v3 so far. I mentioned up top that our goal was to handle vector search over 200 TiB of dense vector data. We also discussed that for this workload shape, all of this data should be cached on SSD to avoid the limited bandwidth and high latency of object storage. And yet, NVMe SSDs only get so large...
This is where distribution comes in. Modern clouds provide storage-dense VMs
with fast locally attached NVMe drives (e.g., GCP's z3, AWS’s I7i, Azure’s
Lsv4). These drives are large (10-40TiB) but not 200TiB large. To achieve the
desired aggregate SSD capacity, we use a cluster of storage-optimized machines,
each storing a subset of the index.
At this scale, simple random sharding works well. During ingestion, each vector is randomly assigned to one of the shards of the index. An ANN search query is broadcast to all shards, and the global top-k is stitched together from the sub-results.
This technique can scale to arbitrarily large indexes, but its cost scales linearly with the number of machines in the cluster. This is why it is crucial to maximize the efficiency of a single machine before turning to distribution. Moving in the other direction leads to an unnecessarily expensive system.
turbopuffer customers can use this technique now by splitting their largest indexes into 4 TiB namespaces and randomly assigning vectors across them. Then, contact us to pin each shard to its own SSD (this option will soon be available directly in the dashboard). We're also working on making the sharding fully transparent in the future.
When performance at scale becomes cost-efficient, it stops being a benchmarking exercise and becomes a building block.
The ANN v3 architecture pushes turbopuffer to 100 billion-vector scale at thousands of QPS, while holding p99 latency at 200ms. More importantly, it does this while keeping costs low enough to run continuously in production. It achieves this through hierarchical clustering that matches the memory hierarchy, binary quantization that compresses vectors by 16-32x with reranking to maintain recall, and distribution across storage-dense machines - all working together to maximize hardware utilization and minimize bottlenecks.
We host 2.5T+ documents, handle writes at 10M+ writes/s, and serve 10k+ queries/s. We are ready for far more. We hope you'll trust us with your queries.
Get startedThis is legitimately pretty impressive. I think the rule of thumb is now, go with postgres(pgvector) for vector search until it breaks, then go with turbopuffer.
Qdrant is also a good default choice, since it can work in-memory for development, with a hard drive for small deployments and also for "web scale" workloads.
As a principal eng, side-stepping a migration and having a good local dev experience is too good of a deal to pass up.
That being said, turbopuffer looks interesting. I will check it out. Hopefully their local dev experience is good
Qdrant is one of the few vendors I actively steer people away from. Look at the GitHub issues, look at what their CEO says, look at their fake “advancements” that they pay for publicity on…
The number of people I know who’ve had unrecoverable shard failures on Qdrant is too high to take it seriously.
I’m curious about this. Could you please point to some things the CEO has said, or reports of shard failures?
The bit about paying for publicity doesn’t bother me.
Edit: I haven’t found anything egregious that the CEO has said, or anything really sketchy. The shard failure warnings look serious, but the issues look closed
https://x.com/nils_reimers/status/1809334134088622217?s=46
https://x.com/generall931/status/1809303448837582850?s=46
There used to be a benchmarking issue with a founder that was particularly egregious but I can’t find it anymore.
The sharding and consensus issues were from around a year and a half ago, so maybe it’s gotten better.
There are just so many options in the space, I don’t know why you’d go with one of the least correct vendors (whether or not the correctness is deception is a different question that I can’t answer)
> issue with a founder
That would be me
What do I say? Happy to talk about "fakes". Here is my calendar. Feel free to book a slot. https://qdrant.to/andre-z
For local dev + testing, we recommend just hitting the production turbopuffer service directly, but with a separate test org/API key: https://turbopuffer.com/docs/testing
Works well for the vast majority of our customers (although we get the very occasional complaint about wanting a dev environment that works offline). The dataset sizes for local dev are usually so small that the cost rounds to free.
> although we get the very occasional complaint about wanting a dev environment that works offline
It's only occasional because the people who care about dev environments that work offline are most likely to just skip you and move on.
For actual developer experience, as well as a number of use cases like customers with security and privacy concerns, being able to host locally is essential.
Fair enough if you don't care about those segments of the market, but don't confuse a small number of people asking about it with a small number of people wanting it.
As someone who works for a competitor, they are probably right holding off on that segment for a while. Supporting both cloud and local deployments is somewhere between 20% harder and 300% harder depending on the day.
I'm watching them with excitement. We all learn from each other. There's so much to do.
Can confirm. With a setup that works offline, one can
- start small on a laptop. Going through procurement at companies is a pain
- test things in CI reliably. Outages don’t break builds
- transition from laptop scale to web scale easily with the same API with just a different backend
Otherwise it’s really hard to justify not using S3 vectors here
The current dev experience is to start with faiss for PoCs, move to pgvector and then something heavy duty like one of the Lucene wrappers.
Yep, we're well aware of the selection bias effects in product feedback. As we grow we're thinking about how to make our product more accessible to small orgs / hobby projects. Introducing a local dev environment may be part of that.
Note that we already have a in-your-own-VPC offering for large orgs with strict security/privacy/regulatory controls.
That’s not local though
having a local simulator (DynamoDB, Spanner, others) helps me a lot for offline/local development and CI. when a vendor doesn't off this I have often end up mocking it out (one way or another) and have to wait for integration or e2e tests for feedback that could have been pushed further to the left.
in many CI environments unit tests don't have network access, it's not purely a price consideration.
(not a turbopuffer customer but I have been looking at it)
> in many CI environments unit tests don't have network access, it's not purely a price consideration.
I've never seen a hard block on network access (how do you install packages/pull images?) but I am sympathetic to wanting to enforce that unit tests run quickly by minimizing/eliminating RTT to networked services.
We've considered the possibility of a local simulator before. Let me know if it winds up being a blocker for your use case.
> how do you install packages/pull images
You pre-build the images with packages installed beforehand, then use those image offline.
My point is it's enough of a hassle to set up that I've yet to see that level of restriction in practice (across hundreds of CI systems).
Look into Bazel, a very standard build system used at many large tech companies. It splits fetches from build/test actions and allows blocking network for build/test actions with a single CLI flag. No hassle at all.
The fact that you haven't come across this kind of setup suggests that your hundreds of CI systems are not representative of the industry as a whole.
I agree our sample may not be representative but we try to stay focused on the current and next crop of tpuf customers rather than the software industry as a whole. So far "CI prohibits network access during tests" just hasn't come up as a pain point for any of them, but as I mentioned in another comment [0], we're definitely keeping an open mind about introducing an offline dev experience.
(I am familiar with Bazel, but I'll have to save the war stories for another thread. It's not a build tool we see our particular customers using.)
you pull packages from a trusted package repository, not from the internet. this is not rare in my experience (financial services, security) and will become increasingly common due to software supply chain issues.
I should have clarified, by local dev and testing I did in fact mean offline usage.
Without that it’s unfortunately a non starter
So I can note this down on our roadmap, what's the root of your requirement here? Supporting local dev without internet (airplanes, coffee shops, etc.)? Unit test speed? Something else?
I listed some reasons in another comment: https://news.ycombinator.com/item?id=46757853
I appreciate your responsiveness and open mind
I'd love to know how they compare versus MixedBread, what relative strengths each has. https://www.mixedbread.com/
I really really enjoy & learn a lot from the mixedbread blog. And they find good stuff to open source (although the product itself is closed). https://www.mixedbread.com/blog
I feel like there's a lot of overlap but also probably a lot of distinction too. Pretty new to this space of products though.
seems like a good rule of thumb to me! though i would perhaps lump "cost" into the "until it breaks" equation. even with decent perf, pg_vector's economics can be much worse, especially in multi-tenant scenarios where you need many small indexes (this is true of any vector db that builds indexes primarily on RAM/SSD)
Are there vector DBs with 100B vectors in production which work well? There was a paper which showed that there's 12% loss in accuracy at just 1 mln vectors. Maybe some kind of logical sharding is another option, to improve both accuracy and speed.
I don't know at these scales, but at the 1M-100M, we found switching from out-of-box embeddings to fine-tuning our embeddings gave less of a sting in the compression/recall trade-off . We had a 10-100X win here wrt comparable recall with better compression.
I'm not sure how that'd work with the binary quantization phase though. For example, we use Matroyska, and some of the bits matter way more than others, so that might be super painful.
So many missing details...
Different vector indexes have very different recall and even different parameters for each dramatically impact this.
HNSW can have very good recall even at high vector counts.
There's also the embedding model, whether you're quantizing, if it's pure rag vs hybrid bm25 / static word embeddings vs graph connections, whether you're reranking etc etc
the solution described in the blog post is currently in production at 100B vectors
For what/who?
unfortunately i'm not able to share the customer or use case :( but the metrics that you see in the first charts in the post are from a production cluster
this is actually not how cursor uses turbopuffer, as they index per codebase and thus need many mid-sizes indexes as opposed to one massive index as this post describes
For those of us who operate on site, we have to add back network latency, which negates this win entirely and makes a proprietary cloud solution like this a nonstarter.
Often not a dealbreaker, actually! We can spin up new tpuf regions and procure dedicated interconnects to minimize latency to the on-prem network on request (and we have done this).
When you're operating at the 100B scale, you're pushing beyond the capacity that most on-prem setups can handle. Most orgs have no choice but to put a 100B workload into the nearest public cloud. (For smaller workloads, considerations are different, for sure.)