Hacker News

450× Faster Joins with Index Condition Pushdown

2025-08-1915:4312057readyset.io

Increase the scale of your PostgreSQL and MySQL deployment by up to 100x with Readyset - all without modifying your application code or database. Start using Readyset today for free!

Show article

Introduction

Readyset is designed to serve queries from cached views with sub-millisecond latency. This post focuses on the cold path—cases where a cache miss forces execution against the base tables. In these scenarios, Readyset must evaluate the query from scratch, including materializing intermediate results. The focus here is on straddled joins, where filtering predicates apply to both sides of the join in addition to the ON clause.

Example:

SELECT u.id, u.name, o.order_id, o.total
FROM users u
JOIN orders o ON u.id = o.user_id
WHERE u.email = 'some@provide.com'
 AND o.status = 'SHIPPED';

In a previous optimization, we transitioned from nested loop joins to hash joins (see https://readyset.io/blog/introducing-hash-join-algorithm), eliminating the need to repeatedly scan one side of the join. This brought significant performance improvements for many workloads, however, some queries remained problematic.

Straddled joins are common in production workloads; for example, filtering users by some property on the left table and orders by a status flag on the right table, then joining on user_id. While our hash join algorithm improved over nested loops, it was still inefficient in these cases, especially when one side’s predicate had low cardinality (like a boolean flag or status column).

This post explains why the old execution strategy struggled, and how our new optimization using Index Condition Pushdown (ICP) changes the execution model to make straddled joins much more efficient.

Previous algorithm

During evaluation of straddled joins on cache misses, we observed significant latency in up-queries. To identify the underlying bottleneck, we profiled query execution and analyzed the resulting flamegraphs

Approximately 30% of the query execution time was attributed to data decompression. We initially suspected the RocksDB compression algorithm as the bottleneck and proceeded to benchmark alternative compression methods to validate this hypothesis.

Switching to ZSTD did not improve performance. Decompression remained a dominant contributor to query execution time. As the next step, we disabled compression in RocksDB entirely to isolate its impact. This came with a space tradeoff but was necessary to confirm whether compression was the root cause:

Disabling compression didn’t eliminate the problem, it merely shifted the bottleneck. The system began spending the majority of execution time on disk reads, as evident from increased I/O activity on ext4.

This made it clear that compression wasn’t the issue; rather, the excessive amount of data being read from disk was the primary cause. Isolated iostat output for the query confirmed this:

Device            r/s     rkB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wkB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dkB/s   drqm/s  %drqm d_await dareq-sz     f/s f_await  aqu-sz  %util
nvme1n1       10068.00  55728.00     0.00   0.00    0.08     5.54    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.79  81.00

The NVMe device was handling approximately 10K IOPS, with the majority of reads around 5KB in size. Despite their small size, the device reached ~81% utilization while executing a single join query, indicating I/O saturation.

Upon examining the query pattern, we identified that one side of the join had a low-cardinality index, causing the engine to scan nearly the entire table on each execution. Due to internal constraints (such as maintaining key provenance and supporting incremental view maintenance) the join engine was independently evaluating both sides of the join, regardless of selectivity.

Execution example

SELECT u.id, u.name, o.order_id, o.totalFROM users uJOIN orders o ON u.id = o.user_idWHERE u.email = 'some@provide.com' AND o.status = 'SHIPPED';

Scenario:

Users table: The predicate u.email = 'some@provide.com' is highly selective, returning exactly one row.
Orders table: The predicate o.status = 'SHIPPED' is very low selectivity — ~99% of all rows match.

Old Execution Strategy: Hash Joins

The old algorithm relied on the hash join approach:

Lookup both sides independently:
1. users: filter by email → 1 row.
2. orders: filter by status → ~99% of the table.
Materialize both result sets:
1. Millions of orders rows, plus 1 users row.
Build and probe a hash table:
1. Hash on o.user_id.
2. Probe with the user row.
3. Discard nearly all rows after the join.

Why This Is Inefficient:

High I/O: We had to read essentially the entire orders table.
High memory usage: Millions of rows materialized only to discard them.
Wasteful CPU work: The join considered far more rows than necessary.

New Execution Strategy: Index Condition Pushdown

The new algorithm issues an initial upquery to one side of the join, then combines the resulting join key with the original predicates on the other side. This composite condition is pushed down to RocksDB, allowing index-based retrieval of only the rows required to satisfy the join. This eliminates unnecessary data reads and avoids full-table scans.

Step‑by‑Step:

Apply the left predicate first: u.email = 'some@provide.com' → 1 row (u.id = 123).
Group rows by join key: collect distinct values {123}.
For each join key, build a right‑side lookup:

WHERE o.status = 'SHIPPED'
AND o.user_id = 123;

Leverage storage engine indexes: with an index on (user_id, status), fetch only matching rows.
Build the result set incrementally: combine left row(s) with right rows per join key, no full materialization needed.

Benchmark: Measuring the Impact

To quantify the performance improvement brought by the new Index Condition Pushdown strategy, we ran a controlled benchmark comparing the old hash join algorithm against the new execution model on the same hardware and dataset.

Old Algorithm: Hash Join

Under the previous strategy, Readyset executed straddled joins using independent filtering on both sides, followed by full materialization and hash-based probing. This approach was highly inefficient when one side had low-cardinality filters.

Results:

Throughput: 7.0 events/s
Latency (avg): 2,284 ms
95th percentile latency: 4,129 ms
Max latency: 6,831 ms
Total events processed: 4,203 over 10 minutes

The system was CPU- and IO-bound, spending excessive time reading and decompressing unnecessary rows from disk; most of which were discarded after join evaluation.

New Algorithm: Index Condition Pushdown

With the ICP-enabled join model, Readyset instead defers right-side lookups until left-side predicates are evaluated, allowing the use of compound indexes (e.g. (user_id, status)) to fetch only relevant rows. This minimizes materialization and disk reads.

Results:

Throughput: 3,214.4 events/s
Latency (avg): 4.98 ms
95th percentile latency: 11.87 ms
Max latency: 3,467.2 ms
Total events processed: 1,928,645 over 10 minutes

This represents a >450x throughput improvement and >450x latency reduction. The join strategy is now highly cache- and index-efficient, with the storage engine only returning matching rows based on precise key lookups.

Conclusion

While Readyset excels in delivering low-latency results via caching, optimizing the cold path is critical for consistent performance during cache misses. By rethinking how straddled joins are executed and leveraging Index Condition Pushdown, we’ve significantly improved real-world performance for these workloads.

Read the original article

marceloaltmann

Karma: 100

@Hacker__News
@hacker._news

Comments

By jamesblonde 2025-08-2315:011 reply

We call these pushdown joins in rondb. They only support an equality condition for the index condition. Joins with index condition pushdown is a bit of a mouthful.

We also went from like 6 seconds to 50ms. Huge speedup.

Reference

https://docs.rondb.com/rondb_parallel_query/#pushdown-joins

By OutOfHere 2025-08-2323:03

rondb code: https://github.com/logicalclocks/rondb

By _1tem 2025-08-2319:091 reply

Shouldn't the query planner catch things like this? Sounds like a performance bug if this happens in Postgres.

By HDThoreaun 2025-08-2320:392 reply

Yea this is pretty fucking basic stuff. Any competent optimization engine should be doing this. "push down indexes as much as possible" is literally the first thing a query planner should be trying to do

By ncruces 2025-08-2321:301 reply

Yes.

But here they are deciding between "pushdown o.status==shipped" and "pushdown u.email==address@", in parallel both, then join (which they already did) or first doing "u.email==address@" then pushing down "u.id==o.user_id" mostly.

This is a judgment call. Their planner is pretty dumb to not know which one is better, but “push down as much as possible” doesn't cut it: you need to actually decide what to push down and why.

By HDThoreaun 2025-08-2321:562 reply

No, it is not a judgement call. The query planner should be storing the distributions of the values in every index. This makes it obvious which pushdown to do here. Again, basic stuff. Youre right though not quite simple as "push down as much as possible", it is one step past that.

By Jupe 2025-08-242:02

Agreed. Isn't this precisely why key statistics (table statistics) are maintained in many DB systems? Essentially, always "push down" the predicate with the worst statistics and always execute (early) the predicates with high selectivity.

I'd be very surprised if virtually every RDBMS doesn't do this already.

By jmalicki 2025-08-2322:131 reply

Without storing the joint distribution of the values corresponding to the conditions that span multiple tables, it can be hard to know what's a win.

By Seattle3503 2025-08-242:531 reply

Postgres by default computes univariate stats for each column and uses those. If this is producing bad query plans, you can extend the statistics to be multivariate for select groups of columns manually. But to avoid combinatorially growth of stats related storage and work, you have to pick the columns by hand.

By jmalicki 2025-08-2416:29

That still isn't multivariate stats across a join.

By zmmmmm 2025-08-241:072 reply

I had to dig through to see the details of what database was really in play here, and sure enough, it's a wrapper around a key-value store (RocksDB). So while I'll confess I know little about RocksDB it does sound an awful lot like they threw out a mature relational database engine with built in optimization and now are in the process of paying the price for that by manually optimizing each query (against a key-value store no less, which probably fundamentally limits what optimizations can be done in any general way).

Would be curious if any RocksDB knowledgeable people have a different analysis.

By hxtk 2025-08-242:09

> against a key-value store no less, which probably fundamentally limits what optimizations can be done in any general way

I would disagree with this assumption for two reasons: first, theoretically, a file system is a key value store, and basically all databases run on file systems, so it stands to reason that any optimization Postgres does can be achieved as an abstraction over a key-value store with a good API because Postgres already did.

Second, less theoretically, this has already been done by CockroachDB, which stores data in Pebble in the current iteration and previously used RocksDB (pebble is CRDB’s Go rewrite of RocksDB) and TiDB, which stores its data in TiKV.

A thin wrapper over a KV store will only be able to use optimizations provided by the KV store, but if your wrapper is thick enough to include abstractions like adding multiple tables or inserting values into multiple cells in multiple tables atomically, then you can build arbitrary indices into the abstraction.

I wouldn’t tend to call a KV store a bad database engine because I don’t think of it as a database engine at all. It might technically be one under the academic definition of a database engine, but I mostly see it being used as a building block in a more complicated database engine.

By marceloaltmann 2025-08-2519:10

Readyset is an Incremental View Maintenance cache that is powered by a dataflow graph to keep caches (result-set) up-to-date as the underlining data changes on the database (MySQL/PostgreSQL). RocksDB is only used as the persistent storage here, and the whole optimization is done for the DFG execution and not related to the persistent storage itself.

By gvsg-rs 2025-08-2519:44

Hi, I’m from Readyset. We hadn’t realized this post had picked up traction here, but I wanted to share a bit more context.

Some folks pointed out that index pushdowns and join optimizations aren’t novel. That’s fair. In a traditional database engine, pushdowns and access path selection are standard. But Readyset isn’t a conventional engine.

When you create a materialized view in Readyset, the query is compiled into a dataflow graph designed for high-throughput incremental updates, not per-request planning. We receive changes from the upstream database’s replication stream and propagate deltas through the graph. Reads typically hit the cache directly at sub-ms latency.

But when a key hasn’t yet been materialized, we perform what we call an upquery -- a one-off pull from the base tables (stored in RocksDB) to hydrate the missing result. Since we don’t re-plan queries on each request, the structure of that upquery, including filter pushdowns and join execution, is precompiled into the dataflow.

Straddled joins, where filtering is required on both sides of the join, are especially tricky in this model. Without smarter pushdown, we were overfetching data and doing unnecessary join work. This optimization pushes composite filters into both sides of the join to reduce RocksDB scans and hash table size.

It’s a well-known idea in the context of traditional databases, but making it work in a static, incrementally maintained dataflow system is what makes it unique here.

Happy to go deeper if folks are curious. Appreciate the thoughtful feedback.