We made Postgres writes faster, but it broke replication

2025-07-2111:2625052paradedb.com

Zero-ETL search and analytics for Postgres

When we built pg_search, the Postgres extension for search and analytics, write throughput was a priority. To be an effective alternative to Elasticsearch we needed to support high ingest workloads in real time. This is because many Elasticsearch use cases — like real-time dashboards, e-commerce search, and recommendation systems — involve continuous writes that must be indexed and made searchable immediately.

In a vanilla Postgres implementation, full-text search is backed by a B-tree or GIN (Generalized Inverted Index) structure. These indexes are good for relatively fast lookups, but they aren’t so fast for writes.

We opted for a data structure optimized for writes: a Log-Structured Merge (LSM) tree. That was great for write throughput, but it broke Postgres replication! Specifically, it broke physical replication, one of the two mechanisms that allow Postgres to replicate data from a primary node across one or more read replicas. The other mode is logical replication, which sends individual row changes to replicas instead of copying the database byte-for-byte.

It turned out that Postgres's out-of-the-box support for physical replication, built on Write-Ahead Log (WAL) shipping, isn't quite enough for an advanced data structure like an LSM tree to be replication-safe. We were surprised to learn this, so we decided to write up our experience and describe how we fixed the problem. In this post, we'll do a deep dive into:

  1. What is an LSM tree?
  2. What it means for an LSM tree to be replication-safe
  3. How Postgres' WAL shipping guarantees physical consistency
  4. Why atomic logging was necessary for logical consistency
  5. How we leveraged a little-known but powerful Postgres setting called hot_standby_feedback

What is an LSM Tree?

A Log-Structured Merge Tree (LSM tree) is a write-optimized data structure commonly used in systems like RocksDB and Cassandra.

The core idea behind an LSM tree is to turn random writes into sequential ones. Incoming writes are first stored in an in-memory buffer called a memtable, which is fast to update. Once the memtable fills up, it is flushed to disk as a sorted, immutable segment file (often called an SSTable).

These segment files are organized by size into layers or levels. Newer data is written to the topmost layer. Over time, data is gradually pushed down into lower levels through a process called compaction, where data from smaller segments is merged, deduplicated, and rewritten into larger segments.

What Do We Mean by Replication Safety?

A reliable distributed datastore (one which guarantees “replication safety”) must demonstrate both physical and logical consistency12 across database replicas.

  1. Physical consistency means the replica contains structurally valid data — each page or block on disk is well-formed and corresponds to a state that did exist on the primary at some point.
  2. Logical consistency ensures that the data on the replica reflects a coherent and stable view of the database, something that could have been seen by a transaction on the primary.

A physically consistent state is not always a logically consistent state. Specifically, if you take a snapshot of a physically consistent replica while replicating an in-flight transaction, it may not be logically consistent. A good analogy is to imagine replicating a book. Physical consistency is like copying every page exactly, even if you're in the middle of a chapter — you're guaranteed to have real pages, but you might end up with half a sentence or a missing footnote. Logical consistency is like waiting until the chapter is finished before copying it, ensuring the result makes sense to a reader.

WAL Shipping: How Postgres Guarantees Physical Consistency

In a primary-standby physical replication setup, a primary server is paired with a standby server that acts as a read-only copy of its leader. The servers remain in sync by using a Write-Ahead Log (WAL) to record all binary-level changes to storage blocks on the primary server before they occur. Changes to this append-only WAL file are then streamed to the standby (a process called “log shipping”) and applied in the order received. This process enables near-realtime data synchronization between the two servers, hence the phrase “hot standby”.

Why Atomicity is a Requirement for Physical Consistency

Atomicity is a requirement for physical consistency because Postgres locks are not replayed on replica servers. This is because replaying every lock taken on the primary would require strict timing synchronization, significantly impacting performance and hindering the ability of the standby to serve reads. Instead, the WAL uses per-buffer locks to incrementally replay edits in some particular order: it acquires an exclusive lock on the buffer (the in-memory representation of a block), makes its change, and then releases it.

The problem arises when modifying data structures that span many Postgres buffers. Without the guarantee that operations are atomic over the entire structure, these modifications can lead to structural corruption.

For example: pg_search uses an unrolled linked list of Postgres buffers where each node holds the read validity of a batch of segments in our LSM tree. To ensure that the primary can never observe a broken linked list, we use hand-over-hand locking (also known as lock coupling) to guarantee that the list remains physically consistent on the primary. After each buffer in the list is modified, its WAL entry becomes visible atomically on the replica.

But what happens when we want to edit multiple entries in the list “at once” (atomically), such as when a newly compacted segment replaces multiple old segments? If only the primary server mattered, then we could protect the logical integrity of multiple list nodes by applying a global lock on the list itself, ensuring that the contents of the list were only visible in valid state. But replicas don’t have access to global locks, so there’s no way to coordinate edits across multiple nodes (and multiple buffers) at once.

Instead, for multi-node operations pg_search uses a Copy-on-Write (CoW) clone of the list, and atomically swaps in the head. More generally, atomic operations insulate you from danger by eliminating reliance on coarse-grained locks.

A Problem: Vacuums Undermine Logical Consistency

Adapting algorithms to work atomically at the block level is table stakes for physical replication: if you don’t do it, your data structures are broken, and you won’t be able to use them consistently.

But even when individual WAL operations and data structures are atomically compatible, VACUUM can interfere with the execution of concurrent transactions spread across multiple WAL entries and compromise logical consistency.

To illustrate this problem, let's imagine the primary has a table with some number of rows. To ensure that concurrent writes can safely write without blocking each other, Postgres uses a mechanism called Multi-Version Concurrency Control (MVCC), which creates multiple versions of a modified row (or tuple) instead of updating the tuple in place. When a row is updated or deleted, the previous tuple is not immediately removed but instead flagged as “dead”.

Those “dead” tuples remain on disk until a periodic maintenance operation called VACUUM is run. Like any other operation, VACUUMs are recorded in the WAL, shipped to the standby, and replayed there.

Since the “deadness” of a tuple is local to the server but the VACUUM operation is replayed globally, faults can occur when a tuple is VACUUMed from the standby prematurely. The standby might be mid-read, iterating (across multiple WAL entries) over a tuple that it considers valid, while the primary might concurrently decide to VACUUM that tuple out of existence. The standby, lacking any locking coordination or awareness of concurrent operations, replays the VACUUM while the prior transaction is still in progress. This can lead to query failure if a long-running query attempts to access a VACUUMed tuple.

Why LSM Trees are Particularly Vulnerable to this Problem

If your Postgres is configured with read replicas and experiences a high volume of writes, you may have already seen this problem, even when using B-tree indexes. If a VACUUM is running on the primary at the same time that a query hits a read replica, it's possible for Postgres to abort the read. However, these errors are likely infrequent and tolerable in a typical Postgres setup where VACUUMs run once every few hours.

The same cannot be said for an LSM tree, where compaction is a core, ongoing part of the system. In a high-write throughput system, compaction can happen many times per minute, even per second. This increases the number of opportunities for conflicts to occur.

Similar to a VACUUM, a compaction rewrites data on the primary server, and needs to know when ongoing queries will no longer need that data in order to be able to safely delete old segments.

The Logical Solution: Hot Standby Feedback

This is where an optional Postgres setting called hot_standby_feedback comes in. When enabled, hot_standby_feedback allows the standby to tell the primary what data is safe to clean up from the replica’s point of view. This information significantly reduces the likelihood of a tuple being prematurely VACUUMed, and allows pg_search to determine when it is safe to delete segments.

To understand what information hot_standby_feedback actually relays, we must first understand how tuple versioning works in Postgres. Every tuple in Postgres has two key metadata attributes: xmin and xmax. xmin stores the Transaction ID (XID) of the transaction that created or inserted that specific tuple version, while xmax stores the XID of the transaction that either updated or deleted the tuple version, effectively marking it as obsolete. When a tuple is deleted, the xmax value is updated with the XID of the deleting transaction. Since XID are assigned sequentially, later transactions are assigned a larger number for their XID, so another way to think about xmin is as a proxy for the tuple’s “creation time” and xmax for its “last updated or deleted time”.

When hot_standby_feedback is enabled, the replica will periodically communicate the smallest xmin (oldest “creation time”) that any of its active queries is currently pinned to. This xmin identifies the oldest tuple still in use on the standby.

Armed with this information, the primary can make smarter decisions about when to permit cleanup operations (i.e. VACUUMs). If it sees that a standby query is still operating on a tuple that would otherwise be considered “dead,” it can defer cleanup until that query has finished.

Final Thoughts

Even with the help of hot_standby_feedback, standby servers are fundamentally at the mercy of the WAL to provide instructions that are safe to execute in the order and moment they are received. The tension between the local incentives and global requirements is just one challenging dimension of the difficulty in achieving full replication safety in a distributed Postgres system.

To achieve both physical and logical consistency, pg_search implements an atomically logged LSM tree, and to achieve logical consistency, we rely on hot_standby_feedback.

This challenge was worth tackling because it enables the fastest possible search performance, without sacrificing consistency. To see it in action, check out our documentation or our open source project!

  1. Postgres supports both "physical replication" and "logical replication", and ParadeDB does too! This post only discusses physical replication.

  2. Physical and logical consistency are also referred to as structural integrity and transactional consistency.


Read the original article

Comments

  • By timetoogo 2025-07-2114:44

    Love this style of no-fluff technical deep dive. HN needs more content like this.

  • By kikimora 2025-07-221:322 reply

    Agree with others saying HN needs more content like this!

    After reading I don’t get how locks held in memory affect WAL shipping. WAL reader reads it in a single thread, updates in-memory data structures periodically dumping them on disk. Perhaps you want to read one big instruction from WAL and apply it to many buffers using multiple threads?

    >Adapting algorithms to work atomically at the block level is table stakes for physical replication

    Why? To me the only thing you have to do atomically is WAL write. WAL readers read and write however they want given that they can detect partial writes and replay WAL.

    >If a VACUUM is running on the primary at the same time that a query hits a read replica, it's possible for Postgres to abort the read.

    The situation you referring to is: 1. Record inserted 2. Standby long query started 3. Record removed 4. Primary vacuum started 5. Vacuum replicated 6. Vacuum on standby cannot remove record because it is being read by the long query. 7. PG cancels the query to let vacuum proceed.

    I guess your implementation generates a lot of dead tuples during compaction. You clearly fighting PG here. Could a custom storage engine be a better option?

    • By stuhood 2025-07-223:51

      Thanks for the questions!

          After reading I don’t get how locks held in memory affect WAL shipping.
          WAL reader reads it in a single thread, updates in-memory data structures
          periodically dumping them on disk. Perhaps you want to read one big
          instruction from WAL and apply it to many buffers using multiple threads?
      
      We currently use an un-modified/generic WAL entry, and don't implement our own replay. That means we don't control the order of locks acquired/released during replay: and the default is to acquire exactly one lock to update a buffer.

      But as far as I know, even with a custom WAL entry implementation, the maximum in one entry would still be ~8k, which might not be sufficient for a multi-block atomic operation. So the data structure needs to support block-at-a-time atomic updates.

          I guess your implementation generates a lot of dead tuples during
          compaction. You clearly fighting PG here. Could a custom storage
          engine be a better option?
      
      `pg_search`'s LSM tree is effectively a custom storage engine, but it is an index (Index Access Method and Custom Scan) rather than a table. See more on it here: https://www.paradedb.com/blog/block_storage_part_one

      LSM compaction does not generate any dead tuples on its own, as what is dead is controlled by what is "dead" in the heap/table due to deletes/updates. Instead, the LSM is cycling blocks into and out of a custom free space map (that we implemented to reduce WAL traffic).

  • By otterley 2025-07-2115:466 reply

    > To be an effective alternative to Elasticsearch we needed to support high ingest workloads in real time.

    Why not just use OpenSearch or ElasticSearch? The tool is already in the inventory; why use a screwdriver when a chisel is needed and available?

    This is another one of those “when you have a hammer, everything looks like your thumb” stories.

    • By iw2rmb 2025-07-2115:471 reply

      Because you don’t need to sync and you have ACID with joins.

      • By otterley 2025-07-2115:493 reply

        Is there a whole business to be had with those advantages alone? I’m curious as to who the target market is.

        • By levkk 2025-07-2116:125 reply

          My last big co, we had a team of 10 who's entire job was to sync data from Postgres into Elastic. It would take weeks and fallover regularly due to traffic.

          If we could have a DB that could do search and be a store of record, it would be amazing.

          • By otterley 2025-07-2116:462 reply

            They're different access patterns, though. Are there no concerns about performance and potentially blocking behavior? Decoupling OLTP and analytics is frequently done with good reason: 1/to allow the systems to scale independently, and 2/to help prevent issues with one component from impacting the other (i.e., contain blast radius). I wouldn't want a failure of my search engine to also take down my transaction system.

            • By philippemnoel 2025-07-2117:111 reply

              You don't need to. Customers usually deploy us on a standalone replica(s) on their Postgres cluster. If a query were to take it down, it would only take down the replica(s) dedicated to ParadeDB, leaving the primary and all other read replicas dedicated to OLTP safe.

              • By otterley 2025-07-2117:342 reply

                Are you saying that the cluster isn't homogenous? It sounds like you're describing an architecture that involves a cluster that has two entirely different pieces of software on it, and whose roles aren't interchangeable.

                • By philippemnoel 2025-07-2117:541 reply

                  Bear with me, this will be a bit of a longer answer. Today, there are two topologies under which people deploy ParadeDB.

                  - <some managed Postgres service> + ParadeDB. Frequently, customers already use a managed Postgres (e.g. AWS RDS) and want ParadeDB. In that world, they maintain their managed Postgres service and deploy a Kubernetes cluster running ParadeDB on the side, with one primary instance and some number of replicas. The AWS RDS primary sends data to the ParadeDB primary via logical replication. You can see a diagram here: https://docs.paradedb.com/deploy/byoc

                  In this topology, the OLTP and search/OLAP workloads are fully isolated from each other. You have two clusters, but you don't need a third-party ETL service since they're both "just Postgres".

                  - <self-hosted Postgres> + ParadeDB. Some customers, typically larger ones, prefer to self-host Postgres and want to install our Postgres extension directly. The extension is installed in their primary Postgres, and the CREATE INDEX commands must be issued on the primary; however, they may route reads only to a subset of the read replicas in their cluster.

                  In this topology, all writes could be directed to the primary, all OLTP read queries could be routed to a pool of read replicas, and all search/OLAP queries could be directed to another subset of replicas.

                  Both are completely reasonable approaches and depend on the workload. Hope this helps :)

                  • By adsharma 2025-07-2119:591 reply

                    Which of these two is the higher order bit?

                    * ParadeDB speaks postgres protocol

                    * These setups don't have a complex ETL pipeline

                    If you have a ETL pipeline specialized for PG logical replication (as opposed to generic JVM based Debizium/Kafka setups), you get some fraction of the same benefits. I'm curious about Conduit and its postgres plugin.

                    That leaves: ParadeDB uses vanilla postgres + rust extension. This is a technology detail. I was looking for an articulation of the customer benefit because of this technologically appealing architecture.

                    • By philippemnoel 2025-07-2121:30

                      The value prop for customers vs Elasticsearch are:

                      - ACID w/ JOINs

                      - Real-time indexing under UPDATE-heavy workloads. Instacart wrote about this, they had to move away from Elasticsearch during COVID because of this problem: https://tech.instacart.com/how-instacart-built-a-modern-sear...

                      Beyond these two benefits, then the added benefits are:

                      - Infrastructure simplification (no need for ETL)

                      - Lower costs

                      Speaking the wire protocol is nice, but it's not worth much.

                • By dangoodmanUT 2025-07-2121:03

                  they both sound like postgres to me, just with different extensions

            • By ucarion 2025-07-2118:51

              Since we both worked there: I can think of a few places at Segment where we'd have added more reporting/analytics/search if it weren't such a pain to set up a OLAP copy of our control plane databases. Remember how much engineering effort we spent on teams that did nothing but control plane database stuff?

              Data plane is a different story, but not everything is 1m+ RPS.

          • By gtowey 2025-07-2117:141 reply

            It's not going to happen anytime soon, because you simply cannot cheat physics.

            A system that supports OLAP/ad-hoc queries is going to need a ton of IOPs & probably also CPU capacity to do your data transformations. If you want this to also scale beyond the capacity limits of a single node, then you're going to run into distributed joins and network becomes a huge factor.

            Now, to support OLTP at the same time, your big, distributed system needs to support ACID, be highly fault-tolerant, etc.

            All you end up with is a system that has to be scaled in every dimension. It needs to support the maximum possible workloads you can throw at it, or else a random, expensive reporting query is going to DOS your system and your primary customer-facing system will be unusable at the same time. It is sort of possible, but it's going to cost A LOT of money. You have to have tons and tons of "spare" capacity.

            Which brings us to the core of engineering -- anyone can build a system that burns dump trucks full of venture capital dollars to create the one-system-to-rule-them-all. But businesses that want to succeed need to optimize their costs so their storage systems don't break the bank. This is why the current status-quo of specialized systems that do one task well isn't going to change. The current technology paradigm cannot be optimized for every task simultaneously. We have to make tradeoffs.

            • By throwaway7783 2025-07-2118:54

              I don't know. For me, I need

              * a primary transactional DB that I can write fast, with ACID guarantees and a read-after-write guarantee, and allows failover

              * one (or more) secondaries that are optimized for analytics and search. This should also tell me how caught up the system is, with the primary.

              If they all can talk the same language (SQL) and can replicate from primary with no additional tools/technology (postgres replcation for example), I will take it any day.

              It is about operational simplicity and not needing intimately to know multiple technologies. Granted, even if this is "just" postgresql, it really is not and all customizations will have their own tuning and whatnot, but the context is all still postgresql.

              Yes, this will not magically solve the CAP theorem, but for most cases we don't need to care too much

          • By mathewpregasen 2025-07-2116:272 reply

            Yeah, in general, I think a lot of businesses would love to skip ETL pipelines if possible / consolidate data. Postgres is a very much a neutral database to extend upon, maybe a wild analogy but it's the canola oil of databases

            • By strbean 2025-07-2117:201 reply

              Total tangent, but I think "Canola is a neutral oil" is a lie. It's got the most distinctive (and in my opinion, bad) flavor of the common cooking oils.

              • By retakeming 2025-07-2117:271 reply

                What would you say is the most neutral oil then?

                • By ahartmetz 2025-07-2118:042 reply

                  Sunflower oil? It seems to very reliably taste like nothing.

                  • By mathewpregasen 2025-07-2123:061 reply

                    Personally I have Canola and Sunflower oil tied. Vegetable Oil I guess deserves a mention here too.

                    • By ahartmetz 2025-07-2123:40

                      If canola oil tastes like something, it's really disgusting IMO. I kinda hate the stuff even though my dad made good money growing it. OTOH, the very sweet smell of the plant's flowers is pleasant enough if pretty basic and the honey is similar.

                  • By zombodb 2025-07-220:53

                    Shout out to grapeseed oil

          • By mikegreenberg 2025-07-2116:36

            Once upon a time, I was using postgres for OLTP and OLAP purposes combined with in-database transforms using TimescaleDB. I had a schema for optimized ingestion and then several aggregate views which produced a bunch of purpose-specific "materialized" tables for efficient analysis based on the ingestion tables.

            Timescale had a nice way of abstracting away the cost of updating these views without putting too much load on ingestion (processing multiple TBs of data a time in a single instance with about 500Gb of data churn daily).

          • By worldsayshi 2025-07-2116:361 reply

            One db that could be interesting here is CrateDB. It's a Lucene based DB that supports the postgres wire protocol. So you can run SQL queries against it.

            I've tried figuring out if it supports acting as a pg read-replica, which sounds to me like the ideal set up - but it doesn't seem to be supported.

            I have no affiliation to them, just met the team at an event and thought it sounded cool.

            • By philippemnoel 2025-07-2117:021 reply

              One of the ParadeDB maintainers here -- Being PostgreSQL wire protocol compatible is very different from being built inside Postgres on top of the Postgres pages, which is what ParadeDB does. You still need the "T" in ETL, e.g. transforming data from your source into the format of the sink (in your example CrateDB). This is where ETL costs and brittleness come into play.

              You can read more about it here: https://www.paradedb.com/blog/block_storage_part_one

              • By worldsayshi 2025-07-226:321 reply

                Sounds very interesting! Unfortunately AGPL license makes it hard to bring into projects.

                • By philippemnoel 2025-07-2214:091 reply

                  How so? Many popular projects are AGPL. MinIO, Grafana, etc.

                  We wrote about this here: https://www.paradedb.com/blog/agpl

                  • By worldsayshi 2025-07-2414:23

                    So, I'm not versed enough in legal matters to be certain about this, so I tend to fallback to caution, but (A) customers I've worked with in the past seem to be wary of such copyleft licenses and (B) the contagious nature of such license would make me think twice about using it in a project of my own as well.

                    It would be nice to have such notion challenged but I'm not sure what would change my mind.

                    I would expect that most commercial companies that use Grafana would obtain a commercial license?

        • By cryptonector 2025-07-2116:323 reply

          For JOINs? Absolutely! Who wants to hand-code queries at the executor level?! It's expensive!

          You need a query language.

          You don't necessarily need ACID, and you don't necessarily need a bunch of things that SQL RDBMSes give you, but you definitely need a QL, and it has to support a lot of what SQL supports, especially JOINs and GROUP BY w/ aggregations.

          NoSQLs tend to evolve into having a QL layered on top. Just start with that if you really want to build a NoSQL.

          • By otterley 2025-07-2116:522 reply

            To be clear here, I'm not arguing that OpenSearch/ElasticSearch is an adequate substitute for Postgres. They're different databases, each with different strengths and weaknesses. If you need JOINs and ACID compliance, you should use Postgres. And if you need distributed search, you should use OpenSearch/ElasticSearch.

            Unless they're building for single-host scale, you're not going to get JOINs for free. Lucene (the engine upon which ES/OS is based) already has JOIN capability. But it's not used in ES/OS because the performance of JOINs is absolutely abysmal in distributed databases.

            • By cryptonector 2025-07-2117:42

              I'm arguing that sometimes you don't need ACID, or rather, sometimes you accept that ACID is too painful so you accept not having ACID, but no one ever really doesn't want a QL -- they only think that they don't want a QL until they learn better.

              I.e., NoACID does not imply NoQueryLanguage, and you can always have a QL, so you should always get a QL, and you should always use a QL.

              > Unless they're building for single-host scale, you're not going to get JOINs for free.

              If by 'free' you mean not having to code them, then that's wrong. You can always have or implement a QL.

              If by 'free' you mean 'performant', then yes, you might have to denormalize your data so that JOINs vanish, though at the cost of write amplification. But so what, that's true whether you use a QL or not -- it's true in SQL RDBMSes too.

            • By philippemnoel 2025-07-2117:10

              Our customers typically deploy ParadeDB in a primary-replicas topology, with one primary Postgres node and 2 or more read replicas, depending on read volume. Queries are executed on a single node today, yes.

              We have plans to eventually support distributed queries.

          • By strbean 2025-07-2117:25

            Obligatory whine that the term NoSQL got co-opted to mean "no relational". There's tons of space for a better query language for querying relation databases.

          • By derefr 2025-07-2117:212 reply

            It's funny; as someone who is exactly pg_search's market, I actually often want the opposite: ACID, MVCC transactions, automatic table and index management... but no query language.

            At the data scale + level of complexity our OLAP queries operate at, we very often run into situations where Postgres's very best plan [with a well-considered schema, with great indexes and statistics, and after tons of tuning and coaxing], still does something literally interminable — not for any semantic reason to do with the query plan, but rather due to how Postgres's architecture executes the query plan[1].

            The last such job, I thought would be simple enough to run in a few hours... I let it run for six days[2], and then gave up and killed it. Whereas, when we encoded the same "query plan" as a series of bulk-primitive ETL steps by:

            1. dumping the raw source data from PG to CSV with a `COPY`,

            2. whipping out simple POSIX CLI tools like sort/uniq/grep/awk (plus a few hand-rolled streaming aggregation scripts) to transform/reduce/normalize the source data into the shape we want it in,

            3. and then loading the resulting CSVs back into PG with another `COPY`,

            ...then the runtime of the whole operation was reduced to just a few hours, with the individual steps completing in ~30 minutes each. (And that's despite the overhead of parsing and/or emitting non-string fields from/to CSV with almost every intermediate step!)

            Honestly, if Postgres would just let us program it the way one programs e.g. Redis through Lua, or ETS tables in Erlang — where the tables and indices are ADTs with low-level public APIs, and you set up your own "query plan" as a set of streaming-channel actors making calls to these APIs — then we would be a lot happier. But even in PL/pgSQL (which we do use, here and there), the only APIs are high-level ones.

            • Sure, you can get a cursor on a query; but you can't e.g. get an LMDB-like B-tree cursor on a target B-tree index, and ask it to jump [i.e. re-nav down from root] or walk [i.e. nav up from current pos to nearest common ancestor then back down] to "the first row-tuple greater-than-or-equal to [key]".

            • You can't write your own efficient implementation of TABLESAMPLE semantics to set up your own Bigtable-esque balanced cluster-order-partitioned parallel seq scan.

            • You can't collect pointers to row-tuples, partially materialize them, filter them by some criterion on the read (but perhaps not parsed!) columns, and then more-fully materialize those same row-tuples "directly" from the references to them you still hold.

            ---

            [1] One example of what I mean by "execution": did you know that Postgres doesn't use any form of concurrency for query plans — not even the most basic libuv-like "This Merge Append node's child-node A is in a blocking-wait on IO; that blocking-wait should yield, so that the Merge Append node's child-node B can instead send row-tuple batches for a while" kind of concurrency?

            ---

            [2] If you're wondering, the query that ran for six days was literally just this (anonymized):

                SELECT a, b, SUM(value) AS total_value
                FROM (
                  SELECT a, b, value FROM source1
                  UNION ALL
                  SELECT a, b, value FROM source2
                ) AS u
                GROUP BY a, b;
            
            `source1` and `source2` are ~150GB tables. (Or at least, they're 150GB when dumped to CSV.) Two integer keys (a,b), and a bigint value. With a b-tree index on `(a,b) INCLUDE (value)`, with correct statistics.

            And its EXPLAIN query plan looked like this (with `SET enable_hashagg = OFF;`) — nominally pretty good:

                GroupAggregate  (cost=1.17..709462419.92 rows=40000 width=40)
                  Group Key: a, b
                  ->  Merge Append  (cost=1.17..659276497.84 rows=6691282944 width=16)
                        Sort Key: a, b
                        ->  Index Only Scan using source1_a_b_idx on source1  (cost=0.58..162356175.31 rows=3345641472 width=16)
                        ->  Index Only Scan using source2_a_b_idx on source2  (cost=0.58..162356175.31 rows=3345641472 width=16)
            
            Each one of the operations here is "obvious." It's what you'd think you'd want! You'd think this would finish quickly. And yet.

            (And no, the machine it ran on was not resource-bottlenecked. It had 1TB of RAM with no contention from other jobs, and this PG session was allowed to use much of it as work memory. But even if it was spilling to disk at every step... that should have been fine. The CSV equivalent of this inherently "spills to disk", for everything except the nursery levels of sort(1)'s merge-sort. And it does fine.)

            • By cryptonector 2025-07-2117:501 reply

              > At the data scale + level of complexity our OLAP queries operate at, we very often run into situations where Postgres's very best plan [with a well-considered schema, with great indexes and statistics, and after tons of tuning and coaxing], still does something literally interminable — not for any semantic reason to do with the query plan, but rather due to how Postgres's architecture executes the query plan[1].

              Well, ok, this is a problem, and I have run into it myself. That's not a reason for not wanting a QL. It's a reason for wanting a way to improve the query planning. Query hints in the QL are a bad idea for several reasons. What I would like instead is out-of-band query hints that I can provide along with my query (though obviously only when using APIs rather than `psql`; for `psql` one would have to provide the hints via some \hints commnad) where I would address each table source using names/aliases for the table source / join, and names for subqueries, and so really something like a path through the query and subqueries like `.<sub_query_alias0>.<sub_query_alias1>.<..>.<sub_query_aliasN>.<table_source_alias>` and where the hint would indicate things like what sub-query plan type to use and what index to use.

              • By derefr 2025-07-2120:281 reply

                I mean, in my case, I don't think what I want could be implemented via query hints. The types of things I would want to communicate to the server, are pragmas entirely divorced from the semantics of SQL: pragmas that only make sense if you can force the query's plan to take a specific shape to begin with, because you're trying to tune specific knobs on specific plan nodes, so if those plan nodes aren't part of the final query, then your tuning is meaningless.

                And if you're pinning the query plan to a specific shape, then there's really no point in sending SQL + hints; you may as well just expose a lower-level "query-execution-engine abstract-machine bytecode" that the user can submit, to be translated in a very low-level — but contractual! — way into a query plan. Or, one step further, into the thing a query plan does, skipping the plan-node-graph abstraction entirely in favor of arbitrarily calling the same primitives the plan nodes call [in a sandboxed way, because such bytecode should be low-level enough that it can encode invalid operation sequences that will crash the PG connection backend — and this is fine, the user signed up for that; they just want to be assured that such a crash won't affect data integrity outside the current transaction.]

                Such a bytecode wouldn't have to be used as the literal compiled internal representation of SQL within the server, mind you. (It'd be ideal if it was, but it doesn't need to be.) Just like e.g. the published and versioned JVM bytecode spec isn't 1:1 with the bytecode ISA the JVM actually uses as its in-memory representation for interpretation — there's module-load-time translation/compilation from the stable public format, to the current internal format.

                • By cryptonector 2025-07-223:16

                  But your mental model of your query is still in a language, even if it's only natural language. Why wouldn't you write a QL and compiler for it that outputs a query plan AST/bytecode/whatever to your liking? The PG SQL compiler and query planner just isn't to your liking, but you really want to be writing queries by hand? I guess what you're saying is you want something like LinkQ that lets you build complex plans/ASTs w/o the complexity of NoSQL hand-coded queries.

            • By cryptonector 2025-07-2220:04

              Oh, and BTW, PG is getting async I/O in the next release. It's not the same as concurrency, but if your workloads are I/O-bound (and likely they are) then it's as good as concurrency.

        • By beoberha 2025-07-2117:13

          Interestingly enough, it looks like the team was just hacking on an open source extension and organically attracted some customers, which snowballed into raising capital. So definitely seems like there’s a market.

    • By izietto 2025-07-2120:24

      Because it's a whole another architectural component for data that is already there, in a tool that is made just for that lacking just of `SELECT full_text_search('kitty pictures');`

    • By bastawhiz 2025-07-221:13

      Running elasticsearch is a miserable experience. If you can run one tool that you already have with slightly more effort, amazing. And you never need to think about rebuilding indexes or tuning the garbage collector or planning an ES major version migration.

    • By sandeepkd 2025-07-2118:00

      There can be multiple reasons, one that I can think of right away would be to keep the stack as simple as possible until you can. Realistically speaking most of the companies do not operate at the scale where they would need the specialized tools.

    • By cheema33 2025-07-2123:14

      > Why not just use OpenSearch or ElasticSearch?

      There is a cost associated with adopting and integrating another tool like ElasticSearch. For some orgs, the ROI might not be there. And if their existing database provide some additional capabilities in this space, that might be preferrable.

      > This is another one of those “when you have a hammer, everything looks like your thumb” stories.

      Are you referring to people who think that every reporting problem must be solved by a dedicated OLAP database?

    • By AtlasBarfed 2025-07-2119:06

      Why not the Cassandra based elastics if you need ingest?

HackerNews