
#Why DuckDB is my first choice for data processing Over the past few years, I've found myself using DuckDB more and more for data processing, to the point where I now use it almost exclusively,…
Over the past few years, I've found myself using DuckDB more and more for data processing, to the point where I now use it almost exclusively, usually from within Python.
We're moving towards a simpler world where most tabular data can be processed on a single large machine1 and the era of clusters is coming to an end for all but the largest datasets.2
This post sets out some of my favourite features of DuckDB that set it apart from other SQL-based tools. In a nutshell, it's simple to install, ergonomic, fast, and more fully featured.
An earlier post explains why I favour SQL over other APIs such as Polars, pandas or dplyr.
DuckDB is an open source in-process SQL engine that is optimised for analytics queries.
The performance difference of analytics-optimised engines (OLAP) vs. transactions-optimised engines (OLTP) should not be underestimated. A query running in DuckDB can be 100 or even 1,000 times faster than exactly the same query running in (say) SQLite or Postgres.
A core use-case of DuckDB is where you have one or more large datasets on disk in formats like csv, parquet or json which you want to batch process. You may want to perform cleaning, joins, aggregation, derivation of new columns - that sort of thing.
But you can also use DuckDB for many other simpler tasks like viewing a csv file from the command line.
DuckDB consistently benchmarks as one of the fastest data processing engines. The benchmarks I've seen3 show there's not much in it between the leading open source engines - which at the moment seem to be polars, DuckDB, DataFusion, Spark and Dask. Spark and Dask can be competitive on large data, but slower on small data.
DuckDB itself is a single precompiled binary. In Python, it can be pip installed with no dependencies. This makes it a joy to install compared to other more heavyweight options like Spark. Combined with uv, you can stand up a fresh DuckDB Python environment from nothing in less than a second - see here.
With its speed and almost-zero startup time, DuckDB is ideally suited for CI and testing of data engineering pipelines.
Historically this has been fiddly and running a large suite of tests in e.g. Apache Spark has been time consuming and frustrating. Now it's much simpler to set up the test environment, and there's less scope for differences between it and your production pipelines.
This simplicity and speed also applies to writing new SQL, and getting syntax right before running it on a large dataset. Historically I have found this annoying in engines like Spark (where it takes a few seconds to start Spark in local mode), or even worse when you're forced to run queries in a proprietary tool like AWS Athena.4
There's even a DuckDB UI with autocomplete - see here.
The DuckDB team has implemented a wide range of innovations in its SQL dialect that make it a joy to use. See the following blog posts 1 2 3 4 5 6.
Some of my favourites are the EXCLUDE keyword, and the COLUMNS keyword which allows you to select and regex-replace a subset of columns.5 I also like QUALIFY and the aggregate modifiers on window functions, see here.
Another is the ability to function chain, like first_name.lower().trim().
You can query data directly from files, including on s3, or on the web.
For example to query a folder of parquet files:
select *
from read_parquet('path/to/*.parquet')
or even (on CORS enabled files) you can run SQL directly:
select *
from read_parquet('https://raw.githubusercontent.com/plotly/datasets/master/2015_flights.parquet')
limit 2;
Click here to try this query yourself in the DuckDB web shell.
One of the easiest ways to cause problems in your data pipelines is to fail to be strict about incoming data types from untyped formats such as csv. DuckDB provides lots of options here - see here.
Many data pipelines effectively boil down to a long sequence of CTEs:
WITH
input_data AS (
SELECT * FROM read_parquet('...')
),
step_1 AS (
SELECT ... FROM input_data JOIN ...
),
step_2 AS (
SELECT ... FROM step_1
)
SELECT ... FROM step_2;
When developing a pipeline like this, we often want to inspect what's happened at each step.
In Python, we can write
input_data = duckdb.sql("SELECT * FROM read_parquet('...')")
step_1 = duckdb.sql("SELECT ... FROM input_data JOIN ...")
step_2 = duckdb.sql("SELECT ... FROM step_1")
final = duckdb.sql("SELECT ... FROM step_2;")
This makes it easy to inspect what the data looks like at step_2 with no performance loss, since these steps will be executed lazily when they're run all at once.
This also facilitates easier testing of SQL in CI, since each step can be an independently-tested function.
DuckDB offers full ACID compliance for bulk data operations, which sets it apart from other analytical data systems - see here. You can listen to more about this on in this podcast, transcribed here.
This is a very interesting new development, making DuckDB potentially a suitable replacement for lakehouse formats such as Iceberg or Delta lake for medium scale data.
A longstanding difficulty with data processing engines has been the difficulty in writing high performance user defined functions (UDFs).
For example, in PySpark, you will generally get best performance by writing custom Scala, compiling to a JAR, and registering it with Spark. But this is cumbersome and in practice, you will encounter a lot of issues around Spark version compatibility and security restrictions environments such as DataBricks.
In DuckDB high performance custom UDFs can be written in C++. Whilst writing these functions is certainly not trivial, DuckDB community extensions offers a low-friction way of distributing the code. Community extensions can be installed almost instantly with a single command such as INSTALL h3 FROM community to install hierarchical hexagonal indexing for geospatial data.
The team provide documentation as a single markdown file so it can easily be provided to an LLM.
My top tip: if you load this file in your code editor, and use code folding, it's easy to copy the parts of the documentation you need into context.
Much of this blog post is based on my experience supporting multiple SQL dialects in Splink, an open source library for record linkage at scale. We've found that transitioning towards recommending DuckDB as the default backend choice has increased adoption of the library and significantly reduced the amount of problems faced by users, even for large linkage tasks, whilst speeding up workloads very substantially.
We've also found it's hugely increased the simplicity and speed of developing and testing new features.
pg_duckdb allows you to embed the DuckDB computation engine within Postgres.The later in particular seems potentially extremely powerful, enabling Postgres to be simultanouesly optimised for analytics and transactional processing. I think it's likely to see widespread adoption, especially after they iron out a few of the current shortcomings around enabling and optimising the use of Postgres indexes and pushing up filters up to PostGres.
What I love about duckdb:
-- Support for .parquet, .json, .csv (note: Spotify listening history comes in a multiple .json files, something fun to play with).
-- Support for glob reading, like: select * from 'tsa20*.csv' - so you can read hundreds of files (any type of file!) as if they were one file.
-- if the files don't have the same schema, union_by_name is amazing.
-- The .csv parser is amazing. Auto assigns types well.
-- It's small! The Web Assembly version is 2mb! The CLI is 16mb.
-- Because it is small you can add duckdb directly to your product, like Malloy has done: https://www.malloydata.dev/ - I think of Malloy as a technical persons alternative to PowerBI and Tableau, but it uses a semantic model that helps AI write amazing queries on your data. Edit: Malloy makes SQL 10x easier to write because of its semantic nature. Malloy transpiles to SQL, like Typescript transpiles to Javascript.
>> The .csv parser is amazing
Their csv support coupled with lots of functions and fast & easy iterative data discovery has totally changed how I approach investigation problems. I used to focus a significant amount of time on understanding the underlying schema of the problem space first, and often there really wasn't one - but you didn't find out easily. Now I start with pulling in data, writing exploratory queries to validate my assumptions, then cleaning & transforming data and creating new tables from that state; rinse and repeat. Aside from getting much deeper much quicker, you also hit dead ends sooner, saving a lot of otherwise wasted time.
There's an interesting paper out there on how the CSV parser works, and some ideas for future enhancements. I couldn't seem to find it but maybe someone else can?
not a paper but I found this: https://duckdb.org/2025/04/16/duckdb-csv-pollock-benchmark
Been playing around with Clickhouse a lot recently and have had a great experience particularly because it hits many of these same points. In my case the "local files" hasn't been a huge fixture but the Parquet and JSON ingestion have been very convenient and I think CH intends for `clickhouse-local` to be some sort of analog to the "add duckdb" point.
One of my favorite features is `SELECT ... FROM s3Cluster('<ch cluster>', 'https://...<s3 url>.../data//.json', ..., 'JSON')`[0] which lets you wildcard ingest from an S3 bucket and distributes the processing across nodes in your configured cluster. Also, I think it works with `schema_inference_mode` (mentioned below) though I haven't tried it. Very cool time for databases / DB tooling.
(I actually wasn't familiar with `union_by_name` but it looks to be like Clickhouse has implemented that as well [1,2] Neat feature in either case!)
[0] https://clickhouse.com/docs/sql-reference/table-functions/s3... [1] https://clickhouse.com/docs/interfaces/schema-inference [2] https://github.com/ClickHouse/ClickHouse/pull/55892
Malloy and PRQL (https://prql-lang.org/book/) are quite cool
I built Shaper following Malloy's idea of combining data queries and visualizations. But Shaper uses SQL instead of a custom language. It turns DuckDB into a dashboard builder for when you all you need is SQL.
I built https://zenquery.app and have used duckdb internally to do all procssing. The speed is crazy, schema auto-detection works correctly (most of the times) and LLM's generate correct SQL's for given queries in plain english.
This is a great sell. I have this annoyingly manual approach with a SQLite import and so on. This is great. Thank you!
Thanks for the excellent comment! Now excuse me while I go export my spotify history to play around with duckdb <3
Spotify says it will take 30 days for the export... it really only takes about 48 hours if I remember correctly. While you wait for the download here is an example listening history exploration in malloy - I converted the listening history to .parquet: https://github.com/mrtimo/spotify-listening-history
Polars has all of these benefits (to some degree), but also allows for larger-than-memory datasets.
DuckDB supports this as well, depending on which benchmark you look at it regularly performs better on those datasets than Polars.
-- hive partitioning
its 32 mb uncompressed and around 6MB compressed, its not that small https://cdn.jsdelivr.net/npm/@duckdb/duckdb-wasm/dist/
it is also difficult to customize as compared to sqlite so for example if you want to use your own parser for csv than it becomes hard.
But yes it provides lot of convenience out of the box as you have already listed.
It has become a favourite tool for me as well.
I work with scientists who research BC's coastal environment, from airborne observation of glaciers to autonomous drones in the deep sea. We've got heaps of data.
A while back I took a leap of faith with DuckDB as the data-processing engine for a new tool we're using to transform and validate biodiversity data. The goal is to take heaps of existing datasets and convert them to valid Darwin Core data. Keyword being valid.
DuckDB is such an incredible tool in this context. Essentially I dynamically build duckdb tables from schemas describing the data, then import it into the tables. If it fails, it explains why on a row-by-row basis (as far as it's able to, at least). Once the raw data is in, transformations can occur. This is accomplished entirely in DuckDB as well. Finally, validations are performed using application-layer logic if the transformation alone isn't assurance enough.
I've managed to build an application that's way faster, way more capable, and much easier to build than I expected. And it's portable! I think I can get the entire core running in a browser. Field researchers could run this on an iPad in a browser, offline!
This is incredible to me. I've had so much fun learning to use DuckDB better. It's probably my favourite discovery in a couple of years.
And yeah, this totally could have been done any number of different ways. I had prototypes which took much different routes. But the cool part here is I can trust DuckDB to do a ton of heavy lifting. It comes with the cost of some things happening in SQL that I'd prefer it didn't sometimes, but I'm content with that tradeoff. In cases where I'm missing application-layer type safety, I use parsing and tests to ensure my DB abstractions are doing what I expect. It works really well!
edit: For anyone curious, the point of this project is to allow scientists to analyze biodiversity and genomic data more easily using common rather than bespoke tools, as well as publish it to public repositories. Publishing is a major pain point because people in the field typically work very far from the Darwin Core spec :) I'm very excited to polish it a bit and get it in the hands of other organizations.
What's the advantage over using Polars for the same task? It seems to me the natural competitor here and I vastly prefer the Polars syntax over SQL any day. So I was curious if I should try duckdb or stick with polars
Polars would be better in some ways for sure. It was in one of my early prototypes. What put me off was that I was essentially designing my own database which I didn't trust as much as something like DuckDB.
Polars would let me have a lot of luxuries that are lost at the boundaries between my application and DuckDB, but those are weighed in the tradeoffs I was talking about. I do a lot of parsing at the boundaries to ensure data structures are sound, and otherwise DuckDB is enforcing strict schemas at runtime which provides as much safety as a dataset's schema requires. I do a lot of testing to ensure that I can trust how schemas are built and enforced as well.
Things like foreign keys, expressions that span multiple tables effortlessly, normalization, check constraints, unique constraints, and primary keys work perfectly right off the shelf. It's kind of perfect because the spec I'm supporting is fundamentally about normalized relational data.
Another consideration was that while Polars is a bit faster, we don't encounter datasets that require more speed. The largest dataset I've processed, including extensive transformations and complex validations (about as complex as they get in this spec), takes ~3 seconds for around 580k rows. That's on an M1 Max with 16GB of RAM, for what it's worth.
Our teams have written countless R scripts to do the same work with less assurance that the outputs are correct, having to relearn the spec each time, and with much worse performance (these people are not developers). So, we're very happy with DuckDB's performance despite that Polars would probably let us do it faster.
Having said that, if someone built the same project and chose Polars I wouldn't think they were wrong to do so. It's a great choice too, which is why your question is a good one.
Familiarity with SQL is a plus in my opinion. Also, DuckDB has SDKs in more languages compared to Polars.
I wasn't all that excited about SQL at first, but I've come around to it. Initially I really wanted to keep all of my data and operations in the application layer, and I'd gone to great lengths to model that to make it possible. I had this vision of all types of operations, queries, and so on being totally type safe and kept in a code-based registry such that I could do things like provide a GUI on top of data and functions I knew were 100% valid an compile-time. The only major drawback was that some kinds of changes to the application would require updating the repository.
I still love that idea but SQL turns out to be so battle-proven, reliable, flexible, capable, and well-documented that it's really hard to beat. After giving it a shot for a couple of weeks it became clear that it would yield a way more flexible and capable application. I'm confident enough that I can overcome the rough edges with the right abstractions and some polish over time.
Polars has all of the benefits of DuckDB (to some degree), but also allows for larger-than-memory datasets.
DuckDB has this capability as well: https://duckdb.org/docs/stable/guides/performance/how_to_tun...
Interesting, I wasn't aware; thanks for that. I will say, Polars' implementation is much more centered on out-of-core processing, and bypasses some of DuckDB's limitations ("DuckDB cannot yet offload some complex intermediate aggregate states to disk"). Both incredible pieces of software.
To expand on this, Polars' `LazyFrame` implementation allows for simple addition of new backends like GPU, streaming, and now distributed computing (though it's currently locked to a vendor). The DuckDB codebase just doesn't have this flexibility, though there are ways to get it to run on GPU using external software.
Thanks for that insight as well! My needs don't tend to be so demanding so I've gotten away without knowing these details, but I suspect I the not-so-distant future this could be useful to know.
Being able to use distributed backends to process frames sounds kind of incredible, but I can't imagine my little projects ever making use of it. Still, very cool stuff.
Have you seen Ibis[1]? It's a dataframe API that translates calls to it into various backends, including Polars and DuckDB. I've messed around with it a little for cases where data engineering transforms had to use pyspark but I wanted to do exploratory analysis in an environment that didn't have pyspark.
What file formats are the existing datasets you have? I also work on data processing in a scientific domain where HDF5 is a common format. Unfortunately Duckdb doesn't support HDF5 out of the box, and the existing hdf5 extension wasn't fast enough and didn't have the features needed, so I made a new one based on the c++ extension template. I'd love to collaborate on it if anyone is interested.
That's really fascinating. Is your format open source? I don't know if I'd have overlapping needs for something like that (though I did investigate hdf5 early on, it seemed very promising as a place to store our outputs) but I'd be curious to explore it and see what you're doing with it.
Right now we typically read from CSV or Excel, because that's what the scientists prefer to work with. For better or worse. There's a bit of parquet kicking around. The wrappers around handling imports for DuckDB are very, very thin. It handles just about everything seamlessly
Is this the extension that was too slow? https://duckdb.org/community_extensions/extensions/hdf5
We use DuckDB to process analytics and feeds for Bluesky (https://bluefacts.app)
To get fast access to the query results we use the Apache Arrow interface and generate the code directly from DuckDB SQL queries using the SQG tool ( https://sqg.dev/generators/java-duckdb-arrow/)
I'm very interested to see a behind the scenes of your tech - do you guys have a writeup somewhere? You HN tool is also promising!