Show HN: Stratum - a pure JVM columnar SQL engine using the Java Vector API

2026-03-0622:3831datahike.io

Stratum Fork any table in O(1). Query past snapshots. Pure JVM, no native dependencies. Faster than DuckDB on 35 of 46 queries via the Java Vector API. PostgreSQL wire protocol - connect with psql,…

Show article

Stratum

Fork any table in O(1). Query past snapshots. Pure JVM, no native dependencies. Faster than DuckDB on 35 of 46 queries via the Java Vector API. PostgreSQL wire protocol - connect with psql, JDBC, or DBeaver.

GitHub Benchmarks Read the announcement →

35/46 queries faster than DuckDB (10M rows, 1T)

8.7× faster on H2O db-benchmark Q10

Pure JVM no native dependencies

Fused SIMD execution over copy-on-write columnar data. Single-threaded comparison vs DuckDB v1.4.4 (JDBC in-process) on 10M rows, Intel Core Ultra 7 258V, JVM 25. Median of 10 iterations, 5 warmup:

Query	Stratum	DuckDB	Ratio
TPC-H Q6 (filter + sum-product)	13ms	28ms	2.2x faster
Filtered COUNT (NEQ pred)	3ms	12ms	4.0x faster
H2O Q3 (100K string groups)	71ms	362ms	5.1x faster
LIKE '%search%' (string scan)	47ms	240ms	5.1x faster
H2O Q6 (STDDEV group-by)	30ms	81ms	2.7x faster
H2O Q10 (10M groups, 6 cols)	832ms	7056ms	8.5x faster
AVG(LENGTH(URL))	38ms	170ms	4.5x faster

Stratum wins 35 of 46 queries at 10M rows (single-threaded, median of 10 runs, Intel Core Ultra 7). DuckDB wins on sparse-selectivity filters, high-cardinality hash group-by, and global COUNT(DISTINCT). Full methodology and raw results in the benchmark docs.

st/fork creates an O(1) copy-on-write branch - no data copied, only a pointer to shared chunks. st/sync! persists a branch to storage. st/load restores any named branch. Pass column data as a table map to SQL queries, or register live storage-backed tables in the server with register-live-table!.

(require '[stratum.api :as st]
 '[konserve.file-store :as fs]
 '[clojure.core.async :refer [<!!]])

;; Open storage and load the orders dataset (10M rows)
(def store (<!! (fs/new-fs-store "/data/stratum")))
(def orders (<!! (st/load store "orders")))

;; Fork in O(1) - structural sharing, zero data copied
(def experiment (st/fork orders))

;; Persist fork as a named branch - shares chunks with main
(<!! (st/sync! experiment store "experiment"))

;; Query both branches via SQL - pass column data as table map
(st/q "SELECT SUM(price * qty) FROM t" {"t" (st/columns orders)})
;; => {:SUM(price * qty) 4821903.40} ← main branch unchanged

(st/q "SELECT SUM(price * qty) FROM t" {"t" (st/columns experiment)})
;; => {:SUM(price * qty) 4401238.66} ← experiment branch

;; Time-travel: load any historical branch from storage
(def baseline (<!! (st/load store "orders-baseline")))
(st/q "SELECT COUNT(*) FROM t" {"t" (st/columns baseline)})
;; => {:COUNT(*) 9847233}

Yggdrasil extends branching across your whole stack - fork Datahike, Stratum, and Proximum together for consistent snapshot isolation across SQL, Datalog, and vector search. Yggdrasil → · Dataset API docs →

Datahike is the system-of-record: Datalog queries, immutable transactions, time travel. Stratum is the SIMD SQL engine for scans and analytics over those same snapshots - fast on group-bys, joins, and window functions where a triple index isn't the right memory layout.

SIMD-accelerated via Java Vector API. Fused single-pass execution, dense group-by indexing, zone-map pruning. 35 of 46 queries faster than DuckDB at 10M rows.

Fork any table in O(1) via copy-on-write structural sharing. Named branches, time-travel, CoW snapshots - built into the engine, not bolted on.

No JNI, no native compilation. Runs anywhere a JVM does. SIMD acceleration without deployment complexity or platform lock-in.

PostgreSQL wire protocol. Full DML, CTEs, window functions, aggregates, FROM read_csv / read_parquet. psql, JDBC, DBeaver, psycopg2.

Datasets implement IEditableCollection, ILookup, IPersistentCollection. tablecloth and tech.ml.dataset work directly. DSL or SQL strings.

Branch Datahike, Stratum, and Proximum together via Yggdrasil. Consistent snapshots across SQL, Datalog, and vector search.

# Java 21+, no Clojure needed
java --add-modules jdk.incubator.vector \
 --enable-native-access=ALL-UNNAMED \
 - jar stratum-standalone.jar \
 --index orders:/data/orders.csv

# Or try the built-in demo tables (lineitem, taxi -100K rows each)
java --add-modules jdk.incubator.vector \
 --enable-native-access=ALL-UNNAMED \
 - jar stratum-standalone.jar --demo

# Connect with any PostgreSQL client
psql - h localhost - p 5432 - U stratum

(require '[stratum.api :as st])

;; Query with DSL
(st/q {:from {:price prices :qty quantities}
 :where [[:> :price 100]]
 :group [:region]
 :agg [[:sum [:* :price :qty]]
 [:count]]})

;; Or with SQL
(st/q "SELECT region, SUM(price * qty), COUNT(*)
 FROM orders WHERE price > 100 GROUP BY region"
 {"orders" {:price prices :qty quantities
 :region regions}})

Fused SIMD - predicate evaluation + aggregation in a single pass, no intermediate arrays
Dense group-by - direct array indexing for low-cardinality groups, hash tables for high-cardinality
Zone map pruning - skip entire chunks based on per-chunk min/max statistics
Parallel execution - cache-friendly work partitioning across all cores

Chunked B-tree - copy-on-write chunks with structural sharing across snapshots
Konserve backend - pluggable storage: filesystem, S3, or custom
Lazy loading - only accessed chunks are loaded from disk on demand
Mark-and-sweep GC - prune unreachable snapshots without manual cleanup

DML - SELECT, INSERT, UPDATE, DELETE, UPSERT (INSERT ON CONFLICT), UPDATE FROM, CREATE TABLE, DROP TABLE
Aggregates - SUM, COUNT, AVG, MIN, MAX, STDDEV, VARIANCE, CORR, MEDIAN, PERCENTILE_CONT, APPROX_QUANTILE, COUNT(DISTINCT), FILTER clause
Group-by - any number of columns, string and numeric, with HAVING
Joins - INNER, LEFT, RIGHT, FULL with multi-column keys
Window functions - ROW_NUMBER, RANK, DENSE_RANK, NTILE, PERCENT_RANK, CUME_DIST, LAG, LEAD, SUM/COUNT/AVG/MIN/MAX OVER with frame clauses
Composition - CTEs (WITH), subqueries, IN/EXISTS, UNION/INTERSECT/EXCEPT
Expressions - CASE WHEN, COALESCE, NULLIF, GREATEST, LEAST, CAST
Date/time - DATE_TRUNC, EXTRACT, DATE_ADD, DATE_DIFF
String - LIKE/ILIKE, UPPER/LOWER, LENGTH, SUBSTR
Files - FROM read_csv('file.csv'), FROM read_parquet('file.parquet')
Analytics - ANOMALY_SCORE, ANOMALY_PREDICT, ANOMALY_PROBA, ANOMALY_CONFIDENCE (isolation forest via SQL; online rotation for concept drift - docs)
Other - EXPLAIN, SELECT DISTINCT, IS NULL/IS NOT NULL, LIMIT/OFFSET

If you need help getting Stratum into production, we can help with integration, custom development, and support contracts.

; deps.edn (Clojure CLI)
; check https://clojars.org/org.replikativ/stratum for latest version
{:deps {org.replikativ/stratum {:mvn/version "0.1.114"}}}

; JVM flags required (add to :jvm-opts or alias)
:jvm-opts ["--add-modules=jdk.incubator.vector"
 "--enable-native-access=ALL-UNNAMED"]

;; Leiningen - project.clj
;; [org.replikativ/stratum "0.1.114"]

Requires JDK 21+. Clojure 1.12+. Apache 2.0 license. Latest version on Clojars.

GitHub Benchmark docs SQL interface Dataset API

Read the original article

whilo

Karma: 13

@Hacker__News
@hacker._news

Hacker News

Show HN: Stratum - a pure JVM columnar SQL engine using the Java Vector API

Show article

whilo

Comments

By whilo 2026-03-0622:39

HackerNews