Show HN: 22 GB of Hacker News in SQLite

Comments

By simonw 2025-12-3019:279 reply

Don't miss how this works. It's not a server-side application - this code runs entirely in your browser using SQLite compiled to WASM, but rather than fetching a full 22GB database it instead uses a clever hack that retrieves just "shards" of the SQLite database needed for the page you are viewing.

I watched it in the browser network panel and saw it fetch:

  https://hackerbook.dosaygo.com/static-shards/shard_1636.sqlite.gz
  https://hackerbook.dosaygo.com/static-shards/shard_1635.sqlite.gz
  https://hackerbook.dosaygo.com/static-shards/shard_1634.sqlite.gz

As I paginated to previous days.

It's reminiscent of that brilliant SQLite.js VFS trick from a few years ago: https://github.com/phiresky/sql.js-httpvfs - only that one used HTTP range headers, this one uses sharded files instead.

The interactive SQL query interface at https://hackerbook.dosaygo.com/?view=query asks you to select which shards to run the query against, there are 1636 total.

By ncruces 2025-12-3023:482 reply

A read-only VFS doing this can be really simple, with the right API…

This is my VFS: https://github.com/ncruces/go-sqlite3/blob/main/vfs/readervf...

And using it with range requests: https://pkg.go.dev/github.com/ncruces/go-sqlite3/vfs/readerv...

And having it work with a Zstandard compressed SQLite database, is one library away: https://pkg.go.dev/github.com/SaveTheRbtz/zstd-seekable-form...

By keepamovin 2025-12-319:291 reply

Your page is served over sqlitevfs with Range queries? Let's try this here.

By ncruces 2025-12-3121:331 reply

I did a similar VFS in Go. It doesn't run client-side in a browser.

But you can use it (e.g.) in a small VPS to access a multi-TB database directly from S3.

By keepamovin 2026-01-015:14

That is cool. Maybe i look at the go code

By pdyc 2025-12-315:071 reply

this does not caches the data right? it would always fetch from network? by any chance do you know of solution/extension that caches the data it would make it so much more efficient.

By ncruces 2025-12-315:56

The package I'm using in the HTTP example can be configured to cache data: https://github.com/psanford/httpreadat?tab=readme-ov-file#ca...

But, also, SQLite caches data; you can simply increase the page cache.

By keepamovin 2025-12-314:451 reply

Thanks! I'm glad you enjoyed the sausage being made. There's a little easter egg if you click on the compact disc icon.

And I just now added a 'me' view. Enter your username and it will show your comments/posts on any day. So you can scrub back through your 2006 - 2025 retrospective using the calendar buttons.

By oblosys 2025-12-3114:341 reply

I almost got tricked into trying to figure out what was Easter eggy about August 9 2015 :-) There's a clarifying tooltip on the link, but it is mostly obscured by the image's "Archive" title attribute.

By keepamovin 2025-12-3114:371 reply

Oh, shit that was the problem! You solved the bug! I was trying to figure out why the right tooltip didn't display. A linked wrapped in an image wrapped in an easter egg! Or something. Ha, thank you. Will fix :)

edit: Fixed! Also I just pushed a new version with a Dec 29th Data Dump, so ... updates - yay!

By oblosys 2025-12-3120:34

Happy to help!

By nextaccountic 2025-12-3021:339 reply

Is there anything more production grade built around the same idea of HTTP range requests like that sqlite thing? This has so much potential

By Humphrey 2025-12-3022:252 reply

Yes — PMTiles is exactly that: a production-ready, single-file, static container for vector tiles built around HTTP range requests.

I’ve used it in production to self-host Australia-only maps on S3. We generated a single ~900 MB PMTiles file from OpenStreetMap (Australia only, up to Z14) and uploaded it to S3. Clients then fetch just the required byte ranges for each vector tile via HTTP range requests.

It’s fast, scales well, and bandwidth costs are negligible because clients only download the exact data they need.

https://docs.protomaps.com/pmtiles/

By simonw 2025-12-3022:322 reply

PMTiles is absurdly great software.

By Humphrey 2025-12-3022:361 reply

I know right! I'd never heard of HTTP Range requests until PMTiles - but gee it's an elegant solution.

By keepamovin 2025-12-318:25

Hadn't seen PMTiles before, but that matches the mental model exactly! I chose physical file sharding over Range Requests on a single db because it felt safer for 'dumb' static hosts like CF. - less risk of a single 22GB file getting stuck or cached weirdly. Maybe it would work

By hyperbolablabla 2025-12-319:202 reply

My only gripe is that the tile metadata is stored as JSON, which I get is for compatibility reasons with existing software, but for e.g. a simple C program to implement the full spec you need to ship a JSON parser on top of the PMTiles parser itself.

By seg_lol 2025-12-3112:561 reply

A JSON parser is less than a thousand lines of code.

By Diti 2025-12-3117:502 reply

And where most of CPU time will be wasted in, if you care about profiling/improving responsiveness.

By monerozcash 2026-01-016:44

At that point you're just io bound, no? I can easily parse json at 100+GB/s on commodity hardware, but I'm gonna have a much harder time actually delivering that much data to parse.

By keepamovin 2026-01-016:27

What's a better way?

By keepamovin 2025-12-3115:22

How would you store it?

By nextaccountic 2025-12-311:231 reply

That's neat, but.. is it just for cartographic data?

I want something like a db with indexes

By jtbaker 2025-12-316:561 reply

Look into using duckdb with remote http/s3 parquet files. The parquet files are organized as columnar vectors, grouped into chunks of rows. Each row group stores metadata about the set it contains that can be used to prune out data that doesn’t need to be scanned by the query engine. https://duckdb.org/docs/stable/guides/performance/indexing

LanceDB has a similar mechanism for operating on remote vector embeddings/text search.

It’s a fun time to be a dev in this space!

By nextaccountic 2026-01-021:441 reply

> Look into using duckdb with remote http/s3 parquet files. The parquet files are organized as columnar vectors, grouped into chunks of rows. Each row group stores metadata about the set it contains that can be used to prune out data that doesn’t need to be scanned by the query engine. https://duckdb.org/docs/stable/guides/performance/indexing

But, when using this on frontend, are portions of files fetched specifically with http range requests? I tried to search for it but couldn't find details

By jtbaker 2026-01-043:37

Yes, you should be able to see the byte range requests and 206 responses from an s3 compatible bucket or http server that supports those access patterns.

By simonw 2025-12-3021:372 reply

There was a UK government GitHub repo that did something interesting with this kind of trick against S3 but I checked just now and the repo is a 404. Here are my notes about what it did: https://simonwillison.net/2025/Feb/7/sqlite-s3vfs/

Looks like it's still on PyPI though: https://pypi.org/project/sqlite-s3vfs/

You can see inside it with my PyPI package explorer: https://tools.simonwillison.net/zip-wheel-explorer?package=s...

By simonw 2025-12-3022:303 reply

I recovered it from https://archive.softwareheritage.org/browse/origin/directory... and pushed a fresh copy to GitHub here:

https://github.com/simonw/sqlite-s3vfs

This comment was helpful in figuring out how to get a full Git clone out of the heritage archive: https://news.ycombinator.com/item?id=37516523#37517378

Here's a TIL I wrote up of the process: https://til.simonwillison.net/github/software-archive-recove...

By QuantumNomad_ 2025-12-3023:121 reply

I also have a locally cloned copy of that repo from when it was on GitHub. Same latest commit as your copy of it.

From what I see in GitHub in your copy of the repo, it looks like you don’t have the tags.

Do you have the tags locally?

If you don’t have the tags, I can push a copy of the repo to GitHub too and you can get the tags from my copy.

By simonw 2025-12-3023:231 reply

I don't have the tags! It would be awesome if you could push that.

By QuantumNomad_ 2025-12-3023:291 reply

Uploaded here:

https://github.com/Quantum-Nomad/sqlite-s3vfs

By simonw 2025-12-3023:331 reply

Thanks for that, though actually it turns out I had them after all - I needed to run:

  git push --tags origin

By QuantumNomad_ 2025-12-3023:35

All the better :)

By bspammer 2025-12-311:001 reply

Doing all this in an hour is such a good example of how absurdly efficient you can be with LLMs.

By socialcommenter 2025-12-3118:35

From reading the TIL, it doesn't appear as if Simon used LLM for a large portion of what he did; only the initial suggestion to check the archive, and the web tool to make his process reproducible. Also, if you read the script from his chat with Claude code, the prompt really does the heavy lifting.

Sure, the LLM fills in all the boilerplate and makes an easy-to-use, reproducible tool with loads of documentation, and credit for that. But is it not more accurate to say that Simon is absurdly efficient, LLM or sans LLM? :)

By AceJohnny2 2025-12-3022:461 reply

didn't you do something similar for Datasette, Simon?

By simonw 2025-12-3023:021 reply

Nothing smart with HTTP range requests yet - I have https://lite.datasette.io which runs the full Python server app in the browser via WebAssembly and Pyodide but it still works by fetching the entire SQLite file at once.

By AceJohnny2 2025-12-3023:04

oh! I must've been confused with your TIL where you linked to an explainer of this technique

https://simonwillison.net/2021/May/2/hosting-sqlite-database...

https://phiresky.github.io/blog/2021/hosting-sqlite-database...

https://news.ycombinator.com/item?id=27016630

By billywhizz 2025-12-313:13

i played around with this a while back. you can see a demo here. it also lets you pull new WAL segments in and apply them to the current database. never got much time to go any further with it than this.

https://just.billywhizz.io/sqlite/demo/#https://raw.githubus...

By ericd 2025-12-3021:43

This is somewhat related to a large dataset browsing service a friend and I worked on a while back - we made index files, and the browser ran a lightweight query planner to fetch static chunks which could be served from S3/torrents/whatever. It worked pretty well, and I think there’s a lot of potential for this style of data serving infra.

By turbobrew 2025-12-313:05

gdal vsis3 dynamically fetches chunks of rasters from s3 using range requests. It is the underlying technology for several mapping systems.

There is also a file format to optimize this https://cogeo.org/

By omneity 2025-12-311:33

I tried to implement something similar to optimize sampling semi-random documents from (very) large datasets on Huggingface, unfortunately their API doesn't support range requests well.

By mootothemax 2025-12-314:04

This is pretty much well what is so remarkable about parquet files; not only do you get seekable data, you can fetch only the columns you want too.

I believe that there are also indexing opportunities (not necessarily via eg hive partitioning) but frankly - am kinda out of my depth pn it.

By 6510 2025-12-310:011 reply

I want to see a bittorrent version :P

By nextaccountic 2025-12-311:21

Maybe webtorrent-based?

By tlarkworthy 2025-12-314:28

Parquet/iceberg

By keepamovin 2025-12-3118:57

A recent change is I added date spans to the shard checboxes on query view so it's easier to zero dates you want if you have that in mind. Because if your copy isn't local all those network pulls take a while.

The sequence of shards you saw when you paginated to days is faciliated by the static-manifest which maps HN item ID ranges to shards, and since IDs are increasing and a pretty good proxy of time (a "HN clock"), we can also map the shards that we cut up by ID to the time spans their items cover. An in memory table sorted by time is created from the manifest on load so we can easily look up which shard we need when you pick a day.

Funnily enough, this system was thrown off early on by a handful of "ID/timestamp" outliers in the data: items with weird future timestamps (offset by a couple years), or null timestamps. To cleanse our pure data from this noise, and restore proper adjacent-in-time shard cuts we just did a 1/99 percentile grouping and discarded the outliers leaving shards with sensible 'effective' time spans.

Sometimes we end up fetching two shards when you enter a new day because some items' comments exist "cross shard". We needed another index for that and it lives in cross-shard-index.bin which is just a list of 4-byte item IDs that have children in more than 1 shard (2-bytes), which occurs when people have the self-indulgence to respond to comments a few days after a post has died down ;)

Thankfully HN imposes a 2 week horizon for replies so there aren't that many cross-shard comments (those living outside the 2-3 days span of most, recent, shards). But I think there's still around 1M or so, IIRC.

By maxloh 2025-12-313:321 reply

I am curios why they don't use a single file and HTTP Range Requests instead. PMTiles (a distribution of OpenStreetMap) uses that.

By keepamovin 2025-12-314:48

This would be a neat idea to try. Want to add a PR? Bench different "hackends" to see how DuckDB, SQLite shards, or range queries perform?

By meander_water 2025-12-312:181 reply

I love this so much, on my phone this is much faster than actual HN (I know it's only a read-only version).

Where did you get the 22GB figure from? On the site it says:

> 46,399,072 items, 1,637 shards, 8.5GB, spanning Oct 9, 2006 to Dec 28, 2025

By amitmahbubani 2025-12-312:572 reply

> Where did you get the 22GB figure from?

The HN post title (:

By keepamovin 2025-12-314:46

22GB is non-gzipped.

By meander_water 2025-12-312:59

Hah, well that's embarrassing

By sodafountan 2025-12-313:131 reply

The GitHub page is no longer available, which is a shame because I'm really interested in how this works.

How was the entirety of HN stored in a single SQLite database? In other words, how was the data acquired? And how does the page load instantly if there's 22GB of data having to be downloaded to the browser?

By keepamovin 2025-12-314:501 reply

You can see it now, forgot to make it public.

- 1. download_hn.sh - bash script that queries BigQuery and saves the data to *.json.gz

- 2. etl-hn.js - does the sharding and ID -> shard map, plus the user stats shards.

- 3. Then either npx serve docs or upload to CloudFlare Pages.

The ./toool/s/predeploy-checks.sh script basically runs the entire pipeline. You can do it unattended with AUTO_RUN=true

By sodafountan 2025-12-316:47

Awesome, I'll take a look

By dzhiurgis 2025-12-3122:28

Is it possible to implement search this way?

By tehlike 2025-12-3019:47

Vfs support is amazing.

By yread 2025-12-3021:176 reply

I wonder how much smaller it could get with some compression. You could probably encode "This website hijacks the scrollbar and I don't like it" comments into just a few bits.

By Rendello 2025-12-3022:271 reply

The hard-coded dictionary wouldn't be much stranger than Brotli's:

https://news.ycombinator.com/item?id=27160590

By maxbond 2025-12-315:45

You can use a BPE variant like SentencePiece to identify these patterns rather than hard coding them.

By jacquesm 2025-12-3022:23

That's at least 45%, then you can leave out all of my comments and you're left with only 5!

By hamburglar 2025-12-312:31

It might be a neat experiment to use ai to produce canonicalized paraphrasings of HN arguments so they could be compared directly and compress well.

By keepamovin 2026-01-0123:22

Dear it is already compressed using G zip – nine for every SQLlight shard and manifest

22 GB is uncompressed and compressed the entire things about 9 GB

By rossant 2025-12-316:45

Guilty.

By kamranjon 2025-12-310:583 reply

It'd be great if you could add it to Kiwix[1] somehow (not sure what the process is for that but 100rabbits figured it out for their site) - I use it all the time now that I have a dumb phone - I have the entirety of wikipedia, wiktionary and 100rabbits all offline.

https://kiwix.org/en/

By codazoda 2025-12-313:43

I love that you have 100r.ca on that short list.

By endofreach 2025-12-315:151 reply

what dumb phone do you use?

and why do you want wikipedia in your pocket, but not a smartphone? where do you draw the line?

(doing a lot of work in that area, so i am asking to learn from someone who might think alike)

By kamranjon 2025-12-315:341 reply

I use the Mudita Kompakt specifically cause it allows sideloading so I can still have a few extras. Right now I have Kiwix and Libby. It works really well.

I have a $10 a month plan from US cellular with only 2gigs so I try to keep everything offline that I can.

Honestly it's mostly the news... so I draw the line at browser, I'll never install a browser, that's basically something I can do when I sit down at a PC. I read quite a bit and I like to have the ability to look up a word or a historical event or some reference from something I read using Kiwix and it's been great for that, just needed to add a 512gb micro sd card. And Libby I just use at the gym when I'm on the treadmill.

By endofreach 2026-01-0121:32

interesting. thank you. any way i can reach out to you regardibg a project i am working on?

your input would be very valuable.

By keepamovin 2026-01-0122:00

This is a good idea — we should do it.

I also want to make sure we can build this in CI. My goal is to have this updated every day using the BigQuery update process, so it becomes a 1–2 day delayed static archive of the current state of Hacker News, which is honestly very cool.

I can probably run the build for free on GitHub Actions runners, as long as the runner has about 40 GB of disk space available. If needed, I can free up space on the runner before the build starts.

I’ll also write to GitHub support and ask if they can sponsor the cost of a larger runner, mainly because I need the extra disk space to run the build reliably.