An extensible, state of the art columnar file format. Formerly at @spiraldb, now an Incubation Stage project at LFAI&Data, part of the Linux Foundation. - vortex-data/vortex
The cuDF interop in the roadmap [1] will be huge for my workloads. XGBoost has the fastest inference time on GPUs, so a fast path straight from these Vortex files to GPU memory seems promising.
Can you explain how it’s faster? GPU memory is just a blob with an address. Is it because the loading algorithms for vortex align better with XGBoost or just plain uploading to the GPU?
What you can do if you have gpu friendly format is you send compressed data over PCI-E and then decompress on the gpu. Thus your overall throughput will increase since PCI-E bandwidth is the limiting factor of the overall system.
That doesn’t explain how vortex is faster. Yes, you should send compressed data to the GPU and let it uncompress. You should maximize your PCI-E throughput to minimize latency in execution, but what does Vortex bring? Other than Parque bad, Vortex good.
XGBoost is just faster on the GPU, regardless of the file format. A sibling post also pointed out compression helping out on bandwidth.
One thing I found interesting is the logical type system doesn't seem to include sum types or unions, unlike Arrow etc.
I'd generally encourage new type systems to include sum types as a first-class concept.
I wonder if a columnar storage format should implement sum types with a struct of arrays where only one array has a nun-null value for each index.
Arrow has two variants of it and this is one of them. Other variant has a seperate offsets array that you use to index into the active “field” array, so it is slower to process in most cases but is more compact
Can you append new columns to a file stored on disk without reading it all in mempey? Somehoe this is beyond parquet capabilities.
The default writer will decompress the values, however, right now you can implement your own write strategy that will avoid doing it. We plan on adding that as an option since it’s quite common.