Show HN: Luminal – Open-source, search-based GPU compiler

2025-08-2016:0115360github.com

Deep learning at the speed of light. Contribute to luminal-ai/luminal development by creating an account on GitHub.

Screenshot 2025-08-14 at 9 18 54 PM

CI Status Docs Current Crates.io Version discord

Luminal is a deep learning library that uses search-based compilation to achieve high performance.

To run the demo shown on HN on mac, clone this repo and run:

cd demos/matmul
cargo run --release

Important

We're undergoing a large transition to "2.0", which introduces large-scale kernel search. This radically simplifies the compiler stack and allows us to discover complex optimizations entirely automatically. Please keep an eye on breaking changes, which usually are staged in the crates/luminal_2 before being merged into the main crate.

use luminal::prelude::*; // Setup graph and tensors
let mut cx = Graph::new();
let a = cx.tensor((3, 1)).set([[1.0], [2.0], [3.0]]);
let b = cx.tensor((1, 4)).set([[1.0, 2.0, 3.0, 4.0]]); // Do math...
let mut c = a.matmul(b).retrieve(); // Compile and run graph
cx.compile(<(GenericCompiler, CPUCompiler)>::default(), &mut c);
cx.execute(); // Get result
println!("Result: {:?}", c);

Llama 3 8B

  • the below is a quick example of how you can run Llama 3 8B locally using Luminal
cd ./examples/llama
# Download the model
bash ./setup/setup.sh
# Run the model
cargo run --release --features metal # MacOS (Recommended)
cargo run --release --features cuda # Nvidia
cargo run --release # CPU

Luminal can run Q8 Llama 3 8B on M-series Macbooks at 15-25 tokens per second. The goal is to become the fastest ML framework for any model on any device.

The core of luminal is and always will be minimal. It should be possible to understand the entire core library in an afternoon.

Everything in luminal boils down to 12 primitive ops:

  • Unary - Log2, Exp2, Sin, Sqrt, Recip
  • Binary - Add, Mul, Mod, LessThan
  • Other - SumReduce, MaxReduce, Contiguous

These ops are enough to support transformers, convnets, etc.

We compile these ops into complex GPU kernels, so even though our ops are simple, we get high performance through the power of compilers! This is how we overcome the typical RISC disadvantages, btw.

The best heuristic is no heuristic. We try to search every possible decision to give the compiler the most flexibility to discover complex optimizations. This allows us to automatically derive Flash Attention and other similarly complex rewrites. It also allows us to stay extremely small long into the future and beat the performance of far larger frameworks with tons of handwritten kernels.

The current ML ecosystem is too fragmented, and the solution isn't another layer of abstraction. Luminal is written in rust, and interacts directly with the CUDA / Metal APIs. No indirections or abstractions, docker containers, or virtual environments. Just a statically-linked rust crate.

Correctness matters. So we write as much tests as possible to cover all ops and verify they work the same as an equivalent Pytorch implementation. (Improvements needed!)

Most deep learning libraries are eager-first, meaning each op call directly operates on the data. In PyTorch, when you see x + y, the addition actually happens right there. This is great for debugging because it works exactly as most developers expect.

However, this isn't great for performance. What makes sense for a developer doesn't work well for the machine, in the same way that no one writes assembly by hand. Most libraries try to fix this problem by tacking on operator fusion or JIT compilation to try to change the compilation flow to something better for the machine. Turns out this is super difficult even for Pytorch!

A core tenet of Luminal is ahead-of-time compilation. Whenever possible, push everything to compile time and leave nothing to run time. Luminal takes an approach more similar to XLA, and tinygrad. Everything's static here. When you write out an expression like x + y, no actual computation happens. The operation is recorded to a directed acyclic computation graph for execution later. Only once graph.execute() is ran does the computation happen. But isn't that just lazy execution? Yes it is! But in luminal everything is done this way. All neural networks are built up as one or a few static computation graphs, compiled, and executed later.

But why?

A consequence of this is that the actual computation that gets ran can be radically different than the code that was written. Since we have an entire neural network fully represented in a compute graph, our compilers have global knowledge. This means we can push most ML complexity to the compilers. For instance, devices, datatypes, and execution schedules are all handled by compliers. Even autograd is handled by a compiler!

Now we can do:

  • Aggressive kernel fusion
  • Shape-specific kernels compiled at runtime
  • Devices and Dtypes are handled through compilers (just run the CUDA compiler to convert the graph to use CUDA kernels, then the fp16 compiler to convert to half-precision kernels)
  • Networks can be written in generic code, but compiled and ran fast on hyper-specific architectures (try writing a PyTorch network that works with both TF32 dtypes and TPUs; get ready for if statement hell...)
  • Search is partially merged. We are between 1.0 and 2.0 (search), which will be completed within the next month or so.
  • Metal and Cuda are supported for running models on Macs and Nvidia GPUs respectively, in both full and half precision.
  • Full training support with graph-based autograd.
  • Llama 3, Phi 3, Whisper and Yolo v8 are implemented in examples/. See instructions above for running.
  • We have a small library of NN modules in luminal_nn, including transformers.
  • A significant amount of high-level ops are implemented in hl_ops. We are aiming to match the most used ~80% of the pytorch api.

Some things on the roadmap:

  • Expand the search space to utilize Tensor Cores more flexibly
  • Bring cuda to parity with Metal
  • Add Blackwell intrinsics, such as TMEM and TMA
  • Build a ROCm backend
  • Build benchmarking suite to test against other libs
  • Distributed data, pipeline and tensor parallel.
  • Beat PT 2.0 perf on LLM inference and training
  • Write compiler for quantum photonic retro encabulator
  • Build dyson swarm

Licensed under the Apache License, Version 2.0 http://www.apache.org/licenses/LICENSE-2.0 or the MIT license http://opensource.org/licenses/MIT, at your option. This file may not be copied, modified, or distributed except according to those terms.


Read the original article

Comments

  • By Alifatisk 2025-08-2017:053 reply

    So wait, am I understanding this correctly?

    Instead of applying just predetermined optimization rules or patterns, the compiler formulates the problem as searching through many possible configurations or versions of the code. Each possible version can have different arrangements, tiling sizes, thread block configurations, memory access patterns, and instruction sequences, right?

    And from my understanding, the “search space” is just a collection of all potential versions of the code (kernels) that the compiler can generate from the original input. So for example, the space might include

    - Different ways to partition workloads among GPU threads and blocks

    - Varying memory access strategies (using shared memory, global memory)

    - Various instruction-level optimizations or reordering

    - Alternative loop unroll factors or vectorization strategies

    The compiler then programmatically produces a large number of candidate kernels by combining different optimizations and configurations. Among these millions of candidates, the compiler tries to find the one that performs best.

    In that case, can the compiler print out which gpu configuration works the best for that computer? And will that configuration be applicable to all computers with the same setup?

    This is such an interesting technique.

    • By jakestevens2 2025-08-2017:313 reply

      Your description is exactly right. We create a search space of all possible kernels and find the best ones based on runtime. The best heuristic is no heuristic.

      This obviously creates a combinatorial problem that we mitigate with smarter search.

      The kernels are run on the computer the compiler is running on. Since runtime is our gold standard it will search for the best configuration for your hardware target. As long as the setup is mostly the same, the optimizations should carry over, yes.

      • By erichocean 2025-08-217:113 reply

        > that we mitigate with smarter search

        aka "a heuristic"

        • By jakestevens2 2025-08-2115:19

          See my other comments about static profiling of kernels. There are ways of improving the search that keep runtime at the heart of it.

        • By jafioti 2025-08-2116:30

          mcts / rl isn't really a heuristic. but yes heuristics can be used temporarily to keep the search space small, and removed over time as the search algorithm improves.

        • By gregorygoc 2025-08-2112:05

          Exactly, I was going to ask about this bit…

      • By UncleOxidant 2025-08-2017:553 reply

        How long does this typically take? It sounds time consuming. Also, it seems like this could be similar to doing a GA?

        • By jakestevens2 2025-08-2018:011 reply

          That depends on the model architecture and how it was written since that informs the size of the search space.

          The typical range is 10 mins to 10 hours. It won't be fast but you only have to do it once and then those optimizations are set for every forward pass.

          • By sitkack 2025-08-210:091 reply

            Do you learn the capabilities of the underlying hardware relative to the kernel src? You should be able to start predicting perf using learned static profiling.

            • By jakestevens2 2025-08-210:51

              Not today but we will implement memoization of kernels for each hardware backend, yes.

        • By jakestevens2 2025-08-2018:04

          You can also set a time budget for how long you'd like the search to run for to avoid wasting time on diminishing returns.

        • By s40078920 2025-08-2918:06

          [dead]

      • By pilooch 2025-08-2021:53

        Is this a bit similar to what tensorrt does, but in a more opened manner ?

    • By jafioti 2025-08-2022:242 reply

      yup! we build a search space by iteratively applying rewrite rules in every possible order (using e-graphs to do this efficiently). the rewrites alter stuff like looping / tiling structures, as well as algebraic rewrites like softmax to online softmax (and then flash attention).

      yes optimized kernels for one system will work on other systems with the same hardware. its fine to take a long time compiling if you just compile once and run a lot.

      • By _0ffh 2025-08-2023:09

        Is/will it be possible to just write a model component with Luminal and then use that as a building block in e.g. Torch or JAX?

      • By almostgotcaught 2025-08-212:121 reply

        > take a long time compiling

        Lol np-hard is still np-hard no matter how you slice it (especially given vague objective functions).

  • By diggan 2025-08-2017:002 reply

    > Luminal can run Q8 Llama 3 8B on M-series Macbooks at 15-25 tokens per second. The goal is to become the fastest ML framework for any model on any device.

    Great that some numbers are provided, but in isolation, I'm not sure what they provide. It would be helpful to also share what tok/s you'd get with llama.cpp or something else on the same hardware, so we can actually understand if it's faster or not :) Also including the prompt processing would be a bonus!

    • By Reubend 2025-08-212:14

      Yeah those numbers look very low to me for something that's supposed to represent a state of the art optimization technique. I think that's lower than other implementations, although it depends on the MacBook.

      Nonetheless this project looks very cool, and I hope they can continue improving it to the point where it indeed beats human-led optimizations.

    • By jafioti 2025-08-2017:36

      a lot of the search is still being optimized so we don't match super hand-optimized kernels like llama.cpp has, so we def don't match their tps yet, but i want to make a perf tracking page to see improvements over time and prevent regressions

  • By sakras 2025-08-2018:381 reply

    I see you guys are using Egg/Egglog! I've been mildly interested in egraphs for quite a while, glad to see they're gaining traction!

    • By PoignardAzur 2025-08-2021:061 reply

      Right, my first thought when reading the blurb was "kinda sounds like e-graphs?"

      • By jafioti 2025-08-2022:20

        e-graphs are awesome! none of this would be possible without them.

HackerNews